Azure Service Fabric: On State Management

Prelog

Making the right design decision is a mix of Experience, Knowledge and a bit of Art. The phrase “right design decision” in itself requires the understanding that design is not an isolated abstract, it has context that spans requirements, environment to the people who actually about to write the software.

Service Fabric employs a complex set of procedures that ensures that your data is consistent and resilient (i.e. if a cluster node goes down).

Understanding (knowing) what goes under the hood in your application platform is key to making better design decisions. Today I want to attempt to describe how state is managed in Service Fabric and different choices that you have and their implications on the performance and availability of your Service Fabric Application.

Strongly recommend to go through Service Fabric: Partitions before reading this one.

Replication Process & Copy Process

Service Fabric employs two distinct processes used to ensure consistency across a replica set (the number of replicas in a given partition for a  stateful service).

Replication Process: used during writing state data in your service.

Copy Process: used to build a new replica when an existing Active Secondary fails. Also used to recover an Active Secondary that was down for a while.

Both processes are described below.

Replication Process

So your Primary goes online start listening to requests and is about to write data using one of the reliable collection classes. Here is a break down of events:

  1. The Primary wraps the write operation (can be an add, update, or delete) puts it in a queue. This operation is synchronized with a transaction. At the other end of the queue your Active Secondary(s) are pumping out the operations as described in the next step. No data has been written yet to the Primary data store.
  2. Active Secondary(s) pick the operation(s) and apply them locally, apply them locally and acknowledge the Replicator one by one.
  3. The replicator delivers a call back to the Primary when write were success.
  4. Primary then can apply the data locally and return the original request caller.

Diving Even Deeper

  • The replicator will not acknowledge the primary with a successful write until it was “Quorum Written”. A quorum is floor(n/2)+1 where n is the # of replicas in a set. As in the more than half  of the replicas in a set has successfully applied it.
  •  Each operation is logically sequenced. The replication process is out of band (from the perspective o the Primary). As in at in any point of time an internal list is maintained by the primary that indicates operation waiting to be sent to replicators, operations sent to replicators and not yet Acked, operations sent to replicator and Acked.
  • There is no relation between the data structure semantics (queue vs dictionary) and the actual storage/replication. All data are replicated and stored the same way.
  • Given the following pseudo code

Create transaction TX
Update dictionary D (key = K to value of “1”)
commit TX
Create another transaction TX1
Read Dictionary D (Key = K)

What value should you expect when you read the dictionary the second time? most probably whatever value K had before updating it. Because remember changes are not applied locally until replicated to the quorum.

Let us bend our minds even more

  • Active Secondary(s) keep pumping data out from replication queue. They don’t depend on replicator to notify for new data. As in a number of Active Secondary(s) can be at different stages of replications. In a perfect case scenario a quorum has to be aligned with the Primary. This effect how you do your reads as we will describe it later.
  • Service Fabric maintains – sort of – configuration version for each replica set. This version changes as the formation of the replica set (i.e. which one currently is the primary). This particular information are communicated to the replica.
  • You can run in a situation where an Active Secondary is a head of the Primary. Given a situation where a Primary failed to apply changes locally before dying and not notifying the caller of successful writes. Or worse Service Fabric lost the write quorum (for example: many nodes of the cluster failed at the same time) and the surviving replicas (of which one will be elected primary) are the slow ones that was not catching up. The details of how this is recovered and communicated to your code (and control you will have on recovery) is the part of the Service Fabric team is working on.  Rolling Back, Loss Acceptance, Restore from Backup are all techniques that can be used for this.

What About My Reads?

The default behavior is to direct reads to the Primary replica to ensure consistent reads (i.e. read from the latest most trusted version of data) However you can enable listening on your secondary(s) and redirect reads to them (more on how to here).
As you can tell from the above discussion reads from Secondary(s) can be out of sync reads or a better expression ghost data. The current service interfaces does not allow you to know how far ahead or far behind the current replica is from the Primary.

The Copy Process

The scenarios above discuss what happens when writes occur from Primary to Active Secondary(s). But what happen when Service Fabric needs to bring one a Secondary up to speed (i.e. synchronize it). The Secondary in question can be new (i.e. service fabric replacing a dead one on a dead node or moving an existing one to a new node due to resource rebalance). Obviously using the replication mechanism is an over kill. Instead Service Fabric allows the Secondary to communicate to the primary its marker (the last sequence it applied internally) the Primary can then send back the delta operations for the secondary to apply. Once the secondary is brought up to speed it can follow the normal replication process.

The Fabric Replicator

Up until this moment we have talked about a “Replicator” with further details. Service Fabric comes with “Fabric Replicator” as the default replicator, this is surfaced to you partially in the “StateManager” member of your stateful service which holds reference to the replicator. If you tinker inside the code you will notice that there is a process that couples your service to a state provider / manager (in IStatefulServiceReplica interface) meaning in theory you can build your own replicator which as you can tell is quite complex.

So What Does This Means For Me?

The below is various discussion points on multiple design aspects when it comes to replicas/replica sets.

How Fast are My Writes?

You have to understand that the fastest is S/A + time -on-slowest-Network (R) + D/A  + Ack. Where
  • S/A is the time of serialization & local calls to the replicator at the Primary.
  • Slowest network to the replica (those of the quorum, or the slowest of fastest responders who happened to be most updated secondary(s)).
  • D/A is the time of deserialization & calls to the replicator at the secondary.
  • Ack time (the way back).

Large # of Replicas vs Small # of Replicas

It is really tempting to go with large # of replicas. But keep in mind “How fast are my writes?” discussion above & the fact that write quorum are floor(r/2) + 1. The larger your replica set the larger your quorum the slower your writes will be. I would go with the smallest possible replica set per partition.

But really, When Should I Go With Larger # of Replica in my Replica Set?

If you have a system where reads are far more than writes, and the impact of ghost reads is negligible to your business then it might make sense to go with large # of replicas (and allowing read on secondary(s)).
As a side note, replica listening address resolution mechanism (more here) does not keep track of your calls. So you probably want to do simple load balancing (for example round/robin across all secondary replicas in a partition).

Large Objects (Data Fragment) vs. Small Objects

 How big your objects (those entries in reliable dictionaries and queues) should be? well consider the following. The sequential nature of replication, the fact that replicas live on different machines (network overhead). you will end up with “smaller is better“.
A data fragment that is a 500K big, it is not only slow in replication to the quorum. it will also slow the line of pending replications to replicas (trying to catch up). Not only that it will slow the Copy Process hence slowing down the recovery and resource re-balancing processes.
Consider techniques of breaking down your data into smaller chunks (writes) and write them in one transaction.
Also note that reliable collections persist in memory (for primary and Active Secondary(s)) as well (that is how data structure semantics queue vs dictionary are maintained on a natural replicator). That means your entire state lives in memory of your cluster nodes.

Epilog

As you can tell the mechanics of Service Fabric internals are quite complex yet an understanding of how it work internally is important to building high performance Service Fabric applications.

till next time

@khnidk

Service Fabric: Hosting, Activation & Isolation

Prelog

If you try to map Service Fabric your old style Application Server platform. You probably end up – after stripping it to bare metal – with defining Service Fabric as an App Server that runs on a multi server cluster that provides:

  1. Process Management: including availability, partitioning, health and recovery.
  2. Application Management: including isolation, version management and updates.
  3. State Management: including resilience, transaction support and partitioning.

The above as an informal definition (for formal definition check here) acts as bases for interesting patterns such as Actor Models, Micro Services. Understanding how your code runs on top of a Service Fabric cluster is key to efficient solution design and architecture.

This post focuses on #1 Process Management/Model & #2 Application Management.

Before we begin I consciously decided not to include the following in the discussion below, to ensure focus. we will able to cover them in later posts:

  1. Actor Framework.
  2. Service Groups.
  3. Application/Service Package Resources, Configuration or Secret Stores (only focused on Service Package Code).
  4. Host Setup

Note: While I use mostly PowerShell below .NET API can also be used. Unless stated otherwise.

Overview Of Service Fabric Related Concepts

We start our understanding of things, by understanding their concepts and how they relate to each others. Instead of trying to describe them in abstract I put together a entity model like diagram that puts them in perspective.

ActivationHosting

// Click to enlarge

Let us attempt to describe the monstrous diagram above.

On The Design Time Side

  • An Application Package is described by ApplicationManifest.xml defines 1) Type Name identifier used to allow you to create instances of applications 2) Version. for example the first line of ApplicationMainfest.xml declares “HostingActivationApp” as Application Type Name & “1.0.0.0” as version:
    <ApplicationManifest ApplicationTypeName="HostingActivationApp" ApplicationTypeVersion="1.0.0.0" xmlns="http://schemas.microsoft.com/2011/01/fabric" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    
  • An Application Package can import multiple Service Package each is described by ServiceManifest.xml each declares 1) Name 2) Version as the following:
    <ServiceManifest Name="HostingActivationPkg"
                     Version="1.0.0.0"
                     xmlns="http://schemas.microsoft.com/2011/01/fabric"
                     xmlns:xsd="http://www.w3.org/2001/XMLSchema"
                     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    
  • Each Service Package defines
    • Services Types (not to be confused by “.NET Class Name”). This name allow you to reference a service for activation irrespective of the actual class name that implements it. The type also defines if it is a Stateful or a Stateless service.
    • Code Packages (with Name & Version) defines where your service will run. Think of it as a host definition, a container definition or “Worker Process Definition”. This host can be explicit, as in EXE program (typically a console application created by VS.NET for you). or Implicit where Service Fabric will run your code in a worker process created for you. For this post we will focus only on Explicit Hosts.
    • Service Package does not define how the service will be partitioned nor the number of instances per service, nor how many times the service will be activated in an application.
    • The below example defines 2 service types (Stateful Service named “ComputeSvc” and another stateless service named “GatewaySvc”). It also defines one Code Package with EXE host named  “HostingActivationSvcs.exe”
      <ServiceTypes>
          <StatefulServiceType ServiceTypeName="ComputeSvc" HasPersistedState="true"  />
      
          <!-- added another service type to my package-->
          <StatelessServiceType ServiceTypeName="GatewaySvc" />
        </ServiceTypes>
      
        <!-- Code package is your service executable. -->
        <CodePackage Name="Gw.Code" Version="1.0.0.0">
          <EntryPoint>
            <ExeHost>
              <!--
                  this is the host where replicas will be activated,
                  typically Service Fabric runs one instance per node per application instance
               -->
              <Program>HostingActivationSvcs.exe </Program>
            </ExeHost>
          </EntryPoint>
        </CodePackage>
      
  • Back to the Application Package. After importing one or more Service Packages it can now declare:
    • Default Services Each defines a Service Name (used in resolution by Uri for example fabric:/<App Instance Name>/<Service Name>),  of “Service Type” and a Partitioning Schema (or instance count for stateless services). Default Services are activated automatically once an application instance is created.
    • Service Templates similar to Default Services but are not activated  when application instance are created. Instead you have to manually activate them via API or PowerShell. Because of this it does not have a Service Name (which has to be provided during activation).
    • The following example of Application Manifest Imports “HostingActivationPkg” with Version & Name. Defines:
      • 1 Service Template (Unnamed).
      • 3 Default Services (named: Worker1, Worker2 & Gw1) which will be activated once an application instance is created.
        
         <ServiceManifestImport>
            <ServiceManifestRef ServiceManifestName="HostingActivationPkg" ServiceManifestVersion="1.0.0.0" />
            <ConfigOverrides />
          </ServiceManifestImport>
        
            <!-- -->
          <ServiceTemplates>
        
            <StatelessService   ServiceTypeName="GatewaySvc" InstanceCount="4">
              <UniformInt64Partition PartitionCount="3" LowKey="-9223372036854775808" HighKey="9223372036854775807"/>
            </StatelessService>
          </ServiceTemplates>
        
            <DefaultServices>
              <!-- this is the initial backend worker staticly defined-->
            <Service Name="Worker1">
              <StatefulService ServiceTypeName="ComputeSvc" TargetReplicaSetSize="3" MinReplicaSetSize="3">
                <UniformInt64Partition PartitionCount="3" LowKey="-9223372036854775808" HighKey="9223372036854775807" />
              </StatefulService>
            </Service>
        
            <!-- this is an additional worker, initially we expect to offload here,
                 notice I am using a different partitioning scheme-->
            <Service Name="worker2">
        
              <StatefulService ServiceTypeName="ComputeSvc" TargetReplicaSetSize="3" MinReplicaSetSize="3">
                <NamedPartition>
                  <Partition Name="London"/>
                  <Partition Name="Paris"/>
                  <Partition Name="Newyork"/>
                </NamedPartition>
              </StatefulService>
            </Service>
        
            <!-- Staticly defining another statless service for the gateway-->
            <Service Name="Gw1">
              <StatelessService ServiceTypeName="GatewaySvc" InstanceCount="-1">
                <SingletonPartition/>
              </StatelessService>
            </Service>
        
          </DefaultServices>
        

On The Runtime Side

  • Application Package are copied to the cluster using Copy-ServiceFabricApplicationPackage cmdlet. Interestingly enough you don’t need to provide name or version, just a path for the Image Store service (more on this in later posts). Meaning, you can copy the package multiple times (which would be a mistake, that is why in the diagram I have it as 1:1 relation).
  •  Then you will have to register your Application Type once and only once using Register-ServiceFabricApplicationType cmdlet. Passing in the location where you copied the Application Package on the cluster.
  • Once an Application Type is registered I can create multiple Application Instances. an Application Instances are created using New-ServiceFabricApplication cmdlet. passing in:
    • Application Instance Name  in form of “fabric:/<arbitrary name>
    • Application Type Name
    • Application Type Version 
    • Application Instances are the unit of management within a Service Fabric cluster. Each
      • Can be upgraded separably.
      • Will have it is own hosts (as discussed in Activation & Isolation below).  Application instances (even of the same Application type) does not share hosts.

Note:

When you Deploy your Service Fabric VS.NET project. VS.NET runs “Package” msbuild target on *.sfproj and produces a package saved in <Solution>/<sfproj>/pkg/<build type>/ then runs through the above cmdlets via the script available in <solution>/<sfproj>/scripts/Create-FabricApplication.ps1 & Deploy-FabricApplication.ps1. VS.NET creates the first Application Instance with Instance Name = “Application Type Name” which tends to be confusing.

Activation & Isolation

For the sake of next discussion, we will assume that Fault Domain, Update Domain & Node are the same thing. We will just refer to them as Node or Cluster Node. We will also completely ignore Placement Constraints, “TargetReplicaSize” & “MinReplicaSize” values. 

In previous posts I referred to your code running in a process. Let us attempt to detail this now. Imagine having an Application Type T registered in a cluster of 3 nodes.  With one Service Type S defined in Default Services section as S1  which has code package C with 1 Named Partition and 3 Replicas. If I created a new a new Application Instance I1 my cluster will look like that:

 

ClusterView1

Service Fabric does not like to host replicas of the same partition on the same Node. Because I have 3 Replicas of S1 it will decide to create 3 Hosts (one per node). This typically done by running your code (your exe process).

Now remember I can choose to activate services in an Application instance (either using Service Templates or using Service Types). My new services can follow different partitioning schema. The following example activates  S2 with 3 Uniform Partition & 3 Replicas each. Which will result in a different view of the cluster.

New-ServiceFabricService -Stateful `
                         -PartitionSchemeUniformInt64 `
                         -LowKey 1 `
                         -HighKey 3 `
                         -PartitionCount 3  `
                         -ApplicationName "fabric:/I1" `
                         -ServiceName "fabric:/I1/S2" `
                         -ServiceTypeName "S" `
                         -HasPersistedState `
                         -TargetReplicaSetSize 3 `
                         -MinReplicaSetSize 3

Now my cluster will look like

ClusterView2

Because S2 is Activated within an existing Application Instance, Service Fabric will not create new hosts, instead S2 will be activated within the same host processes already running. That means multiple Services within the same Application Instance and their Replicas will share the same Windows Server Process. This means they share the memory, handle table, address space and so on. For this Service Fabric preview version they also share the App Domain for CLR based services.

If I have other Services Types that runs in the same Code Package (similar to my sample above), They will be activated within the same existing hosts. Service Types can explicitly define different Code Package (for example using different Service Packages); in this case different host instances running different Code Packages (processes) will be instantiated on Cluster Nodes to run instances of this Service Type with the same Application Instance.

How will my cluster look like if created a new instance I2 of my Application Type T? (for example using the below PowerShell).

New-ServiceFabricApplication -ApplicationName "fabric:/I2" `
                             -ApplicationTypeName "T" `
                             -ApplicationTypeVersion "1.0.0.0"

ClusterView3

Because Service Fabric treats Application Instance as isolation boundaries, it will create a new instance of your Code Package on the cluster nodes. Remember S2 which was activated manually on I1 above (not using default services section)? it will not be activated in I2 application instance.

Similarly I can change Application Type Version in my package, copy and register on the cluster the same Application Type with a Different Application Type Version. Which will allow me to:

  1. Create Application Instances (isolated on Code Packages instances) from the new Application Type Version.
  2.  Upgrade existing Application Instance to the new Application Type Version.

Few “Gotchas” to Think About

  1. You will not be able to create partitions with # of replicas >  # of cluster nodes. as Fabric really doesn’t like 2 replicas of the same partition in the same node.
  2. For the same reason. For Stateless services you will not be able to state # of instances >  # of cluster nodes.
  3. If a replica fails the entire windows process instance is scrapped and recreated again (might be on a different node, if a process contains a Primary, an Active Secondary on different nodes will be prompted to Primary) with all replicas/instances running inside it. In essence un-handled exceptions are truly evil for your replicas.
  4.  Don’t assume that certain replicas will exist together in the same windows process instance. And it is a good practice not to share process wide resources among replicas (such as Thread Pools, Static Variables etc.) unless you really understand the implications.
  5. Because all replicas share the process GC runs for all of them at once, certain principals in optimized CLR memory management applies heavily here. It is possible for a replica to hog the process resources and affect the performance of other replicas running in the same process.
  6. Although I did not talk about it. You can remove Activated Services (check remove-ServiceFabircService ccmdlet) from an Application Instances. This means replicas can be taken down from a process. Hence the practice of not sharing anything process wide.
  7. Service Fabric is an excellent tool. However it groups the display according to the Partition, not the node. Which might be confusing if you are trying to learn about these concepts. A Combination of SysInternals Process Explorer & Debug View will be needed if you want to drill into this topic. My sample code throws few OutputDebugString Win32 API calls (via Debug.WriteLine) to clarify the relationship between Windows Process & Replicas.

Making The Right Isolation Decision

  • For applications that require “Data Level” isolation, Partitions works really well. However keep in mind that your code might run in the same process. An example for that same Application Instance might support multiple departments – within the same organization – using partitions. Keep in mind that data are still shared on the same disk (as separate files). If you are dealing with undefined # of departments you can activate services within the same application as needed.
  • For applications that require “Data Level” and “Process Level” isolation, use different Application Instances. As an example of this SaaS application where each customer gets a dedicated application instance, each instance will have it is data and processes. As far as I can tell, this model is used by Azure Event Hub and Azure DocumentDB.

Factory Based Activation

If you noticed there is nothing that ties “Service Type” as a concept to the actual .NET class that implements it in the manifest files. Service Fabric expects your Code Package to register “Service Types” with “.NET Types” inside your code this happens as the following (usually in your Main() method):

using (FabricRuntime fabricRuntime = FabricRuntime.Create())
    {

                    // static registeration for both service type
                    fabricRuntime.RegisterServiceType("ComputeSvc", typeof(ComputeSvc));
                    fabricRuntime.RegisterServiceType("GatewaySvc", typeof(GatewaySvc));
    }

The above register Service Types named “ComputeSvc” and “GatewaySvc” with their respective .NET types (who happened to share the same name).
For those of you who are using Dependency Injection (or abuse it, I may need to create a post just ranting about that) and want to inject dependencies in your Service .NET types instances you can use factories as the following:

    // factories for both Stateful and Stateless services
    class StatefulSvcFactory : IStatefulServiceFactory
    {
        public IStatefulServiceReplica CreateReplica(string serviceTypeName, Uri serviceName, byte[] initializationData, Guid partitionId, long replicaId)
        {
            Debug.WriteLine("****** Stateful Factory");

            // this factory works only with one type of services.
            // uri will give you hosting app, which you can use in SaaS models where appname maps to data location, reource pool, user identities etc.
            ComputeSvc svc = new ComputeSvc();

            // if i have a super duper DI i can assign it here (or use the constructor).
            // svc.MySuperDuperDI = CreateDI();

            return svc;

        }
    }

    class StatelessSvcFactory : IStatelessServiceFactory
    {
        public IStatelessServiceInstance CreateInstance(string serviceTypeName, Uri serviceName, byte[] initializationData, Guid partitionId, long instanceId)
        {
            // just like above only stateless
            Debug.WriteLine("****** Statless Factory");
            return new GatewaySvc();
        }
    }

// inside my Main() method

          using (FabricRuntime fabricRuntime = FabricRuntime.Create())
                {

                    fabricRuntime.RegisterStatefulServiceFactory("ComputeSvc", new StatefulSvcFactory());
                    fabricRuntime.RegisterStatelessServiceFactory("GatewaySvc", new StatelessSvcFactory());
}

Instead of explicitly specifying .NET type I am passing in a factory which can create an instance of the Service (as a replica or as an Instance). It also gives me a change to learn about which Application Instance is used (this can be extended in highly isolated environments such as SaaS).

Running It Step By Step

I have published a sample code used in the discussion above here https://github.com/khenidak/ServiceFabricSeries/tree/master/HostingActivation you can run the walk through using do-it.ps1 if you want to keep your dev machine clean run un-do-it.ps1 after you are done.

Extras! – after you run do-it.ps1 try to play around with Application Type Version and use the cmdlet to copy,register and create a new application instance using the new Application Type Version

Epilog

I would strongly recommend playing around with the concepts above, having Process Explorer & Debug View available. While most of times you do not have to worry about the underlying platform details yet understanding its internal dynamics is key for a good solution architecture. I hope the above clarified some of Service Fabric concepts.

Till Next Time!

Questions / comments

@khnidk