Using Service Fabric Partitioning to Concurrently Receive From Event Hub Partitions

Prelog

The classic tale is you have Azure EventHub with a number of partitions, you run your receiver process in a singleton Azure Web Jobs or a single instance Azure Worker Role cloud service both will use “Event Processor Host” to pump events into your process. The host externalize offset state storage to Azure Blob Storage (to survive processor crashes and restarts). This works fine yet it has the following problem. You can not scale it out easily. You have to create more separate cloud services or web jobs each handling some of your event hub partitions. Which will complicate your upgrade and deployment processes.

This post is about solving this problem using a different approach.

Service Fabric Azure Event Hub Listener

Azure Service Fabric partitions are somehow logical processes (for the purists I am talking about primary replicas). They can exist separably each on a server, or co exist on a node. depending on how big your cluster is and your placement approach.

Using this fundamental knowledge we can create a service that receives events from Azure Event Hub as the following:

  • If the number of Service Fabric Partitions is greater than Event Hub Partitions. Then each Service Fabric partition should receive from multiple Event Hub partitions at once.
  • If the number of Service Fabric Partitions is equal Event Hub Partitions then each Service Fabric partition should receive from one Event Hub Partition.

Because we have absolute control over how the above mapping is done, we can also add 1:1 mapping (irrespective of the number of partitions here and there). or even custom funky logic where some service fabric service partitions are doing the receive the rest are doing something else.

The second piece of the new tale is state. Service Fabric has it is own state via the use of stateful services I can save few network I/Os by using them to store the state – reliably – internally on the cluster via the use of Reliable Dictionary. The state management for our receiver is plug-able. As in you can replace it with whatever custom storage you want (one of the scenarios I had in mind is running the listener in a stateless service and use a another packing stateful service).

The third piece of the new tale is usage. Event Hub Processor is great but it requires a few interfaces to be implemented (because it is a general purpose component). Ours is highly specialized single purpose component. Hence we can simplify the entire process by just implementing one interface (that represents what happened to events and when to store the offset).

The last piece of our new tale is the entire code wrapped in Service Fabric ICommnuicationListener implementation allowing you to use just as any other Service Fabric listener.

The code for the above is on github here https://github.com/khenidak/service-fabric-eventhub-listener

till next time

@khnidk

Azure for SaaS Providers

So I kicked off yet another side project. Azure for SaaS providers is a projet aims is to build a reference literture to be used by those who are using Azure to build SaaS solutions. This a long term project (probably ending by Feb 2016). I will be adding frequently to this content.

While the target the primary audience is SaaS providers the information can also be for solutions that needs to support massive scale.  And the core concepts can be used to support private cloud hosting or even different cloud providers.

The content is in OneNote format hosted here http://1drv.ms/1Ni2geR (no you don’t need to install OneNote)

Below is the content introduction page for a quick read:

What is this?

This is collection of literature and reference materials that help those planning or currently using Azure to build or host SaaS solutions. While the primary audience is those who are planning to use Azure in SaaS context there is nothing stopping you from using this for other applications, specifically applications with high scale requirements. Typically applications that require partitioning, isolation and containerization.

Why?

Frankly speaking, because there is not a lot of literature out there covering this topic. And those which cover it either too narrow, too old or both. The purpose is to liberate SaaS vendors’ developers to build more great features and less platform components (for architects this represents a wealth of options to consider while thinking about your SaaS solutions running on Azure).

What should I expect here?

Design options with tradeoffs, diagrams, code samples, code components and reference to external sources.

Why using this format?

  1. SaaS vendors come in all shapes and colors and there is no one solution that fits all. Hence the recipes approach. You mix, match and modify recipes as you see fit.
  2. Microsoft Azure is currently (and will always be) in fluid state. New services/components will be added and existing ones will be updated. This format allows us to revisit each recipe without having to change a lot of other recipes.
  3. OneNote is easy to use, free and with web frontend which can be viewed irrespective of the device you choose to use to view this content.

What if I want to contribute?

Please do reach out and let us have a discussion on which areas you can cover.

Questions/comments

@khnidk

 

Azure Service Fabric: On State Management

Prelog

Making the right design decision is a mix of Experience, Knowledge and a bit of Art. The phrase “right design decision” in itself requires the understanding that design is not an isolated abstract, it has context that spans requirements, environment to the people who actually about to write the software.

Service Fabric employs a complex set of procedures that ensures that your data is consistent and resilient (i.e. if a cluster node goes down).

Understanding (knowing) what goes under the hood in your application platform is key to making better design decisions. Today I want to attempt to describe how state is managed in Service Fabric and different choices that you have and their implications on the performance and availability of your Service Fabric Application.

Strongly recommend to go through Service Fabric: Partitions before reading this one.

Replication Process & Copy Process

Service Fabric employs two distinct processes used to ensure consistency across a replica set (the number of replicas in a given partition for a  stateful service).

Replication Process: used during writing state data in your service.

Copy Process: used to build a new replica when an existing Active Secondary fails. Also used to recover an Active Secondary that was down for a while.

Both processes are described below.

Replication Process

So your Primary goes online start listening to requests and is about to write data using one of the reliable collection classes. Here is a break down of events:

  1. The Primary wraps the write operation (can be an add, update, or delete) puts it in a queue. This operation is synchronized with a transaction. At the other end of the queue your Active Secondary(s) are pumping out the operations as described in the next step. No data has been written yet to the Primary data store.
  2. Active Secondary(s) pick the operation(s) and apply them locally, apply them locally and acknowledge the Replicator one by one.
  3. The replicator delivers a call back to the Primary when write were success.
  4. Primary then can apply the data locally and return the original request caller.

Diving Even Deeper

  • The replicator will not acknowledge the primary with a successful write until it was “Quorum Written”. A quorum is floor(n/2)+1 where n is the # of replicas in a set. As in the more than half  of the replicas in a set has successfully applied it.
  •  Each operation is logically sequenced. The replication process is out of band (from the perspective o the Primary). As in at in any point of time an internal list is maintained by the primary that indicates operation waiting to be sent to replicators, operations sent to replicators and not yet Acked, operations sent to replicator and Acked.
  • There is no relation between the data structure semantics (queue vs dictionary) and the actual storage/replication. All data are replicated and stored the same way.
  • Given the following pseudo code

Create transaction TX
Update dictionary D (key = K to value of “1”)
commit TX
Create another transaction TX1
Read Dictionary D (Key = K)

What value should you expect when you read the dictionary the second time? most probably whatever value K had before updating it. Because remember changes are not applied locally until replicated to the quorum.

Let us bend our minds even more

  • Active Secondary(s) keep pumping data out from replication queue. They don’t depend on replicator to notify for new data. As in a number of Active Secondary(s) can be at different stages of replications. In a perfect case scenario a quorum has to be aligned with the Primary. This effect how you do your reads as we will describe it later.
  • Service Fabric maintains – sort of – configuration version for each replica set. This version changes as the formation of the replica set (i.e. which one currently is the primary). This particular information are communicated to the replica.
  • You can run in a situation where an Active Secondary is a head of the Primary. Given a situation where a Primary failed to apply changes locally before dying and not notifying the caller of successful writes. Or worse Service Fabric lost the write quorum (for example: many nodes of the cluster failed at the same time) and the surviving replicas (of which one will be elected primary) are the slow ones that was not catching up. The details of how this is recovered and communicated to your code (and control you will have on recovery) is the part of the Service Fabric team is working on.  Rolling Back, Loss Acceptance, Restore from Backup are all techniques that can be used for this.

What About My Reads?

The default behavior is to direct reads to the Primary replica to ensure consistent reads (i.e. read from the latest most trusted version of data) However you can enable listening on your secondary(s) and redirect reads to them (more on how to here).
As you can tell from the above discussion reads from Secondary(s) can be out of sync reads or a better expression ghost data. The current service interfaces does not allow you to know how far ahead or far behind the current replica is from the Primary.

The Copy Process

The scenarios above discuss what happens when writes occur from Primary to Active Secondary(s). But what happen when Service Fabric needs to bring one a Secondary up to speed (i.e. synchronize it). The Secondary in question can be new (i.e. service fabric replacing a dead one on a dead node or moving an existing one to a new node due to resource rebalance). Obviously using the replication mechanism is an over kill. Instead Service Fabric allows the Secondary to communicate to the primary its marker (the last sequence it applied internally) the Primary can then send back the delta operations for the secondary to apply. Once the secondary is brought up to speed it can follow the normal replication process.

The Fabric Replicator

Up until this moment we have talked about a “Replicator” with further details. Service Fabric comes with “Fabric Replicator” as the default replicator, this is surfaced to you partially in the “StateManager” member of your stateful service which holds reference to the replicator. If you tinker inside the code you will notice that there is a process that couples your service to a state provider / manager (in IStatefulServiceReplica interface) meaning in theory you can build your own replicator which as you can tell is quite complex.

So What Does This Means For Me?

The below is various discussion points on multiple design aspects when it comes to replicas/replica sets.

How Fast are My Writes?

You have to understand that the fastest is S/A + time -on-slowest-Network (R) + D/A  + Ack. Where
  • S/A is the time of serialization & local calls to the replicator at the Primary.
  • Slowest network to the replica (those of the quorum, or the slowest of fastest responders who happened to be most updated secondary(s)).
  • D/A is the time of deserialization & calls to the replicator at the secondary.
  • Ack time (the way back).

Large # of Replicas vs Small # of Replicas

It is really tempting to go with large # of replicas. But keep in mind “How fast are my writes?” discussion above & the fact that write quorum are floor(r/2) + 1. The larger your replica set the larger your quorum the slower your writes will be. I would go with the smallest possible replica set per partition.

But really, When Should I Go With Larger # of Replica in my Replica Set?

If you have a system where reads are far more than writes, and the impact of ghost reads is negligible to your business then it might make sense to go with large # of replicas (and allowing read on secondary(s)).
As a side note, replica listening address resolution mechanism (more here) does not keep track of your calls. So you probably want to do simple load balancing (for example round/robin across all secondary replicas in a partition).

Large Objects (Data Fragment) vs. Small Objects

 How big your objects (those entries in reliable dictionaries and queues) should be? well consider the following. The sequential nature of replication, the fact that replicas live on different machines (network overhead). you will end up with “smaller is better“.
A data fragment that is a 500K big, it is not only slow in replication to the quorum. it will also slow the line of pending replications to replicas (trying to catch up). Not only that it will slow the Copy Process hence slowing down the recovery and resource re-balancing processes.
Consider techniques of breaking down your data into smaller chunks (writes) and write them in one transaction.
Also note that reliable collections persist in memory (for primary and Active Secondary(s)) as well (that is how data structure semantics queue vs dictionary are maintained on a natural replicator). That means your entire state lives in memory of your cluster nodes.

Epilog

As you can tell the mechanics of Service Fabric internals are quite complex yet an understanding of how it work internally is important to building high performance Service Fabric applications.

till next time

@khnidk