The Case on Latency, Fairness & Throughput for Connected Clients


The rant below is some of the challenges and possible solutions that you will encounter when developing a server side application that offer services to connected clients. While I will talk mostly about state-ful/session connection (as in TCP/IP sockets and the likes) the below to a high degree also applies on session-less connection (such as REST/Web API type of implantation).

A couple of things we need to note before going forward. Performance & Scale are different things. Performance is how fast you can respond to a request. Scale is how many of those can you respond to concurrently. They are related because typically at large number of concurrent connections & requests, your response times tend to increase.

For the sake of discussion, we will assume that we are building a game server that powers an action game such as Halo or the likes.

What is Under the Hood? Threads, Thread Pools and .NET’s TPL

So clients’ connections come in, in your code you need to assign compute resource to respond to connection’s requests. Irrespective of Windows, Linux or even other systems. There will be a message loop in place for each client that picks fragments of (typically a byte array) out of the wire buffer into your code. Each client gets a copy of that loop (or one loop goes through the clients). Fragments are picked, routed into your code, some execution happens, then a response is sent to the original client.

    • Threads: Each client connection gets a thread, the thread will perform the message pump for that particular client and you are done. As the client connects the thread is created, as the client disconnects the thread is destroyed. Easy? yes. too easy that it makes questionable. Here is why

Threads are an expensive compute resource, you don’t have a lot of them per process (obviously depends on your CPU/MEM and how big your kernel memory is, since it carries the handle table including thread handles). The maximum number of connected clients will be tied by the maximum number of your threads you can create in-process (and this usually tend to be small number).

The other problem is, your clients will come in all shapes and colors, some will be active, some will not be very active/idle. Not very active clients will have their idle threads (which executes just the message loop against empty buffer) eating away your compute resources.

It cannot be all that wrong, right? yes. Threads are really good if you are trying to achieve the fastest possible response times for an expected - better, fixed - number of clients. example: 2 servers sharing data (backup servers, warm stand by ones etc).  But they are problematic if you want to support the maximum number of connected clients, fairly (more on this later).

Best way to think about it if your topology is more like snowflake (few clients connected to each server, each server is connected to one or more servers) then threads are better, if you are doing a star like topology they are not.

  • Thread Pools (and .NET TPL): Because .NET TPL is built on top of Thread Pool we will just group the 2 approaches in one discussion. Here is how it works. As connections come in, you create the connection object and then string the pull (aka BCL’s/.NET Socket.Receive or win32 recv function) into a continues call where one call upon finishing queue the next call in thread pool via direct call or via await construct inside some loop). This will scale well. Resources are distributed well on connected clients (and are not mapped or dedicated to them). 

On Fairness Across Conneted Clients

Are We Fair?

Fair is when all connected clients’ requests are treated equally, irrespective of how active they are.

The thread per connected client approach, is fair/er because each client gets a dedicated thread. The OS Kernel will ensure equal time scheduled per threads. It does not scale well but it is fair. The thread pool approach while scalable it is not fair and will yield skewed performance numbers. Some clients will get faster responses others won’t.

Here is why (hint: The keyword here is “queue”):

Thread pools offer a queue (one and only one) where all your work items are queued, threads on the other end of the queue pick them up one by one and execute them, with a call back to notify completion (or the await .NET construct in TPL). Active clients will be served faster than *not very* active clients because they will have more work items in that queue. You might have heard ”The harder you hit the faster it will respond across all your calls” behavior in some of cloud PaaS services in Aws or Azure. It is largely attributed to this. Everything is placed in queue irrespective of fairness to connected clients.

Additionally, it has that phantom lock/release behavior. Consider this, One active client with 20 not very active clients on 20 thread thread-pool. There will be a condition where all threads in the thread pool are busy executing read from inactive clients while we have one client ready with a request waiting at the end of the queue. You will see this as idle CPU + longer response time then busy CPU with less response times even when the total number of request did not change.

Typically, we solve these problems by using very short timeout on the receive call. Or we use smaller I/O frames. Those solutions will not stop condition above from happening, but they will resolve it much faster than longer timeout on receive calls.

Do I Need Fairness Across Connected Clients?

If the situation above applies on the server side application, you are developing then start by asking yourself. Does it really worth it? The answer really is in what you promised your clients, if the maximum response time (for the most unfairly treated connected client) is within your SLA then you don’t need fairness. Ensuring that you are within SLA is a bit of art and a bit science (also known as performance testing). As you can tell you will have to test multiple load patterns (not just stress the system until it fails).

As an analogy, this is a lot like waiting for few minutes (minimum standard wait time) to get your espresso because the shop is full of other clients (who are not ordering anything). Some applications with super small latency cannot just afford that, gaming is one of them.

Where Can I Apply Fairness?

If you are reading thus far I assume you still need fairness and you are looking for possible solutions. Let us start. A typical application in this context perform 3 major things

  • Receive Requests
  • Execute Requests (i.e. route data fragments into your code)
  • Sends Response.

Each of this area (can/will) need fairness, depending on your requirements. I am obviously assuming that we will go with the thread pool route, not the thread per connected client route.

Using Time Slices

Each execution unit (thread pool work item) will get a specified amount of time that it has to be allowed to execute in. if it does not then it should time out. It is important that this is not razor sharp time allocation, it will vary as we will discuss it. This is fairly simple to implement; the complexity is to leave in-memory objects in a non-corrupt state. Some APIs are easier than others when it comes to that. For example, the Socket.Receive can be called with a timeout when it returns copy the received bytes into an array outside the about-to-be-terminated-task scope. Some will require addition work; for example, consider ExecuteRequestAsync below (as sample of #2 above)

async Task ExecuteRequstAsync(byte[] frame)


// do something that takes along time.


Such an opaque method is very hard to put a timeout on, because if you terminate the call you might be in a situation where your data structures are in a corrupt state.

A better approach is:

Task ExecuteRequstAsync(byte[] frame, CancellationToken ct)


// do step 0: execute any previous uncompleted work.

if (ct.IsCancellationRequested)


// retain state for uncomplete work (maybe an external queue)

return null; // don’t throw exceptions because from the perspective of the caller, the call did succeed.


// do something step 1

if (ct.IsCancellationRequested)


// retain state for uncomplete work (maybe an external queue)

return null; // don’t throw exceptions because from the perspective of the caller, the call did succeed.


// do step 2

if (ct.IsCancellationRequested)


// retain state for uncomplete work (maybe an external queue)

return null; // don’t throw exceptions because from the perspective of the caller, the call did succeed.


// do step 3

return null;


And here is how I can call it

ExecuteRequstAsync(data, new CancellationTokenSource(timeoutInMilisecond).token); //Token is created to be canceled after timeout

This way the method choses when and how it can return (leaving the in memory objects in a non-corrupt state). Most of the execution will use a little more time than the allocated timeout, yet overall it is easier to implement in safe fashion. As you do your checks for IsCancelRequested you can sign off the rest of the method to another Task (or use Task.ContinueWith calling another async method) which thread pool will put at the end of the queue (keep in mind that does not mean in-order execusion). The most difficult challenge is to ensure that ExecuteRequestAsync is called accumulatively (i.e. all state changes are applied in accumulative in-order fashion).

An alternative to this, is differed execution approach. A deferred execution is represented by an object that has a queue, and an execution loop that, instead of calling ExecuteRequestAsync directly you en-queue the call to the deferred execution. Each connected client is represented by a deferred execution instance timeout can still be used with retry (assuming that ExecuteRequestAsync is idempotent).

Using A Quantum

So far we refer to work items and execution as roughly the same thing. This approach depends on separating them. A work item is a task that connected client need to execute (i.e. one of the receive, execute, or send tasks). A quantum represents the number of work items that you will execute for a certain connected client before moving to the next. This assumes that all work items are executed in roughly the same time. This fairness model depends on having a queue per connected client (outside the queue which the thread pool controls). The easiest way to do this in .NET world is to implement your own TaskScheduler. For native you will have to implement queue and the scheduler from scratch. I have implemented the same pattern in .NET for a web socket server (that ensures fairness on the send side only) here I also ranted about this here

Epilog – Fixed Time Slice and Fixed Quantum are Bad

In a typical solution like the above you will start with a value (for either this or that). And then improve on them. If you see too many timeouts reported on your tasks (or the number of remaining tasks in queue for the second approach is too high) then you need to increase the size of your value. Remember that a time out or (quantum expiry) is somehow like context switching that require dropping/acquiring locks, unwinding stacks you don’t too many of these happening in your process it consumes CPU and does not directly contribute to responding to connected clients. If you are feeling brave you can use a watch dog that dynamically adjust the value in the runtime and/or per connected client.

till next time @khnidk