I know… I know, these days we’re mostly talking about being stateless and there are valid reasons for it. But when we’re building Cloud Services where performance is critical, we need to think about state.
Let’s reflect on what it means to be stateful for a moment. Being stateful means that we keep data close to our processes. What it doesn’t mean, is that we should forget about persistence. That being said lets go over some of the reasons why much of the community is all about statelessness.
- Scalability: A stateless service is easy to scale. It removes many of the challenges associated with load balancing.
- Simplicity: State is hard to synchronize and maintain between instances. That means that by not having any state, memory management is much easier. Once a process comes to completion, it releases the memory that was used, leaving nothing behind.
- Availability: It’s easy to spin up a new instance of a service to satisfy demand because there isn’t much logic required to setup the initial state. Requests can be treated by any instance without introducing any side-effects.
These are compelling reasons to build Cloud Services that don’t rely on state. But there’s a price to pay for their simplicity. For one, the size of resources matters a lot on the cloud, because it’s the number one reason why our VM’s CPUs are underutilized. Storing the application state on remote services like SQL Database, Windows Azure Storage or MongoDB inserts latency in our process. It also contributes to the creation of resource contention. At this point, it’s important to note that the contention probably won’t be on the data provider’s side. Remember, VM’s have limited resources. Bandwidth, RAM, Hard Drive and CPUs are always limited in capacity.
If our processes are completely stateless, it means that they’ll need to make a gigantic number of requests to data providers in order to complete each task. Each network communication is bound by the laws of physics. We need to account for network latency, the physical distance a request must travel, the network topology, the time it takes for the data provider to respond, the time it takes to serialize the response and the time it takes to desterilize the response. Now this is pretty high level, but it gives us a nice mental picture of the process. Imagine going through this dance 100 times per process. Each network bound operation probably adds 4 to 200 milliseconds to the time it takes to complete a task process. This accumulated latency translates to a probable time of execution of 20 seconds.
Let’s take this a step further, if every process takes 20 seconds and each VM instance is capable of executing 4 processes in parallel, it means that our Cloud Service would have a processing throughput of 12 tasks per minute.
In .Net, the default is 2 threads per CPU. Applying this to our example equates to having 2 CPUs per VM in order to execute 4 processes in parallel. In the Windows Azure world, this translates to a Medium (A2) VM that costs approximately $120/month (as of February 2014, for up to date pricing please visit windowsazure.com). Doubling the service’s throughput means doubling its cost of operation or rewriting it so that each task executes faster. Let’s face it, 24 tasks per minute for $240/month isn’t mind blowing…
Getting more bang for each $ spent is all about details. In this case, we have a few options. The first and cheapest to setup is caching. Caching will bring your data into RAM removing the initial read time from cold storage. Furthermore, caching query results will reduce the overall execution time by removing the time required to execute queries and shape results for our applications. The second option that should be combined with caching, is to change our implementation so that our service becomes stateful. This brings its own variety of challenges, but when you look at it from a number’s perspective, it makes sense. By bringing data closer to the process, we are drastically reducing latency cause by communicating with external services.
The goal behind creating stateful Cloud Services is to consciously move away from a database centric/ data store centric / data at rest centric models. When state resides in memory, it can be persisted elsewhere for durability. Virtual Machines are great examples of stateful services that leverage durable persistence.
Stateful services remember their state between consecutive invocations. Therefore, we don’t need to reconstruct the service’s internal state every time it needs to perform an action.
Employing techniques like caching, creating a stateful service and reducing the number of network requests, we could potentially reach a sub-second processing time for each task. Achieving this performance would potentially bring our overall throughput from 12 tasks/minute to 240 tasks/minute. That’s twenty times the original throughput! Achieving the same throughput by adding more instances of the original Cloud Service would cost approximately $2,400/month.
It’s important to strike the right balance between statefulness and statelessness. Front-facing API services should be stateless with lots of caching because each transaction is short and contained. Backend services, often hidden behind queues, should be stateful, because we want these to be as efficient as possible. Our goal is to exploit the provisioned resources to their limits. Scaling out can be challenging and must be considered before attempting any scale up operations. In the end, we will favor larger context-ful (bulky) messages versus many smaller frequent messages in order to reduce network induced latencies. We will also favor heavy caching and accept to work with the added complexities of state management. Furthermore, we will work with stateful in-memory models to take advantage of the available resources and implement mechanisms to leverage durable persistence.
Building a stateful service also permits us to build rich models that are flexible and promotes better boundaries between our services. Building this sort of service requires a little more design effort, but it pays off.