Businesses need to be agile to compete in today’s global economy. Programmers use various tools and techniques in order to meet this business requirement. The challenge is great and quite complex. Going too fast without the right approach can lead to ephemeral success.
I believe that Microservices give us the agility and architectural patterns that empower us to scale and create value at a far greater pace for the business compared to using a traditional tiered architectures approach.
Forget about 3-tier architectures, they just doesn’t scale. Stateless services need to rebuild their internal state for every call, and they can generate tremendous pressure on data stores. Consequently, this generates back pressure that bubbles up through the layers of our solution and reaches out to the edge. Back pressure then translates into unavailable services. The key is Data Locality and Stateful Services.
The Microservice architectural style is an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. These services are built around business capabilities and independently deployable by fully automated deployment machinery. There is a bare minimum of centralized management of these services, which may be written in different programming languages and use different data storage technologies… read more
Getting to Know Actors
Today’s Internet-scale services are built using Microservices. Service Fabric is a next-generation middleware platform used for building enterprise-class, Tier-1 services. This Microservices platform allows us to build scalable, highly available, reliable, and easy to manage solutions. It addresses the significant challenges in developing and managing stateful services. The Reliable Actors API is one of two high-level frameworks provided by Service Fabric, and it is based on the Actor pattern. This API gives us an asynchronous, single-threaded programming model that simplifies our code while still providing the advantages of scalability and reliability guarantees offered by Service Fabric.
In this post, we will run through the what, when and how of various aspects of Service Fabric and Reliable Actors.
What Are Actors
Actors are isolated, independent units of logic and state within a single-threaded execution model. Concurrency is achieved by having many actors executing simultaneously and more specifically independent from each other.
When to Use Actors
Consider Actors when the domain involves a large number of small, independent, and isolated units of state and logic. Best practices around Actors include reduced interaction with external components, including querying state across multiple Actors. Use Actors for low latency scenarios that can be modeled with practices like Domain Driven Development (DDD).
Actors are a good fit for
- Highly available services
- Scalable services
- Computation on nonstatic data
- Session-based interactive applications
- Distributed graph processing
- Data analytics and workflows
Actors are NOT a good fit for
- Distributed transactions
- Control over concurrency, partitioning and communication
Actors provide us with flexibility when it comes to designing our solutions. Through the use of Domain Driven Development (DDD) and other similar practices, it is possible for us to leverage the business to build Actors and drive their interactions.
Domain-driven design (DDD) is an approach to developing software for complex needs by deeply connecting the implementation to an evolving model of the core business concepts.
The most significant complexity of an app or service is usually not technical. It is in the domain itself, the activity or business of the user. This is why I like Actors. They allow me to model real-world scenarios and to communicate with the business. By breaking down the problem space into multiple Actors, we are able to benefit from independent deployability and scalability. Focusing on a specific challenge allows, us to think of an Actor as a Microservice. And by leveraging many Microservices to build our solutions, we are able to reduce complexity on many levels. For example, by segregating data, we force external entities to interact with data through an Actor’s interface. Consequently, we benefit from lower accidental coupling between services. To put this in perspective, think of a database that you’ve worked on recently. How many different pieces of code had dependencies on the same table and fields? If you changed this highly responsible table, how much of the solution would break? Would it break from the database to the user interface (UI)?
At first, Actors may seem limited by design. That’s because we’ve forgotten about GRASP (a.k.a General Responsibility Assignment Software Principles).
The different patterns and principles used in GRASP are: controller, creator, indirection, information expert, high cohesion, low coupling, polymorphism, protected variations, and pure fabrication.
High cohesion and low coupling, are by far the most important principles when it comes to designing Actors because they save us from anemic domain models. By providing a simple programming model, supported by a platform that enforces single thread execution, Actors make it hard and painful for us to ignore best practices.
Over the years, our craft has refined itself, and we have shared many patterns, principles and practices that help us deal with the complexities that arise in the real-world. SOLID (a.k.a single responsibility, open-closed, Liskov substitution, interface segregation and dependency inversion), GOF (a.k.a Gang of Four Patterns) and Cloud Design Patterns contain proven solutions to known problems. Being familiar with these helps us identify and deal with complexities before they become technical debt, a bug, or worse, a feature.
So, why Actors? Because they give us the ability to be granular in our approach to design. An actor can be an order, a user or an IoT (Internet of Things) device. We can represent physical things or concepts as well as their interactions. By removing much of the complexity related to building highly available distributed systems, Actors allow us to focus on the business and to communicate using business terms. This results in better communication, understanding, and more value for our customers.
Actors on Service Fabric
Because the Actor Service itself is a Reliable Service, all of the application model, lifecycle, packaging, deployment, upgrade, backup/restore and scaling concepts of Reliable Services apply the same way to Actor services.
Reliable Services in Service Fabric is different from services you may have written before. Because Service Fabric provides reliability, availability, consistency, and scalability.
• Reliability – Your service will stay up even in unreliable environments where your machines may fail or hit network issues.
• Availability – Your service will be reachable and responsive. (This doesn’t mean that you can’t have services that can’t be found or reached from outside.)
• Scalability – Services are decoupled from specific hardware, and they can grow or shrink as necessary through the addition or removal of hardware or virtual resources. Services are easily partitioned (especially in the stateful case) to ensure that independent portions of the service can scale and respond to failures independently.
• Consistency – Any information stored in this service can be guaranteed to be consistent.
The way this works, is that Actors are hosted inside Actor Services. Each Actor Service instance is a Partition. And each Service Fabric node can host many Partitions.
Service Fabric automatically distributes Partitions over multiple nodes, resulting in even resource utilization across all nodes. When a node is removed from the cluster, Service Fabric rebalances itself by moving service instances from node to node.
This type of dynamicity usually leads to interesting technical challenges like finding specific Actor instances. Fortunately, Actors have an Actor ID that uniquely identifies then within a cluster.
Reliable Actors are considered stateful because each Actor instance maps to a unique ID. This means that repeated calls to the same Actor ID will be routed to the same Actor instance. This is in contrast to a stateless system in which client calls are not guaranteed to be routed to the same server every time. For this reason, Actor Services are always stateful services.
Actor IDs can be randomly generated or can be Custom ID values (GUIDs, strings or Int64s). Behind the scenes, Actor IDs are hashed to an Int64 which is then used by the Actor Service to distribute Actor Instances uniformly across its Partitions.
When explicitly using an Int64 as an Actor ID, the Int64 is mapped to a Partition without further hashing. This behavior can be leveraged to control Partition placement for Actor instances. NOTE: doing so results in an unbalanced distribution of actors across the Actor Service Partitions.
An Actor instance is found using its Actor ID and its application name. The application name identifies the Service Fabric application that the Actor’s Service instance is deployed. This is important because we can deploy many instances of the same Service Fabric application under different names.
Keep in mind that it is expected that interacting with Actor instances requires network communication, including serialization and deserialization of method call data. This translates to overhead and latency.
The Actor Proxy allows us to communicate with Actor instances. It performs the necessary resolution to locate the Actor instance and opens a communication channel (a.k.a Remoting) with it. It also retries to locate the Actor instance in the case of communication failure and in the case of failovers. As a result, messages delivery is best effort. Due to retry mechanisms, Actors may receive duplicate messages. Since this is possible, it is important to build our Actors with Idempotence in mind.
Idempotence is the property of certain operations in mathematics and computer science, that can be applied multiple times without changing the result beyond the initial application. The concept of idempotence arises in a number of places in abstract algebra and functional programming.
An Actor is defined through its interface. This interface inherits from IActor and defines Task returning methods. At this point, it’s important to note that methods are used to define the types of messages that can be understood and processed by our Actor. Therefore, it is not possible to define properties. Furthermore, methods cannot be overloaded and must not have out, ref or optional parameters. This is due to the fact that an Actor’s interface is used by the Actor Proxy to communicate over Remoting with our Actor instance.
.NET Remoting allows an application to make an object (termed remotable object) available across remoting boundaries, which includes different processes or even different computers connected by a network. It does so, by making a reference of a remotable object available to a client (calling) application, which then instantiates and uses a remotable object as if it were a local object. However, the actual code execution happens at the server-side (remotly). Read more about .NET Framework Remoting.
At the client side, the remoting infrastructure creates a proxy that stands-in as a pseudo-instantiation of the remotable object. It does not implement the functionality of the remotable object, but presents a similar interface. As such, the remoting infrastructure needs to know the public interface of the remotable object beforehand. Any method calls made against the object, including the identity of the method and any parameters passed, are serialized to a byte stream and transferred over a communication protocol-dependent Channel to a recipient proxy object at the server side (“marshalled”), by writing to the Channel’s transport sink. At the server side, the proxy reads the stream off the sink and makes the call to the remotable object on the behalf of the client. The results are serialized and transferred over the sink to the client, where the proxy reads the result and hands it over to the calling application object.
Service Fabric Reliable Actors handles additional complexities like ensuring that Actors are single-threaded. This is done via a turn-based access model, that greatly simplifies our development efforts, because there’s no need for complex synchronization mechanisms when manipulating data.
Service Fabric applications must be designed with special considerations for the single-threaded access nature of each Actor instance.
The Actor runtime enforces turn-base concurrency across methods, timers and callbacks, by acquiring a per-actor lock at the beginning of a turn and by releasing it at the end of the turn. This makes it possible for an Actor instance to become a throughput bottleneck. Instances can also deadlock when there is a circular request between them while an external request is made to these Actor instances concurrently. When this scenario takes place, the Actor runtime automatically times out the calls and throws an exception to the caller. This breaks the possible deadlock. Finally, each Actor instance deals with the caught exception and is now able to receive new calls.
The Actors runtime, by default, allows logical call context-based reentrancy. This allows for actors to be reentrant (processing a new message even if the Actor instance is locked) if they are in the same call context chain.
As part of the message processing, if Actor 3 calls Actor 1, the message is reentrant, so it will be allowed. Any other messages that are part of a different call context will be blocked on Actor 1 until it is finished processing.
As mentioned above, the Service Fabric Actor runtime provides concurrency guarantees for method invocations that are done in response to a client request, as well as for a timer and for reminder callbacks. However, if the Actor code directly invokes these methods outside of the mechanisms provided by the Actor runtime, then it cannot provide any concurrency guarantees. For example, a method that is invoked from a Task that is not associated with the Task returned by the Actor methods, or if the method is invoked from a thread that the Actor creates on its own, then the runtime cannot provide concurrency guarantees. Both examples listed above often emerge from trying to background processing. The best practice in this specific scenario is to use Actor Timers and Actor Reminders that respect the Actor’s turn-based concurrency mechanisms.
Service Fabric Actors are virtual, meaning that their lifetime is not tied to their in-memory representations. They do not need to be explicitly created or destroyed because the Actor runtime automatically creates an Actor instance the first time it receives a request for a new Actor ID. It’s important to note that client code cannot pass parameters to the Actor’s constructor because constructors are called implicitly by the Actor runtime.
If an Actor instance is not used for a period of time (default is 1 hour), the Actor runtime garbage-collects the in-memory instance. It will also maintain knowledge of the Actor’s existence so that if it needs to be reactivated at a later time.
When the Actor runtime garbage-collects an in-memory Actor instance, it calls the Actor’s OnDeactivateAsync method. This will unregister all Actor Timers and can be overridden to execute clean-up code.
Actor Garbage collection is configurable and is responsible for cleaning up deactivated Actor instances (idle for a configurable period of time). Keep in mind that it does not remove state stored in the Actor’s State Manager. It scans for deactivated Actor instances on a regular interval (default is 1 minute) and proceeds with the cleanup.
An Actor instance is considered as “being used” when it has received a call from an Actor or a client and when the IRemindable.ReceiveReminderAsync method is invoked (applicable only if the Actor implements Actor Reminders). Actor Timer callbacks are not counted as “being used”.
The Actor’s state outlives the Actor instance’s lifetime when it is stored in the State Manager.
Actors may be constructed in a partially-initialized state. If the Actor instance needs initialization parameters from the client, there is no single entry point and design must take this into consideration. One approach is to leverage the Actor’s OnActivateAsync method to set initial state to known defaults.
The Actor’s State Manager resembles a Dictionary (Key Value) type structure where the key is a string and the values are generic that can be of any type including custom types. Values stored in the State Manager must be Data Contract serializable because they may be transmitted over the network to other nodes during replication and may be written to disk depending on the Actor’s state persistence setting.
Actor State Persistence Settings
• Persisted state – State is persisted to disk and is replicated to 3 or more replicas. This is the most durable state storage option, where state can persist through complete cluster outage.
• Volatile state – State is replicated to 3 or more replicas and only kept in memory. This provides resilience against node failure, Actor failure, and during upgrades and resource balancing. However, state is not persisted to disk, so if all replicas are lost at once, the state is lost as well.
• No persisted state – State is not replicated nor is it written to disk. For Actors that simply don’t need to maintain state reliably.
Each level of persistence is a different State Provider in combination with a Replication Configuration. Whether or not state is written to disk depends on the State Provider (the component in a Reliable Service that stores state) and replication depends on how many replica instances a service has. Just as with Reliable Services, both the State Provider and replica count can easily be set through configurations. The Actor framework provides an attribute, that, when used on an actor will automatically select a default state provider and auto-generate settings for replica count to achieve one of the three persistence settings.
When working with the State Manager, it’s important to note that its methods are all asynchronous as they may require disk I/O when actors have persisted state. Upon first access, state objects are cached in memory. Repeat access operations access objects directly from memory and return synchronously without incurring disk I/O or asynchronous context switching overhead.
State can be retrieved using a standard Get or TryGet operations. Then modifying the state in local memory alone does not cause it to be replicated by the State Manager. The state must be re-inserted using the State Manager’s Set operation for it to be handled according to the Actor’s State Persistence Setting. Once the Actor method returns, the State Manager handles added and modified values. If the save event fails, the modified state is discarded and the original state reloaded.
By default, the replication security is turned off. But it can be enabled through configuration. Once enabled, services are not able to see each other’s replication traffic, ensuring that the data that is made highly available is also secure.
Traditional tiered applications can be challenging to debug due various layers and dependencies. Service Fabric encourages services to be lightweight by allowing thousands of Actors to be provisioned within a single process, rather than requiring or dedicating entire OS instances to a single instance of a particular service. Now thinking about possible failures, exceptions, and auditing it can seem like an overwhelming task to figure out what’s going on. Fortunately, the team who created Service Fabric thought about this specific scenario and built-in tools and features to make this easy. By providing us with Performance Counters and a multitude of Events exposed through ETW (Event Tracing for Windows).
Event Tracing for Windows (ETW) is an efficient kernel-level tracing facility that lets you log kernel or application-defined events to a log file. You can consume the events in real time or from a log file and use them to debug an application or to determine where performance issues are occurring in the application.
The telemetry produced by these mechanisms flows out to Azure Storage through Azure Diagnostics. By having this information centralized it is possible to trace events that are distributed across multiple nodes. Furthermore, the information is quite rich and ranges from how long a specific Actor method took to execute, exceptions and to the number of Actor calls waiting for an Actor Lock (great for spotting bottlenecks). Visual Studio also taps into this same telemetry during local and remote debug sessions. This tooling allows us to filter (slice and dice) the stream of events so that we can concentrate on what matters most to us. Greatly simplifying analysis, tracing and debugging, the tools allow us to leverage breakpoints and to step through the code. Developers who’ve worked on Winforms, WPF, ASP.Net, UWP or other .NET projects will feel right at home when they hit F5. Visual Studio builds and packages the Actor Service. Then it leverages PowerShell scripts, that are built into the project’s template, to test and deploy to the local Service Fabric Cluster. These came PowerShell scripts can be used to built out a Continuous Integration (CI) and Continuous Delivery (CD) pipeline.
Interactions with Service Fabric are done through PowerShell, .NET or wrapper UIs. This empowers us to automate our activities and integrates well with DevOps practices.
Speaking about DevOps, CI and CD brings us to our next topic. Versioning of Actors on Service Fabric is done through the Application Package, which is composed of 4 independent version strings. An actor Service has a version for the code, for the configuration and for its data. Actor Services are packaged in Application Types that also have a version string. At first, this may seem overengineered but if we take a step back and think about what this means, we quickly notice that it provides us with the ability to control how our Actors are upgraded. By changing the version string on configuration and upgrading our deployments, Service Fabric applies the configuration part of the Service Type and leaves everything else as is. In turn, lowering the risk and impacts of deployments.
Once deployed, Service Fabric uses Health Probes to monitor and validate that the deployed Actors are behaving as expected. If Health Probes report that something has gone wrong, Service Fabric can rollback to a previous version, providing us with zero downtime application upgrades. Learn more about this by reading the introduction to Service Fabric health monitoring.
Sample – Actors on Service Fabric
The above diagram depicts a Live Q&A app I built for my DevTeach presentation (code) about Service Fabric Actors. Since Actors are not directly accessible from outside a Service Fabric Cluster, we need to create a Service layer. This can be a Web API, a WCF service that exposes a publicly accessible endpoint or even custom code accepts TCP connections.
This sample is composed of a .NET WebAPI that translates REST calls to calls against the domain Actors. Blue actors are persisted to disk, green actors are volatile (in-memory replication) and red actors are singletons. Since Actors are single threaded, I added Transcript Views that act as individual cached instances of the Live Q & A transcript. This allows each participant to get a quick response from the service without being penalized by competing for a lock on the read operation. The transcript actor uses an Actor Reminder to read the list of question Actors from the session Actor on a regular interval. Then it proceeds to rendering the transcript, stores it in the State Manager and pushes it out to the registered transcript views. Through the Web API, users can also create sessions, list sessions, join sessions and ask questions.
Since rendering a Transcript can be expensive, I decided to have in-memory replicas. This allows for failovers to be quick and efficient. Transcripts are continuously being rebuilt so they do not need to take the extra performance hit of writing to state to disk. I/O is expensive and we have to keep that in mind. Transcript views are in-memory copies of the transcript and only serve as a means to scale. They can be rebuilt at any time and do not require persistence.
To help make things easier, I decided to reuse Actor IDs. The transcript Actor ID is the same as the session Actor ID. This means that I had one less ID to track. By making a call to GET a session transcript view, the user triggers the creation of the transcript view, that triggers the creation of a transcript and registers for updates. The transcript then pulls the existing session on a regular basis. In future iterations of this sample, I am thinking of using Actor Events to build out a Pub-Sub (best effort messaging) mechanism.
publish–subscribe is a messaging pattern where senders of messages, called publishers, do not program the messages to be sent directly to specific receivers, called subscribers, but instead characterize published messages into classes without knowledge of which subscribers, if any, there may be. Similarly, subscribers express interest in one or more classes and only receive messages that are of interest, without knowledge of which publishers, if any, there are… read more
Service Fabric has been around for a while at Microsoft, and has recently made its way to all developers. It’s not a small platform and I’ve only scratched the surface. This post has two goals. The first is to get you interested in learning more and the second is to get you to try it out for yourself.
I’ve been playing with Service Fabric and Actors for a while now and I see so much potential, so many options and so much agility. I can’t think of any scenario today, where I would not consider Service Fabric as my goto Microservice platform.
Service Fabric runs on Linux and on Windows, in any Data Center and as a Platform as a Service on Microsoft Azure. Think about the applications that you’re working on today. Would they benefit from highly available stateful Microservice?