The Kubernetes sweet-spot is running stateless microservices that can scale horizontally. By keeping state out of your application, Kubernetes can seamlessly add, remove, or restart pods to keep your service healthy and scalable. Developing a stateless application is, without question, the easiest way to ensure that your app can scale with Kubernetes. However, there are some workloads that do not run effectively in a stateless way, and for that, Kubernetes offers a few tools for developing stateful applications: leader election, StatefulSets and session affinity.

Leader election is typically used when you want only one instance of your service (one pod) to act on data at one time. For example, if you want to ensure only one instance of your service updates customer information at a time, you must lock the data that is being accessed. Typically, this locking is facilitated by a transactional database. But, if you are using an eventually consistent datastore, this locking is not usually guaranteed at the database level, meaning it is up to the application developer to implement safe locking. Since Kubernetes is backed by etcd, and etcd implements a consistent consensus protocol (raft) we can leverage Kubernetes primitives to perform leader election.

StatefulSets are used for building stateful distributed systems by providing a few key features: a unique index for each pod, ordered pod creation, and stable network identities. Together, these three features guarantee the ordering and uniqueness of the pods that make up your service. These guarantees are preserved across pod restarts and any rescheduling within the cluster, allowing you to run applications that require shared knowledge of cluster membership.

Session affinity, sometimes referred to as sticky sessions, associates all requests coming from an end-user with a single pod. This means that all traffic from a client to a pod will be — to the greatest extent possible — directed to the same pod.

Together, leader election, StatefulSets and session affinity can be used to build robust stateful applications that should meet most needs. The rest of this article will show how each these concepts are implemented in Kubernetes, and when you would want to use them.

Leader Election

With leader election, you begin with a set of candidates that wish to become the leader and each of these candidates race to see who will be the first to be declared the leader. Once a candidate has been elected the leader, it continually sends a heart beat signal to keep renewing their position as the leader. If that heart beat fails, the other candidates again race to become the new leader. Implementing a leader election algorithm usually requires either deploying software such as ZooKeeper, or etcd and using it to determine consensus, or alternately, implementing a consensus algorithm on your own. Neither of these are ideal: ZooKeeper and etcd are complicated pieces of software that can be difficult to operate, and implementing a consensus algorithm on your own is a road fraught with peril. Thankfully, Kubernetes already runs an etcd cluster that consistently stores Kubernetes cluster state, and we can leverage that cluster to perform leader election simply by leveraging the Kubernetes API server.

Kubernetes already uses the Endpoints resource to represent a replicated set of pods that comprise a service and we can re-use that same object to retrieve all the pods that make up your distributed system. Given this list of pods, we leverage two other properties of the Kubernetes API: ResourceVersions and Annotations. Annotations are arbitrary key/value pairs that can be used by Kubernetes clients, and ResourceVersions mark the unique version of every Kubernetes resource in the cluster. Given these two primitives, we can perform leader election in a fairly straightforward manner: query the Endpoints resource to get the list of all pods running your service, and set Annotations on those resources. Each change to an Annotation also updates the ResourceVersion metadata. Because the Kubernetes API server is backed by etcd, a strongly consistent datastore, you can use Annotations and the ResourceVersion metadata to implement a simple compare-and-swap algorithm.

Google has used this approach to implement leader election as a Kubernetes Service, and you can run that service as a sidecar to your application to perform leader election backed by etc. For more on running a leader election algorithm in Kubernetes, refer to this blog post.

I would be remiss at this point not to mention another technique for implementing leader election: leveraging a managed strongly consistent datastore. In particular, DynamoDB is a strongly consistent datastore that we can leverage as a simple lock system. By having your system acquire locks using DynamoDB, you can develop a leader election algorithm ensuring only one client acts on a certain piece of data at a time. For more on using DynamoDB for leader election, refer to DynamoDB Lock Client project on Github.

StatefulSets

Some services require a notion of cluster membership to run successfully. In Kubernetes, this means that each pod in a service knows about each other pod in the service, and that those pods are able to communicate with each other consistently. This is facilitated by StatefulSets which guarantee pods have unique and addressable identities, and persistent volumes which guarantee that any data written to disk will be available after a pod restart.

In particular, each pod in a StatefulSet has a unique identity that is comprised of an index, a stable network identity, and stable storage. This identity sticks to the pod, regardless of which node it’s (re)scheduled on. For example, if you are deploying a service called postgres with three replicas, the identity of the pods in your cluster will be postgres-0, postgres-1, and postgres-2. What’s more, Kubernetes will guarantee that your pods are started in the correct order, so postgres-0 will be started and available before postgres-1, and so on. You can use this feature to deploy a service in a master-slave topology by setting pod 0 as the master, and the rest of the pods as a slave. For example, PostgreSQL’s high-availability deployment involves running a single master (responsible for handling writes) with one or more replicas (aka read-only slaves). In this configuration, the master would be associated with pod 0, and all non-zero pods would serve as replicas of that master.

You can also leverage StatefulSets deploy services that require a notion of cluster membership with no specific master. For example, a Zookeeper cluster requires each node to know the network addressable name of each other node. With a StatefulSet, you can deploy a set number of nodes with known network addressable names. By configuring your Zookeeper cluster with the names and addresses of each node in the cluster, it will be able to run its own internal leader election algorithm that ensures strongly consistent and replicated writes.

For more on StatefulSets, consult the Kubernetes documentation:

Session Affinity

Session affinity always directs traffic from a client to the same pod. It is typically used as an optimization to ensure that the same pod receives traffic from the same user so that you can leverage session caching. It is worth noting that session affinity is a best-effort endeavour and there are scenarios where it will fail due to pod restarts or network errors.

Typically, session affinity is handled by load-balancers that direct traffic to a set of VMs (or nodes). In Kubernetes, however, we deploy services as pods, not VMS, and we require session affinity to direct traffic at pods. This requires some use of Kubernetes internal load balancing using either iptables or ipvs. When deploying your service, you can configure session affinity using YAML. For example, the following service uses ClientIP session affinity, which will route all requests originating from the same IP address to the same pod:

kind: Service
apiVersion: v1
metadata:
  name: my-service
spec:
  selector:
    app: my-app
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 80
  sessionAffinity: ClientIP

This configuration works well for routing traffic to the same pod after it has reached the Kubernetes cluster. To enable session affinity for traffic that originates outside of the Kubernetes cluster, we require configuring Ingress into the cluster for session affinity. With nginx Ingress, we enable session affinity when defining the route to our service. For example, the following Ingress uses HTTP cookies, the cookie is named route and the value of the cookie is hashed using sha1.

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: nginx-test
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "route"
    nginx.ingress.kubernetes.io/session-cookie-hash: "sha1"

spec:
  rules:
  - host: stickyingress.example.com
    http:
      paths:
      - backend:
          serviceName: http-svc
          servicePort: 80
        path: /

For more on session affinity, see Virtual IPs and service proxies and the nginx sticky session docs. We will also be providing more documentation on how to enable session affinity for our deployment of Kubernetes as we finalize the Kubernetes deployment.

Summary

Stateful distributed computing is both a broad and deep topic with inherent complexity — it is impossible to prescribe an exact best-practice for running such complicated applications. However, the techniques shown in this article can be used as building blocks for deploying and running stateful applications using some of the built-in functionality of Kubernetes. In particular, you can leverage the etcd cluster used by the Kubernetes API server to perform leader election, you can use StatefulSets to define a cluster membership topology for you service, and you can use session affinity to consistently route traffic to the same pod. All of these can help you develop a highly-available stateful application — but none of them are silver bullets. It is still up to each application developer to design their application to handle failures, restarts, and data consistency depending on the needs of their application.