Disaster Recovery with DynamoDB Global Tables

DynamoDB provides built-in support for cross regional data replication using a solution AWS calls global tables. This article shows how to build and run an application in Kubernetes that uses global tables to replicate data between regions. In the event of a regional disaster, a secondary Kubernetes cluster in a secondary region has all the data from DynamoDB replicated locally to continue operation.

How global tables work

A DynamoDB global table is a set of multiple replica tables. Each replica table exists in a different AWS region, but all replicas have the same name and primary key. Whenever data is written to any replica table, the data is automatically replicated to all other replica tables that have been added to the global table.

Because any replica in a global table can accept both reads and writes, there are some potential issues with data consistency to be aware of. Notably, DynamoDB does not support strongly consistent reads across regions. Therefore, if you write to one region and read from another region, the read response might include stale data that doesn’t reflect the results of recently completed writes in the other region. Furthermore, if applications update the same item in different regions at about the same time, conflicts can happen as the data is replicated because each table does not know which update was causally “first”. To resolve conflicts, DynamoDB uses a last writer wins policy to settle where the last write that is written to any replica table gets propagated to all replica tables, potentially overwriting data that was stored in a replica that received an edit locally that was behind an edit received in a replica elsewhere.

For the purposes of this article, we are using DynamoDB in a disaster recovery scenario where users are only directed to one region to perform reads and writes, and in the case of a regional failure, users are directed to a separate region. In this case, the application can work without any modification. If you require an active-active deployment where any table can be modified by any writer then you need to design your application in a way that matches the eventually consistent reads and last-writer wins constraints of global tables.

Multi-region Kubernetes with EKS

I’m assuming in this article that you have two Kubernetes clusters deployed in two separate AWS regions. You can follow the guidance in this article to get started. The end result will be an architecture like in the diagram below: two EKS clusters in isolated VPCs in us-east-1 and us-west-2 with AWS Global Accelerator configured to direct traffic between them.

A multi-region EKS deployment supporting disaster recovery

With this setup in place, let’s see how to deploy a Kubernetes service that uses DynamoDB global tables to support disaster recovery.

Setting up IAM roles for Service Accounts

The first thing we need to do to is configure a service account for our Kubernetes deployment using IAM Roles. IAM Roles for Service Accounts grants containerized workloads running on Kubernetes permissions to access AWS services or resources using traditional AWS IAM roles.

Creating the IAM Policy

The following JSON document specifies an IAM policy allowing all access to a specific DynamoDB table called users, in addition to the indexes for that table. For example purposes, this policy applies to resources on any AWS account and region. You can of course make it more restrictive by limiting access to the specific account and regions where you intend to deploy to your global table.

{
  "Statement": [
    {
      "Action": "dynamodb:*",
      "Effect": "Allow",
      "Resource": [
        "arn:aws:dynamodb:*:*:table/users",
        "arn:aws:dynamodb:*:*:table/users/index/*"
      ],
      "Sid": "AllAPIActionsOnUsers"
    }
  ],
  "Version": "2012-10-17"
}

You can create the policy using this file via the AWS CLI.

aws iam create-policy --policy-name dynamo-iam-policy --policy-document file://iam-policy.json

Since IAM is a global AWS resource, you only need to create this policy once, and it will be accessible in any AWS region. This is convenient for us because we can use the same policy across both the us-east-1 and us-west-2 regions.

Creating the Service Account

With our policy in place, you can create a Kubernetes service account using eksctl.

eksctl create iamserviceaccount \
    --name dynamo-sa \
    --cluster sookocheff-us-east-1 \
    --attach-policy-arn "<the-dynamodb-policy-arn>" \
    --approve

After executing this command you should see a CloudFormation stack has been created. In my case, the stack is given a name that represents the EKS cluster, Kubernetes namespace, and service account name that will be created.

eksctl-sookocheff-us-east-1-addon-iamserviceaccount-default-dynamo-sa

The resource created by the stack is an IAM Role with our DynamoDB Policy attached.

In addition to the IAM Role, this command creates a service account in our Kubernetes cluster. A Kubernetes service account gives pods an identity that they can assume for the lifetime of the pod. When a service account resource is created, a JWT token is automatically created as a Kubernetes secret. The secret can then be mounted into Pods and used to make authenticated requests.

You can see the service account that was created using kubectl. The annotations on the service account list the IAM Role used and the ARN of that Role.

❯ kubectl describe sa dynamo-sa
Name:                dynamo-sa
Namespace:           default
Labels:              app.kubernetes.io/managed-by=eksctl
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::<your-account>:role/eksctl-sookocheff-us-east-1-addon-iamservicea-Role1-<your-role>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

Attaching the Service Account to our Pod

To allow a Pod to use the service account and its associated IAM role, AWS installs a validating and mutating Kubernetes webhook in every EKS cluster. This webhook listens to any create pod API calls and injects a JWT into our pods with credentials attached based on the annotations on the Pod deployment.

Specifying the service account to use for a Pod deployment is as simple as stating it in the Pod spec using the serviceAccountName field. For example, the following simple deployment includes serviceAccountName: my-service-account. Upon deploying this service, the IAM Roles for Service Accounts webhook installed in our EKS cluster will react by mutating the deployment and injecting the correct credentials into the Pod at runtime. When we (finally) deploy our DynamoDB service to Kubernetes we will need to set serviceAccountName to dynamo-sa to match the service account we created.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      serviceAccountName: my-service-account
      containers:
      - name: my-app
        image: public.ecr.aws/nginx/nginx:X.XX

Configuring DynamoDB Global Tables

I tried for quite some time to use the boto3 Python SDK to create a DynamoDB global table, but that feature is not supported at the time of writing — I could not figure out how to create a table with elastic read/write throughput which is a requirement of Global Tables. Instead, I created an initial table using boto, and then modified it manually using the AWS console to first set read/write throughput to elastic and second add replicas to create global table.

Using boto3 and Python, you can create a DynamoDB table using the following function. If you are following along, the key points are to use a schema where username and last_name are attributes of the table, both of String type. username is a partition key and last_name is a sort key.

def create_table(self, table_name):
    """
    Creates an Amazon DynamoDB table that can be used to store user data.

    :param table_name: The name of the table to create.
    :return: The newly created table.
    """
    try:
        self.table = self.dyn_resource.create_table(
            TableName=table_name,
            KeySchema=[
                {
                    'AttributeName': 'username',
                    'KeyType': 'HASH'
                },
                {
                    'AttributeName': 'last_name',
                    'KeyType': 'RANGE'
                }
            ],
            AttributeDefinitions=[
                {
                    'AttributeName': 'username',
                    'AttributeType': 'S'
                },
                {
                    'AttributeName': 'last_name',
                    'AttributeType': 'S'
                },
            ],
            ProvisionedThroughput={
                'ReadCapacityUnits': 5,
                'WriteCapacityUnits': 5
            }
        )

        self.table.wait_until_exists()

Once created, you can navigate to this table in the AWS console and create a replica. This will require first updating the throughput to be elastic, and then specifying a region for the replica. AWS will do the work of setting up the replica and copying the data between tables in each region.

Now we can finish out the rest of our simple application that adds and lists users. We start with a simple class that wraps our table with the ability to add users and list existing ones.

class Users:
    """Encapsulates an Amazon DynamoDB table of use data."""

    def __init__(self, dyn_resource):
        """
        :param dyn_resource: A Boto3 DynamoDB resource.
        """
        self.dyn_resource = dyn_resource
        # The table variable is set during the scenario in the call to
        # 'exists' if the table exists. Otherwise, it is set by 'create_table'.
        self.table = None

    def create_table(self, table_name):
        """
        Creates an Amazon DynamoDB table that can be used to store user data.

        :param table_name: The name of the table to create.
        :return: The newly created table.
        """
        try:
            self.table = self.dyn_resource.create_table(
                TableName=table_name,
                KeySchema=[
                    {
                        'AttributeName': 'username',
                        'KeyType': 'HASH'
                    },
                    {
                        'AttributeName': 'last_name',
                        'KeyType': 'RANGE'
                    }
                ],
                AttributeDefinitions=[
                    {
                        'AttributeName': 'username',
                        'AttributeType': 'S'
                    },
                    {
                        'AttributeName': 'last_name',
                        'AttributeType': 'S'
                    },
                ],
                ProvisionedThroughput={
                    'ReadCapacityUnits': 5,
                    'WriteCapacityUnits': 5
                }
            )

            self.table.wait_until_exists()

        except botocore.exceptions.ClientError as error:
            if error.response['Error']['Code'] == 'ResourceInUseException':
                self.table = self.dyn_resource.Table(table_name)
            else:
                raise error

        return self.table.table_status

    def add_user(self, profile):
        """
        Adds a user to the table.

        :param profile: A user profile.
        """
        _, last_name = profile['name'].split()

        self.table.put_item(
           Item={
                'username': profile['username'],
                'last_name': last_name,
            }
        )

        return profile

    def list_users(self):
        """
        Scans for users in the table.
        """
        users = []
        scan_kwargs = {}

        done = False
        start_key = None
        while not done:
            if start_key:
                scan_kwargs["ExclusiveStartKey"] = start_key
            response = self.table.scan(**scan_kwargs)
            users.extend(response.get("Items", []))
            start_key = response.get("LastEvaluatedKey", None)
            done = start_key is None

        return users

Next, we can run a simple Flask server with API endpoints for adding and listing users to test our functionality. In this example, I’m reading the regional DynamoDB endpoint and AWS region as environment variables that we will need to set on our deployment.

dynamo_endpoint = os.environ.get("DYNAMO_ENDPOINT", "http://localhost:8000")
dynamo_region = os.environ.get("AWS_REGION", "us-east-1")

dynamo_client = boto3.client('dynamodb',
                          endpoint_url=dynamo_endpoint,
                          region_name=dynamo_region)
dynamodb = boto3.resource('dynamodb',
                          endpoint_url=dynamo_endpoint,
                          region_name=dynamo_region)
users = Users(dynamodb)
fake = Faker()

users.create_table('users')

app = Flask(__name__)

@app.route('/dynamodr')
def index():
    return jsonify(users.list_users())


@app.route('/dynamodr/add-user')
def add_user():
    return jsonify(users.add_user(fake.profile()))


if __name__ == '__main__':
	app.run(host='0.0.0.0', port=5000)

We can containerize this simple app with an equally simple Dockerfile.

FROM --platform=linux/amd64 python:3.11-slim-bullseye

WORKDIR /app

COPY requirements.txt /app
RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

COPY . /app

ENTRYPOINT ["python"]
CMD ["app.py"]

The only requirements listed in requirements.txt are:

boto3==1.29.0
Faker==20.0.3
Flask==3.0.0

Deploying the App

For both our us-east-1 and us-west-2 regions, we can deploy our app using the following spec. You will need to update the location of the container image to wherever your container registry lives and make sure to push the test app to that registry. This spec creates a Deployment from our container, a Service allowing access to our Deployment over port 80, and an Ingress allowing web traffic to reach our Service. Not the serviceAccountName: dynamo-sa line that grants the pod access to the IAM role and policy we created earlier.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dynamodr
  labels:
    app.kubernetes.io/name: dynamodr
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dynamodr
  replicas: 1
  template:
    metadata:
      labels:
        app.kubernetes.io/name: dynamodr
    spec:
      terminationGracePeriodSeconds: 0
      serviceAccountName: dynamo-sa
      containers:
        - name: dynamodr
          image: <your-container-registry>/<your-container-tag>-amd64
          imagePullPolicy: Always
          ports:
            - name: app-port
              containerPort: 5000
          env:
            - name: AWS_REGION
              value: "${AWS_REGION}"
            - name: DYNAMO_ENDPOINT
              value: "https://dynamodb.${AWS_REGION}.amazonaws.com"
---
apiVersion: v1
kind: Service
metadata:
  name: dynamodr
  labels:
    app.kubernetes.io/name: dynamodr
spec:
  type: ClusterIP
  selector:
    app.kubernetes.io/name: dynamodr
  ports:
    - name: svc-port
      port: 80
      targetPort: app-port
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: default-ingress
spec:
  ingressClassName: nginx
  rules:
    - http:
        paths:
          - path: /dynamodr
            pathType: Prefix
            backend:
              service:
                name: dynamodr
                port:
                  name: svc-port

Deploy this to each region using the following commands:

> AWS_REGION=us-east-1 envsubst < service.yaml | kubectl apply -f -
> AWS_REGION=us-west-2 envsubst < service.yaml | kubectl apply -f -

Your service is now accessible from the URL of the load balancer for each region. If you capture that URL in two different terminal sessions — one for each AWS region, you can start adding users to your table and view those additions being replicated across regions.

First, get the URL of the load balancer for your EKS cluster:

> export NLB_URL=$(kubectl get -n kube-system service/ingress-nginx-controller \
    -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Then, begin adding users.

> curl ${NLB_URL}/dynamodr/add-user

Now, in each terminal you can list users and see that users are added to the tables in both regions and that regardless of where you add data, the other table will be updated to reflect your changes! We now have a fully replicated DynamoDB cluster in two Kubernetes in two different AWS regions.

> curl ${NLB_URL}/dynamodr

Failing over Traffic

If you followed the previous article on setting up a multi-region EKS cluster, you will have Global Accelerator pointing to the two load balancers in each AWS region. You can use Global Accelerator to failover traffic by routing a percentage of traffic to each region.

To simulate a disaster recovery scenario, you can route 100% of traffic to us-east-1, and then follow that by routing 100% of traffic to us-west-2. Any data that has been persisted in your main site will be added and available in the table in your failover site.

How global tables work#

Multi-region Kubernetes with EKS#

Setting up IAM roles for Service Accounts#

Creating the IAM Policy#

Creating the Service Account#

Attaching the Service Account to our Pod#

Configuring DynamoDB Global Tables#

Deploying the App#

Failing over Traffic#