Learning a Language with Amazon Polly and a Serverless Chalice App

For the past year I’ve been making a concerted effort to learn French using the methods from the book Fluent Forever, which is an excellent resource for learning how to learn a language. For those not familiar with the method, it boils down to this:

Learn Pronunciation: knowing how to correctly pronounce words in your target language makes everything else easier.
Learn Frequently Used Words: not all words are created equal, learn the most frequently used words first.
Learn Grammar: put together grammatical sentences using the words you already know.

If you turn your head to the side and squint at that list, it somewhat resembles the steps you would take to learn a language as an infant — first understand the sounds of the language, then learn words (“mommy”, “daddy”), and finally put together correct sentences. In addition, as an infant you have a constant source of high quality input helping you learn words and grammar. You can imagine the following “conversation” between an adult and a hungry child:

“Tommy, do you want an apple?”
“Apple?”
(points at Apple)
“Apple? Want Apple?”
(points at Apple again)
“Apple now?”

It won’t take long for that hungry infant to connect the word “Apple” to the object in front of them. This type of reinforcement works great for an infant, but as an adult we are forced to simulate it using frequent review. This is where a spaced-repetition program can help. If you aren’t familiar with spaced repetition, the simplified version is that you create flash cards and review them at the points in time that help you remember them the best. Spaced repetition software helps by computing the optimal time to review flash cards to help drive them in to long-term memory. The most flexible and feature rich program for spaced-repetition I’ve found is Anki, which I’ve been using with a lot of success to learn frequently used words.

Now, a particular thorn in my progress learning French is verb conjugations. I’m trying to resolve this thorn using Anki by finding grammatically correct sentences and creating flash cards from them. These cards ask you to find the root form of a verb, and the correct conjugated form that fits grammatically in the example sentence (for a full explanation of the method, see the Fluent Forever blog). To reinforce correct grammar and pronunciation, each sentence should ideally be accompanied by a recording of a native speaker speaking the sentence. Unfortunately, it’s not always easy to find a native speaker willing to record sentences for you — this is where Amazon Polly comes in.

Amazon Polly is a service that turns text into speech in a wide variety of languages and voices. By leveraging Polly, you can easily create quality examples of native speakers for learning a language. To help automate the creation of these recordings, I created a simple serverless web application that takes text as input, turns that text to speech using Polly, and stores the result in S3. The rest of this post describes this application. Full source code is available on Github.

Architecture

The API for this simple service exposes two endpoints. One for creating a recording, and a second for retrieving recordings. These endpoints are exposed through API Gateway and are backed by Lambda functions. The Lambda functions handle converting text to speech and storing that speech in S3. A DynamoDB table lists all the recordings and their locations in S3.

The following diagram shows the application architecture.

When the user wants to create a new recording:

An HTTP call is made to the create endpoint exposed by API Gateway.
API Gateway invokes a Lambda function responsible for converting the text into speech and storing the result. The function performs the following actions:
- Use Polly to convert text into an audio file.
- Store the result in S3.
- Store a record of the input text and the resulting mp3 file location in DynamoDB.

When the user wants to get an existing recording:

An HTTP call is made to the get endpoint exposed by API Gateway.
API Gateway invokes a Lambda function responsible for retrieving the record data from DynamoDB.
The user uses the S3 URL returned by DynamoDB to download the mp3 file.

Now let’s walk through how to create the application using the Chalice serverless framework from AWS labs.

Creating The Backing Resources

Our serverless application relies on two AWS resources: an S3 bucket to store recorded speech, and a DynamoDB table to index the S3 url for the recorded text. Since I don’t want to keep these recordings forever, I set an expiration time of two days on all S3 objects in the bucket, and also configure a time-to-live for DynamoDB entries of two days. The following CloudFormation template creates the required resources:

---
AWSTemplateFormatVersion: '2010-09-09'
Parameters:
  S3BucketName:
    Type: String
    Description: "S3 bucket name"
    MinLength: 4
    MaxLength: 253
  DynamoDBTableName:
    Type: String
    Description: "Dynamo table name"
    MinLength: 4
    MaxLength: 253
Resources:
  TranslationsBucket:
    Properties:
      AccessControl: Private
      BucketName: !Ref S3BucketName
      LifecycleConfiguration:
        Rules:
        - ExpirationInDays: 2
          Id: TranslationsBucketRule
          Status: Enabled
    Type: AWS::S3::Bucket
  TranslationsTable:
    Properties:
      AttributeDefinitions:
      - AttributeName: id
        AttributeType: S
      KeySchema:
      - AttributeName: id
        KeyType: HASH
      ProvisionedThroughput:
        ReadCapacityUnits: 5
        WriteCapacityUnits: 5
      TableName: !Ref DynamoDBTableName
      TimeToLiveSpecification:
        AttributeName: expires
        Enabled: true
    Type: AWS::DynamoDB::Table
Outputs:
  TranslationsBucket:
    Description: S3 bucket storing translations.
    Value:
      Ref: TranslationsBucket
  TranslationsTable:
    Description: DynamoDB table indexing translations.
    Value:
      Ref: TranslationsTable

In the Resources section, we create TranslationsBucket of type S3Bucket. The bucket includes a LifecycleConfiguration rule specifying the expiration date of all objects placed in the bucket. The TranslationsTable is a DynamoDB table with a simple id as the primary hash key. The TimeToLiveSpecification lists the Dynamo attribute we will use to expire records and enables TTL for the table. Note that Dynamo does not require you to define a full schema ahead of time, you only need to specify the key to start using the table.

You can deploy this CloudFormation template to create the required resources for our serverless application. Be sure to specify the desired name for your S3 bucket and for your Dynamo table.

aws cloudformation create-stack \
    --stack-name polly-recorder \
    --template-body file://cloudformation.yaml \
    --parameters ParameterKey=S3BucketName,ParameterValue=<s3-bucket-name> \
                 ParameterKey=DynamoDBTableName,ParameterValue=<dynamo-table>

The Chalice Application

With our resources ready to use, we can create the Chalice application implementing our application. The following sequence of commands creates a new Chalice application:

$ pip install --pre chalice
$ chalice new-project polly-recorder && cd polly-recorder
$ cat app.py

You can then deploy and test the simple hello world example:

$ chalice deploy
...
Your application is available at: https://endpoint/dev

$ curl https://endpoint/dev
{"hello": "world"}

Create Endpoint

The create endpoint is responsible for synthesizing text into speech, storing the result in S3, and indexing the S3 URL in Dynamo for future retrieval.

Using Chalice, we define a route called recordings that accepts POST requests. We also enable CORS support and require an API key for a minimal layer of security. You can add the endpoint to app.py:

@app.route('/recordings',
           methods=['POST'],
           cors=True,
           api_key_required=True)
def create_recording():
    pass

Now we need to fill this out to implement the desired functionality.

VOICES = ['Celine', 'Mathieu', 'Chantal']

@app.route('/recordings',
           methods=['POST'],
           cors=True,
           api_key_required=True)
def create_recording():
    """
    Create a new recording.
    """
    body = app.current_request.json_body
    record_id = str(uuid.uuid4())
    text = body.get("text")
    voice = random.choice(VOICES)

    synthesize_speech(record_id, text, voice)
    url = upload_to_s3(record_id, S3_BUCKET)
    item = index_in_dynamodb(record_id, text, voice, url, DYNAMO_DB_TABLE)
    return [item]

This function starts by accessing the JSON body of the current request, available from the Chalice request metadata. From here, it extracts the text from the request, and converts that text to a randomly chosen French voice.

We can now implement each of the functions required to create a recording.

Synthesizing Speech

Synthesizing speech requires an API call to Polly with the text to synthesize, and the voice to speak with. We save the result to the Lambda functions temporary file system.

def synthesize_speech(record_id, text, voice):
    """
    Synthesizes the text, writing the result to Lambda's temp filesystem.
    """
    response = polly.synthesize_speech(
        OutputFormat='mp3',
        Text=text,
        VoiceId=voice
    )

    output = os.path.join("/tmp/", record_id)

    if "AudioStream" in response:
        with closing(response["AudioStream"]) as stream:
            with open(output, "a") as file:
                file.write(stream.read())

Uploading to S3

We can now upload the result from the temporary file system to S3. After uploading, we set the file to be publicly readable so we can retrieve it later through a web interface.

def upload_to_s3(record_id, s3_bucket):
    """
    Upload the tmp file to S3.

    Returns the S3 URL of the uploaded result.
    """
    s3.upload_file('/tmp/' + record_id,
                   s3_bucket,
                   record_id + ".mp3")

    s3.put_object_acl(ACL='public-read',
                      Bucket=s3_bucket,
                      Key=record_id + ".mp3")

    location = s3.get_bucket_location(Bucket=s3_bucket)
    region = location['LocationConstraint']

    if region is None:
        url_begining = "https://s3.amazonaws.com/"
    else:
        url_begining = "https://s3-" + str(region) + ".amazonaws.com/" \

    url = url_begining + s3_bucket + "/" + record_id + ".mp3"

    return url

Indexing with Dynamo

Lastly, we can index the request and the S3 url in Dynamo for later retrieval. We set the expires attribute to be two days in the future so that Dynamo’s time-to-live feature will expire old recordings.

def index_in_dynamodb(record_id, text, voice, url, table_name):
    """
    Index the record in DynamoDB.

    Returns the Item.
    """
    table = dynamodb.Table(table_name)

    # Set the expiration for two days from now
    posix_day = 86400
    expire_time = long(time.time()) + 2 * posix_day

    item = {
        'id': record_id,
        'text': text,
        'voice': voice,
        'url': url,
        'created': datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        'expires': expire_time,
    }

    table.put_item(Item=item)
    return item

Putting this all together, we get the following Chalice file for creating a French recording of input text.

import os
import uuid
import random
import time
import datetime
from contextlib import closing

import boto3
from boto3.dynamodb.conditions import Key

from chalice import Chalice

app = Chalice(app_name='recorder')
app.debug = True

DYNAMO_DB_TABLE = <your-table-name>
S3_BUCKET = <your-bucket-name>

VOICES = ['Celine', 'Mathieu', 'Chantal']


dynamodb = boto3.resource('dynamodb')
polly = boto3.client('polly')
s3 = boto3.client('s3')


def synthesize_speech(record_id, text, voice):
    """
    Synthesizes the text, writing the result to Lambda's temp filesystem.
    """
    response = polly.synthesize_speech(
        OutputFormat='mp3',
        Text=text,
        VoiceId=voice
    )

    output = os.path.join("/tmp/", record_id)

    if "AudioStream" in response:
        with closing(response["AudioStream"]) as stream:
            with open(output, "a") as file:
                file.write(stream.read())


def upload_to_s3(record_id, s3_bucket):
    """
    Upload the tmp file to S3.

    Returns the S3 URL of the uploaded result.
    """
    s3.upload_file('/tmp/' + record_id,
                   s3_bucket,
                   record_id + ".mp3")

    s3.put_object_acl(ACL='public-read',
                      Bucket=s3_bucket,
                      Key=record_id + ".mp3")

    location = s3.get_bucket_location(Bucket=s3_bucket)
    region = location['LocationConstraint']

    if region is None:
        url_begining = "https://s3.amazonaws.com/"
    else:
        url_begining = "https://s3-" + str(region) + ".amazonaws.com/" \

    url = url_begining + s3_bucket + "/" + record_id + ".mp3"

    return url


def delete_from_s3(record_id, s3_bucket):
    """
    Delete a file from S3.
    """
    bucket = s3.Bucket(s3_bucket)
    bucket.delete_key(record_id + ".mp3")


def index_in_dynamodb(record_id, text, voice, url, table_name):
    """
    Index the record in DynamoDB.

    Returns the Item.
    """
    table = dynamodb.Table(table_name)
    posix_day = 86400
    expire_time = long(time.time()) + 2*posix_day

    item = {
        'id': record_id,
        'text': text,
        'voice': voice,
        'url': url,
        'created': datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
        'expires': expire_time,
    }

    table.put_item(Item=item)
    return item


@app.route('/recordings',
           methods=['POST'],
           cors=True)
def create_recording():
    """
    Create a new recording.
    """
    body = app.current_request.json_body
    record_id = str(uuid.uuid4())
    text = body.get("text")
    voice = random.choice(VOICES)

    synthesize_speech(record_id, text, voice)

    url = upload_to_s3(record_id, S3_BUCKET)
    item = index_in_dynamodb(record_id, text, voice, url, DYNAMO_DB_TABLE)
    return [item]

We can go ahead and deploy our application:

$ chalice deploy

Update IAM Policies

The Lambda function deployed by Chalice will need to have access to the S3 bucket, the DynamoDB table, and to Amazon Polly. Set the policy on the Lambda execution role created by Chalice to include this access.

Testing Record Creation

You can deploy and test our recording function using httpie by calling your endpoint. Substitute your API gateway URL for <endpoint>:

$ http https://<endpoint>/dev/recordings 'text=Bonjour'
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 247
Content-Type: application/json
Date: Fri, 28 Jul 2017 18:05:53 GMT
Via: 1.1 d9adada028fe3a04aed64f9ed9d80dd2.cloudfront.net (CloudFront)
X-Amz-Cf-Id: dN7O52phUUew64CLNMKNTkBFZNqmzuMVz6y0eCQV0dDGXDxh8wrvKw==
X-Amzn-Trace-Id: sampled=0;root=1-597b7d01-1fd4e99f39f639de71e5d034
X-Cache: Miss from cloudfront
x-amzn-RequestId: 651bfc48-73bf-11e7-95c2-a1560fa94d06

[
    {
        "created": "2017-07-28 18:05:53",
        "expires": 1501437953,
        "id": "3b0b048c-d0dc-449b-b385-7793a641e44c",
        "text": "Bonjour",
        "url": "https://s3.amazonaws.com/<s3-bucket>/3b0b048c-d0dc-449b-b385-7793a641e44c.mp3",
        "voice": "Celine"
    }
]

Get Endpoint

The create endpoint is fairly straightforward. We use Chalices’ URL parameter functionality to specify a URL parameter called record_id. We use that identifier to fetch the corresponding entry from DynamoDB and return that to the user. For convenience, we use a record_id of * to return all entries from the table.

@app.route('/recordings/{record_id}',
           cors=True)
def get_recording(record_id):
    """
    Get existing recordings.
    """
    if record_id == "*":
        # List all recordings
        dynamodb = boto3.resource('dynamodb')
        table = dynamodb.Table(DYNAMO_DB_TABLE)
        items = table.scan()

        return items["Items"]
    else:
        dynamodb = boto3.resource('dynamodb')
        table = dynamodb.Table(DYNAMO_DB_TABLE)

        items = table.query(
            KeyConditionExpression=Key('id').eq(record_id)
        )

        return items["Items"]

Testing Recording Retrieval

We can use our get endpoint to retrieve recordings our recordings.

$ chalice deploy
...

$ http https://<endpoint>/dev/recordings/*
HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 18144
Content-Type: application/json
Date: Fri, 28 Jul 2017 18:05:00 GMT
Via: 1.1 86335fa0218c5bd3b89dc26ce10431df.cloudfront.net (CloudFront)
X-Amz-Cf-Id: xdXPN4jzxrkwtfAC00ZuFpnNHZ3U3ssTnCzubflcT_rkwTzowY8Fyg==
X-Amzn-Trace-Id: sampled=0;root=1-597b7ccc-13b475d72462df7b4a83012d
X-Cache: Miss from cloudfront
x-amzn-RequestId: 458b63aa-73bf-11e7-81ae-e3cf445a26c6

[
    {
        "created": "2017-07-27 21:27:55",
        "expires": 1508966875.0,
        "id": "1fb0e41b-8b0a-4c78-967d-fa72f56348c1",
        "text": "test",
        "url": "https://s3.amazonaws.com/<s3-bucket>/1fb0e41b-8b0a-4c78-967d-fa72f56348c1.mp3",
        "voice": "Mathieu"
    },
    {
        "created": "2017-07-27 21:43:57",
        "expires": 1508967837.0,
        "id": "7daded04-ea70-4244-899b-f862abd6318b",
        "text": "test",
        "url": "https://s3.amazonaws.com/<s3-bucket>/7daded04-ea70-4244-899b-f862abd6318b.mp3",
        "voice": "Chantal"
    }
]

User Interface

To help make generating example sentences a little easier, I created a simple user interface that accepts text and calls the API endpoints to store record that text as an mp3 file in S3 using our API.

You can find the full source code for the interface on Github.

Summary

When starting this application I was skeptical that Polly would provide a natural expression of example sentences. Thankfully, I was quite surprised by the quality of the sentences. With the serverless application, I am now able to quickly create recordings of any French word or phrase to aid in language learning. Combining this with Anki for spaced repetition I’ve found a valuable resource for learning and recalling verb conjugations.

Architecture#

Creating The Backing Resources#

The Chalice Application#

Create Endpoint#

Synthesizing Speech#

Uploading to S3#

Indexing with Dynamo#

Update IAM Policies#

Testing Record Creation#

Get Endpoint#

Testing Recording Retrieval#

User Interface#

Summary#