Writing an Apache Beam Batch Sink

This article describes how you can use the Dataflow/Beam SDK to write files to an S3 bucket by implementing a Sink. A Sink has three phases: initialization, writing, and finalization. The initialization phase is a sequential process where you can create necessary preconditions such as output directories. The write phase lets workers write bundles of records to the Sink. The finalization phase allows for cleanup like merging files or committing writes. [Read More]

Getting to Know Cloud Dataflow

Cloud Dataflow is Google’s managed service for batch and stream data processing. Dataflow provides a programming model and execution framework that allows you to run the same code in batch or streaming mode, with guarantees on correctness and primitives for correcting timing issues. Why should you care about Dataflow? A few reasons. First, Dataflow is the only stream processing framework that has strong consistency guarantees for time series data. Second, Dataflow integrates well with the Google Cloud Platform and provides seamless methods for reading from and writing to the Datastore, PubSub, BigQuery and Cloud Storage. [Read More]

Counting N-Grams with Cloud Dataflow

Counting n-grams is a common pre-processing step for computing sentence and word probabilities over a corpus. Thankfully, this task is embarrassingly parallel and is a natural fit for distributed processing frameworks like Cloud Dataflow. This article provides an implementation of n-gram counting using Cloud Dataflow that is able to efficiently compute n-grams in parallel over massive datasets. The Algorithm Cloud Dataflow uses a programming abstraction called PCollections which are collections of data that can be operated on in parallel (Parallel Collections). [Read More]

Create a Google Cloud Dataflow Project with Gradle

I’ve been experimenting with the Google Cloud Dataflow Java SDK for running managed data processing pipelines. One of the first tasks is getting a build environment up and running. For this I chose Gradle. We start by declaring this a java application and listing the configuration variables that declare the source compatibility level (which for now must be 1.7) and the main class to be executed by the run task to be defined later. [Read More]