Writing an Apache Beam Batch Sink

This article describes how you can use the Dataflow/Beam SDK to write files to an S3 bucket by implementing a Sink. A Sink has three phases: initialization, writing, and finalization. The initialization phase is a sequential process where you can create necessary preconditions such as output directories. The write phase lets workers write bundles of records to the Sink. The finalization phase allows for cleanup like merging files or committing writes. ...

February 11, 2016 · 10 min · Kevin Sookocheff

Getting to Know Cloud Dataflow

Cloud Dataflow is Google’s managed service for batch and stream data processing. Dataflow provides a programming model and execution framework that allows you to run the same code in batch or streaming mode, with guarantees on correctness and primitives for correcting timing issues. Why should you care about Dataflow? A few reasons. First, Dataflow is the only stream processing framework that has strong consistency guarantees for time series data. Second, Dataflow integrates well with the Google Cloud Platform and provides seamless methods for reading from and writing to the Datastore, PubSub, BigQuery and Cloud Storage. Third, the Dataflow SDK is open source and has received contributions for interfacing with Hadoop, Firebase, and Salesforce — AWS integration is absolutely possible. Lastly, Dataflow is completely managed, whereas competing offerings such as Spark and Flink typically run on top of a Hadoop installation used for intermediate storage. ...

January 4, 2016 · 8 min · Kevin Sookocheff

Counting N-Grams with Cloud Dataflow

Counting n-grams is a common pre-processing step for computing sentence and word probabilities over a corpus. Thankfully, this task is embarrassingly parallel and is a natural fit for distributed processing frameworks like Cloud Dataflow. This article provides an implementation of n-gram counting using Cloud Dataflow that is able to efficiently compute n-grams in parallel over massive datasets. The Algorithm Cloud Dataflow uses a programming abstraction called PCollections which are collections of data that can be operated on in parallel (Parallel Collections). When programming for Cloud Dataflow you treat each operation as a transformation of a parallel collection that returns another parallel collection for further processing. This style of development is similar to the traditional Unix philosophy of piping the output of one command to another for further processing. ...

August 5, 2015 · 7 min · Kevin Sookocheff

Create a Google Cloud Dataflow Project with Gradle

I’ve been experimenting with the Google Cloud Dataflow Java SDK for running managed data processing pipelines. One of the first tasks is getting a build environment up and running. For this I chose Gradle. We start by declaring this a java application and listing the configuration variables that declare the source compatibility level (which for now must be 1.7) and the main class to be executed by the run task to be defined later. ...

February 11, 2015 · 2 min · Kevin Sookocheff