I’ve been experimenting with the Google Cloud Dataflow Java SDK for running managed data processing pipelines. One of the first tasks is getting a build environment up and running. For this I chose Gradle.

We start by declaring this a java application and listing the configuration variables that declare the source compatibility level (which for now must be 1.7) and the main class to be executed by the run task to be defined later.

apply plugin: 'java'
apply plugin: 'application'

sourceCompatibility = '1.7'

mainClassName = 'com.sookocheff.dataflow.Main'

We then declare the mavenCentral repository where the dependencies are located and the basic dependencies for a Cloud Dataflow application.

repositories {

dependencies {
    compile 'com.google.guava:guava:18.0'
    compile 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:0.3.150109'

    testCompile 'junit:junit:4.11'

Last, we create our run task that will launch the Cloud Dataflow application. The Cloud Dataflow runtime expects the folder resources/main to exist in your build. If you are not actually shipping any resources with your application you will need to tell Gradle to create the correct directory. We also pass any parameters to our main class using the -P flag. These two steps are encapsulated below.

task resources {
    def resourcesDir = new File('build/resources/main')

run {
    if (project.hasProperty('args')) {
        args project.args.split('\\s')

run.mustRunAfter 'resources'

You should now be able to launch your Cloud Dataflow application using the gradle run task, passing your project identifiers as parameters. For example,

gradle run -Pargs="--project=<your-project> --runner=BlockingDataflowPipelineRunner --stagingLocation=gs://<staging-location>"