Kevin Sookocheff

Generating Java with JCodeModel

Have you come across the misfortune of needing to auto-generate Java source code? Luckily, anything you’ve wanted to do with Java has already been done — and auto-generating Java is no different. I recently used JCodeModel to translate from JSON to Java — it worked great but it lacks any tutorial-style documentation. This article means to fill that gap. If you feel the need to go more in-depth, consult the the Javadoc. ...

Writing an Apache Beam Batch Sink

This article describes how you can use the Dataflow/Beam SDK to write files to an S3 bucket by implementing a Sink. A Sink has three phases: initialization, writing, and finalization. The initialization phase is a sequential process where you can create necessary preconditions such as output directories. The write phase lets workers write bundles of records to the Sink. The finalization phase allows for cleanup like merging files or committing writes. ...

Deploying a Druid Cluster with Ansible

During my continued education on Ansible I’ve writing some roles for deploying a Druid cluster to AWS similar to the article on deploying Zookeeper with Ansible. The methods are fairly simple so rather than going through a detailed explanation I will just leave a link to the full source Github. Any and all contributions are welcome!

Paper Review: What Goes Around Comes Around

Title and Author of Paper What Goes Around Comes Around. Joseph M. Hellerstein and Michael Stonebraker. Summary What Goes Around Comes Around summarizes several methods for modelling data within a database system. Each data model is described and the benefits and drawbacks listed as lessons learned from research into that model. The authors clearly present their opinions on each model and help readers unfamiliar with past modelling attempts understand the history of this area of research. ...

Deploying Zookeeper with Exhibitor to AWS using Ansible

This article provides a detailed guide of deploying Zookeeper to AWS using Exhibitor for cluster management. Exhibitor is a great help for managing your cluster but getting things up and running is not well documented. Hopefully this article corrects that deficiency. ...

Getting to Know Cloud Dataflow

Cloud Dataflow is Google’s managed service for batch and stream data processing. Dataflow provides a programming model and execution framework that allows you to run the same code in batch or streaming mode, with guarantees on correctness and primitives for correcting timing issues. Why should you care about Dataflow? A few reasons. First, Dataflow is the only stream processing framework that has strong consistency guarantees for time series data. Second, Dataflow integrates well with the Google Cloud Platform and provides seamless methods for reading from and writing to the Datastore, PubSub, BigQuery and Cloud Storage. Third, the Dataflow SDK is open source and has received contributions for interfacing with Hadoop, Firebase, and Salesforce — AWS integration is absolutely possible. Lastly, Dataflow is completely managed, whereas competing offerings such as Spark and Flink typically run on top of a Hadoop installation used for intermediate storage. ...

Docker Step By Step: Containerizing Zookeeper

Follow along with this article as we take a guided tour of containerizing Zookeeper using Docker. This guide will show how to install Zookeeper to the container, how to configure the Zookeeper application, and how to share data volumes between the host and container. ...

Why Java? Tales from a Python Convert

Whenever I tell people I’ve been working with Java I get the same reaction: “Yuck! Java? Why Java?” And, admittedly, I had the same reaction — at first. But over time, I’ve come to appreciate Java for its type safety, performance, and rock-solid tooling. I’ve also come to notice that this isn’t the Java I was used to — it’s been steadily improving over the last ten years. ...

Including a local package as a Maven dependency

Lately I’ve been tasked with developing a Java library for internal use. For testing, its proved useful to package the library for local use. This article describes how to add a Jar file to a local Maven repository for use in your own testing and development. Create your local Maven repository Your local Maven repository lives within the project you are developing for. Creating your local repository is as simple as making a new directory. ...

Configuring an Upstream Remote

This is something I often do but rarely remember the steps for. This post is intended to serve as a reminder for me and anyone else having the same question: how to add an upstream remote git repository. Start by forking the repository you are contributing to and cloning that repository to your local file system. In this example, we will use the Elasticsearch repository and assume you have cloned it locally. ...