A Maven runner for vim-test

If you haven’t tried vim-test yet, please do. It provides a consistent interface for running unit tests from within vim for a variety of languages and test runners. It also has support for different methods of dispatching your tests from vim to a terminal to retrieve the results. vim-test works out of the box for a number of languages and is easily extensible to support additional languages. In fact, I recently contributed a Maven test runner for the Java language. If you are developing a Maven project, you can now run your tests from within vim using vim-test. ...

October 16, 2015 · 1 min · Kevin Sookocheff

Deploying Kafka to Google Compute Engine

This article provides a startup script for deploying Kafka to a Google Compute Engine instance. This isn’t meant to be a production-ready system — it uses the Zookeeper instance embedded with Kafka and keeps most of the default settings. Instead, treat this as a quick and easy way do Kafka development using a live server. This article uses Compute Engine startup scripts to install and run Kafka on instance startup. Startup scripts allow you to run arbitrary Bash commands whenever an instance is created or restarted. Since this script is run on every restart, we lead with a check that makes sure we have not already ran the startup script and, if we have, we simply exit. ...

October 12, 2015 · 4 min · Kevin Sookocheff

Kafka Quick Start Guide

If you’ve read the previous article describing Kafka in a Nutshell you may be itching to write an application using Kafka as a data backend. This article will get you part of the way there by describing how to deploy Kafka locally using Docker and test it using kafkacat. Running Kafka Locally First, if you haven’t already, download and install Docker. Once you have Docker installed, create a default virtual machine that will host your local Docker containers. ...

September 30, 2015 · 3 min · Kevin Sookocheff

Kafka in a Nutshell

Kafka is a messaging system. That’s it. So why all the hype? In reality messaging is a hugely important piece of infrastructure for moving data between systems. To see why, let’s look at a data pipeline without a messaging system. This system starts with Hadoop for storage and data processing. Hadoop isn’t very useful without data so the first stage in using Hadoop is getting data in. Bringing Data in to Hadoop So far, not a big deal. Unfortunately, in the real world data exists on many systems in parallel, all of which need to interact with Hadoop and with each other. The situation quickly becomes more complex, ending with a system where multiple data systems are talking to one another over many channels. Each of these channels requires their own custom protocols and communication methods and moving data between these systems becomes a full-time job for a team of developers. ...

September 25, 2015 · 11 min · Kevin Sookocheff

Beginning Docker

I’m writing this article as a means of tracking commonly used docker commands in a place where I won’t forget them. If you find it useful or have additional suggestions let me know in the comments. ...

September 14, 2015 · 3 min · Kevin Sookocheff

A Review of the Coursera Data Science Specialization

A Review of the Coursera Data Science Specialization I recently completed the 10th and final course in the Data Science Specialization offered by Coursera in conjunction with Johns Hopkins University. My background is as a computer scientist and programmer looking to learn more about statistical analysis and machine learning — I have always had an interest in data analysis and machine learning but never actually studied it. I used the Data Science Specialization acted as a starting point to learn more about the field and become familiar with typical problems and solutions that data scientists encounter in the field. This article describes my experience with the specialization and answers the question of whether or not the it is worth the time. ...

September 10, 2015 · 10 min · Kevin Sookocheff

Counting N-Grams with Cloud Dataflow

Counting n-grams is a common pre-processing step for computing sentence and word probabilities over a corpus. Thankfully, this task is embarrassingly parallel and is a natural fit for distributed processing frameworks like Cloud Dataflow. This article provides an implementation of n-gram counting using Cloud Dataflow that is able to efficiently compute n-grams in parallel over massive datasets. The Algorithm Cloud Dataflow uses a programming abstraction called PCollections which are collections of data that can be operated on in parallel (Parallel Collections). When programming for Cloud Dataflow you treat each operation as a transformation of a parallel collection that returns another parallel collection for further processing. This style of development is similar to the traditional Unix philosophy of piping the output of one command to another for further processing. ...

August 5, 2015 · 7 min · Kevin Sookocheff

N-gram Modeling With Markov Chains

A common method of reducing the complexity of n-gram modeling is using the Markov Property. The Markov Property states that the probability of future states depends only on the present state, not on the sequence of events that preceded it. This concept can be elegantly implemented using a Markov Chain storing the probabilities of transitioning to a next state. ...

July 31, 2015 · 5 min · Kevin Sookocheff

Modeling Natural Language with N-Gram Models

One of the most widely used methods natural language is n-gram modeling. This article explains what an n-gram model is, how it is computed, and what the probabilities of an n-gram model tell us. ...

July 25, 2015 · 4 min · Kevin Sookocheff

Structuring an Application using Model View Controller

Early pioneers in object-oriented programming paved the path towards using Model View Controller (MVC) for graphical user interfaces as early as 1970 and web applications have continued using the pattern to separate business logic from display. This article attempts to clarify the use of Model View Controller within web applications — giving consideration to the fact that most developers will be building their application using an existing web framework. ...

July 9, 2015 · 6 min · Kevin Sookocheff