Counting N-Grams with Cloud Dataflow

Counting n-grams is a common pre-processing step for computing sentence and word probabilities over a corpus. Thankfully, this task is embarrassingly parallel and is a natural fit for distributed processing frameworks like Cloud Dataflow. This article provides an implementation of n-gram counting using Cloud Dataflow that is able to efficiently compute n-grams in parallel over massive datasets. The Algorithm Cloud Dataflow uses a programming abstraction called PCollections which are collections of data that can be operated on in parallel (Parallel Collections). [Read More]

N-gram Modeling With Markov Chains

A common method of reducing the complexity of n-gram modeling is using the Markov Property. The Markov Property states that the probability of future states depends only on the present state, not on the sequence of events that preceded it. This concept can be elegantly implemented using a Markov Chain storing the probabilities of transitioning to a next state.

[Read More]

Using the Google Prediction API to Predict the Sentiment of a Tweet

The Google Prediction API offers the power of Google’s machine learning algorithms over a RESTful API interface. The machine learning algorithms themselves are a complete black box. As a user you upload the training data and, once it has been analyzed, start classifying new observations based on the analysis of the training data. I recently spent some time investigating how to use the API to determine the sentiment of a tweet. This article collects my thoughts on the experience and a few recommendations for future work.

[Read More]