Mapreduce

Paper Review: MapReduce: Simplified Data Processing on Large Clusters

Title and Author of Paper MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat. Summary MapReduce is designed to solve the problem of processing large sets of data on a fleet of commodity hardware. In such an environment it is assumed that you may have hundreds or thousands of machines and that, at any point in time, these machines may experience failures. The MapReduce framework hides the details of parallelizing your workflow, fault-tolerance, distributing data to workers, and load balancing behind the abstractions map and reduce. The user of MapReduce is responsible for writing these map and reduce functions, while the MapReduce library is responsible for executing that program in a distributed environment. ...

App Engine MapReduce API - Part 7: Writing a Custom Output Writer

View all articles in the MapReduce API Series. The MapReduce library supports a number of default output writers. You can also write your own that implements the output writer interface. This article examines how to write a custom output writer that pushes data from the App Engine datastore to an elasticsearch cluster. A similar pattern can be followed to push the output from your MapReduce job to any number of places. ...

App Engine MapReduce API - Part 6: Writing a Custom Input Reader

View all articles in the MapReduce API Series. One of the great things about the MapReduce library is the abilitiy to write a cutom InputReader to process data from any data source. In this post we will explore how to write an InputReader the leases tasks from an AppEngine pull queue by implementing the InputReader interface. ...

Bypassing ndb hooks with the RawDatastoreInputReader

When doing a MapReduce operation there are times when you want to edit a set of entities without triggering the post or pre put hooks associated with those entities. On such ocassions using the raw datastore entity allows you to process the data without unwanted side effects. This article will show how to use the RawDatastoreInputReader to process datastore entities. ...

App Engine MapReduce API - Part 5: Using Combiners to Reduce Data Throughput

View all articles in the MapReduce API Series. So far we’ve looked at using MapReduce pipelines to perform calculations over large data sets and combined multiple pipelines in succession. In this article we will look at how to reduce the amount of data transfer by using a combiner. ...

App Engine MapReduce API - Part 4: Combining Sequential MapReduce Jobs

View all articles in the MapReduce API Series. Last time we looked at how to run a full MapReduce Pipeline to count the number of occurrences of a character within each string. In this post we will see how to chain multiple MapReduce Pipelines together to perform sequential tasks. ...

App Engine MapReduce API - Part 3: Programmatic MapReduce using Pipelines

View all articles in the MapReduce API Series. In the last article we examined how to run one-off tasks that operate on a large dataset using a mapreduce.yaml configuration file. This article will take us a step further and look at how to run a MapReduce job programmatically using the App Engine Pipeline API. ...

App Engine MapReduce API - Part 2: Running a MapReduce Job Using mapreduce.yaml

View all articles in the MapReduce API Series. Last time we looked at an overview of how MapReduce works. In this article we’ll be getting our hands dirty writing some code to handle the Map Stage. If you’ll recall, the Map Stage is composed of two separate components: an InputReader and a map function. We’ll look at each of these in turn. ...

App Engine MapReduce API - Part 1: The Basics

View all articles in the MapReduce API Series. The first arcticle in this series provides an overview of the App Engine MapReduce API. We will give a basic overview of what MapReduce is and how it is used to do parallel and distributed processing of large datasets. ...