Title and Author of Paper
MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat.
Summary
MapReduce is designed to solve the problem of processing large sets of data on a fleet of commodity hardware. In such an environment it is assumed that you may have hundreds or thousands of machines and that, at any point in time, these machines may experience failures.
The MapReduce framework hides the details of parallelizing your workflow, fault-tolerance, distributing data to workers, and load balancing behind the abstractions map and reduce. The user of MapReduce is responsible for writing these map and reduce functions, while the MapReduce library is responsible for executing that program in a distributed environment.
Programming Model
The computation to be performed is expressed through two functions: map and reduce. The map function takes an input key, value pair and outputs an intermediate key, value pair, while the reduce function accepts the intermediate key and a set of values for that key. The reduce function merges together these values to form a possibly smaller set of values. In effect, the map and reduce functions mimic those of the Lisp programming language where map is responsible for applying a function to a list of elements and reduce is responsible for merging a list of elements together. The difference with the MapReduce framework is it can handle lists that are too large to fit on a single machine.
Execution
Given a MapReduce program with user-defined map and reduce functions, how is that program executed using the MapReduce framework?
To begin, the framework partitions the input data set into M pieces or splits to be processed by different machines. After splitting the input data, multiple copies of the MapReduce program are started on a cluster of worker machines and a single master machine. The master is responsible for assigning work to the worker machines. To start the computation, the master assigns one of the M map tasks to each idle worker.
When a worker receives a map task, it reads the contents of the corresponding input split, parsing the key-value pairs out of the input split and outputting intermediate key-value pairs by invoking the user-defined map function. The intermediate results are kept in memory and periodically flushed to local disk. The location of these results are passed back to the master machine, who forwards these locations to worker machines to run the reduce function.
When a worker receives a reduce task, it reads the intermediate data using RPC calls to the map workers. The reduce function then sorts the data by key and then iterates over this sorted data, invoking the user’s reduce function for each element. The output of the reduce function is appended to a final output file for this partition of the data.
The MapReduce framework handles errors by restarting worker machines. Since the results of a map worker are stored on local disk, if that machine goes down the results are lost — the master is responsible for scheduling that piece of work on a new worker machine.
Dependencies
The MapReduce framework depends on two systems for proper operation. The first is a distributed file system for storing input and output data from MapReduce programs. The second is a scheduling system for managing a cluster of machines.