Paper Review: DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

Title and Author of Paper

DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language, Yu et al.

Summary

DryadLINQ describes a system for distributing the computation of .NET LINQ expressions on an underlying Dryad cluster.

The motivation for this work is to simplify the expression of data parallel algorithms by providing using the higher-level LINQ primitives. This allows the programmer to implement their algorithm as if it was computed on a single machine, and allow the system to worry about the complexities of scheduling, distribution, and fault-tolerance.

The system works by allowing the programmer to develop a simple C# program using LINQ expressions. DryadLINQ then exploits the fact that the .NET framework decouples the implementation of LINQ expressions, allowing DryadLINQ to execute them on a distributed cluster. After execution is complete, the results are returned back to the program as an iterator over the dataset.

Architecturally, the system runs a .NET user application and creates a DryadLINQ expression using LINQ’s deferred evaluation mechanisms. The DryadLINQ system then compiles the expression into a Dryad execution plan. The DryadLINQ then invokes a job manager that executes the execution plan on an available cluster of machines. When the job completes it writes data to output tables and the distributed process terminates. Control is handed back to the user application, where the results of the computation are returned as a collection of .NET objects.

Compiling a LINQ expression into a Dryad execution plan allows the system an opportunity to perform optimizations such as removing redundant operations or combining operations together. These optimizations can drastically speed up the processing of LINQ expressions.

The Dryad execution plan is ultimately expressed as a directed acyclic graph (DAG), where vertices in the graph are processes of the computation and edges of the graph are data channels. The Dryad system is responsible for starting the process at each vertex and channeling data between vertices.

On of the most interesting aspects of DryadLINQ is its interface. Unlike traditional data warehouses, where SQL statements are executed over the database and stored procedures handle more complex processing, SQL-like LINQ expressions are embedded within a general-purpose programming language. This allows Dryad programs to combine simple data processing tasks with more advanced programming tasks like machine learning within the same programming model.

See also

comments powered by Disqus