Paper Review: The Gamma Database Machine Project

Title and Author of Paper

The Gamma Database Machine Project. David J. DeWitt et al.

Summary

This paper presents the research undertaken at the University of Wisconsin-Madison to develop a scalable database architecture. The paper presents novel methods for scaling a database cluster using a shared-nothing architecture, and for using hash-based join algorithms to parallelize the workload across the cluster.

What are the motivations for this work?

The motivation behind Gamma was to support horizontally scalable database using commodity parts.

What is the proposed solution?

The Gamma project uses a shared-nothing architecture to build a horizontally scalable database for the same reasons we see NoSQL vendors tackling the same problem today. Communication between nodes in the system is handled exclusively by passing messages between one another over the network.

Gamma is divided into several processes, the Catalog Manager handles database schema and metadata information, the Query Manager is associated with each active Gamma user and is responsible for providing ad-hoc query support. Coordination of multi-node queries is handled through a Scheduler Process that handles the execution of individual Operator Processes on each node.

Gamma maintains a split table mapping tuples to the node that the tuple resides on. Queries are then shipped to the node where the tuple resides for processing.

Selection operators are parallelized over all nodes by initiating a selection on the set of nodes where the required tuples are held. If the selection query is based on the attribute responsible for building the split table, then a subset of nodes will be engaged to process the query. Otherwise, the selection operator is run on every node in the cluster.

Join operators are executed using the Hybrid hash-join algorithm, where relations are first partitioned into N buckets and the first bucket is used to build an in-memory hash table. The remaining buckets are stored in temporary files. The join operation is run on the in-memory hash tables and, once complete, the remaining buckets are loaded into memory and joined on. In this way, a large join is broken up in to a series of smaller joins.

Aggregate operators are executed by each node in the cluster, and the results combined into a final answer at a single node.

What are the contributions?

Gamma uses two key ideas for enabling a scalable architecture. First, relations are partitioned across nodes in the cluster, allowing for parallel data scans without any specialized hardware. Second, Gamma popularized hash-based parallel algorithms for join and aggregate operators.

What are future directions for this research?

The shared-nothing architecture continues to be the dominant form of scaling databases.

What questions are you left with?

A more thorough explanation of how each of the relational operators is executed within the cluster. The paper provides a short summary of the techniques but I would like to read a more detailed summary.

What is your take-away message from this paper?

What’s old is new again. Many of the ideas for building a horizontally scalable database presented in this paper are being implemented again in NoSQL databases today.

See also

comments powered by Disqus