Title and Author of Paper

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. Y. Zhao et al.

Summary

One of the core functions of an OLAP system is computing aggregations and group-by operations. This functionality has been characterized by the “Cube” operator, which computes group-by aggregations over all possible subsets of a specified dimension. As an example of the Cube operator, consider a model with the dimensions product, store, date, and the measured value sales. To compute the Cube for this data set requires computing sales for all subsets of the dimensions: sales by product, store, and date; sales by product and store; sales by product; etc. As a user, I want the system to prepare these results for me in response to ad-hoc queries or as part of a ETL job that prepares the data for analysis. Because there is a lot of data involved, the challenge of implementing the Cube operator is in computing these aggregations as efficiently as possible.

This paper presents an algorithm for computing the Cube operator over data sets that are stored as sparse arrays. In traditional relational databases, each “cell” is represented as a tuple, with entries in the tuple denoting the attributes of the cell, and one entry donating the value. For example, the tuple (shoes, WestTown, 3-July-96, $34.00) represents a cell for shoes at the WestTown store on 3-July-96. The value of the cell (sales) is $34.00. When stored within a sparse array, this cell would be stored as just the value $34.00, and the position within the sparse array would encode the remaining facts about the cell (product, store, date). The remainder of this paper is devoted to an algorithm for computing the Cube for data stored in such a sparse array.

Storing Large Arrays

One of the key problems in implementing an array-based cubing algorithm is efficiently managing the loading and storage of large arrays: the arrays can be too large to fit in main memory, the arrays may have a lot of empty values (sparse), and the data may need to be loaded from alternate (relational) data sources.

Chunking

To handle loading arrays that are too large for memory, the authors implemented an existing chunking algorithm to divide n-dimensional arrays into smaller n-dimensional chunks, storing each chunk on disk.

Compression

Sparse arrays may contain significant blocks of empty values. Compression algorithms can be run on these arrays to reduce the amount of memory necessary for storage and processing. The author’s implement a chunked offset compression algorithm that stores a pair of values for each valid array entry consisting of the actual data for the cell, and an offset describing the location of this value in the sparse array. Any empty values can be removed from the array.

Ingesting Data from Relational Tables

Loading data from table-based storage is done via a partition-based algorithm that converts rows in a table to a location in chunked storage. After all data has been loaded into n-dimensional chunks, a second pass over the data compresses it.

The Basic Cubing Algorithm

The first version of the array cubing algorithm computes the cube of a previously chunked array in multiple passes, with the goal of using minimum memory. To understand how the algorithm works, first consider computing the cube of a three-dimensional array by one of its dimensions. That is, consider the array ABC, where you wish to compute all variations of AB and aggregate by C. This can be visualized as projecting the values of AB onto a two-dimensional AB plane, and then sweeping this plane through the C dimensions, calculating values at each intersection.

The algorithm works in the same manner for a chunked array. Instead of sweeping a plane through the entire set of data, the plane is swept over a chunk of data, and that result stored to disk until all chunks have been swept. Continuing in this fashion for all chunks, we will have read each chunk into memory only once, and the aggregation for each chunk will be stored on disk. We can use the result stored on disk to perform the final aggregation for the entire data set. To generalize this to higher dimensions involves projecting the aggregate dimensions into higher dimensional arrays.

The preceding example showed how to compute a single aggregation, but to compute a full cube requires computing over all dimensions: AB, BC, AC, A, B, and C. To do this efficiently, the authors leverage existing work on deriving the cube from a hierarchical lattice representation of the data. If you view ABC as the root of the dimensional hierarchy, ABC has children AB, BC, and AC; AC has children A and C; and so forth. This lattice view allows a set of computations to proceed using intermediate results already stored for a parent computation. That is, to compute an aggregation over dimension A, we can leverage an existing computation over dimensions AB.

The authors define a “minimum size spanning tree” that encodes this hierarchy, and use that hierarchy to compute the cube. This allows the algorithm to reuse previous results to compute new results, maximizing efficiency.

The Multi-Way Array Algorithm

The authors also present a multi-way array algorithm that overlaps computation of different aggregations, avoiding the need for multiple scans over the data. This algorithm holds intermediate results in memory, requiring the algorithm to optimize the order of computation to reduce memory requirements. The concept of “dimension order” is used for this purpose.

Dimension order is a row-major ordering of the chunks of the array. The multi-way algorithm reads in data in dimension order, allowing the algorithm to leverage the minimum size spanning tree ideas to use previous results to compute new results, while keeping previous results in memory. This idea is generalized into the concept of a “minimum memory spanning tree” that defines an ordering of computations that minimizes memory.

Performance

The authors provide performance results of their algorithm that show that the array-based cubing algorithm is efficient for different size data sets, and that it is wildly more efficient than cubing a traditional relational database. The authors show that for some workloads it is faster to load relational data into a sparse array structure to compute the cube than it is to compute the cube on the original relational data.