Title and Author of Paper
C-Store: A column-oriented DBMS. Stonebraker et al.
In traditional databases, all attributes of a record (or tuple) are stored together as a contiguous block. When writing to disk, a single write pushes all fields of the record to disk. For the purposes of this paper, we call this type of DBMS a write-optimized system and this type of system works well for transactional processing. However, for querying data we can do better with a system that is read-optimized. C-Store is such a read-optimized system.
In C-Store, fast reads are accomplished by storing data organized by column instead of row, where the values for each single column (or attribute) of a table are stored as a contiguous block. Organizing data by columns offers a number of advantages:
- Processing a query only involves reading into memory the relevant attributes for resolving the query.
- Since all data within a column is of uniform type, compression techniques can be used to code data in a more compact form.
- Data can be packed densely together rather than aligned by byte or word boundaries.
Together, these changes result in significant performance gains for read-only queries.
C-Store is organized into two major systems: a writable store (WS) for handling update transactions and a read-optimized store (RS) for handling ad-hoc queries. A tuple mover sits between these two systems and copies data from the writable store to the readable store as necessary.
The writable store acts much like a traditional row-oriented database and handles insert and update requests. The read-optimized store is a much larger component organized by column instead of row. To provide data consistency guarantees, read-only queries are done on historical data using snapshot isolation.
In the read-optimized store, data is split into groups of column sorted on the same attribute, called projections. The same column may exist in multiple projections, possibly in a different sort order. By offering different sorted orders of the data, reads can be optimized to process the most performant sort order. Every table is thus reduced to a covering set of projections that include every column of the table. To join projections together when running a query, storage keys are used to record the logical row for each attribute being stored along with a join index mapping the segment id and storage key of two tuples that can be joined together to form an original table.
Data in the read-optimized store is compressed using existing compression techniques such as run-length encoding depending on the type of data being stored.
The writable store is organized as a traditional row-oriented RDBMS. In addition to storing each record, the writable store includes the same storage key as the read-optimized store.
To avoid locking and two-phase commit for all reads, snapshot isolation over the read-optimized store is used for read-only queries. With snapshot isolation, read-only queries access the database at some time in the recent past, where we can guarantee there are no uncommitted transactions.
To provide snapshot isolation, we cannot perform updates in place. Instead, an update is turned into an insert and a delete. Hence, a record is visible if it was inserted before the effective time of the query and deleted after the effective time.
To determine if a record is visible without requiring a large amount of space, coarse grained “epochs” spanning several seconds are used as the unit of time in C-Store. Each segment in the writable store maintains an insertion vector which records the epoch in which a record was inserted. The tuple mover is responsible for ensuring that the correct updates and insertions are moved to the read-optimized store for the proper epochs.
The job of the tuple mover is to — ahem — move tuples. Specifically it moves blocks of tuples from the writable store to the read-optimized store and updates any join indexes in the process. The tuple mover creates a new segment in the read-optimized store for every segment requiring an update. After the new segment is created, it replaces the old segment and the old segment can be deleted.
Odds and Ends
In addition to the core system described here, C-Store also implements its own query optimizer that can operate over compressed data, distributed transactions using without a redo log or two phase commit, and a system designed to safely handle up to K failures. Although these contributions are important, the ideas around column-based storage for efficient reads are the key takeaways from this paper.