Title and Author of Paper
Hekaton: SQL Server’s Memory-Optimized OLTP Engine, Diaconu et al.
Database design has traditionally revolved around efficient access to disk. However, recent memory prices make it feasible to keep the majority (or entirety) of a database in main-memory. A main-memory design requires a few adjustments to maximize concurrency, handle transactions, and recover after failure. This paper describes such a design in relation to the development of Hekaton — an extension to Microsoft’s SQL Server. With Hekaton, if the user specifies that a table is “memory-optimized”, this triggers SQL Server to store that table entirely in memory, allowing Hekaton to optimize the table with its in-memory database engine.
At a high-level, Hekaton is divided into three components: the storage engine, runtime, and compiler. Hekaton leverages SQL server for storing metadata, handling security, and for query optimization. In this paper review, I will focus on how Hekaton handles in-memory transactions.
Hekaton tables are stored using snapshot isolation in two different index types: lock-free hash tables, and lock-free B-trees. Since they are lock-free, threads do not block when updating indexes. This means that multiple threads may be accessing the same data concurrently. To ensure consistency during transactions, threads check at commit time that the data they accessed during a transaction has not been modified during the transaction. If any data has been modified, the transaction is aborted. Otherwise, it can continue on. These checks require a notion of visibility. A record is visible to a transaction if the transaction is executing at a point-in-time that the record is valid for. Each record tracks the begin and end time where it is valid and updates to a record set the end time of old records and create a new record with a new begin and end time. Thus, each record may have multiple versions in the database that are valid for different points in time.
Validating that a transaction is successful is done by checking that the versions of data read during the transaction are still visible and valid at the end of the transaction — implying that data has not been changed between the time a transaction was read and the time it was written. If multiple transactions depend upon the same data, then the transactions have a commit dependency on each other.
Commit dependencies introduce two problems: 1) a transaction cannot commit until every transaction upon which it is dependent has committed and 2) commit dependencies imply working with uncommitted data and such data should not be exposed to users.
To solve the first problem, when a transaction commits it notifies dependent transactions of whether or not it succeeded. On failure, dependent transactions are aborted and need to be retried. To solve the second problem, results are not delivered to the user while a transaction has unresolved commit dependencies. After all commit dependencies are resolved, a response can be made to the user.
Transaction durability is handled using a transaction log that is stored in the regular SQL server installation. Checkpoints offer a compressed version of the log and are used to provide faster recovery times by avoiding iterating the entire log during recovery.
Because Hekaton uses snapshot isolation, multiple versions of a table may exist at one time. This necessitates a garbage collection process that removes old versions of the data that are no longer accessible by any transaction. This is done with two methods: a long-running garbage collection process and removing garbage as it is encountered during query scans. During queries, multiple versions of a table may be scanned to find the visible and valid version for the point-in-time of the current query. If a version found can be reclaimed it is done as part of query processing. This method spreads out the load of garbage collection to multiple processes.
- Paper Review: The CQL continuous query language: semantic foundations and query execution
- Paper Review: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
- Paper Review: Informix under CONTROL: Online Query Processing
- Paper Review: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates