VoltDB

VoltDB is an in-memory database borne out of the H-Store research project spearheaded by Michael Stonebraker. Within that project, Michael started with the premise of building a fully transactional database with the best possible performance by using insights gathered from an in-depth study on database performance that completely removed disk access — the primary limiting factor on database performance. By removing disk access completely, we end up with a completely in-memory database where we can make additional optimizations like removing write-ahead logging, buffer management, and locks and latches. This effort resulted in the research database H-Store which was commercialized as the in-memory database VoltDB. This article takes a deeper dive into VoltDB to understand how it works and where you may benefit from this approach. ...

June 3, 2020 · 6 min · Kevin Sookocheff

Paper Review: The CQL continuous query language: semantic foundations and query execution

Title and Author of Paper The CQL continuous query language: semantic foundations and query execution. Arasu et al. Summary CQL is a derivation of the SQL query language developed for running continuous queries over streams of data. The goals of the system are to provide a precise set of language semantics for running such continuous stream workloads. The paper starts by defining precise abstract semantics for continuous queries that cover two data types — streams and relations — and three classes of operators: ones that produce a relation from a stream, one that produces a relation from other relations, and one that produces a stream from a relation. These semantics are defined independent of the underlying implementation. The second portion of this paper defines how CQL instantiates these abstract semantics using existing SQL specifications and some new CQL additions. ...

March 17, 2017 · 5 min · Kevin Sookocheff

Paper Review: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Title and Author of Paper BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. Agarwal et al. Summary BlinkDB is a massively parallel database that provides approximate results for queries over large data sets. BlinkDB’s distinguishing feature is providing the opportunity for users to trade response time for query accuracy — partial results are returned with annotated error bars describing their accuracy at the current point in time. ...

March 4, 2017 · 3 min · Kevin Sookocheff

Paper Review: Informix under CONTROL: Online Query Processing

Title and Author of Paper Informix under CONTROL: Online Query Processing. J. M. Hellerstein et al. Summary The CONTROL project attempts to improve the interaction between users and computers during data analysis. Traditional data analysis systems are a black box where a user enters a query, and waits for some amount of time before receiving a result. The CONTROL project aims to make this process interactive by continuously providing approximate results that are improved over time. Implementing such a system requires rethinking some fundamental tenants of database systems. First, with interactive systems queries may never complete, but instead they may be halted when results are “good enough”. Second, interactive systems must be able to provide approximate results quickly while maximizing the rate at which an accurate answer is found. This paper explores the changes in database technology needed to support interactive use cases. ...

March 3, 2017 · 4 min · Kevin Sookocheff

Paper Review: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Title and Author of Paper An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. Y. Zhao et al. Summary One of the core functions of an OLAP system is computing aggregations and group-by operations. This functionality has been characterized by the “Cube” operator, which computes group-by aggregations over all possible subsets of a specified dimension. As an example of the Cube operator, consider a model with the dimensions product, store, date, and the measured value sales. To compute the Cube for this data set requires computing sales for all subsets of the dimensions: sales by product, store, and date; sales by product and store; sales by product; etc. As a user, I want the system to prepare these results for me in response to ad-hoc queries or as part of a ETL job that prepares the data for analysis. Because there is a lot of data involved, the challenge of implementing the Cube operator is in computing these aggregations as efficiently as possible. ...

February 20, 2017 · 5 min · Kevin Sookocheff

Paper Review: Implementing Data Cubes Efficiently

Business intelligence and analytics use cases involve complex queries on potentially very large databases. To minimize query response times, query optimization is critical. One approach to optimizing query response times is to precompute relevant values ahead of time, and to use those precomputed results to answer queries. Unfortunately, it is not always feasible to precompute every potential value that is required to answer arbitrary queries. This paper describes a framework and presents algorithms that pick a good subset of queries to precompute to optimize response time. ...

January 14, 2017 · 3 min · Kevin Sookocheff

Paper Review: Robust Query Processing through Progressive Optimization

Title and Author of Paper Robust Query Processing through Progressive Optimization. Markl et al. Summary Traditional query optimizers choose an execution plan for a query by using estimates of current database statistics. However, these estimates may be inaccurate, leading to overly expensive query plans being chosen and executed. This paper presents progressive query optimization, allowing query execution to detect and recover from estimation errors during processing. During each execution step, progressive query optimization (POP) detects differences between the cardinality of the currently processed tuple and compares that to the estimated cardinality that was used to define the original execution plan. If those cardinalities differ enough, POP will re-optimize the query using updated estimates of cardinality. Any materialized views already computed can be reused during the re-execution step. ...

November 18, 2016 · 3 min · Kevin Sookocheff

Paper Review: Dynamo: Amazon’s Highly Available Key-value Store

Title and Author of Paper Dynamo: Amazon’s Highly Available Key-value Store, DeCandia et al. Summary Dynamo, as the title of the paper suggests, is Amazon’s highly available key-value storage system. Dynamo only supports primary-key access to data, which is useful for services such as shopping carts and session management. Dynamo’s use case for these services is providing a highly-available system that always accepts writes. This requirement forces the complexity of conflict resolution to data readers. Writes are never rejected. ...

October 7, 2016 · 2 min · Kevin Sookocheff

Paper Review: Cap Twelve Years Later: How the “Rules” Have Changed

Title and Author of Paper Cap Twelve Years Later: How the “Rules” Have Changed. Eric Brewer. Summary This article provides an exploration of the CAP Theorem and how it relates to database system design. The author argues that, since partitions are likely to happen, the system designer can introduce methods for safely recovering from partitions to compensate. This strategy allows the database to continue to provide availability during a partition, and enforce consistency once the partition is resolved. ...

October 5, 2016 · 3 min · Kevin Sookocheff

Paper Review: Generalized Isolation Level Definitions

Title and Author of Paper Generalized Isolation Level Definitions, Adya et al. Summary The ANSI SQL standard defines isolation levels allowing database users to trade off between performance and consistency when running transactions. Unfortunately, the wording in the SQL standard is geared towards locking as the sole supported concurrency method. This paper presents alternative definitions to the isolation levels specified in the ANSI SQL standard that are general enough to allow for any concurrency method (multi-version, optimistic, etc.) to be used. ...

June 23, 2016 · 3 min · Kevin Sookocheff