Title and Author of Paper
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. Agarwal et al.
BlinkDB is a massively parallel database that provides approximate results for queries over large data sets. BlinkDB’s distinguishing feature is providing the opportunity for users to trade response time for query accuracy — partial results are returned with annotated error bars describing their accuracy at the current point in time.
Any sample-based query processor needs to decide what types of samples to create. BlinkDB makes the assumption that the frequency of columns used in grouping and filtering remains static over time, while the exact values being queried on are unpredictable. In the rest of the paper, the term query column set or QCS describes the set of columns used in a grouping or filtering query. For example, if 5% of prior database queries grouped on the column City, then we assume that 5% of future queries also group on the same column.
Given this assumption, BlinkDB is implemented using two main components: an offline sampling module and a run-time sample selection module.
BlinkDB treats the creation of samples as an optimization problem for each QCS. A key point about the samples is that they are stratified, meaning that rare subgroups of the data are sufficiently represented so that queries for rare data still return relevant results. The optimization takes as input the sparsity of the data, characteristics of the workload being run, and the storage cost of the samples taken.
The sampling algorithm defines a “sparsity” function that accounts for the rarity of items in the data. Workload is defined as the frequency of queries on this data set being run. Storage cost is calculated as the number of rows being stored.
Given a set of stratified data samples the BlinkDB query engine is responsible for selecting the samples to run to answer the query. Given a query, the goal is to select one or more samples that meet the time or error constraints the user provides, and then answer the query using these samples. Sample choice primarily depends on the set of columns that are being queried for. BlinkDB uses this information to select samples from the cluster by finding samples that have a high selectivity, defined as the ratio of the number of rows selected from a sample to the number of rows in the entire sample.
Once the initial set of samples is decided, the runtime engine selects a sample and picks a sub-sample of it that meets the predefined query response time for this request. The sub-sample chosen also affects the accuracy of the query result. To determine this accuracy, an error latency profile or ELP is computed based on the selectivity of the query and the response time constraints.
Lastly, once the sub-sample is chosen, BlinkDB adjusts for any statistical bias attributed to using stratified samples.