Title and Author of Paper
Informix under CONTROL: Online Query Processing. J. M. Hellerstein et al.
Summary
The CONTROL project attempts to improve the interaction between users and computers during data analysis.
Traditional data analysis systems are a black box where a user enters a query, and waits for some amount of time before receiving a result. The CONTROL project aims to make this process interactive by continuously providing approximate results that are improved over time. Implementing such a system requires rethinking some fundamental tenants of database systems. First, with interactive systems queries may never complete, but instead they may be halted when results are “good enough”. Second, interactive systems must be able to provide approximate results quickly while maximizing the rate at which an accurate answer is found. This paper explores the changes in database technology needed to support interactive use cases.
Algorithms for online query processing
To support online (or interactive) query processing requires some specific algorithms and some additional query operators. This includes algorithms for randomizing access to sample data, delivering data to the user using preferential algorithms, and doing real-time joins and groupings.
Randomized data access
One way to deliver partial updates to interactive queries is to randomly sample the data that supports a query to provide an estimate of the result. By doing statistical random sampling, any returned results can be accompanied by a confidence interval stating the accuracy of the result. To provide random data delivery requires data access methods that produce larger and larger samples of tables. The authors of this paper have chosen to implement random sampling by storing data in random order by using a user-defined function to generate pseudo-random numbers used for record storage. After storing the data in order, regular table scans return random samples of data.
This approach leads to two problems. First, that subsequent scans of the data always return data in the same order, which may lead to misinterpretation of the results. This can be mitigated by starting scans at arbitrary points in the table. Second, storing data in randomized order affects the design of other structures like indexes that optimize results based on storing data by shared attributes. This can be mitigated by generating secondary indexes for data — one random and the other ordered by a separate attribute.
Preferential data delivery
One of the user interface expectations is that data is processed and presented over a sequence of time. This requires data to be reordered during delivery to the user so that data of interest is presented first, allowing a user to cancel or continue a query as desired. The authors implement this functionality by allowing a user preference to do online reordering according to an “interestingness” heuristic. Users can dynamically change their definition of interesting during the course of a query. The reorder operator buffers items into main memory and requests for this data are delivered in the new order. Reordering is “best-effort”.
Ripple join algorithm
Many meaningful queries require combining data from multiple tables using join algorithms. Many existing join algorithms block until all data for the join is available. For online query processing, the authors developed a new “ripple join” algorithm for interactive queries. With this algorithm, each new tuple is joined with previously seen tuples, building out an incremental join result as new tuples become available.
Hash-based grouping
Aggregations queries using GROUP BY
statements partition tuples into
data groups by sorting or hashing the data. Implementing GROUP BY using
a hash-based algorithm is used for online aggregations because it provides
an easy to use non-blocking algorithm that acts on a tuple-by-tuple
basis.
User interface
The user interface to an interactive query processing system requires some changes from a traditional system. First, the cursor of results is returned almost immediately in an online system. Second, users can provide input to the system while a query is running. These UI issues are solved by offering an API for output and input. The output API returns a cursor along with a confidence interval of the current results. The input API allows users to cancel or pause a query.
End to end interactive query processing
This collection of algorithms and techniques can be used to build an end-to-end interactive query processing system. The authors dedicate the rest of the paper to exploring this interaction between the different technique and how they can be composed to create a fully working system. These details are rather intricate and difficult to summarize. Refer to the original paper for more details.