Paper Review: WebTables: Exploring the Power of Tables on the Web

Title and Author of Paper WebTables: Exploring the Power of Tables on the Web. M.J. Cafarella et al. Summary WebTables is a project to extract and process HTML tables from Google’s serach index. It attempts to answer two questions: what are some effective techniques for searching structured data at search engine scale, and what can be derived from analyzing a large corpus of HTML tables? Web documents often contain structured and relational data embedded in HTML tables....

March 29, 2017 · 4 min · Kevin Sookocheff

Paper Review: Combining Systems and Databases: A Search Engine Retrospective

Title and Author of Paper Combining Systems and Databases: A Search Engine Retrospective. Eric A. Brewer. Summary Search engines manage data and respond to queries, which provides some similarities to databases. However, search engines are really an application-specific system built to handle large datasets. This system can leverage databases, or not, depending on the system goals. This paper describes a search engine design that leverages the ideas and vocabulary of the database community....

March 27, 2017 · 6 min · Kevin Sookocheff

Paper Review: The Anatomy of a Large-Scale Hypertextual Web Search Engine

Title and Author of Paper The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Summary This paper describes the underpinnings of the Google search engine. The paper presents the initial Google prototype and describes the challenges in scaling search engine technology to handle large datasets. At the time of writing, the main goal of Google is to improve the quality of web searches by taking advantage of the existing link data embedded in web pages to calculate the quality of a page....

March 23, 2017 · 5 min · Kevin Sookocheff

Paper Review: Consistency Analysis in Bloom: a CALM and Collected Approach

Title and Author of Paper Consistency Analysis in Bloom: a CALM and Collected Approach. Alvaro et al. Summary Distributed programming is difficult for even experienced developers to get correct. Understanding the tradeoff between consistency, availability, and latency, while guaranteeing data correctness, provides a wealth of problems for the application developer. This paper presents a language and method for programmatically verifying distributed consistency. CALM - Consistency and Logical Monotonicity There is a connection between distributed consistency algorithms and logical monotonicity, that is, our programs must be correct even in the face of the delay and re-ordering of messages and data across different nodes in a system....

March 22, 2017 · 4 min · Kevin Sookocheff

Paper Review: The CQL continuous query language: semantic foundations and query execution

Title and Author of Paper The CQL continuous query language: semantic foundations and query execution. Arasu et al. Summary CQL is a derivation of the SQL query language developed for running continuous queries over streams of data. The goals of the system are to provide a precise set of language semantics for running such continuous stream workloads. The paper starts by defining precise abstract semantics for continuous queries that cover two data types — streams and relations — and three classes of operators: ones that produce a relation from a stream, one that produces a relation from other relations, and one that produces a stream from a relation....

March 17, 2017 · 5 min · Kevin Sookocheff

Paper Review: BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Title and Author of Paper BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. Agarwal et al. Summary BlinkDB is a massively parallel database that provides approximate results for queries over large data sets. BlinkDB’s distinguishing feature is providing the opportunity for users to trade response time for query accuracy — partial results are returned with annotated error bars describing their accuracy at the current point in time....

March 4, 2017 · 3 min · Kevin Sookocheff

Paper Review: Informix under CONTROL: Online Query Processing

Title and Author of Paper Informix under CONTROL: Online Query Processing. J. M. Hellerstein et al. Summary The CONTROL project attempts to improve the interaction between users and computers during data analysis. Traditional data analysis systems are a black box where a user enters a query, and waits for some amount of time before receiving a result. The CONTROL project aims to make this process interactive by continuously providing approximate results that are improved over time....

March 3, 2017 · 4 min · Kevin Sookocheff

Paper Review: An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Title and Author of Paper An Array-Based Algorithm for Simultaneous Multidimensional Aggregates. Y. Zhao et al. Summary One of the core functions of an OLAP system is computing aggregations and group-by operations. This functionality has been characterized by the “Cube” operator, which computes group-by aggregations over all possible subsets of a specified dimension. As an example of the Cube operator, consider a model with the dimensions product, store, date, and the measured value sales....

February 20, 2017 · 5 min · Kevin Sookocheff

Paper Review: Implementing Data Cubes Efficiently

Business intelligence and analytics use cases involve complex queries on potentially very large databases. To minimize query response times, query optimization is critical. One approach to optimizing query response times is to precompute relevant values ahead of time, and to use those precomputed results to answer queries. Unfortunately, it is not always feasible to precompute every potential value that is required to answer arbitrary queries. This paper describes a framework and presents algorithms that pick a good subset of queries to precompute to optimize response time....

January 14, 2017 · 3 min · Kevin Sookocheff

Paper Review: Robust Query Processing through Progressive Optimization

Title and Author of Paper Robust Query Processing through Progressive Optimization. Markl et al. Summary Traditional query optimizers choose an execution plan for a query by using estimates of current database statistics. However, these estimates may be inaccurate, leading to overly expensive query plans being chosen and executed. This paper presents progressive query optimization, allowing query execution to detect and recover from estimation errors during processing. During each execution step, progressive query optimization (POP) detects differences between the cardinality of the currently processed tuple and compares that to the estimated cardinality that was used to define the original execution plan....

November 18, 2016 · 3 min · Kevin Sookocheff