Title and Author of Paper

Access Path Selection in a Relational Database Management System. P. G. Selinger et al.

Summary

This paper describes methods of the SQL query optimizer for determining the cost of satisfying a query. It also describes methods for choosing among several competing methods.

What are the motivations for this work?

SQL is a high-level language where requests for data are stated non-procedurally. The user is not expected to need any knowledge of how the data is stored in the database or how it is retrieved. Thus, it is up to the DBMS to choose an appropriate access path for data retrieval on the users behalf. By designing the database in this fashion, we preserve data independence, where a users view of the data is independent of the databases view of the data.

What is the proposed solution?

The optimizer performs access path selection by first determining the best evaluation order of each query block (represented as a SELECT, FROM and WHERE clause). For each query block, each of the relations in the FROM list are evaluated, and if more than one relation is in a block, permutations of the join order and of join methods are also evaluated. The access path that minimizes the total cost of is chosen.

The cost for accessing a single relation is the cost of doing a sequential scan either of an index for the relation or of an un-indexed page in the database. The cost for single relations is evaluated as the number of page fetches plus the number of tuples being returned. The number of page fetches is an estimate of the I/O cost of accessing the data, while the number of tuples returned represents the CPU cost of evaluating the WHERE criteria on each tuple.

To facilitate easily calculating the cost of single relation access, the database keeps a catalog of statistic on the tables in the database. For each relation, it tracks the cardinality and number of pages of both the relation and any indexes defined on the relation.

Given the metadata about each relation available in the system catalog, several functions are given for estimating the number of tuples to be returned given the WHERE clause of the query. For example, to satisfy the predicate of column IN [list of values], the cost estimate is equal to the number of items in the list multiplied by the expected fraction of tuples that will satisfy column = value for each value.

For joins, two join methods are evaluated (nested loops and merging scans). Each join method has a cost associated with it that includes the cost of accessing each column and relation using the single relation access methods plus the cost of joining the data together.

When the optimizer is choosing a method for access data, it builds a tree of different cost alternatives for each access path, then prunes the tree based on estimated cost. In the end, one out of all alternative access paths is chosen and used to return data to the user.

What are the contributions?

Many of today’s database systems adopt the basic methods of this paper, even if the details of cost functions may vary.

What are future directions for this research?

This optimizer does not actually return a mathematically “optimal” solution. Rather, it provides a set of heuristics that estimate a cost that is “mostly correct”. This fact provides ample opportunity for additional research.