Paper Review: Combining Systems and Databases: A Search Engine Retrospective

Title and Author of Paper

Combining Systems and Databases: A Search Engine Retrospective. Eric A. Brewer.

Summary

Search engines manage data and respond to queries, which provides some similarities to databases. However, search engines are really an application-specific system built to handle large datasets. This system can leverage databases, or not, depending on the system goals. This paper describes a search engine design that leverages the ideas and vocabulary of the database community.

Design Principles

The problem with using existing DBMS systems to build a search engine is that they are not traditionally designed to handle full-text search. Search engines also do not need strict ACID semantics, and usually prefer high availability over consistency. Furthermore, search engines are able to clearly separate updating data from querying data, and each portion can be optimized as needed.

An observation made by the author is that, although search engines largely involve complex data management, they generally do not use databases to do so. However, many of the design principles that guide database implementations can be leveraged for search engine development. These design principles are:

Top-Down Design

Traditional databases are designed “top-down” by focusing first on the desired semantics (ACID), and developing systems that implement these semantics. Search engines can be designed the same way by clearly specifying the desired semantics before implementing the system.

Data Independence

Data in DBMS systems are represented as sets and relations. This representation is independent from data store and processing. Successful search engines can follow the same pattern.

Declarative Query Language

The SQL language specifies “what” to return, not “how” to return it. Removing “how” from a query allows for powerful query optimizations that are independent of the query itself.

Search Engine Top-Down Design

If designing a search engine from scratch, the focus should be on supporting many concurrent read-only queries, with a focus on availability more than consistency. This focus leads to an architecture where the database is largely static, and updated using asynchronous workers that rebuild and update the database periodically.

Crawling, Index, Serve

Providing the basic operation of search engine involves three steps: crawling, indexing, and serving.

First, web crawlers fetch documents from the web and output them as lists or maps of “visited” sites. Second, an indexer parses and interprets documents, outputting a chunk of data that serves as a portion of the static database. The indexer scores and ranks documents, minimizing the work of the query processor. Finally, the server executes queries against the chunks output by the indexer.

Queries

Search engine queries differ from SQL queries, but can still undergo similar query plan optimization steps. A search engine query defines words and some properties that matching document should or should not contain. Queries combine words and properties such as query language, or boolean conditions. In the simplest case, a query is a list of terms that matching documents must contain. In more complex cases, boolean expressions and properties affect query interpretation.

Search engine queries can be represented as logical query plans that can be implemented by a separate optimizer. The author contends that the separation of query parsing and planning for execution is a key result that can be leveraged from the database community. In a search engine, a query plan is tuned to a very specific use case and need not support the general query semantics of traditional SQL queries. In the search engine case, plans are represented as joins on matching document ids, with some filtering where needed.

To implement a query plan, we can define some key physical operators for accessing data. These are:

  • OR: multiway outer join with scoring.
  • ORp: multiway outer join without scoring.
  • ANDp: multiway inner join without scoring.
  • FILTER: multiway inner join with scoring for expressions (not properties).

Most queries map into a sequence of AND operations on all of the inputs, with further expressions used to compute the final score.

To optimize the query, you must map the logical query representation, exploit any cached results, and minimize the number of joins required to compute the result set (all steps familiar to any database implementer).

One wrinkle with search engines is that the size of the datasets require query plans to be executed on clusters of machines. The author’s have designed their system to send the entire query to each node, and have the node execute the query on their local data set (which may differ from the data on other nodes). Result sets are then joined using a distributed hash join and returned to the user.

Updates

The search engine index must be updated to return fresh query results to users. In a search engine, each node is independent and can be updated without concern for other nodes. The authors also introduce a constraint that whole tables are updated a time, rather than individual rows within a table.

To perform a full update, the first step is to get new content through crawling. The result of the crawl is used by an indexer to update a chunk of data that the indexer is responsible for. The chunk corresponds to a range of documents. After a chunk has been updated by the indexer, an atomic update to the chunk can be performed on the master data set. Caches must be invalidated where necessary.

In some cases, documents must be updated or deleted in real-time (not as part of the indexing atomic update). The approach taken by the author is to add a new row to the chunk that means “item deleted”, and add that filter to each query. The same mechanism can be used to handle updates. Periodically, the whole table can be updated with this data using the same process as updating a chunk through indexing.

Fault Tolerance

The primary goal of fault tolerance in search engines is high availability. This problem can be sub-divided by first determining what portions of the system must be highly available. Query servers rely on snapshots of data, and the process of creating snapshots using crawling and indexing does not need to be highly available. Therefore, the snapshots themselves are the primary target for high availability. This amounts to having the nodes running queries coordinate a master for each query, and electing a new master when a query fails. Failures in follower nodes are detected by the master and a new follower is selected to execute the query. Disk failures are handled by using replication.

To recover from a complete disaster, clients can try alternate masters in alternate data centers to resolve a query. Crawling and indexing can be restarted at will since they are idempotent.

Discussion

Search engines can leverage much of the top-down design work of database systems to produce a cleaner separation between semantics and implementation. This allows for the semantics and implementation to evolve independently, and provides clear partitioning of system responsibilities. This same strategy can be used when designing any data-intensive application.

Like this post? Subscribe via RSS or email to never miss an update.