Title and Author of Paper
The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page.
This paper describes the underpinnings of the Google search engine. The paper presents the initial Google prototype and describes the challenges in scaling search engine technology to handle large datasets. At the time of writing, the main goal of Google is to improve the quality of web searches by taking advantage of the existing link data embedded in web pages to calculate the quality of a page.
The key observation of Google is that the link graph embedded in the web is a valuable resource for ranking the quality of web pages. Google uses the number of links entering a given web page as an approximation of the page’s quality. This rank is normalized by the number of links on a page. The final calculation is the page’s PageRank. Intuitively, PageRank is a proxy for user behaviour on the web, wherein if a page is highly cited (linked to) it has a higher PageRank.
A key difficulty in realizing the PageRank algorithm on the web is scaling its computation and results to web-sized data. First, distributed web crawlers are run to download pages from the web. Each crawler is sent a list of URLs to be fetched. Each fetched page is then sent to a storage server to be compressed and stored. Each page is assigned a document id corresponding to the URL of the page. After storage, an indexer reads the data from storage, uncompresses the documents, and parses them. Those parsed documents are then divided into word occurrences and all links are parsed out of each page. This indexing operation updates a link database storing all parsed link data; individual word data is used to generate an inverted index mapping words to documents those words come from.
Major Data Structures
To build a large-scale search engine requires thinking about how to store documents with as little cost as possible. Google was designed to avoid disk seek costs whenever possible.
In Google, BigFiles are virtual files that span multiple real file systems. BigFiles is a package that handles allocating files across file systems automatically, and handles managing file descriptors for accessing the files.
The repository contains the full HTML of every web page crawled, after compressing those pages with zlib. Within the repository, each document is stored along with the document id, the length of the document, and the URL it was retrieved from. The repository acts as the source of truth for data, and all other data structures can be rebuilt from the repository when necessary.
The document index stores a pointer to the document in the repository, a checksum of the data, and various statistics. The document is stored as an ISAM index ordered by document id.
The lexicon tracks the different words that make up the corpus of documents. The lexicon is stored as a list of words concatenated together, and a hash table of pointers to words for fast lookup.
A hit list corresponds to the list of occurrences of a particular word in the lexicon in a document. The hit list encodes the font, position in the document, and capitalization of the word. The authors use a hand optimized encoding scheme to minimize the space required to store the list.
The forward index stores a mapping between document id, word ids, and the hit list corresponding to these words.
The inverted index maps between word ids and document ids. This list index provides the representation of the occurrences of a word in all documents.
A large portion of search engine development is crawling the web and downloading pages to be added to the index. Crawling also presents unique challenges since it must deal millions of different web servers and pages with which it has no control over. In the Google prototype, a single URL server forwards lists of URLs to distributed web crawlers that download pages for indexing. Given the vastness of the web, there are hundreds if not thousands of obscure problems running web crawlers at scale that must be developed in to the crawlers themselves.
Given the data crawled and indexed, we can start running search queries on it. The Google search algorithm runs the following set of steps:
- Parse the query.
- Convert words into word ids.
- Seek to the start of the doclist for every word.
- Scan through the doclist until there is a document matching all the search terms.
- Compute the rank of that document for the query.
- If we are at the end of a doclist, seek to the start of the next doclist and repeat at step 4.
- If we are not at the end of any doclist, go to step 4.
- Sort the documents that match by rank, and return the top k.
Google was born from humble beginnings as a research project in building a scalable search engine. In doing so, it implements a complex production system able to operate in the “real world” of the web. A central message is that web services must account for enormous variety and quantity of data to operate on the web at large.