Title and Author of Paper

WebTables: Exploring the Power of Tables on the Web. M.J. Cafarella et al.

Summary

WebTables is a project to extract and process HTML tables from Google’s serach index. It attempts to answer two questions: what are some effective techniques for searching structured data at search engine scale, and what can be derived from analyzing a large corpus of HTML tables?

Web documents often contain structured and relational data embedded in HTML tables. The WebTables project extracted 14.1 billion English language HTML tables and further filtered those down to 154 million tables that contain structured data. From this data, we have the potential to determine semantic information embedded in the web, create visualizations, and integrate web documents into new applications.

By viewing HTML tables as small relational databases, we discover the problem of effectively exposing this data to search engine users. This requires ranking millions of small relational tables that each contain their own schema and set of tables, and matching ranked tables to user input.

Modelling Tables

Extracting Tables

The author’s wrote a system for automatically extracting tables from web documents. This used a combination of hand-written and statistically-trained classifiers to filter “good” results from “bad” ones. The actual extraction algorithm is the subject of a different paper and not described here. For the remainder of the paper, you can assume a corpus R of “relations” that are extracted from HTML documents.

Attribute Correlation Statistics

To rank HTML tables according to search criteria requires scoring the data using an algorithm tailored to table-based data. The authors compute some summary statistics, called the attribute correlation statistics database or ACSDb, for this task. These statistics allow for the calculation of the probability of seeing attributes within a given table’s schema. For example, p(address) is simply the sum of all counts c for pairs whose schema contains address divided by the total sum of all counts. The statistic also allows us to dtect relationships between attribute names. For example, p(address|name) returns the probability of having an attribute address in a relation, given that the attribute name exists.

The WebTables search engine uses the data model embedded in the ACSDb statistics to allow users to rank relations by relevance to a query, using a familiar (to search engines) keyword-based input. The author’s use this data to derive several ranking algorithms and combine them using a linear regression estimator scored by human judges. Some algorithms operate solely on word data and return results similar to a text-based search, other algorithms take advantage of the tables schema. The best performing algorithm is the schemaRank. This ranking scheme uses features of the table such as number of rows, number of columns, hits on header values, and hits on table body values, combined with a schema coherency score derived from the ACSDb.

Intuitively, the schema coherency score ranks tables based on how closely related the tables attributes are to one another.

Applications of the ACSDb

Having the data embedded in the ACSDb allows for a few novel applications. First, the ACSDb allows for schema auto-completion, and can be used to assist data base users that are designing a database system — the auto-complete system can guess appropriate attribute names based on previous data points.

A second application of the ACSDb is in suggestion attribute synonyms. This can help users find column headers that are synonyms of one another. This can also be used to help return valid search results.

Lastly, the database can be used to artificially “join” relations in a join graph. That is, related relations can be returned based on the schema information present in the search results.

The authors present experimental results for each of these applications, proving the usefulness of the ACSDb as a statistical measure for relation “table” data.

Conclusions

By extracting a search engine scale sample of HTML tables from web documents, the authors were able to drive the statistical ASCDb. The ASCDb can be used for some novel applications in search engine technology. This paper shows the value of using statistical analysis on large datasets, and provides some exploratory work in realizing the semantic information embedded in the web.