Title and Author of Paper
WebTables: Exploring the Power of Tables on the Web. M.J. Cafarella et al.
Summary
WebTables is a project to extract and process HTML tables from Google’s serach index. It attempts to answer two questions: what are some effective techniques for searching structured data at search engine scale, and what can be derived from analyzing a large corpus of HTML tables?
Web documents often contain structured and relational data embedded in HTML tables. The WebTables project extracted 14.1 billion English language HTML tables and further filtered those down to 154 million tables that contain structured data. From this data, we have the potential to determine semantic information embedded in the web, create visualizations, and integrate web documents into new applications.
By viewing HTML tables as small relational databases, we discover the problem of effectively exposing this data to search engine users. This requires ranking millions of small relational tables that each contain their own schema and set of tables, and matching ranked tables to user input.
Modelling Tables
Extracting Tables
The author’s wrote a system for automatically extracting tables from web
documents. This used a combination of hand-written and
statistically-trained classifiers to filter “good” results from “bad”
ones. The actual extraction algorithm is the subject of a different paper
and not described here. For the remainder of the paper, you can assume
a corpus R
of “relations” that are extracted from HTML documents.
Attribute Correlation Statistics
To rank HTML tables according to search criteria requires scoring the data
using an algorithm tailored to table-based data. The authors compute some
summary statistics, called the attribute correlation statistics database
or ACSDb, for this task. These statistics allow for the calculation of the
probability of seeing attributes within a given table’s schema. For
example, p(address)
is simply the sum of all counts c
for pairs whose
schema contains address
divided by the total sum of all counts. The
statistic also allows us to dtect relationships between attribute names.
For example, p(address|name)
returns the probability of having an
attribute address
in a relation, given that the attribute name
exists.
The WebTables search engine uses the data model embedded in the ACSDb
statistics to allow users to rank relations by relevance to a query, using
a familiar (to search engines) keyword-based input. The author’s use this
data to derive several ranking algorithms and combine them using a linear
regression estimator scored by human judges. Some algorithms operate
solely on word data and return results similar to a text-based search,
other algorithms take advantage of the tables schema. The best performing
algorithm is the schemaRank
. This ranking scheme uses features of the
table such as number of rows, number of columns, hits on header values,
and hits on table body values, combined with a schema coherency score
derived from the ACSDb.
Intuitively, the schema coherency score ranks tables based on how closely related the tables attributes are to one another.
Applications of the ACSDb
Having the data embedded in the ACSDb allows for a few novel applications. First, the ACSDb allows for schema auto-completion, and can be used to assist data base users that are designing a database system — the auto-complete system can guess appropriate attribute names based on previous data points.
A second application of the ACSDb is in suggestion attribute synonyms. This can help users find column headers that are synonyms of one another. This can also be used to help return valid search results.
Lastly, the database can be used to artificially “join” relations in a join graph. That is, related relations can be returned based on the schema information present in the search results.
The authors present experimental results for each of these applications, proving the usefulness of the ACSDb as a statistical measure for relation “table” data.
Conclusions
By extracting a search engine scale sample of HTML tables from web documents, the authors were able to drive the statistical ASCDb. The ASCDb can be used for some novel applications in search engine technology. This paper shows the value of using statistical analysis on large datasets, and provides some exploratory work in realizing the semantic information embedded in the web.