Information integration
Document Sample


.
T R E N D S & C O N T R O V E R S I E S
Information integration
By Marti A. Hearst
University of California, Berkeley
hearst@sims.berkeley.edu
Despite the Web’s current disorganized and anarchic state, many AI researchers believe that it
will become the world’s largest knowledge base. In this installment of Trends and Controversies,
we examine a line of research whose final goal is to make disparate data sources work together to
better serve users’ information needs. This work is known as information integration. In the fol-
lowing essays, our authors talk about its application to datasets made available over the Web. To answer our query, we would first
Alon Levy leads off by discussing the relationship between information-integration and tradi- query the Internet Movie Database to ob-
tional database systems. He then enumerates important issues in the field and demonstrates how tain the list of movies directed by Woody
the Information Manifold project has addressed some of these, including a language for describ-
Allen, and then feed the result into the
ing the contents of diverse sources and optimizing queries across sources.
MovieLink database to check which ones
Craig Knoblock and Steve Minton describe the Ariadne system. Two of its distinguishing fea-
tures are its use of wrapper algorithms to extract structured information from semistructured data are playing in Seattle. Finally, we would
sources and its use of planning algorithms to determine how to integrate information efficiently and find reviews for the relevant movies using
effectively across sources. This system also features a mechanism that determines when to prefetch any of the movie review sites.
data depending on how often the target sources are updated and how fast the databases are. Most importantly, a data-integration sys-
William Cohen describes an interesting variation on the theme, focusing on “informal” infor- tem lets users focus on specifying what they
mation integration. The idea is that, as in related fields that deal with uncertain and incomplete want, rather than thinking about how to ob-
information, an information-integration system should be allowed to take chances and make mis- tain the answers. As a result, it frees them
takes. His Whirl system uses information-retrieval algorithms to find approximate matches be- from the tedious tasks of finding the relevant
tween different databases, and as a consequence knits together data from quite diverse sources.
data sources, interacting with each source in
A controversy emerges in the midst of this trend, centering around the issue of whether informa-
isolation using a particular interface, and
tion extraction from HTML-based Web pages is a long-standing problem. Proponents of XML
(Extensible Markup Language, see www.w3.org/TR/REC-xml.html) argue that in the future infor- combining data from multiple sources.
mation of any importance will be exchanged between programs using a well-defined protocol,
rather than being displayed solely for purposes of reading using ad hoc formats in HTML. In his Traditional database systems
essay, Levy argues that the problem of extracting information from HTML markup will, as a conse- To understand the challenges involved in
quence of such protocols, become less important. He notes, however, that the problem of integrating building data-integration systems, I will
data that differs semantically will still remain. Knoblock and Minton counter that the need for briefly compare the problems that arise in
HTML wrappers will remain strong, arguing that there will always be exceptions and legacy pages. this context with those encountered in tra-
Cohen takes a different stance, suggesting that many information providers want to help in- ditional database systems. In this discus-
form people, but might not see a direct benefit from the investment required to form a highly
sion, I focus mainly on comparisons with
structured data source. He suggests that cheap, approximate information integration, such as
relational database systems, but the differ-
enabled by his system, can render these simpler sites more powerful, providing a larger benefit
than any individual site developer alone could attain, and getting around the chicken-and-egg ences also hold for systems based on other
problem of who pays to make useful information available free. models, such as object-oriented and object-
On a different note, Haym Hirsh of Rutgers has signed on to help edit Trends & Controversies. relational ones. Figure 1 illustrates the dif-
To continue providing sharp, cogent debates on topics that span a wide range of intelligent sys- ferent stages in processing a query in a
tems research and applications development, he and I will be alternating installments. For his first data-integration system.
outing next issue, Haym has lined up Barbara Hayes-Roth, Janet Murray, and Andrew Stern, who
will address interactive fiction. Data modeling. Traditional database sys-
—Marti Hearst tems and data-integration systems differ
mainly in the process they use to organize
data into an application. In a traditional sys-
listings of movies, their casts, directors, tem, the application designer examines the
The Information Manifold approach
genres, and so forth), MovieLink (listing application’s requirements, designs a data-
to data integration
playing times of movies in US cities), and base schema (such as a set of relation names
Alon Y. Levy, University of Washington
several sites that provide textual reviews and the attributes of each relation), and then
A data-integration system provides a for selected movies. Suppose we want to implements the application, part of which
uniform interface to a multitude of data find which Woody Allen movies are play- involves actually populating the database
sources. Consider a data-integration system ing tonight in Seattle and see their respec- (inserting tuples into the tables).
providing information about movies from tive reviews. None of these data sources in In contrast, a data-integration application
data sources on the World Wide Web. There isolation can answer this query. However, begins from a set of pre-existing data
are numerous sources on the Web concern- by combining data from multiple sources, sources. These sources might be database
ing movies, such as the Internet Movie we can answer queries like this one, and systems, but more often are unconventional
Database (which provides comprehensive even more complex ones. data sources, such as structured files, legacy
12 IEEE INTELLIGENT SYSTEMS
.
Alon Y. Levy is a faculty member at the University of Washington’s Com-
puter Science and Engineering Department. His research interests are Web-
site management, data integration, query optimization, management of
semistructured data, description logics and their relationship to database
systems, or Web sites. Here the application query languages, abstractions and approximations of computational theo-
ries, and relevance reasoning. He received his PhD in computer science
builder must design a mediated schema on from Stanford University and his undergraduate degree at Hebrew Univer-
which users will pose queries. The medi- sity. Contact him at the Dept. of Computer Science and Engineering, Sieg
ated schema is a set of virtual relations, in Hall, Room 310, Univ. of Washington, Seattle, WA, 98195; alon@cs.
that they are not actually stored anywhere. washington.edu; http://www.cs.washington.edu/homes/alon/.
The mediated schema is designed manually
Craig Knoblock is a project leader and senior research scientist at the In-
for a particular data-integration application. formation Sciences Institute, a research assistant professor in the Computer
For example, in the movie domain, the me- Science Department, and a key investigator in the Integrated Media Systems
diated schema might contain the relations Center at the University of Southern California. His research interests in-
MovieInfo(id, title, genre, coun- clude information gathering and integration, automated planning, machine
learning, knowledge discovery, and knowledge representation. He received
try, year, director) describing the his BS from Syracuse University and his MS and PhD from Carnegie Mel-
different properties of a movie, the relation lon University, all in computer science. Contact him at USC/ISI, 4676 Ad-
MovieActor(id, name) representing a miralty Way, Marina del Rey, CA 90292; knoblock@isi.edu; http://www.isi.
movie’s cast, and MovieReview(id, re- edu/~knoblock.
view) representing reviews of movies.
Steve Minton is a senior computer scientist at the Information Sciences
Along with the mediated schema, the Institute and a research associate professor in the Computer Science Depart-
application designer needs to supply de- ment at the University of Southern California. His research interests are in
scriptions of the data sources. The descrip- machine learning, planning, scheduling, constraint-based reasoning, and
tions specify the relationship between the program synthesis. He received his BA in psychology from Yale University
and his PhD in computer science from Carnegie Mellon University. He
relations in the mediated schema and those founded the Journal of Artificial Intelligence Research and served as its first
in the local schemas at the sources. (Even executive editor. He was recently elected to be a fellow of the AAAI. Con-
though not all the sources are databases, we tact him at USC/ISI, 4676 Admiralty Way, Marina del Rey, CA 90292;
model them as having schemas at the con- minton@isi.edu; http://www.isi.edu/sims/minton/homepage.html.
ceptual level.) An information-source de-
William Cohen is a principal research staff member in the department of
scription specifies Machine Learning and Information Retrieval Research at AT&T Labs-Re-
search. In addition to information integration, his research interests include
• the source’s contents (for example, con- machine learning, text categorization, learning from large datasets, compu-
tains movies), tational learning theory, and inductive logic programming. He received a
bachelor’s degree from Duke and a PhD from Rutgers, both in computer
• the attributes found in the source science. Contact him at AT&T Labs-Research, 180 Park Avenue, Florham
(genre, cast), Park NJ 07932-0971; wcohen@research.att.com; http://www.
• constraints on the source’s contents research.att.com/~wcohen/.
(contains only American movies),
• the source’s completeness and reliabil-
ity, and finally,
• its query-processing capabilities (can use different names to refer to the same optimizer relies on extensive statistics
perform selections or can answer arbi- object. For example, the same person about the underlying data, such as the sizes
trary SQL queries). might be called as “John Smith” in one of relations, sizes of domains, and selectiv-
source and “J.M. Smith” in another. ity of predicates. Finally, the query-execu-
Because the data sources are preexisting, tion plan passes to the query-execution
data in the sources might be overlapping Query optimization and execution. A engine, which evaluates the query.
and even contradictory. Furthermore, we traditional relational-database system ac- The traditional database and the data-
might face the following problems: cepts a declarative query in SQL. The sys- integration contexts differ primarily in that
tem first parses the query before passing it the optimizer has little information about the
• Semantic mismatches between sources. to the query optimizer. The optimizer pro- data, because the data resides in remote au-
Because each data source has been de- duces an efficient query-execution plan for tonomous sources rather than locally. Fur-
signed by a different organization for the query, which is an imperative program thermore, because the data sources are not
different purposes, the data is modeled that specifies exactly how to evaluate the necessarily database systems, the sources
in different ways. For example, one query. In particular, the plan specifies the appear to have different processing capabili-
source might store a relational database order for performing the query’s operations ties. For example, one data source might be
that stores all of a particular movie’s (join, selection, and projection), the meth- a Web interface to a legacy information sys-
attributes in one table, while another od for implementing each operation (such tem, while another might be a program that
source might spread the attributes as sort-merge join or hash join), and the scans data stored in a structured file (such as
across several relations. Furthermore, scheduling of the different operators bibliography entries). Hence, the query opti-
the names of the attributes and tables (where parallelism is possible). Typically, mizer must consider the possibility of ex-
will differ from one source to another, the optimizer selects a query-execution ploiting a data source’s query-processing
as will the choice of what should be a plan by searching a space of possible plans capabilities. Query optimizers in distributed
table and what should be an attribute. and comparing their estimated costs. To database systems also consider where parts
• Different naming conventions. Sources evaluate a query-execution plan’s cost, the of the query are executed, but in that context
SEPTEMBER/OCTOBER 1998 13
.
Global data model
Query in mediated schema
the different processors have identical capa- data sources with closely related content
bilities. Finally, because data must be trans- Query reformulation and that would answer queries efficiently
ferred over a network, the query optimizer by accessing only the sources relevant to
Query in the
and the execution engine must be able to union of exported
the query. The remainder of this essay will
adapt to data-transfer delays. source schemas describe its main contributions.3–6
Query reformulation. A data-integration Query optimization The AI and DB approach. We based our
system user poses queries in terms of the Distributed query-
approach in designing the Information
mediated schema, rather than directly in the execution plan Manifold on the observation that the data-
schema where the data resides. As a conse- integration problem lies at the intersection
quence, a data-integration system must first Query execution engine of database systems and artificial intelli-
reformulate a user query into a query that Query the gence. Hence, we searched for solutions
refers directly to the schemas in the sources. exported source that combine and extend techniques from
Such a reformulation step does not exist in schema both fields. For example, we developed a
traditional database systems. To perform representation language and a language for
the reformulation step, the data-integration Wrapper Wrapper describing data sources that was simple
system uses the source descriptions. from the knowledge-representation per-
Local data spective, but that had the necessary added
model
Wrappers. Unlike a traditional query-exe- flexibility concerning previous techniques
cution engine that communicates with the Query in developed in the database community.
the
storage manager to fetch the data, a data-
source
integration system’s query-execution plan schema Source description language. The Infor-
must obtain data from remote sources. To do mation Manifold is most importantly a flexi-
so, the execution engine communicates with ble mechanism for describing data sources.
a set of wrappers. A wrapper is a program This mechanism lets users describe complex
that is specific to every data source and that constraints on a data source’s contents,
translates the source’s data to a form that the thereby letting them distinguish between
system’s query processor can further pro- Figure 1. Prototypical architecture of a data-integration sources with closely related data. Also, this
cess. For example, the wrapper might extract system. mechanism makes it easy to add or delete
a set of tuples from an HTML file and per- data sources from the system without chang-
form translations in the data’s format. ing the descriptions of other sources. Infor-
oped several methods to model and query mally, the contents of a data source are de-
Semistructured data. The term semistruc- semistructured data and is currently consid- scribed by a query over the mediated
tured data has been used with various mean- ering the issues of query optimization and schema. For example, we might describe a
ings to refer to characteristics of data present storage for such data.1,2 Building data-inte- data source as containing American movies
in a data-integration system. To understand gration systems based on a semistructured that are all comedies and were produced
the importance of semistructured data, we data model has two main advantages: after 1965 (source 1 in Figure 2). As another
distinguish between a lack of structure at the example, we can describe sources in whose
physical level versus one at the logical level. • In many cases, the data in the sources is schema significantly differs from the one in
With lack of structure at the physical level, indeed semistructured at the logical level. the mediated schema. For instance, we can
structured data (for example, tuples) are • The models developed for semistruc- describe a source in which a movie’s year,
embedded in a file containing additional tured data can cleanly integrate data genre, actor, and review attributes al-
markup information such as HTML files. coming from multiple data models, ready appear in one table (source 2 in Figure
Extracting the actual values from the HTML such as relational, object-oriented, and 2). This source is modeled as containing the
file can be very complex task, and is one that Web-data models. result of a join over a set of relations in the
the source’s wrapper performs. mediated schema. For some queries, extract-
Most work on semistructured data con- Classes of data integration applications. ing data from this source might be cheaper
cerns lack of structure at the logical level. In The two main classes of data-integration than from others if the join computed in the
this context, semistructured data refers to applications are integration of data sources source is indeed required for the query.
cases in which the data does not necessarily on the Web and within a single company or The Information Manifold employed an
fit into a rigidly predefined schema, as re- enterprise. In the latter case, the sources are expressive language, Carin, for formulating
quired in traditional database systems. This not as autonomous as they are on the Web, queries and for representing background
might arise because the data is very irregular but the requirements from a data-integra- knowledge about the relations in the medi-
and hence can be described only by a rela- tion system might be more stringent. ated schema. Cairn7 combined the expres-
tively large schema. In other cases, the sche- sive power of the datalog database-query
ma might be rapidly evolving, or not even The Information Manifold Project language (needed to model relational
declared at all—it might be implicit in the In this project, we wanted to develop a sources) and Description Logics, which are
data. The database community has devel- system that would flexibly integrate many knowledge-representation languages de-
14 IEEE INTELLIGENT SYSTEMS
.
Source 1 Source 2
select title, year, director select title, genre, review
from MOVIEINFO from MOVIEINFO M, MOVIEREVIEW R
where genre = COMEDY where m.id = r.id
signed especially to model year ≥ 1965 compare the relative expres-
complex hierarchies that country = USA sive power of our source-
frequently arise in data- description languages. We
integration applications. Figure 2. Data-source descriptions. have a set of properties along
which we can compare our
Query-answering algorithms. We devel- Manifold, we developed a method for rep- query-answering algorithms (such as, do
oped algorithms for answering queries resenting local-source completeness and an they guarantee accessing only relevant
using the information sources. Recall that algorithm for exploiting such information sources or a minimal number of sources?).
user queries are posed in terms of the medi- in query answering.5 We can also compare features of our
ated schema. Hence, the main challenge in systems (do they assume sources are com-
designing the query-answering algorithms Using probabilistic information. The In- plete, can they handle local completeness,
is to reformulate the query such that it formation Manifold pioneered the use of and can they compare directly between
refers to the relations in the data sources. probabilistic reasoning for data integration sources?).
Our algorithms were the first to guarantee (representing another example of the com- We need to take this progress into account
that only the relevant set of data sources are bined AI and DB approach to the data- as we address the challenges that lie ahead.
accessed when answering a query, even in integration problem).6 When numerous data Our common terminology will enable us
the presence of sources described by com- sources are relevant to a given query (such (and should force us) to compare systems
plex constraints. as bibliographic databases available for a more rigorously, either theoretically or ex-
It is interesting to note the difference topic search), a data-integration system perimentally. To proceed, we must also de-
between our approach to query answering needs to order the access to the data sources. velop a set of data-integration benchmarks,
and that employed in the SIMS and Ariadne Such an ordering is dependent on the over- along which we can experimentally com-
projects described in Craig Knoblock’s and lap between the sources and the query and pare our data-integration systems.
Steve Minton’s companion essay. In their on the coverage of the sources. We devel-
approach, even though they used a knowl- oped a probabilistic formalism for specify- The immediate future
edge-representation system for specifying ing and deducing such information and al- The data-integration problem is by no
the source descriptions, they used a general- gorithms for ordering the access to data means solved. We have made significant
purpose planner to reformulate a user query sources given such information.6 progress in the problems relating to model-
into a query on the data sources. In contrast, ing data sources and developing methods for
our approach uses the reasoning mechan- Exploiting source capabilities. Sources combining data from them via a single, inte-
isms associated with the underlying knowl- often have different query-processing ca- grated view. Many problems remain in that
edge-representation system to perform the pabilities. For example, one source might area, the most significant being the problem
reformulation. Aside from the natural ad- be a full-fledged relational database, while of name matching across sources. This prob-
vantages obtained by treating the represen- another might be a Web site with a very lem is finally starting to be addressed in a
tation and the query reformulation within specific form interface that supports only a principled manner in the Whirl system.13
the same framework, our approach can pro- limited set of queries and that requires In the near future, I believe that the bulk
vide better formal guarantees on the results certain inputs be provided to it. The Infor- of the work in the field should shift into
and can benefit immediately from exten- mation Manifold developed several novel other, less attended problems, some of
sions to the underlying knowledge-repre- algorithms for adapting to differing source which I describe here.
sentation system. capabilities. When possible, to reduce the
amount of processing done locally, the Information presentation. Users are
Handling completeness information. In system would fully exploit the query-pro- rarely interested in data that is simply and
general, sources on the Web are not neces- cessing capabilities of its data sources.4,9 concisely presented. More commonly, the
sarily complete for the domain they are In addition, we developed a mechanism for result of users queries are best seen as entry
covering. For example, a computer science describing source capabilities that is a nat- points into entire webs of data. This obser-
bibliography source is unlikely to contain ural extension of our method for describ- vation begs the question of how to build
all the references in the field. However, in ing source contents. 9 systems that enable us to flexibly design a
some cases, we can assert local complete- web of information. In fact, this is exactly
ness statements about sources.6 For exam- What we did as a community the problem we face in designing a richly
ple, the DB&LP Database (http://www. In the past few years, each of the groups structured Web site. The key to designing
informatik.uni-trier.de/ley/db/) contains the working on data integration has made sig- such systems is a declarative representation
complete set of papers published in some nificant individual progress (see this maga- of a web of information’s structure. Based
of the major database conferences. Knowl- zine’s Web page for a list of projects; on such a representation, we can easily
edge of a Web source’s completeness can http://computer.org/intelligent). However, specify how to restructure the information
help a data-integration system in several we have also made progress as a commu- integrated from multiple sources into a
ways. Most importantly, because a negative nity. In particular, we have developed a structure that users can navigate. Recently,
answer from a complete source is meaning- common set of terms and dimensions along we have developed the Strudel system,14
ful, the data-integration system can prune which we can now compare our work more which is the first to apply these principles
access to other sources. In the Information rigorously.12 For example, we can now in creating Web sites.
SEPTEMBER/OCTOBER 1998 15
.
Optimization and execution. Modern problem will become significantly less im- tem will answer the query completely; in
database systems succeed largely based on portant, given the emergence of standards other cases, the system will guide the user
the careful design of their query-optimiza- such XML and languages that will facilitate to the desirable answers.
tion and query-execution engines. Recall querying XML documents.15 Web sites that
that the query optimizer is the module in serve significant amounts of data are usually
charge of transforming a declarative query developed using some tool for serving data- References
(given, for example, in SQL) into a query- base contents. Using such tools will make it 1. P. Buneman, “Semistructured Data,” Proc.
ACM Sigact-Sigmod-Sigart Symp. Principles
execution plan, thereby making decisions easier to serve the data in XML form, rather
of Database Systems (PODS), ACM Press,
on the order of joins and the specific meth- than directly in HTML. Hence, Web sites New York, 1997, pp. 117–121.
ods for implementing each operation in the will be able to export data in XML with no 2. S. Abiteboul, “Querying Semi-Structured
plan. The query-execution engine actually added burden to the information providers. Data,” Proc. Int’l Conf. on Database Theory
evaluates the plan. Given that data integra- Of course, some Web sites that do not want (ICDT), 1997.
tion is a more general form of the problem their data to be used for integration purposes 3. A.Y. Levy, A. Rajaraman, and J.J. Ordille,
“Query Answering Algorithms for Information
addressed in database systems, they will might still only serve HTML pages, but try- Agents,” Proc. 11th Nat’l Conf. AI, AAAI
succeed only if we carefully consider the ing to integrate data from such sources is Press, Menlo Park, Calif., 1996, pp. 40–47.
design of these components in data-integra- probably a futile effort at best. 4. A.Y. Levy, A. Rajaraman, and J.J. Ordille,
tion systems. However, while the availability of data “Querying Heterogeneous Information
Two factors complicate the problems of in XML format will reduce the emphasis Sources Using Source Descriptions,” Proc.
22nd Int’l Conf. Very Large Databases (VLDB-
query optimization and execution in the on wrappers converting human-readable 96), Morgan Kaufmann, San Francisco, 1996,
context of data integration: lack of exten- data to machine-readable data, the chal- pp. 251–262.
sive statistics on the data we are accessing lenges of semantic integration I’ve men- 5. A.Y. Levy, “Obtaining Complete Answers
(unlike with relational databases) and un- tioned and the need to manage data that is from Incomplete Databases,” Proc. 22nd Int’l
predictable arrival rates of data from the structured at the logical level remains. Fur- Conf. Very Large Databases (VLDB-96), Mor-
gan Kaufmann, 1996.
sources at runtime. Here, too, a combina- thermore, the machine-learning algorithms
6. D. Florescu, D. Koller, and A.Y. Levy, “Using
tion of techniques from AI and database developed for extracting data from HTML Probabilistic Information in Data Integration,”
systems is likely to provide interesting so- pages might prove useful for the problem Proc. 23nd Int’l Conf. Very Large Databases,
lutions. In particular, in this context, the of obtaining semantic mappings. Morgan Kaufmann, 1997, pp. 216–225.
need for interleaving query optimization 7. A.Y. Levy and M.-C. Rousset, “CARIN: A
Farther down the road Representation Language Integrating Rules
and query execution is much more signifi-
and Description Logics,” Proc. Int’l Descrip-
cant. The idea of interleaving of planning Once we can build stand-alone, robust, tion Logics, 1995.
and execution has been considered in the data-integration systems, we will face the 8. O. Etzioni and D. Weld, “A Softbot-Based
AI planning literature in recent years.15 In challenge of embedding such systems in Interface to the Internet,” Comm. ACM, Vol.
contrast, current database systems perform more general environments. I illustrate this 37, No. 7, 1994, pp. 72–76.
complete query optimization before begin- challenge with two examples. The first 9. A.Y. Levy, A. Rajaraman, and J.D. Ullman,
“Answering Queries Using Limited External
ning the execution. The issues of query concerns extending the interaction with a Processors,” Proc. ACM Sigact-Sigmod-Sigart
optimization and execution are the focus of data-integration system beyond simple Symp. Principles of Database Systems
the Tukwila project underway at the Uni- query answering. In particular, we should (PODS), ACM Press, 1996.
versity of Washington. be able to use the data to automate some of 10. H. Garcia-Molina et al., “The TSIMMIS Ap-
the tasks we routinely perform with the proach to Mediation: Data Models and Lan-
guages (Extended Abstract),” Next Generation
Obtaining source descriptions. Current data. For example, the system should be Information Technologies and Systems
systems are very good at using descriptions able to use the data for everyday informa- (NGITS-95), 1995.
of the source for answering queries. How- tion-management tasks, such as managing 11. O. Etzioni and D. Weld, “A Softbot-Based
ever, source descriptions must still be given our personal information (our schedules, Interface to the Internet,” Comm. ACM, Vol.
manually. Specifically, the problem is to for example) and document workflow in 37, No. 7, 1994, pp. 72–76.
12. D. Florescu, A. Levy, and A. Mendelzon,
obtain the semantic mapping between the organizations, or for alerting different users
“Database Techniques for the World-Wide
content of the source and the relations in on important events. Web: A Survey,” Proc. ACM Sigmod-98, ACM
the mediated schema. If data-integration The second method concerns the expec- Press, 1998.
systems are really going to scale up to large tations we have from the data-integration 13. W.W. Cohen, “Integration of Heterogeneous
numbers, we must develop automatic meth- system. It is unlikely that we will be able to Databases without Common Domains Using
Queries Based on Textual Similarity,” Proc.
ods for obtaining source descriptions, pos- answer all user queries fully automatically,
ACM Sigmod-98, ACM Press, 1998, pp.
sibly by employing techniques from ma- because there will always remain sources 201–212.
chine learning. for which we will have only models or 14. M. Fernandez et al., “Catching the Boat with
sources whose structure (such as natural- Strudel: Experiences with a Web-site Manage-
A nonproblem. Many data-integration ef- language text) does not enable us to reli- ment System,” Proc. ACM Sigmod-98, ACM
Press, 1998.
forts have focused on the problem of ex- ably extract the data. Hence, we must de-
15. J. Ambros-Ingerson and S. Steel, “Integrating
tracting data from HTML pages—extracting velop an environment in which the system Planning, Execution, and Monitoring,” Proc.
tuples from documents in which data is cooperates with the user to obtain the an- 13th Nat’l Conf. AI, AAAI Press, Menlo Park,
semistructured at the physical level. This swer to the query. Where possible, the sys- Calif., 1998, pp. 83–88.
16 IEEE INTELLIGENT SYSTEMS
.
Ariadne.6 In Greek
The Ariadne approach to Web- mythology, Ariadne
based information integration was the daughter of
Craig A. Knoblock and Steven Minton, Minos and Pasiphae
University of Southern California who gave Theseus the
The rise of hyperlinked networks has thread with which to
made a wealth of data readily available. find his way out of the
However, the Web’s browsing paradigm Minotaur’s labyrinth.
does not strongly support retrieving and The Ariadne project’s
integrating data from multiple sites. Today, goal is to make it simple for users to create Figure 3 outlines our general framework.
the only way to integrate the huge amount their own specialized Web-based media- We assume that a user building an applica-
of available data is to build specialized tors. We are developing the technology for tion has identified a set of semistructured
applications, which are time-consuming, rapidly constructing mediators to extract, Web sources he or she wants to integrate.
costly to build, and difficult to maintain. query, and integrate data from Web These might be both publicly available
Mediator technology offers a solution to sources. The system includes tools for con- sources as well as a user’s personal sour-
this dilemma. Information mediators,1–4 structing wrappers that make it possible to ces. For each source, the developer uses
such as the SIMS system,5 provide an inter- query Web sources as if they were data- Ariadne to generate a wrapper for extract-
mediate layer between information sources bases and the mediator technology required ing information from that source. The
and users. Queries to a mediator are in a to dynamically and efficiently answer source is then linked into a global, unified
uniform language, independent of such queries using these sources. domain model. Once the mediator is con-
factors as the distribution of information A simple example illustrates how Ariadne structed, users can query the mediator as if
over sources, the source query languages, can be used to provide access to Web-based the sources were all in a single database.
and the location of sources. The mediator sources (also see the “Ariadne” sidebar). Ariadne will efficiently retrieve the
determines which data sources to use, how Numerous sites provide reviews on restau- requested information, hiding the planning
to obtain the desired information, how and rants, such as Zagats, Fodors, and Cuisine- and retrieval process details from the user.
where to temporarily store and manipulate Net, but none are comprehensive, and
data, and how to efficiently retrieve infor- checking each site can be time consuming. Research challenges in Web-based
mation from the sources. In addition, information from other Web integration
One of the most important ideas under- sources can be useful in selecting a restau- Web sources differ from databases in
lying information mediation in many sys- rant. For example, the LA County Health many significant ways, so we could not
tems, including SIMS, is that for each ap- Department publishes the health rating of all simply apply the existing SIMS system to
plication there is a unifying domain model restaurants in the county, and many sources integrate Web-based sources. Here we’ll
that provides a single ontology for the ap- provide maps showing the location of res- describe the problems that arise in the Web
plication. The domain model ties together taurants. Using Ariadne, we can integrate environment and how we addressed these
the individual source models, which each these sources relatively easily to create an problems in Ariadne.
describe the contents of a single informa- application where people could search these
tion source. Given a query in terms of the sources to create a map showing the restau- Converting semistructured data into
domain model, the system dynamically rants that meet their requirements. structured data. Web sources are not data-
selects an appropriate set of sources and With such an application, a user could bases, but to integrate sources we must be
then generates a plan to efficiently produce pose requests that would generate a map able to query the sources as if they were.
the requested data. listing all the seafood restaurants in Santa This is done using a wrapper, which is a
Information mediators were originally Monica that have an “A” health rating and piece of software that interprets a request
developed for integrating information in whose typical meal costs less than $30. The (expressed in SQL or some other structured
databases. Applying the mediator frame- resulting map would let the user click on language) against a Web source and returns
work to the Web environment solves the the individual restaurants to see the restau- a structured reply (such as a set of tuples).
difficult problem of gaining access to real- rant critic reviews. (In practice, we do not Wrappers let the mediator both locate the
world data sources. The Web provides the support natural language, so queries are Web pages that contain the desired informa-
underlying communication layer that either expressed in a structured query lan- tion and extract the specific data off a page.
makes it easy to set up a mediator system, guage or are entered through a Web-based The huge number of evolving Web sources
because it is typically much easier to get graphical user interface.) The integration makes manual construction of wrappers
access to Web data sources than to the un- process that Ariadne facilitates can be com- expensive, so we need the tools for rapidly
derlying databases systems. In addition, the plex. For example, to actually place a res- building and maintaining wrappers.
Web environment means that users who taurant on a map requires the restaurant’s For this, we have developed the Stalker
want to build their own mediator applica- latitude and longitude, which is not usually inductive-learning system,7 which learns a
tion need no expertise in installing, main- listed in a review site, but can be deter- set of extraction rules for pulling informa-
taining, and accessing databases. mined by running an online geocoder, such tion off a page. The user trains the system
We have developed a Web-based version as Etak, which takes a street address and by marking up example pages to show the
of the SIMS mediator architecture, called returns the coordinates. system what information it should extract
SEPTEMBER/OCTOBER 1998 17
.
Constructing a mediator Using a mediator
Application tial, suboptimal plan and attempts to im-
developer Web Application user prove it by applying rewriting rules. With
pages Queries Answers query planning, producing an initial, sub-
optimal plan is straightforward—the diffi-
Source modeling Query cult part is finding an efficient plan. The
and wrapper Models and
wrappers planning rewriting process iteratively improves the
construction
initial query plan using a local search
process that can change both the sources
used to answer a query and the order of the
operations on the data.
In our restaurant selection example, to
Figure 3. Architecture for information integration on the Web. answer queries that cover all restaurants,
the system would need to integrate data
from multiple sources (wrappers) for each
from each page. Stalker can learn rules however, across domains people are restaurant review site and filter the result-
from a relatively small number of examples unlikely to agree on the granularity that ing restaurant data based on the search pa-
by exploiting the fact that there are typi- information should be modeled. For rameters. The mediator would then geo-
cally “landmarks” on a page that help users example, for many applications, the code the addresses to place the data on a
visually locate information. mailing address is the right level of map. The plans for performing these opera-
Consider our restaurant mediator exam- granularity to model address, but if you tions might involve many steps, with many
ple. To extract data from the Zagats restau- want to geocode an address, it needs to possible orderings and opportunities to
rant review site, a user would need to build be divided into street address, city, exploit parallelism, in minimizing the over-
two wrappers. The first lets the system ex- state, and zip code. all time to obtain the data. Our planning
tract the information from an index page, approach provides a tractable approach to
which lists all of the restaurants and con- Planning to integrate data in the Web producing large, high-quality information-
tains the URLs to the restaurant review environment. Another problem that arises integration plans.
pages. The second wrapper extracts the in the web environment is that generating
detailed data about the restaurant, includ- efficient plans for processing data is diffi- Providing fast access to slow Web
ing the address, phone number, review, cult. For one, the number of sources to be sources. In exploiting and integrating Web-
rating, and price. With these wrappers, the integrated could be much larger than in the based information sources, accessing and
mediator can answer queries to Zagats, database environment. Also, Web sources extracting data from distributed Web sour-
such as “find the price and review of do not provide the same processing capa- ces is also much slower than retrieving
Spago” or “give me the list of all restau- bilities found in a typical database system, information from local databases. Because
rants that are reviewed in Zagats.” such as the ability to perform joins. Finally, the amount of data might be huge and the
In his companion essay on the Informa- unlike relational databases, there might be remote sources are frequently being up-
tion Manifold, Alon Levy claims that the restrictions on how a source can be ac- dated, simply warehousing all of the data is
problem of wrapping semistructured cessed, such as a geocoder that takes the not usually a practical option. Instead, we
sources will soon be irrelevant because street address returns the geographic coor- are working on an approach to selectively
XML will eliminate the need for wrapper dinates, but cannot take the geographic materialize (store locally) critical pieces of
construction tools. We believe that he is coordinates and return the street address. data that let the mediator efficiently per-
being overly optimistic about the degree Ariadne breaks down query processing form the integration task. The materialized
that XML will solve the wrapping problem. into a preprocessing phase and a query- data might be portions of the data from an
XML clearly is coming; it will significantly planning phase. In the first phase, the sys- individual source or the result of integrat-
simplify the problem and might even elimi- tem determines the possible ways of com- ing data from multiple sources.
nate the need for building wrappers for bining the available sources to answer a To decide what information to store lo-
many Web sources. However, the problem query. Because sources might be overlap- cally, we take several factors into account.
of querying semistructured data will not ping—an attribute may be available from First, we consider the queries that have
disappear, for several reasons: several sources—or replicated, the system been run against a mediator application.
must determine an appropriate combina- This lets the system focus on the portions
• There will always be applications where tion of sources that can answer the query. of the data that will have the greatest im-
the providers of the data do not want to The Ariadne source-selection algorithm8 pact on the most queries. Next, we consider
actively share their data with anyone preprocesses the domain model so that the both the frequency of updates to the sour-
who can access their Web page. system can efficiently and dynamically ces and the application’s requirements for
• Just as there are legacy Cobol pro- select sources based on the classes and getting the most recent information. For
grams, there will be legacy Web appli- attributes mentioned in the query. example, in the restaurant application, even
cations for many years to come. In the second phase, Ariadne generates a though reviews might change daily, provid-
• Within individual domains, XML will plan using a method called Planning-by- ing information that is current within a
greatly simplify the access to sources; Rewriting.9,10 This approach takes an ini- week is probably satisfactory. But, in a
18 IEEE INTELLIGENT SYSTEMS
.
Ariadne
This Restaurant Location
finance application, providing the latest application of Ariadne
stock price would likely be critical. Finally, shown in the first image
integrates data from a vari-
we consider the sources’ organization and
ety of sources, including
structure. For example, the system can only
restaurant review sites,
get the latitude and longitude from the health ratings, geocoders,
geocoder by providing the street address. If and maps.
the application lets a user request the res- In response to a query for
taurants located within a region of a map, it all highly rated restau-
could be very expensive to figure out which rants in Santa Monica
restaurants are in that region because the with an ‘A’ health rating,
system would need to geocode each restau- the mediator finds the
rant to determine whether it falls within the restaurants that satisfy
the query by extracting
region. Materializing the restaurant ad-
the data directly from the
dresses and their corresponding geocodes
relevant Web sites.
avoids a costly lookup. The mediator also
Once the system decides to materialize a produces a map of the
set of information, the materialized data restaurants (second
becomes another information source for image) by converting the
the mediator. This meshes well with our street addresses into
mediator framework because the planner latitute and longitude
dynamically selects the sources and the coordinates using an
plans that can most efficiently produce the online geocoder.
Each point on the map
requested data. In the restaurant example, if
in the second image is click-
the system decides to materialize address
able. Selecting the point for
and geocode, it can use the locally stored Chinois on Main returns the
data to determine which restaurants could detailed restaurant review
possibly fall within a region for a map- directly from the appropriate
based query. restaurant review site (third
image).
Resolving naming inconsistencies across
sources. Within a single site, entities—such
as people, places, countries, or compan-
ies—are usually named consistently. How-
ever, across sites, the same entities might be We are developing a semi-automated based applications that do more than simply
referred to with different names. For exam- method for building mapping tables and return documents. Information-integration
ple, one restaurant review site might refer to functions by analyzing the underlying data systems such as Ariadne will help users
a restaurant as Art’s Deli and another site in advance. The basic idea is to use informa- rapidly construct and extend their own
might call it Art’s Delicatessen. Or, one site tion-retrieval techniques, such as those de- Web-based applications out of the huge
might use California Pizza Kitchen and scribed in William Cohen’s companion quantity of data available online.
another site could use the abbreviation essay, to provide an initial mapping,11 and While information integration has made
CPK. To make sense of data that spans mul- then use additional data in the sources to tremendous progress over the last few
tiple sites, our system must be able to rec- resolve any remaining ambiguities via statis- years,13 many hard problems still must be
ognize and resolve these differences. tical learning methods.12 For example, res- solved. In particular, two mostly overlooked
In our approach, we select a primary taurants are best matched up by considering problems deserve more attention:
source for an entity’s name and then pro- name, street address, and phone number, but
vide a mapping from that source to each of not by using a field such as city because a • Coming up with the models or source
the other sources that use a different nam- restaurant in Hollywood could be listed as descriptions of the information sources,
ing scheme. The Ariadne architecture lets either being in Hollywood or Los Angeles a time-consuming and difficult problem
us represent the mapping itself as simply and different sites list them differently. that is largely performed by hand today.
another wrapped information source. Spe- • Automatically locating and integrating
cifically, we can create a mapping table, The future of Web-based new sources of data, which would be
which specifies for each entry in one data integration enabled by solutions to the first prob-
source what the equivalent entity is called As more and more data becomes avail- lem. (This problem has been addressed
in another data source. Alternatively, if the able, users will become increasingly less in limited domains, such as Internet
mapping is computable, Ariadne can repre- satisfied using existing search engines that shopping,14 but the problem is still
sent the mapping by a mapping function, return massive quantities of mostly irrele- largely unexplored.)
which is a program that converts one form vant information. Instead, the Web will
into another form. move toward more specialized content- For more information on the Ariadne
SEPTEMBER/OCTOBER 1998 19
.
project and example applications that were are tempted to add additional structure—for
built using Ariadne, see the Ariadne home- The Whirl approach to information instance, we might organize the games in
page at http://www.isi.edu/ariadne. integration the list into categories and provide, for each
William W. Cohen, AT&T Labs-Research game, links to online resources, such as
Search engines such as AltaVista and pricing information and reviews.
References portal sites such as Yahoo! help us find From the standpoint of computer sci-
1. G. Wiederhold, “Mediators in the Architecture useful online information sources. What ence, augmenting the list of games in this
of Future Information Systems,” Computer,
Vol. 25, No. 3, Mar. 1992, pp. 38–49.
we need now are systems to help use this way is clearly a bad idea, because it leads
2. H. Garcia-Molina et al., “The Tsimmis Ap- information effectively. Ideally, we would to a structure that lacks modularity. The
proach to Mediation: Data Models and Lan- like programs that answer a user’s ques- original structure was a static, easily main-
guages,” J. Intelligent Information Systems, tions based on information obtained from tained list of computer games. In the aug-
1997. many different online sources. We call such mented hypertext, this information is inter-
3. A.Y. Levy, A. Rajaraman, and J.J. Ordille,
a program an information-integration sys- mixed with orthogonal information about
“Querying Heterogeneous Information
Sources Using Source Descriptions,” Proc. tem, because to answer questions it must game categories, possibly ephemeral infor-
22nd Very Large Databases Conf., Morgan integrate the information from the various mation concerning the organization of ex-
Kaufmann, San Francisco, 1996, pp. 251–262. sources into a single, coherent whole. ternal Web sites, and possibly incorrect
4. M.R. Genesereth, A.M. Keller, and O.M. For example, consider consumer infor- assumptions about the readers’ goals. The
Duschka, “Infomaster: An Information Integra-
mation about computer games. Many Web resulting structure is hard to maintain and
tion System,” Proc. ACM Sigmod Int’l Conf.
Management of Data, ACM Press, New York, sites contain information of this sort. As hard to modify in certain natural ways,
1997, pp. 539–542. this essay will show, in addition to the ob- such as by changing the set of categories
5. Y. Arens, C.A. Knoblock, and W.-M. Shen, vious benefit of reducing the number of used to organize the list of games.
“Query Reformulation for Dynamic Informa- sites a user must visit, integrating this in- To summarize, the simple, modular en-
tion Integration,” J. Intelligent Information
formation has several important and coding of this information will be difficult
Systems, Special Issue on Intelligent Informa-
tion Integration, Vol. 6, Nos. 1 and 3, 1996, pp. nonobvious advantages. for users to exploit, and the easy-to-use
99–130. One advantage is that often, more ques- encoding will be difficult to create, modify,
6. C.A. Knoblock et al., “Modeling Web Sources tions can be answered using the integrated and maintain. By contrast, it is trivial to
for Information Integration,” Proc. 11th Nat’l information than using any single source. encode this information in a relational
Conf. Artificial Intelligence, AAAI Press,
Consider two sources containing slightly database in a manner that is both modular
Menlo Park, Calif., 1998, pp. 211–218.
7. I. Muslea, S. Minton, and C.A. Knoblock,
different information: one source catego- and useful: we simply create a relation list-
“Stalker: Learning Extraciton Rules for Semi- rizes games into children’s games and adult ing all old-PC-friendly games, and stan-
structured Web-Based Information Sources,” games, and another categorizes games into dard query languages let users find, say,
Proc. 1998 Workshop AI and Information arcade games, puzzle games, and adventure reviews of inexpensive old-PC-friendly
Integration, AAAI Press, 1998, pp. 74–81.
games. In this case, the sources must be arcade games. (This example assumes that
8. J.L. Ambite et al., Compiling Source Descrip-
tions for Efficient and Flexible Information
integrated to find, say, a list of children’s information about game prices and reviews
Integration, tech. report, Information Sciences adventure games. Conversely, integration is also available in the database.) Relational
Institute, Univ. of Southern California, Marina can help exploit overlap among sources; databases thus provide a more modular
del Rey, Calif., 1998. for instance, one might be interested in encoding of the information.
9. J.L. Ambite and C.A. Knoblock, “Planning by finding games that three or more sources Unfortunately, conventional databases
Rewriting: Efficiently Generating High-Qual-
ity Plans,” Proc. 14th Nat’l Conf. Artificial
have rated highly, or in reading several assume information is stored locally, in a
Intelligence, AAAI Press, 1997, pp. 706–713. independent reviews of a particular game. consistent format—not externally, in di-
10. J.L. Ambite and C.A. Knoblock, “Flexible and A second and more important advantage verse formats, as is the case with informa-
Scalable Query Planning in Distributed and of integration is that making it possible to tion on the Web. Hence they do not solve
Heterogeneous Environments,” Proc. Fourth combine information sources also makes it the problem of organizing information on
Int’l Conf. Artificial Intelligence Planning
Systems, AAAI Press, 1998, pp. 3–10.
possible to decompose information so as to the Web. To use modular, maintainable
11. W.W. Cohen, “Integration of Heterogeneous represent it in a clean, modular way. For representations for information, while still
Databases without Common Domains Using example, suppose we wished to create a exploiting the power of the Web—its dis-
Queries Based on Textual Similarity,” Proc. Web site providing some new sort of infor- tributed nature, large size, and broad
ACM Sigmod-98, ACM Press, 1998, pp. mation about computer games—say, infor- scope—we need practical ways of integrat-
201–212.
mation about which games work well on ing information from diverse sources.
12. T. Huang and S. Russell, “Object Identification
in a Bayesian Context,” Proc. 15th Int’l J. older, slower machines. The simplest way
Conf. AI, Morgan Kaufmann, 1997, pp. of representing this information is exten- Why integrating information is hard
1276–1283. sionally, as a list of games having this prop- Unfortunately, integrating information
13. Proc. 1998 Workshop on AI and Information erty. By itself, however, such a list is not from multiple sources is very hard. One
Integration, AAAI Press, 1998.
very valuable to end users, who are proba- difficulty is programming a computer to
14. R.B. Doorenbos, O. Etzioni, and D.S. Weld, “A
Scalable Comparison-Shopping Agent for the
bly interested in games that not only work understand the various information sources
World-Wide Web,” Proc. First Int’l Conf. on their PC, but also satisfy other proper- well enough to answer questions about
Autonomous Agents, AAAI Press, 1997, pp. ties, such as being inexpensive or well- them. Surprisingly, this is often difficult
39–48. designed. To make the list more useful, we even when information is presented in sim-
20 IEEE INTELLIGENT SYSTEMS
.
GAME TITLE PUBLISHER
Aladdin Activity Center Disney Interactive
Arthur’s Computer Adventure Living Books/Broderbund
ple, easy-to-parse regular structures such as Escape from Dimension Q Headbone Interactive
lists and tables. How the Leopard Got His Spots Microsoft Kids
As an example, Figure 4 shows a tabular
representation of the information in two (a)
hypothetical Web sites. Consider the GAME PUBLISHER HOME PAGE
knowledge an integration system would
need to answer the following question Disney http://www.disneyinteractive.com
using these information sources: Headbone http://www.headbone.com
Humongous http://www.humongous.com
Who publishes “Escape from Dimension Q” Broderbund http://www.broderbund.com
and where is their home page? Microsoft http://www.microsoft.com
(In this essay, we assume that questions are (b)
given to the information-integration system
in a formal language; for readability, how-
Figure 4. Two typical information sources: (a) Web site 1 and (b) Web site 2.
ever, we’ll paraphrase questions in English
whenever possible.)
To answer this question, the system must that this will continue to hold true, simply able to recognize such structures. (Al-
have knowledge of several kinds: because presenting information to a human though general table-recognition meth-
audience is less demanding for the informa- ods exist,2 to our knowledge, no existing
• It must know where to find these tables tion provider—information intended for a Web-based integration system uses
on the Web, and how they are formatted human audience need not conform to some them.) Similarly, most people would
(access knowledge). externally set formal standard; it only has to judge it likely that the “Headbone” and
• It must know that each tuple 〈x,y〉 in the be comprehensible to a reader. “Headbone Interactive” denote the same
table Website-1 should be interpreted as company (or closely related ones), but
the statement “the company y publishes The Whirl approach to information would consider it unlikely that “Disney
the game x,” and that each tuple 〈t,u〉 integration Interactive” and “Microsoft” do; an inte-
in the table Website-2 should be inter- We have written a system for information gration system should be able to make a
preted as the statement “the home page integration called Whirl. The approach to similar judgement, even without knowl-
for the company t is found at the URL integration embodied in Whirl is based on edge of the domain.
u” (semantic knowledge). two premises:
• Finally, it must know that the string Of the many mechanisms required by
“Headbone” in Website-2 refers to the • It is unreasonable to assume that all the such an integration system, we have chosen
same company as the string “Headbone knowledge needed for information inte- to concentrate (initially) on general meth-
Interactive” in Website-1 (object-identity gration will be present, and in any case ods for integrating information without
knowledge). impractical to encode this information object identity knowledge. In most integra-
explicitly. Consequently, inferences tion tasks, far more object-identity knowl-
Even given all this knowledge, many made by an integration system are in- edge is needed than any other type of
interesting technical problems remain; herently incomplete and uncertain. As knowledge; while semantic knowledge and
however, the technical difficulties involved in machine learning, speech recogni- access knowledge might be needed for
in using these types of knowledge pale be- tion, and information retrieval, the inte- each source, a system potentially needs
side the practical difficulties of acquiring gration system will have to take some object-identity knowledge about each pair
the knowledge. Currently, all this knowl- chances and make some mistakes. An of constants in the integrated database.
edge must be manually provided to the integration system thus must have ways Our approach for dealing with uncertain
integration system and updated whenever of reasoning with uncertain informa- object identities relies on the observation
the original information sources change. tion, and communicating to the user its that information sources tend to use textu-
Performing information integration is thus confidence in an answer. ally similar names for the same real-world
extremely knowledge-intensive and hence • Information integration should exploit objects. This is particularly true when
expensive in terms of human time. the existing human-oriented interface to sources are presenting information to peo-
Of course, many of these problems can be information sources as much as poss- ple of similar background in similar con-
“assumed away:” integrating information ible. It should, whenever possible, un- texts. To exploit this, the Whirl query lan-
sources is not nearly so difficult if they use derstand information using general tech- guage allows users to formulate SQL-like
common object identifiers, adopt a common niques, analogous to the ones people queries about the similarity of names. Con-
data format, and use a known ontology. Un- use, rather than relying on externally sider again the tables of Figure 4. Assum-
fortunately, few existing information provided, problem-specific knowledge. ing that the table Website-1 is encoded
sources satisfy these assumptions. The vast For instance, people have no difficulty as a relation with schema game(name,
majority of existing online sources are de- recognizing the structures in Table1 as pubName), and Website-2 is encoded as
signed to communicate only with human two-column tables; thus a good informa- a relation with schema publisher (name,
readers, not with other programs. We believe tion-integration system should also be homepage), the question
SEPTEMBER/OCTOBER 1998 21
.
Table 1. Output of a Whirl query pairing paragraphs of free text and names of computer games.
The score is the similarity of the last two columns, normalized to a range of 0–100, and the checkmark
indicates if the pairing is correct.
SCORE DEMO.NAME GAME.NAME ly speaking, two names are similar accord-
ing to this metric if they share terms, where
80.26 Ubi Software has a demo of Amazing Learning Amazing Learning √ a term is a word stem, and names are con-
Games with Rayman. Games with Rayman sidered more similar if they share more
78.25 Interplay has a demo of Mario Teaches Typing. (PC) Mario Teaches Typing √
75.91 Warner Active has a small interactive demo for Where’s Waldo? Exploring √ terms, or if the shared terms are rare. As an
Where’s Waldo at the Circus and Where’s Waldo? Geography example, “Disney Interactive” and “Dis-
Exploring Geography. (Mac and Win) ney” would be more similar than “Disney
74.94 MacPlay has demos of Marios Game Gallery Mario Teaches Typing √ Interactive” and “Headbone Interactive,”
and Mario Teaches Typing. (Mac) because “Interactive” is a more common
71.56 Interplay has a demo of Mario Teaches Typing. (PC) Mario Teaches Typing 2 √
68.54 MacPlay has demos of Marios Game Gallery Mario Teaches Typing 2 √ term than “Disney.” These similarity met-
and Mario Teaches Typing. (Mac) rics are not well understood formally, but
68.45 Psygnosis has an interactive demo for Lemmings Paintball √ are well supported experimentally.
Lemmings Paintball. (Win95) Whirl also builds on ideas from artificial
65.70 ICONOS has a demo of What’s The Secret? What’s the Secret? √
Volume 1. (Mac and Win)
intelligence. To find the best K answers to a
64.33 Fox Interactive has a fully working demo version Simpsons Cartoon Studio √ query, we use a variant of A* search,3,4
of the Simpsons Cartoon Studio. (Win and Mac) coupled with inverted-index techniques
62.90 Gryphon Software has demos of Gryphon Gryphon Bricks √
developed in the information-retrieval
Bricks, Colorforms Computer Fun Set—Power community.5 In combination, these tech-
Rangers and Sailor Moon, and a FREE Gryphon
Bricks Screen Saver. (Mac and Win) niques allow Whirl to find the best K an-
60.30 Vividus Software has a free 30 day demo of Web Workshop √ swers to a query fairly quickly, even when
Web Workshop (Web-authoring package for kids!). the universe of possible answers is
(Win 95 and Mac) extremely large.
59.96 Conexus has two shockwave demos—Bubbleoids Super Radio Addition √
(from Super Radio Addition with Mike and Spike) with Mike & Spike
and Hopper (from Phonics Adventure with Sing What Whirl has accomplished
Along Sam). Using a search-engine-like interface (in
which possible answers come in a ranked list)
lets us evaluate Whirl in the same way that
Who publishes “Escape from Dimension Q” from the Cartesian product of the game and information retrieval researchers evaluate
and where is their home page? publisher relations. Whirl scores an- search engines. In particular, given informa-
might be encoded as the Whirl query swers according to how well they satisfy tion about which of Whirl’s proposed an-
the conditions in the WHERE part of the swers are correct, we can evaluate Whirl
SELECT publisher.name, query: each similarity condition gets a using metrics such as recall and precision. We
publisher.homepage score between zero and one, each Boolean evaluated Whirl on a number of benchmark
FROM game, publisher condition receives a score of either zero or problems from several different domains,
WHERE (game.pubName ~ game. one, and Whirl combines these primitive using the measure of noninterpolated aver-
name AND game.name ~ scores as if they were independent proba- age precision. (Roughly speaking, this aver-
“Escape from Dimension Q”) bilities to obtain a score for the entire ages the best level of precision obtained at
WHERE clause. For the query above, the each distinct recall level. The highest possible
Here ~ is a similarity operator, and thus the score of a tuple 〈u,v,x,y〉 is the product of value for this measure is 100%.) We discov-
query asks Whirl to find a tuple 〈u,v〉 from the similarity of y to u and the similarity of ered that the off-the-shelf similarity metric
publisher such that for some tuple 〈x,y〉 x to “Escape from Dimension Q.” we adopted is surprisingly accurate. On 14 of
from game, y is textually similar to u, and x In typical use, Whirl returns only the K 18 benchmark problems, average precision is
is textually similar to the string “Escape highest-scoring answers, where K is a para- 90% or higher; on seven of the 18 problems,
from Dimension Q.” Such a pair is a plausi- meter set by the user. From the user’s per- average precision is 99% or higher.
ble answer to the query, although not nec- spective, interacting with the system is thus Intriguingly, good performance can
essarily a correct one. much like interacting with a search engine: often come even when the names from one
This query language is central to our the user requests the first K answers, ex- or both sources are embedded in extrane-
approach, so we will describe it in some amines them, and then requests more if ous text. Table 1 presents the first few an-
detail. The query language has a “soft” necessary. swers for the query
semantics; the answer to such a query is not Semantically, then, the Whirl query lan-
the set of all tuples that satisfy the query, guage is quite simple—but as in many en- SELECT demo.name,game.name
but a list of tuples, each of which is consid- terprises, “the devil is in the details.” To FROM demo,game
ered a plausible answer to the query and make this idea work well, we needed to WHERE demo.name ~ game.name
each of which is associated with a numeric adopt ideas from several research commu-
score indicating its perceived plausibility. nities. Our system computes the similarity for a problem in which the names in demo
The universe of possible answers is de- of two names using cosine distance in the are embedded in arbitrary paragraph-long
termined by the FROM part of the query; in vector space model, a metric widely used passages of free text. As the table’s last col-
the example above, possible answers come in statistical-information retrieval.1 Rough- umn shows, most top-ranked pairings are
22 IEEE INTELLIGENT SYSTEMS
.
correct, and the complete ranking of answers
Whirl proposed has a respectable average
precision of 67%. Whirl’s robustness to ex-
traneous noise words means that we can
afford to use approximate methods of ex-
tracting of data from information sources.
We have also built several nontrivial
integrated-information systems using
Whirl. The domain of the first is children’s
computer games. This application inte-
grates information from 16 Web sites.
Using the HTML form interface shown in
Figure 5, users can construct questions like
the following:
Help me find reviews of games that are in the
category “art,” are recommended by two or
more sites, and are designed for children six
years old.
The application knows how to find reviews,
demos, and vendors of games, and also
understands several properties of games, Figure 5. The interface to an information-integration system based on Whirl.
such as which games are popular and who
publishes which games.
We have built a similar system that inte- applications can be built, even without the great blue heron. This sort of intelligent
grates information about North American aid of other advanced techniques. behavior is enabled by translating the
birds. Collectively, the integrated databases The two implemented integration appli- user’s quick search into a structured query
contain about 100,000 tuples, about 10,000 cations illustrate several other important that exploits an information source giving
of which point to external Web pages. Both points. First, in the games application, the scientific nomenclature for birds.
systems are available on the Web (at http:// Whirl extracts information about the age
whirl.research.att.com/cdroms/ and http://whirl. range for which games are appropriate from The future of integration
research.att.com/birds/). The response time a commercial site; this information can then Whirl’s current implementation could be
for complex queries is typically less than be used to access a collection of reviews extended in many ways. Challenging tech-
10 seconds. (These time measurements are taken from several consumer-oriented sites. nical issues include scaling up to larger
on a lightly loaded Sun Ultra 2170 with The age-range information has, in some data sets (the current implementation is
167-MHz processors and 512 Mbytes of sense, been made portable; it has been dis- memory-based), finding more flexible and
main memory. The current server is not associated from the site that provided it and more automatic extraction methods, learn-
multithreaded, so response times vary used for a goal different from its intended ing to improve scores based on feedback of
greatly with load.) purpose (of improving access to a large various kinds, and collecting data effic-
In building these applications, we delib- online catalog). Attaining this sort of modu- iently at query time. We will conclude,
erately sidestepped many of the problems larity and portability of information was however, with some more general remarks.
that have historically been research issues one of Whirl’s primary goals.
in information integration. Whirl data is not Second, integration need not require
semistructured, 6 but instead is stored in complex query interfaces to be useful. As
simple relations. The problem of query well as providing a query-based interface,
planning7 is simplified by collecting all the bird application allows data about birds
data with spiders offline. We map individ- to be browsed, either geographically or by Coming Next Issue
ual information sources into a global scientific order and family. This browsing
schema using manually constructed views, interface extends the capability of the origi- Interactive Fiction
rather than using more powerful meth- nal sources, which are seldom organized
ods.8,9 Access knowledge is represented in along both of these dimensions. The bird Haym Hirsh, editor
hand-coded extraction programs,10 rather application also includes a quick-search
than learned by example as proposed by feature, in which the user types in the name
with essays by Barbara Hayes-
Nicholas Kushmeric and others.11–13 We of a bird and gets a list of URLs in
made these decisions to highlight the ad- response. As an example, in response to a
Roth, Janet Murray, and
vantages of our approach, relative to earlier quick search for “great blue heron,” Whirl’s Andrew Stern
integration methods: by adopting an uncer- answer includes a picture indexed only as
tain approach to integration, significant ardea herodias, the scientific name of the
SEPTEMBER/OCTOBER 1998 23
One possible goal for computer science sources are integrated. In economic terms, Queries Based on Textual Similarity,” Proc.
research is the construction of an informa- the value of having a well-structured, eas- ACM Sigmod-98, ACM Press, New York, 1998,
ily-integrated information source is largely pp. 201-212.
tion system with size and scope compara-
6. D. Suciu, ed., Proc. Workshop on Management
ble to the Web, but with abilities compara- external, leading to a classic chicken-and- of Semistructured Data; http://www.research.
ble to a knowledge base. In particular, we egg problem. The availability of cheap, att.com/~suciu/workshop-papers. html.
would like a system that can reason about approximate integration methods could 7. J.L. Ambite and C.A. Knoblock, “Planning by
and understand information that, like infor- help to overcome this problem. Rewriting: Efficiently Generating High-quality
mation found on the Web, is constructed Let us close with an analogy. Information Plans,” Proc. 14th Nat’l Conf. AI, AAAI Press,
Menlo Park, Calif., 1997, pp. 706–713.
and maintained in a decentralized fashion. integration can be viewed as the problem of
8. A.Y. Levy, A. Rajaraman, and J.J. Ordille,
This is a very hard problem and perhaps a getting information sources to talk to each “Querying Heterogeneous Information
very distant goal; however, Whirl repre- other. Our approach can be viewed as get- Sources Using Source Descriptions,” Proc.
sents an important step toward that goal. ting information sources to talk to each 22nd Int’l Conf. Very Large Databases (VLDB-
Previous systems that access information other in an informal way. We hope that this 96), Morgan Kaufmann, 1996, pp. 251–262.
kind of informal communication will retain 9. O.M. Duschka and M.R. Genesereth, “Answer-
from multiple sources fall into two main
ing Recursive Queries Using Views,” Proc.
classes. Search engines provide weak and much of the utility of formal communica- 16th ACM Sigact-Sigmod-Sigart Symp. Princi-
relatively unstructured access to a large tion, but be far easier to attain—just as in- ples of Database Systems (PODS-97), ACM
number of sites. Previous Web-based infor- formal essays like this one can, without Press, 1997, pp. 109–116.
mation integration systems provide better tedious technical detail, communicate the 10. W.W. Cohen, “A Web-Based Information Sys-
access to a small number of highly struc- essence of a new technical result. tem that Reasons with Structured Collections
of Text,” Proc. Second Int’l Conf. Autonomous
tured sites. Whirl’s emphasis on inexact, Agents, ACM Press, New York, 1998, pp.
uncertain integration provides an interme- References 400–407.
diate step between these two extremes. 1. G. Salton, ed., Automatic Text Processing, 11. N. Kushmerick, D.S. Weld, and R. Doorenbos,
Addison Wesley, Reading, Mass., 1989. “Wrapper Induction for Information Extrac-
The intermediate step of cheap, approxi-
2. D. Rus and D. Subramanian, “Customizing tion,” Proc. 15th Int’l J. Conf. AI, AAAI Press,
mate information integration is a critical Capture and Access,” ACM Trans. Information 1997, pp. 729–735.
one. It is unreasonable to expect an infor- Sys., Vol. 15, No. 1, 1997, pp. 67–101. 12. C.A. Knoblock et al., “Modeling Web Sources
mation provider whose primary audience is 3. N. Nilsson, Principles of Artificial Intel- for Information Integration,” Proc. 15th Nat’l
people to spend much time and energy in ligence, Morgan Kaufmann, San Francisco, Conf. AI (AAAI-98), AAAI Press, 1998, pp.
1987. 211–218.
making his or her information programmat-
4. J. Pearl. Heuristics: Intelligent Search Strat- 13. C.-N. Hsu, “Initial Results on Wrapping Semi-
ically available unless there is a clear and egies for Computer Problem Solving, Addison- structured Web Pages with Finite-State Trans-
immediate benefit. Unfortunately, while Wesley, 1984. ducers and Contextual Rules,” Papers from the
integration does provide a benefit, this ben- 5. W.W. Cohen, “Integration of Heterogeneous 1998 Workshop on AI and Information Integra-
efit does not materialize until a number of Databases without Common Domains Using tion, AAAI Press, 1998, pp. 66–73.
24 IEEE INTELLIGENT SYSTEMS
Get documents about "