Information integration

W
Document Sample
scope of work template
							.




                       T R E N D S                  &        C O N T R O V E R S I E S


    Information integration
                                                                                                                                                  By Marti A. Hearst
                                                                                                                                   University of California, Berkeley
                                                                                                                                         hearst@sims.berkeley.edu
        Despite the Web’s current disorganized and anarchic state, many AI researchers believe that it
     will become the world’s largest knowledge base. In this installment of Trends and Controversies,
     we examine a line of research whose final goal is to make disparate data sources work together to
     better serve users’ information needs. This work is known as information integration. In the fol-
     lowing essays, our authors talk about its application to datasets made available over the Web.            To answer our query, we would first
        Alon Levy leads off by discussing the relationship between information-integration and tradi-       query the Internet Movie Database to ob-
     tional database systems. He then enumerates important issues in the field and demonstrates how         tain the list of movies directed by Woody
     the Information Manifold project has addressed some of these, including a language for describ-
                                                                                                            Allen, and then feed the result into the
     ing the contents of diverse sources and optimizing queries across sources.
                                                                                                            MovieLink database to check which ones
        Craig Knoblock and Steve Minton describe the Ariadne system. Two of its distinguishing fea-
     tures are its use of wrapper algorithms to extract structured information from semistructured data     are playing in Seattle. Finally, we would
     sources and its use of planning algorithms to determine how to integrate information efficiently and   find reviews for the relevant movies using
     effectively across sources. This system also features a mechanism that determines when to prefetch     any of the movie review sites.
     data depending on how often the target sources are updated and how fast the databases are.                Most importantly, a data-integration sys-
        William Cohen describes an interesting variation on the theme, focusing on “informal” infor-        tem lets users focus on specifying what they
     mation integration. The idea is that, as in related fields that deal with uncertain and incomplete     want, rather than thinking about how to ob-
     information, an information-integration system should be allowed to take chances and make mis-         tain the answers. As a result, it frees them
     takes. His Whirl system uses information-retrieval algorithms to find approximate matches be-          from the tedious tasks of finding the relevant
     tween different databases, and as a consequence knits together data from quite diverse sources.
                                                                                                            data sources, interacting with each source in
        A controversy emerges in the midst of this trend, centering around the issue of whether informa-
                                                                                                            isolation using a particular interface, and
     tion extraction from HTML-based Web pages is a long-standing problem. Proponents of XML
     (Extensible Markup Language, see www.w3.org/TR/REC-xml.html) argue that in the future infor-           combining data from multiple sources.
     mation of any importance will be exchanged between programs using a well-defined protocol,
     rather than being displayed solely for purposes of reading using ad hoc formats in HTML. In his        Traditional database systems
     essay, Levy argues that the problem of extracting information from HTML markup will, as a conse-          To understand the challenges involved in
     quence of such protocols, become less important. He notes, however, that the problem of integrating    building data-integration systems, I will
     data that differs semantically will still remain. Knoblock and Minton counter that the need for        briefly compare the problems that arise in
     HTML wrappers will remain strong, arguing that there will always be exceptions and legacy pages.       this context with those encountered in tra-
        Cohen takes a different stance, suggesting that many information providers want to help in-         ditional database systems. In this discus-
     form people, but might not see a direct benefit from the investment required to form a highly
                                                                                                            sion, I focus mainly on comparisons with
     structured data source. He suggests that cheap, approximate information integration, such as
                                                                                                            relational database systems, but the differ-
     enabled by his system, can render these simpler sites more powerful, providing a larger benefit
     than any individual site developer alone could attain, and getting around the chicken-and-egg          ences also hold for systems based on other
     problem of who pays to make useful information available free.                                         models, such as object-oriented and object-
        On a different note, Haym Hirsh of Rutgers has signed on to help edit Trends & Controversies.       relational ones. Figure 1 illustrates the dif-
     To continue providing sharp, cogent debates on topics that span a wide range of intelligent sys-       ferent stages in processing a query in a
     tems research and applications development, he and I will be alternating installments. For his first   data-integration system.
     outing next issue, Haym has lined up Barbara Hayes-Roth, Janet Murray, and Andrew Stern, who
     will address interactive fiction.                                                                      Data modeling. Traditional database sys-
                                                                                           —Marti Hearst    tems and data-integration systems differ
                                                                                                            mainly in the process they use to organize
                                                                                                            data into an application. In a traditional sys-
                                                        listings of movies, their casts, directors,         tem, the application designer examines the
    The Information Manifold approach
                                                        genres, and so forth), MovieLink (listing           application’s requirements, designs a data-
    to data integration
                                                        playing times of movies in US cities), and          base schema (such as a set of relation names
    Alon Y. Levy, University of Washington
                                                        several sites that provide textual reviews          and the attributes of each relation), and then
       A data-integration system provides a             for selected movies. Suppose we want to             implements the application, part of which
    uniform interface to a multitude of data            find which Woody Allen movies are play-             involves actually populating the database
    sources. Consider a data-integration system         ing tonight in Seattle and see their respec-        (inserting tuples into the tables).
    providing information about movies from             tive reviews. None of these data sources in            In contrast, a data-integration application
    data sources on the World Wide Web. There           isolation can answer this query. However,           begins from a set of pre-existing data
    are numerous sources on the Web concern-            by combining data from multiple sources,            sources. These sources might be database
    ing movies, such as the Internet Movie              we can answer queries like this one, and            systems, but more often are unconventional
    Database (which provides comprehensive              even more complex ones.                             data sources, such as structured files, legacy

    12                                                                                                                             IEEE INTELLIGENT SYSTEMS
                                                                                                                                                            .



                                                                      Alon Y. Levy is a faculty member at the University of Washington’s Com-
                                                                      puter Science and Engineering Department. His research interests are Web-
                                                                      site management, data integration, query optimization, management of
                                                                      semistructured data, description logics and their relationship to database
systems, or Web sites. Here the application                           query languages, abstractions and approximations of computational theo-
                                                                      ries, and relevance reasoning. He received his PhD in computer science
builder must design a mediated schema on                              from Stanford University and his undergraduate degree at Hebrew Univer-
which users will pose queries. The medi-                              sity. Contact him at the Dept. of Computer Science and Engineering, Sieg
ated schema is a set of virtual relations, in                         Hall, Room 310, Univ. of Washington, Seattle, WA, 98195; alon@cs.
that they are not actually stored anywhere.                           washington.edu; http://www.cs.washington.edu/homes/alon/.
The mediated schema is designed manually
                                                                      Craig Knoblock is a project leader and senior research scientist at the In-
for a particular data-integration application.                        formation Sciences Institute, a research assistant professor in the Computer
For example, in the movie domain, the me-                             Science Department, and a key investigator in the Integrated Media Systems
diated schema might contain the relations                             Center at the University of Southern California. His research interests in-
MovieInfo(id, title, genre, coun-                                     clude information gathering and integration, automated planning, machine
                                                                      learning, knowledge discovery, and knowledge representation. He received
try, year, director) describing the                                   his BS from Syracuse University and his MS and PhD from Carnegie Mel-
different properties of a movie, the relation                         lon University, all in computer science. Contact him at USC/ISI, 4676 Ad-
MovieActor(id, name) representing a                                   miralty Way, Marina del Rey, CA 90292; knoblock@isi.edu; http://www.isi.
movie’s cast, and MovieReview(id, re-                                 edu/~knoblock.
view) representing reviews of movies.
                                                                      Steve Minton is a senior computer scientist at the Information Sciences
   Along with the mediated schema, the                                Institute and a research associate professor in the Computer Science Depart-
application designer needs to supply de-                              ment at the University of Southern California. His research interests are in
scriptions of the data sources. The descrip-                          machine learning, planning, scheduling, constraint-based reasoning, and
tions specify the relationship between the                            program synthesis. He received his BA in psychology from Yale University
                                                                      and his PhD in computer science from Carnegie Mellon University. He
relations in the mediated schema and those                            founded the Journal of Artificial Intelligence Research and served as its first
in the local schemas at the sources. (Even                            executive editor. He was recently elected to be a fellow of the AAAI. Con-
though not all the sources are databases, we                          tact him at USC/ISI, 4676 Admiralty Way, Marina del Rey, CA 90292;
model them as having schemas at the con-                              minton@isi.edu; http://www.isi.edu/sims/minton/homepage.html.
ceptual level.) An information-source de-
                                                                      William Cohen is a principal research staff member in the department of
scription specifies                                                   Machine Learning and Information Retrieval Research at AT&T Labs-Re-
                                                                      search. In addition to information integration, his research interests include
• the source’s contents (for example, con-                            machine learning, text categorization, learning from large datasets, compu-
  tains movies),                                                      tational learning theory, and inductive logic programming. He received a
                                                                      bachelor’s degree from Duke and a PhD from Rutgers, both in computer
• the attributes found in the source                                  science. Contact him at AT&T Labs-Research, 180 Park Avenue, Florham
  (genre, cast),                                                      Park NJ 07932-0971; wcohen@research.att.com; http://www.
• constraints on the source’s contents                                research.att.com/~wcohen/.
  (contains only American movies),
• the source’s completeness and reliabil-
  ity, and finally,
• its query-processing capabilities (can            use different names to refer to the same        optimizer relies on extensive statistics
  perform selections or can answer arbi-            object. For example, the same person            about the underlying data, such as the sizes
  trary SQL queries).                               might be called as “John Smith” in one          of relations, sizes of domains, and selectiv-
                                                    source and “J.M. Smith” in another.             ity of predicates. Finally, the query-execu-
  Because the data sources are preexisting,                                                         tion plan passes to the query-execution
data in the sources might be overlapping         Query optimization and execution. A                engine, which evaluates the query.
and even contradictory. Furthermore, we          traditional relational-database system ac-            The traditional database and the data-
might face the following problems:               cepts a declarative query in SQL. The sys-         integration contexts differ primarily in that
                                                 tem first parses the query before passing it       the optimizer has little information about the
• Semantic mismatches between sources.           to the query optimizer. The optimizer pro-         data, because the data resides in remote au-
  Because each data source has been de-          duces an efficient query-execution plan for        tonomous sources rather than locally. Fur-
  signed by a different organization for         the query, which is an imperative program          thermore, because the data sources are not
  different purposes, the data is modeled        that specifies exactly how to evaluate the         necessarily database systems, the sources
  in different ways. For example, one            query. In particular, the plan specifies the       appear to have different processing capabili-
  source might store a relational database       order for performing the query’s operations        ties. For example, one data source might be
  that stores all of a particular movie’s        (join, selection, and projection), the meth-       a Web interface to a legacy information sys-
  attributes in one table, while another         od for implementing each operation (such           tem, while another might be a program that
  source might spread the attributes             as sort-merge join or hash join), and the          scans data stored in a structured file (such as
  across several relations. Furthermore,         scheduling of the different operators              bibliography entries). Hence, the query opti-
  the names of the attributes and tables         (where parallelism is possible). Typically,        mizer must consider the possibility of ex-
  will differ from one source to another,        the optimizer selects a query-execution            ploiting a data source’s query-processing
  as will the choice of what should be a         plan by searching a space of possible plans        capabilities. Query optimizers in distributed
  table and what should be an attribute.         and comparing their estimated costs. To            database systems also consider where parts
• Different naming conventions. Sources          evaluate a query-execution plan’s cost, the        of the query are executed, but in that context

SEPTEMBER/OCTOBER 1998                                                                                                                                 13
.



                                                            Global data model
                                                                   Query in mediated schema

    the different processors have identical capa-                                                                  data sources with closely related content
    bilities. Finally, because data must be trans-                     Query reformulation                         and that would answer queries efficiently
    ferred over a network, the query optimizer                                                                     by accessing only the sources relevant to
                                                                   Query in the
    and the execution engine must be able to                  union of exported
                                                                                                                   the query. The remainder of this essay will
    adapt to data-transfer delays.                             source schemas                                      describe its main contributions.3–6

    Query reformulation. A data-integration                             Query optimization                         The AI and DB approach. We based our
    system user poses queries in terms of the                Distributed query-
                                                                                                                   approach in designing the Information
    mediated schema, rather than directly in the                 execution plan                                    Manifold on the observation that the data-
    schema where the data resides. As a conse-                                                                     integration problem lies at the intersection
    quence, a data-integration system must first                     Query execution engine                        of database systems and artificial intelli-
    reformulate a user query into a query that                 Query the                                           gence. Hence, we searched for solutions
    refers directly to the schemas in the sources.          exported source                                        that combine and extend techniques from
    Such a reformulation step does not exist in                 schema                                             both fields. For example, we developed a
    traditional database systems. To perform                                                                       representation language and a language for
    the reformulation step, the data-integration                     Wrapper           Wrapper                     describing data sources that was simple
    system uses the source descriptions.                                                                           from the knowledge-representation per-
                                                            Local data                                             spective, but that had the necessary added
                                                            model
    Wrappers. Unlike a traditional query-exe-                                                                      flexibility concerning previous techniques
    cution engine that communicates with the                   Query in                                            developed in the database community.
                                                                  the
    storage manager to fetch the data, a data-
                                                                source
    integration system’s query-execution plan                  schema                                              Source description language. The Infor-
    must obtain data from remote sources. To do                                                                    mation Manifold is most importantly a flexi-
    so, the execution engine communicates with                                                                     ble mechanism for describing data sources.
    a set of wrappers. A wrapper is a program                                                                      This mechanism lets users describe complex
    that is specific to every data source and that                                                                 constraints on a data source’s contents,
    translates the source’s data to a form that the                                                                thereby letting them distinguish between
    system’s query processor can further pro-          Figure 1. Prototypical architecture of a data-integration   sources with closely related data. Also, this
    cess. For example, the wrapper might extract       system.                                                     mechanism makes it easy to add or delete
    a set of tuples from an HTML file and per-                                                                     data sources from the system without chang-
    form translations in the data’s format.                                                                        ing the descriptions of other sources. Infor-
                                                       oped several methods to model and query                     mally, the contents of a data source are de-
    Semistructured data. The term semistruc-           semistructured data and is currently consid-                scribed by a query over the mediated
    tured data has been used with various mean-        ering the issues of query optimization and                  schema. For example, we might describe a
    ings to refer to characteristics of data present   storage for such data.1,2 Building data-inte-               data source as containing American movies
    in a data-integration system. To understand        gration systems based on a semistructured                   that are all comedies and were produced
    the importance of semistructured data, we          data model has two main advantages:                         after 1965 (source 1 in Figure 2). As another
    distinguish between a lack of structure at the                                                                 example, we can describe sources in whose
    physical level versus one at the logical level.    • In many cases, the data in the sources is                 schema significantly differs from the one in
    With lack of structure at the physical level,        indeed semistructured at the logical level.               the mediated schema. For instance, we can
    structured data (for example, tuples) are          • The models developed for semistruc-                       describe a source in which a movie’s year,
    embedded in a file containing additional             tured data can cleanly integrate data                     genre, actor, and review attributes al-
    markup information such as HTML files.               coming from multiple data models,                         ready appear in one table (source 2 in Figure
    Extracting the actual values from the HTML           such as relational, object-oriented, and                  2). This source is modeled as containing the
    file can be very complex task, and is one that       Web-data models.                                          result of a join over a set of relations in the
    the source’s wrapper performs.                                                                                 mediated schema. For some queries, extract-
        Most work on semistructured data con-          Classes of data integration applications.                   ing data from this source might be cheaper
    cerns lack of structure at the logical level. In   The two main classes of data-integration                    than from others if the join computed in the
    this context, semistructured data refers to        applications are integration of data sources                source is indeed required for the query.
    cases in which the data does not necessarily       on the Web and within a single company or                      The Information Manifold employed an
    fit into a rigidly predefined schema, as re-       enterprise. In the latter case, the sources are             expressive language, Carin, for formulating
    quired in traditional database systems. This       not as autonomous as they are on the Web,                   queries and for representing background
    might arise because the data is very irregular     but the requirements from a data-integra-                   knowledge about the relations in the medi-
    and hence can be described only by a rela-         tion system might be more stringent.                        ated schema. Cairn7 combined the expres-
    tively large schema. In other cases, the sche-                                                                 sive power of the datalog database-query
    ma might be rapidly evolving, or not even          The Information Manifold Project                            language (needed to model relational
    declared at all—it might be implicit in the          In this project, we wanted to develop a                   sources) and Description Logics, which are
    data. The database community has devel-            system that would flexibly integrate many                   knowledge-representation languages de-

    14                                                                                                                                    IEEE INTELLIGENT SYSTEMS
                                                                                                                                                       .



                                      Source 1                            Source 2
                                      select title, year, director        select title, genre, review
                                      from MOVIEINFO                      from MOVIEINFO M, MOVIEREVIEW R
                                      where genre = COMEDY                     where m.id = r.id
signed especially to model                  year ≥ 1965                                                              compare the relative expres-
complex hierarchies that                    country = USA                                                            sive power of our source-
frequently arise in data-                                                                                            description languages. We
integration applications.        Figure 2. Data-source descriptions.                                                have a set of properties along
                                                                                                                    which we can compare our
Query-answering algorithms. We devel-                Manifold, we developed a method for rep-         query-answering algorithms (such as, do
oped algorithms for answering queries                resenting local-source completeness and an       they guarantee accessing only relevant
using the information sources. Recall that           algorithm for exploiting such information        sources or a minimal number of sources?).
user queries are posed in terms of the medi-         in query answering.5                             We can also compare features of our
ated schema. Hence, the main challenge in                                                             systems (do they assume sources are com-
designing the query-answering algorithms             Using probabilistic information. The In-         plete, can they handle local completeness,
is to reformulate the query such that it             formation Manifold pioneered the use of          and can they compare directly between
refers to the relations in the data sources.         probabilistic reasoning for data integration     sources?).
Our algorithms were the first to guarantee           (representing another example of the com-           We need to take this progress into account
that only the relevant set of data sources are       bined AI and DB approach to the data-            as we address the challenges that lie ahead.
accessed when answering a query, even in             integration problem).6 When numerous data        Our common terminology will enable us
the presence of sources described by com-            sources are relevant to a given query (such      (and should force us) to compare systems
plex constraints.                                    as bibliographic databases available for a       more rigorously, either theoretically or ex-
   It is interesting to note the difference          topic search), a data-integration system         perimentally. To proceed, we must also de-
between our approach to query answering              needs to order the access to the data sources.   velop a set of data-integration benchmarks,
and that employed in the SIMS and Ariadne            Such an ordering is dependent on the over-       along which we can experimentally com-
projects described in Craig Knoblock’s and           lap between the sources and the query and        pare our data-integration systems.
Steve Minton’s companion essay. In their             on the coverage of the sources. We devel-
approach, even though they used a knowl-             oped a probabilistic formalism for specify-      The immediate future
edge-representation system for specifying            ing and deducing such information and al-           The data-integration problem is by no
the source descriptions, they used a general-        gorithms for ordering the access to data         means solved. We have made significant
purpose planner to reformulate a user query          sources given such information.6                 progress in the problems relating to model-
into a query on the data sources. In contrast,                                                        ing data sources and developing methods for
our approach uses the reasoning mechan-              Exploiting source capabilities. Sources          combining data from them via a single, inte-
isms associated with the underlying knowl-           often have different query-processing ca-        grated view. Many problems remain in that
edge-representation system to perform the            pabilities. For example, one source might        area, the most significant being the problem
reformulation. Aside from the natural ad-            be a full-fledged relational database, while     of name matching across sources. This prob-
vantages obtained by treating the represen-          another might be a Web site with a very          lem is finally starting to be addressed in a
tation and the query reformulation within            specific form interface that supports only a     principled manner in the Whirl system.13
the same framework, our approach can pro-            limited set of queries and that requires            In the near future, I believe that the bulk
vide better formal guarantees on the results         certain inputs be provided to it. The Infor-     of the work in the field should shift into
and can benefit immediately from exten-              mation Manifold developed several novel          other, less attended problems, some of
sions to the underlying knowledge-repre-             algorithms for adapting to differing source      which I describe here.
sentation system.                                    capabilities. When possible, to reduce the
                                                     amount of processing done locally, the           Information presentation. Users are
Handling completeness information. In                system would fully exploit the query-pro-        rarely interested in data that is simply and
general, sources on the Web are not neces-           cessing capabilities of its data sources.4,9     concisely presented. More commonly, the
sarily complete for the domain they are              In addition, we developed a mechanism for        result of users queries are best seen as entry
covering. For example, a computer science            describing source capabilities that is a nat-    points into entire webs of data. This obser-
bibliography source is unlikely to contain           ural extension of our method for describ-        vation begs the question of how to build
all the references in the field. However, in         ing source contents. 9                           systems that enable us to flexibly design a
some cases, we can assert local complete-                                                             web of information. In fact, this is exactly
ness statements about sources.6 For exam-            What we did as a community                       the problem we face in designing a richly
ple, the DB&LP Database (http://www.                    In the past few years, each of the groups     structured Web site. The key to designing
informatik.uni-trier.de/ley/db/) contains the        working on data integration has made sig-        such systems is a declarative representation
complete set of papers published in some             nificant individual progress (see this maga-     of a web of information’s structure. Based
of the major database conferences. Knowl-            zine’s Web page for a list of projects;          on such a representation, we can easily
edge of a Web source’s completeness can              http://computer.org/intelligent). However,       specify how to restructure the information
help a data-integration system in several            we have also made progress as a commu-           integrated from multiple sources into a
ways. Most importantly, because a negative           nity. In particular, we have developed a         structure that users can navigate. Recently,
answer from a complete source is meaning-            common set of terms and dimensions along         we have developed the Strudel system,14
ful, the data-integration system can prune           which we can now compare our work more           which is the first to apply these principles
access to other sources. In the Information          rigorously.12 For example, we can now            in creating Web sites.

SEPTEMBER/OCTOBER 1998                                                                                                                           15
.




    Optimization and execution. Modern              problem will become significantly less im-       tem will answer the query completely; in
    database systems succeed largely based on       portant, given the emergence of standards        other cases, the system will guide the user
    the careful design of their query-optimiza-     such XML and languages that will facilitate      to the desirable answers.
    tion and query-execution engines. Recall        querying XML documents.15 Web sites that
    that the query optimizer is the module in       serve significant amounts of data are usually
    charge of transforming a declarative query      developed using some tool for serving data-      References
    (given, for example, in SQL) into a query-      base contents. Using such tools will make it     1.    P. Buneman, “Semistructured Data,” Proc.
                                                                                                           ACM Sigact-Sigmod-Sigart Symp. Principles
    execution plan, thereby making decisions        easier to serve the data in XML form, rather
                                                                                                           of Database Systems (PODS), ACM Press,
    on the order of joins and the specific meth-    than directly in HTML. Hence, Web sites                New York, 1997, pp. 117–121.
    ods for implementing each operation in the      will be able to export data in XML with no       2.    S. Abiteboul, “Querying Semi-Structured
    plan. The query-execution engine actually       added burden to the information providers.             Data,” Proc. Int’l Conf. on Database Theory
    evaluates the plan. Given that data integra-    Of course, some Web sites that do not want             (ICDT), 1997.
    tion is a more general form of the problem      their data to be used for integration purposes   3.    A.Y. Levy, A. Rajaraman, and J.J. Ordille,
                                                                                                           “Query Answering Algorithms for Information
    addressed in database systems, they will        might still only serve HTML pages, but try-            Agents,” Proc. 11th Nat’l Conf. AI, AAAI
    succeed only if we carefully consider the       ing to integrate data from such sources is             Press, Menlo Park, Calif., 1996, pp. 40–47.
    design of these components in data-integra-     probably a futile effort at best.                4.    A.Y. Levy, A. Rajaraman, and J.J. Ordille,
    tion systems.                                      However, while the availability of data             “Querying Heterogeneous Information
       Two factors complicate the problems of       in XML format will reduce the emphasis                 Sources Using Source Descriptions,” Proc.
                                                                                                           22nd Int’l Conf. Very Large Databases (VLDB-
    query optimization and execution in the         on wrappers converting human-readable                  96), Morgan Kaufmann, San Francisco, 1996,
    context of data integration: lack of exten-     data to machine-readable data, the chal-               pp. 251–262.
    sive statistics on the data we are accessing    lenges of semantic integration I’ve men-         5.    A.Y. Levy, “Obtaining Complete Answers
    (unlike with relational databases) and un-      tioned and the need to manage data that is             from Incomplete Databases,” Proc. 22nd Int’l
    predictable arrival rates of data from the      structured at the logical level remains. Fur-          Conf. Very Large Databases (VLDB-96), Mor-
                                                                                                           gan Kaufmann, 1996.
    sources at runtime. Here, too, a combina-       thermore, the machine-learning algorithms
                                                                                                     6.    D. Florescu, D. Koller, and A.Y. Levy, “Using
    tion of techniques from AI and database         developed for extracting data from HTML                Probabilistic Information in Data Integration,”
    systems is likely to provide interesting so-    pages might prove useful for the problem               Proc. 23nd Int’l Conf. Very Large Databases,
    lutions. In particular, in this context, the    of obtaining semantic mappings.                        Morgan Kaufmann, 1997, pp. 216–225.
    need for interleaving query optimization                                                         7.    A.Y. Levy and M.-C. Rousset, “CARIN: A
                                                    Farther down the road                                  Representation Language Integrating Rules
    and query execution is much more signifi-
                                                                                                           and Description Logics,” Proc. Int’l Descrip-
    cant. The idea of interleaving of planning         Once we can build stand-alone, robust,              tion Logics, 1995.
    and execution has been considered in the        data-integration systems, we will face the       8.    O. Etzioni and D. Weld, “A Softbot-Based
    AI planning literature in recent years.15 In    challenge of embedding such systems in                 Interface to the Internet,” Comm. ACM, Vol.
    contrast, current database systems perform      more general environments. I illustrate this           37, No. 7, 1994, pp. 72–76.
    complete query optimization before begin-       challenge with two examples. The first           9.    A.Y. Levy, A. Rajaraman, and J.D. Ullman,
                                                                                                           “Answering Queries Using Limited External
    ning the execution. The issues of query         concerns extending the interaction with a              Processors,” Proc. ACM Sigact-Sigmod-Sigart
    optimization and execution are the focus of     data-integration system beyond simple                  Symp. Principles of Database Systems
    the Tukwila project underway at the Uni-        query answering. In particular, we should              (PODS), ACM Press, 1996.
    versity of Washington.                          be able to use the data to automate some of      10.   H. Garcia-Molina et al., “The TSIMMIS Ap-
                                                    the tasks we routinely perform with the                proach to Mediation: Data Models and Lan-
                                                                                                           guages (Extended Abstract),” Next Generation
    Obtaining source descriptions. Current          data. For example, the system should be                Information Technologies and Systems
    systems are very good at using descriptions     able to use the data for everyday informa-             (NGITS-95), 1995.
    of the source for answering queries. How-       tion-management tasks, such as managing          11.   O. Etzioni and D. Weld, “A Softbot-Based
    ever, source descriptions must still be given   our personal information (our schedules,               Interface to the Internet,” Comm. ACM, Vol.
    manually. Specifically, the problem is to       for example) and document workflow in                  37, No. 7, 1994, pp. 72–76.
                                                                                                     12.   D. Florescu, A. Levy, and A. Mendelzon,
    obtain the semantic mapping between the         organizations, or for alerting different users
                                                                                                           “Database Techniques for the World-Wide
    content of the source and the relations in      on important events.                                   Web: A Survey,” Proc. ACM Sigmod-98, ACM
    the mediated schema. If data-integration           The second method concerns the expec-               Press, 1998.
    systems are really going to scale up to large   tations we have from the data-integration        13.   W.W. Cohen, “Integration of Heterogeneous
    numbers, we must develop automatic meth-        system. It is unlikely that we will be able to         Databases without Common Domains Using
                                                                                                           Queries Based on Textual Similarity,” Proc.
    ods for obtaining source descriptions, pos-     answer all user queries fully automatically,
                                                                                                           ACM Sigmod-98, ACM Press, 1998, pp.
    sibly by employing techniques from ma-          because there will always remain sources               201–212.
    chine learning.                                 for which we will have only models or            14.   M. Fernandez et al., “Catching the Boat with
                                                    sources whose structure (such as natural-              Strudel: Experiences with a Web-site Manage-
    A nonproblem. Many data-integration ef-         language text) does not enable us to reli-             ment System,” Proc. ACM Sigmod-98, ACM
                                                                                                           Press, 1998.
    forts have focused on the problem of ex-        ably extract the data. Hence, we must de-
                                                                                                     15.   J. Ambros-Ingerson and S. Steel, “Integrating
    tracting data from HTML pages—extracting        velop an environment in which the system               Planning, Execution, and Monitoring,” Proc.
    tuples from documents in which data is          cooperates with the user to obtain the an-             13th Nat’l Conf. AI, AAAI Press, Menlo Park,
    semistructured at the physical level. This      swer to the query. Where possible, the sys-            Calif., 1998, pp. 83–88.


    16                                                                                                                         IEEE INTELLIGENT SYSTEMS
                                                                                                                                                .




                                               Ariadne.6 In Greek
The Ariadne approach to Web-                   mythology, Ariadne
based information integration                  was the daughter of
Craig A. Knoblock and Steven Minton,           Minos and Pasiphae
University of Southern California              who gave Theseus the
   The rise of hyperlinked networks has        thread with which to
made a wealth of data readily available.       find his way out of the
However, the Web’s browsing paradigm           Minotaur’s labyrinth.
does not strongly support retrieving and       The Ariadne project’s
integrating data from multiple sites. Today,   goal is to make it simple for users to create      Figure 3 outlines our general framework.
the only way to integrate the huge amount      their own specialized Web-based media-          We assume that a user building an applica-
of available data is to build specialized      tors. We are developing the technology for      tion has identified a set of semistructured
applications, which are time-consuming,        rapidly constructing mediators to extract,      Web sources he or she wants to integrate.
costly to build, and difficult to maintain.    query, and integrate data from Web              These might be both publicly available
Mediator technology offers a solution to       sources. The system includes tools for con-     sources as well as a user’s personal sour-
this dilemma. Information mediators,1–4        structing wrappers that make it possible to     ces. For each source, the developer uses
such as the SIMS system,5 provide an inter-    query Web sources as if they were data-         Ariadne to generate a wrapper for extract-
mediate layer between information sources      bases and the mediator technology required      ing information from that source. The
and users. Queries to a mediator are in a      to dynamically and efficiently answer           source is then linked into a global, unified
uniform language, independent of such          queries using these sources.                    domain model. Once the mediator is con-
factors as the distribution of information        A simple example illustrates how Ariadne     structed, users can query the mediator as if
over sources, the source query languages,      can be used to provide access to Web-based      the sources were all in a single database.
and the location of sources. The mediator      sources (also see the “Ariadne” sidebar).       Ariadne will efficiently retrieve the
determines which data sources to use, how      Numerous sites provide reviews on restau-       requested information, hiding the planning
to obtain the desired information, how and     rants, such as Zagats, Fodors, and Cuisine-     and retrieval process details from the user.
where to temporarily store and manipulate      Net, but none are comprehensive, and
data, and how to efficiently retrieve infor-   checking each site can be time consuming.       Research challenges in Web-based
mation from the sources.                       In addition, information from other Web         integration
   One of the most important ideas under-      sources can be useful in selecting a restau-       Web sources differ from databases in
lying information mediation in many sys-       rant. For example, the LA County Health         many significant ways, so we could not
tems, including SIMS, is that for each ap-     Department publishes the health rating of all   simply apply the existing SIMS system to
plication there is a unifying domain model     restaurants in the county, and many sources     integrate Web-based sources. Here we’ll
that provides a single ontology for the ap-    provide maps showing the location of res-       describe the problems that arise in the Web
plication. The domain model ties together      taurants. Using Ariadne, we can integrate       environment and how we addressed these
the individual source models, which each       these sources relatively easily to create an    problems in Ariadne.
describe the contents of a single informa-     application where people could search these
tion source. Given a query in terms of the     sources to create a map showing the restau-     Converting semistructured data into
domain model, the system dynamically           rants that meet their requirements.             structured data. Web sources are not data-
selects an appropriate set of sources and         With such an application, a user could       bases, but to integrate sources we must be
then generates a plan to efficiently produce   pose requests that would generate a map         able to query the sources as if they were.
the requested data.                            listing all the seafood restaurants in Santa    This is done using a wrapper, which is a
   Information mediators were originally       Monica that have an “A” health rating and       piece of software that interprets a request
developed for integrating information in       whose typical meal costs less than $30. The     (expressed in SQL or some other structured
databases. Applying the mediator frame-        resulting map would let the user click on       language) against a Web source and returns
work to the Web environment solves the         the individual restaurants to see the restau-   a structured reply (such as a set of tuples).
difficult problem of gaining access to real-   rant critic reviews. (In practice, we do not    Wrappers let the mediator both locate the
world data sources. The Web provides the       support natural language, so queries are        Web pages that contain the desired informa-
underlying communication layer that            either expressed in a structured query lan-     tion and extract the specific data off a page.
makes it easy to set up a mediator system,     guage or are entered through a Web-based        The huge number of evolving Web sources
because it is typically much easier to get     graphical user interface.) The integration      makes manual construction of wrappers
access to Web data sources than to the un-     process that Ariadne facilitates can be com-    expensive, so we need the tools for rapidly
derlying databases systems. In addition, the   plex. For example, to actually place a res-     building and maintaining wrappers.
Web environment means that users who           taurant on a map requires the restaurant’s         For this, we have developed the Stalker
want to build their own mediator applica-      latitude and longitude, which is not usually    inductive-learning system,7 which learns a
tion need no expertise in installing, main-    listed in a review site, but can be deter-      set of extraction rules for pulling informa-
taining, and accessing databases.              mined by running an online geocoder, such       tion off a page. The user trains the system
   We have developed a Web-based version       as Etak, which takes a street address and       by marking up example pages to show the
of the SIMS mediator architecture, called      returns the coordinates.                        system what information it should extract

SEPTEMBER/OCTOBER 1998                                                                                                                    17
.



                                  Constructing a mediator            Using a mediator




                  Application                                                                                     tial, suboptimal plan and attempts to im-
                   developer                                  Web                        Application user         prove it by applying rewriting rules. With
                                                             pages                      Queries    Answers        query planning, producing an initial, sub-
                                                                                                                  optimal plan is straightforward—the diffi-
               Source modeling                                                               Query                cult part is finding an efficient plan. The
                 and wrapper                              Models and
                                                           wrappers                         planning              rewriting process iteratively improves the
                 construction
                                                                                                                  initial query plan using a local search
                                                                                                                  process that can change both the sources
                                                                                                                  used to answer a query and the order of the
                                                                                                                  operations on the data.
                                                                                                                     In our restaurant selection example, to
    Figure 3. Architecture for information integration on the Web.                                                answer queries that cover all restaurants,
                                                                                                                  the system would need to integrate data
                                                                                                                  from multiple sources (wrappers) for each
    from each page. Stalker can learn rules                           however, across domains people are          restaurant review site and filter the result-
    from a relatively small number of examples                        unlikely to agree on the granularity that   ing restaurant data based on the search pa-
    by exploiting the fact that there are typi-                       information should be modeled. For          rameters. The mediator would then geo-
    cally “landmarks” on a page that help users                       example, for many applications, the         code the addresses to place the data on a
    visually locate information.                                      mailing address is the right level of       map. The plans for performing these opera-
       Consider our restaurant mediator exam-                         granularity to model address, but if you    tions might involve many steps, with many
    ple. To extract data from the Zagats restau-                      want to geocode an address, it needs to     possible orderings and opportunities to
    rant review site, a user would need to build                      be divided into street address, city,       exploit parallelism, in minimizing the over-
    two wrappers. The first lets the system ex-                       state, and zip code.                        all time to obtain the data. Our planning
    tract the information from an index page,                                                                     approach provides a tractable approach to
    which lists all of the restaurants and con-                  Planning to integrate data in the Web            producing large, high-quality information-
    tains the URLs to the restaurant review                      environment. Another problem that arises         integration plans.
    pages. The second wrapper extracts the                       in the web environment is that generating
    detailed data about the restaurant, includ-                  efficient plans for processing data is diffi-    Providing fast access to slow Web
    ing the address, phone number, review,                       cult. For one, the number of sources to be       sources. In exploiting and integrating Web-
    rating, and price. With these wrappers, the                  integrated could be much larger than in the      based information sources, accessing and
    mediator can answer queries to Zagats,                       database environment. Also, Web sources          extracting data from distributed Web sour-
    such as “find the price and review of                        do not provide the same processing capa-         ces is also much slower than retrieving
    Spago” or “give me the list of all restau-                   bilities found in a typical database system,     information from local databases. Because
    rants that are reviewed in Zagats.”                          such as the ability to perform joins. Finally,   the amount of data might be huge and the
       In his companion essay on the Informa-                    unlike relational databases, there might be      remote sources are frequently being up-
    tion Manifold, Alon Levy claims that the                     restrictions on how a source can be ac-          dated, simply warehousing all of the data is
    problem of wrapping semistructured                           cessed, such as a geocoder that takes the        not usually a practical option. Instead, we
    sources will soon be irrelevant because                      street address returns the geographic coor-      are working on an approach to selectively
    XML will eliminate the need for wrapper                      dinates, but cannot take the geographic          materialize (store locally) critical pieces of
    construction tools. We believe that he is                    coordinates and return the street address.       data that let the mediator efficiently per-
    being overly optimistic about the degree                        Ariadne breaks down query processing          form the integration task. The materialized
    that XML will solve the wrapping problem.                    into a preprocessing phase and a query-          data might be portions of the data from an
    XML clearly is coming; it will significantly                 planning phase. In the first phase, the sys-     individual source or the result of integrat-
    simplify the problem and might even elimi-                   tem determines the possible ways of com-         ing data from multiple sources.
    nate the need for building wrappers for                      bining the available sources to answer a            To decide what information to store lo-
    many Web sources. However, the problem                       query. Because sources might be overlap-         cally, we take several factors into account.
    of querying semistructured data will not                     ping—an attribute may be available from          First, we consider the queries that have
    disappear, for several reasons:                              several sources—or replicated, the system        been run against a mediator application.
                                                                 must determine an appropriate combina-           This lets the system focus on the portions
    • There will always be applications where                    tion of sources that can answer the query.       of the data that will have the greatest im-
      the providers of the data do not want to                   The Ariadne source-selection algorithm8          pact on the most queries. Next, we consider
      actively share their data with anyone                      preprocesses the domain model so that the        both the frequency of updates to the sour-
      who can access their Web page.                             system can efficiently and dynamically           ces and the application’s requirements for
    • Just as there are legacy Cobol pro-                        select sources based on the classes and          getting the most recent information. For
      grams, there will be legacy Web appli-                     attributes mentioned in the query.               example, in the restaurant application, even
      cations for many years to come.                               In the second phase, Ariadne generates a      though reviews might change daily, provid-
    • Within individual domains, XML will                        plan using a method called Planning-by-          ing information that is current within a
      greatly simplify the access to sources;                    Rewriting.9,10 This approach takes an ini-       week is probably satisfactory. But, in a

    18                                                                                                                                  IEEE INTELLIGENT SYSTEMS
                                                                                                                                                .




                                                    Ariadne
                                                       This Restaurant Location
finance application, providing the latest           application of Ariadne
stock price would likely be critical. Finally,      shown in the first image
                                                    integrates data from a vari-
we consider the sources’ organization and
                                                    ety of sources, including
structure. For example, the system can only
                                                    restaurant review sites,
get the latitude and longitude from the             health ratings, geocoders,
geocoder by providing the street address. If        and maps.
the application lets a user request the res-           In response to a query for
taurants located within a region of a map, it       all highly rated restau-
could be very expensive to figure out which         rants in Santa Monica
restaurants are in that region because the          with an ‘A’ health rating,
system would need to geocode each restau-           the mediator finds the
rant to determine whether it falls within the       restaurants that satisfy
                                                    the query by extracting
region. Materializing the restaurant ad-
                                                    the data directly from the
dresses and their corresponding geocodes
                                                    relevant Web sites.
avoids a costly lookup.                                The mediator also
   Once the system decides to materialize a         produces a map of the
set of information, the materialized data           restaurants (second
becomes another information source for              image) by converting the
the mediator. This meshes well with our             street addresses into
mediator framework because the planner              latitute and longitude
dynamically selects the sources and the             coordinates using an
plans that can most efficiently produce the         online geocoder.
                                                       Each point on the map
requested data. In the restaurant example, if
                                                    in the second image is click-
the system decides to materialize address
                                                    able. Selecting the point for
and geocode, it can use the locally stored          Chinois on Main returns the
data to determine which restaurants could           detailed restaurant review
possibly fall within a region for a map-            directly from the appropriate
based query.                                        restaurant review site (third
                                                    image).
Resolving naming inconsistencies across
sources. Within a single site, entities—such
as people, places, countries, or compan-
ies—are usually named consistently. How-
ever, across sites, the same entities might be      We are developing a semi-automated           based applications that do more than simply
referred to with different names. For exam-      method for building mapping tables and          return documents. Information-integration
ple, one restaurant review site might refer to   functions by analyzing the underlying data      systems such as Ariadne will help users
a restaurant as Art’s Deli and another site      in advance. The basic idea is to use informa-   rapidly construct and extend their own
might call it Art’s Delicatessen. Or, one site   tion-retrieval techniques, such as those de-    Web-based applications out of the huge
might use California Pizza Kitchen and           scribed in William Cohen’s companion            quantity of data available online.
another site could use the abbreviation          essay, to provide an initial mapping,11 and        While information integration has made
CPK. To make sense of data that spans mul-       then use additional data in the sources to      tremendous progress over the last few
tiple sites, our system must be able to rec-     resolve any remaining ambiguities via statis-   years,13 many hard problems still must be
ognize and resolve these differences.            tical learning methods.12 For example, res-     solved. In particular, two mostly overlooked
   In our approach, we select a primary          taurants are best matched up by considering     problems deserve more attention:
source for an entity’s name and then pro-        name, street address, and phone number, but
vide a mapping from that source to each of       not by using a field such as city because a     • Coming up with the models or source
the other sources that use a different nam-      restaurant in Hollywood could be listed as        descriptions of the information sources,
ing scheme. The Ariadne architecture lets        either being in Hollywood or Los Angeles          a time-consuming and difficult problem
us represent the mapping itself as simply        and different sites list them differently.        that is largely performed by hand today.
another wrapped information source. Spe-                                                         • Automatically locating and integrating
cifically, we can create a mapping table,        The future of Web-based                           new sources of data, which would be
which specifies for each entry in one data       integration                                       enabled by solutions to the first prob-
source what the equivalent entity is called         As more and more data becomes avail-           lem. (This problem has been addressed
in another data source. Alternatively, if the    able, users will become increasingly less         in limited domains, such as Internet
mapping is computable, Ariadne can repre-        satisfied using existing search engines that      shopping,14 but the problem is still
sent the mapping by a mapping function,          return massive quantities of mostly irrele-       largely unexplored.)
which is a program that converts one form        vant information. Instead, the Web will
into another form.                               move toward more specialized content-             For more information on the Ariadne

SEPTEMBER/OCTOBER 1998                                                                                                                    19
.




    project and example applications that were                                                                 are tempted to add additional structure—for
    built using Ariadne, see the Ariadne home-                The Whirl approach to information                instance, we might organize the games in
    page at http://www.isi.edu/ariadne.                       integration                                      the list into categories and provide, for each
                                                              William W. Cohen, AT&T Labs-Research             game, links to online resources, such as
                                                                 Search engines such as AltaVista and          pricing information and reviews.
    References                                                portal sites such as Yahoo! help us find            From the standpoint of computer sci-
    1.    G. Wiederhold, “Mediators in the Architecture       useful online information sources. What          ence, augmenting the list of games in this
          of Future Information Systems,” Computer,
          Vol. 25, No. 3, Mar. 1992, pp. 38–49.
                                                              we need now are systems to help use this         way is clearly a bad idea, because it leads
    2.    H. Garcia-Molina et al., “The Tsimmis Ap-           information effectively. Ideally, we would       to a structure that lacks modularity. The
          proach to Mediation: Data Models and Lan-           like programs that answer a user’s ques-         original structure was a static, easily main-
          guages,” J. Intelligent Information Systems,        tions based on information obtained from         tained list of computer games. In the aug-
          1997.                                               many different online sources. We call such      mented hypertext, this information is inter-
    3.    A.Y. Levy, A. Rajaraman, and J.J. Ordille,
                                                              a program an information-integration sys-        mixed with orthogonal information about
          “Querying Heterogeneous Information
          Sources Using Source Descriptions,” Proc.           tem, because to answer questions it must         game categories, possibly ephemeral infor-
          22nd Very Large Databases Conf., Morgan             integrate the information from the various       mation concerning the organization of ex-
          Kaufmann, San Francisco, 1996, pp. 251–262.         sources into a single, coherent whole.           ternal Web sites, and possibly incorrect
    4.    M.R. Genesereth, A.M. Keller, and O.M.                 For example, consider consumer infor-         assumptions about the readers’ goals. The
          Duschka, “Infomaster: An Information Integra-
                                                              mation about computer games. Many Web            resulting structure is hard to maintain and
          tion System,” Proc. ACM Sigmod Int’l Conf.
          Management of Data, ACM Press, New York,            sites contain information of this sort. As       hard to modify in certain natural ways,
          1997, pp. 539–542.                                  this essay will show, in addition to the ob-     such as by changing the set of categories
    5.    Y. Arens, C.A. Knoblock, and W.-M. Shen,            vious benefit of reducing the number of          used to organize the list of games.
          “Query Reformulation for Dynamic Informa-           sites a user must visit, integrating this in-       To summarize, the simple, modular en-
          tion Integration,” J. Intelligent Information
                                                              formation has several important and              coding of this information will be difficult
          Systems, Special Issue on Intelligent Informa-
          tion Integration, Vol. 6, Nos. 1 and 3, 1996, pp.   nonobvious advantages.                           for users to exploit, and the easy-to-use
          99–130.                                                One advantage is that often, more ques-       encoding will be difficult to create, modify,
    6.    C.A. Knoblock et al., “Modeling Web Sources         tions can be answered using the integrated       and maintain. By contrast, it is trivial to
          for Information Integration,” Proc. 11th Nat’l      information than using any single source.        encode this information in a relational
          Conf. Artificial Intelligence, AAAI Press,
                                                              Consider two sources containing slightly         database in a manner that is both modular
          Menlo Park, Calif., 1998, pp. 211–218.
    7.    I. Muslea, S. Minton, and C.A. Knoblock,
                                                              different information: one source catego-        and useful: we simply create a relation list-
          “Stalker: Learning Extraciton Rules for Semi-       rizes games into children’s games and adult      ing all old-PC-friendly games, and stan-
          structured Web-Based Information Sources,”          games, and another categorizes games into        dard query languages let users find, say,
          Proc. 1998 Workshop AI and Information              arcade games, puzzle games, and adventure        reviews of inexpensive old-PC-friendly
          Integration, AAAI Press, 1998, pp. 74–81.
                                                              games. In this case, the sources must be         arcade games. (This example assumes that
    8.    J.L. Ambite et al., Compiling Source Descrip-
          tions for Efficient and Flexible Information
                                                              integrated to find, say, a list of children’s    information about game prices and reviews
          Integration, tech. report, Information Sciences     adventure games. Conversely, integration         is also available in the database.) Relational
          Institute, Univ. of Southern California, Marina     can help exploit overlap among sources;          databases thus provide a more modular
          del Rey, Calif., 1998.                              for instance, one might be interested in         encoding of the information.
    9.    J.L. Ambite and C.A. Knoblock, “Planning by         finding games that three or more sources            Unfortunately, conventional databases
          Rewriting: Efficiently Generating High-Qual-
          ity Plans,” Proc. 14th Nat’l Conf. Artificial
                                                              have rated highly, or in reading several         assume information is stored locally, in a
          Intelligence, AAAI Press, 1997, pp. 706–713.        independent reviews of a particular game.        consistent format—not externally, in di-
    10.   J.L. Ambite and C.A. Knoblock, “Flexible and           A second and more important advantage         verse formats, as is the case with informa-
          Scalable Query Planning in Distributed and          of integration is that making it possible to     tion on the Web. Hence they do not solve
          Heterogeneous Environments,” Proc. Fourth           combine information sources also makes it        the problem of organizing information on
          Int’l Conf. Artificial Intelligence Planning
          Systems, AAAI Press, 1998, pp. 3–10.
                                                              possible to decompose information so as to       the Web. To use modular, maintainable
    11.   W.W. Cohen, “Integration of Heterogeneous           represent it in a clean, modular way. For        representations for information, while still
          Databases without Common Domains Using              example, suppose we wished to create a           exploiting the power of the Web—its dis-
          Queries Based on Textual Similarity,” Proc.         Web site providing some new sort of infor-       tributed nature, large size, and broad
          ACM Sigmod-98, ACM Press, 1998, pp.                 mation about computer games—say, infor-          scope—we need practical ways of integrat-
          201–212.
                                                              mation about which games work well on            ing information from diverse sources.
    12.   T. Huang and S. Russell, “Object Identification
          in a Bayesian Context,” Proc. 15th Int’l J.         older, slower machines. The simplest way
          Conf. AI, Morgan Kaufmann, 1997, pp.                of representing this information is exten-       Why integrating information is hard
          1276–1283.                                          sionally, as a list of games having this prop-      Unfortunately, integrating information
    13.   Proc. 1998 Workshop on AI and Information           erty. By itself, however, such a list is not     from multiple sources is very hard. One
          Integration, AAAI Press, 1998.
                                                              very valuable to end users, who are proba-       difficulty is programming a computer to
    14.   R.B. Doorenbos, O. Etzioni, and D.S. Weld, “A
          Scalable Comparison-Shopping Agent for the
                                                              bly interested in games that not only work       understand the various information sources
          World-Wide Web,” Proc. First Int’l Conf.            on their PC, but also satisfy other proper-      well enough to answer questions about
          Autonomous Agents, AAAI Press, 1997, pp.            ties, such as being inexpensive or well-         them. Surprisingly, this is often difficult
          39–48.                                              designed. To make the list more useful, we       even when information is presented in sim-

    20                                                                                                                               IEEE INTELLIGENT SYSTEMS
                                                                                                                                                               .



                                                          GAME TITLE                                       PUBLISHER

                                                          Aladdin Activity Center                          Disney Interactive
                                                          Arthur’s Computer Adventure                      Living Books/Broderbund
ple, easy-to-parse regular structures such as             Escape from Dimension Q                          Headbone Interactive
lists and tables.                                         How the Leopard Got His Spots                    Microsoft Kids
   As an example, Figure 4 shows a tabular
representation of the information in two               (a)
hypothetical Web sites. Consider the                      GAME PUBLISHER                                   HOME PAGE
knowledge an integration system would
need to answer the following question                     Disney                                           http://www.disneyinteractive.com
using these information sources:                          Headbone                                         http://www.headbone.com
                                                          Humongous                                        http://www.humongous.com
  Who publishes “Escape from Dimension Q”                 Broderbund                                       http://www.broderbund.com
  and where is their home page?                           Microsoft                                        http://www.microsoft.com

(In this essay, we assume that questions are           (b)
given to the information-integration system
in a formal language; for readability, how-
                                                 Figure 4. Two typical information sources: (a) Web site 1 and (b) Web site 2.
ever, we’ll paraphrase questions in English
whenever possible.)
   To answer this question, the system must      that this will continue to hold true, simply                      able to recognize such structures. (Al-
have knowledge of several kinds:                 because presenting information to a human                         though general table-recognition meth-
                                                 audience is less demanding for the informa-                       ods exist,2 to our knowledge, no existing
• It must know where to find these tables        tion provider—information intended for a                          Web-based integration system uses
  on the Web, and how they are formatted         human audience need not conform to some                           them.) Similarly, most people would
  (access knowledge).                            externally set formal standard; it only has to                    judge it likely that the “Headbone” and
• It must know that each tuple 〈x,y〉 in the      be comprehensible to a reader.                                    “Headbone Interactive” denote the same
  table Website-1 should be interpreted as                                                                         company (or closely related ones), but
  the statement “the company y publishes         The Whirl approach to information                                 would consider it unlikely that “Disney
  the game x,” and that each tuple 〈t,u〉         integration                                                       Interactive” and “Microsoft” do; an inte-
  in the table Website-2 should be inter-           We have written a system for information                       gration system should be able to make a
  preted as the statement “the home page         integration called Whirl. The approach to                         similar judgement, even without knowl-
  for the company t is found at the URL          integration embodied in Whirl is based on                         edge of the domain.
  u” (semantic knowledge).                       two premises:
• Finally, it must know that the string                                                                           Of the many mechanisms required by
  “Headbone” in Website-2 refers to the          • It is unreasonable to assume that all the                   such an integration system, we have chosen
  same company as the string “Headbone             knowledge needed for information inte-                      to concentrate (initially) on general meth-
  Interactive” in Website-1 (object-identity       gration will be present, and in any case                    ods for integrating information without
  knowledge).                                      impractical to encode this information                      object identity knowledge. In most integra-
                                                   explicitly. Consequently, inferences                        tion tasks, far more object-identity knowl-
   Even given all this knowledge, many             made by an integration system are in-                       edge is needed than any other type of
interesting technical problems remain;             herently incomplete and uncertain. As                       knowledge; while semantic knowledge and
however, the technical difficulties involved       in machine learning, speech recogni-                        access knowledge might be needed for
in using these types of knowledge pale be-         tion, and information retrieval, the inte-                  each source, a system potentially needs
side the practical difficulties of acquiring       gration system will have to take some                       object-identity knowledge about each pair
the knowledge. Currently, all this knowl-          chances and make some mistakes. An                          of constants in the integrated database.
edge must be manually provided to the              integration system thus must have ways                         Our approach for dealing with uncertain
integration system and updated whenever            of reasoning with uncertain informa-                        object identities relies on the observation
the original information sources change.           tion, and communicating to the user its                     that information sources tend to use textu-
Performing information integration is thus         confidence in an answer.                                    ally similar names for the same real-world
extremely knowledge-intensive and hence          • Information integration should exploit                      objects. This is particularly true when
expensive in terms of human time.                  the existing human-oriented interface to                    sources are presenting information to peo-
   Of course, many of these problems can be        information sources as much as poss-                        ple of similar background in similar con-
“assumed away:” integrating information            ible. It should, whenever possible, un-                     texts. To exploit this, the Whirl query lan-
sources is not nearly so difficult if they use     derstand information using general tech-                    guage allows users to formulate SQL-like
common object identifiers, adopt a common          niques, analogous to the ones people                        queries about the similarity of names. Con-
data format, and use a known ontology. Un-         use, rather than relying on externally                      sider again the tables of Figure 4. Assum-
fortunately, few existing information              provided, problem-specific knowledge.                       ing that the table Website-1 is encoded
sources satisfy these assumptions. The vast        For instance, people have no difficulty                     as a relation with schema game(name,
majority of existing online sources are de-        recognizing the structures in Table1 as                     pubName), and Website-2 is encoded as
signed to communicate only with human              two-column tables; thus a good informa-                     a relation with schema publisher (name,
readers, not with other programs. We believe       tion-integration system should also be                      homepage), the question


SEPTEMBER/OCTOBER 1998                                                                                                                                   21
.



             Table 1. Output of a Whirl query pairing paragraphs of free text and names of computer games.
          The score is the similarity of the last two columns, normalized to a range of 0–100, and the checkmark
                                                indicates if the pairing is correct.

     SCORE      DEMO.NAME                                                    GAME.NAME                             ly speaking, two names are similar accord-
                                                                                                                   ing to this metric if they share terms, where
     80.26      Ubi Software has a demo of Amazing Learning                  Amazing Learning                 √    a term is a word stem, and names are con-
                Games with Rayman.                                           Games with Rayman                     sidered more similar if they share more
     78.25      Interplay has a demo of Mario Teaches Typing. (PC)           Mario Teaches Typing             √
     75.91      Warner Active has a small interactive demo for               Where’s Waldo? Exploring         √    terms, or if the shared terms are rare. As an
                Where’s Waldo at the Circus and Where’s Waldo?               Geography                             example, “Disney Interactive” and “Dis-
                Exploring Geography. (Mac and Win)                                                                 ney” would be more similar than “Disney
     74.94      MacPlay has demos of Marios Game Gallery                     Mario Teaches Typing             √    Interactive” and “Headbone Interactive,”
                and Mario Teaches Typing. (Mac)                                                                    because “Interactive” is a more common
     71.56      Interplay has a demo of Mario Teaches Typing. (PC)           Mario Teaches Typing 2           √
     68.54      MacPlay has demos of Marios Game Gallery                     Mario Teaches Typing 2           √    term than “Disney.” These similarity met-
                and Mario Teaches Typing. (Mac)                                                                    rics are not well understood formally, but
     68.45      Psygnosis has an interactive demo for                        Lemmings Paintball               √    are well supported experimentally.
                Lemmings Paintball. (Win95)                                                                           Whirl also builds on ideas from artificial
     65.70      ICONOS has a demo of What’s The Secret?                      What’s the Secret?               √
                Volume 1. (Mac and Win)
                                                                                                                   intelligence. To find the best K answers to a
     64.33      Fox Interactive has a fully working demo version             Simpsons Cartoon Studio          √    query, we use a variant of A* search,3,4
                of the Simpsons Cartoon Studio. (Win and Mac)                                                      coupled with inverted-index techniques
     62.90      Gryphon Software has demos of Gryphon                        Gryphon Bricks                   √
                                                                                                                   developed in the information-retrieval
                Bricks, Colorforms Computer Fun Set—Power                                                          community.5 In combination, these tech-
                Rangers and Sailor Moon, and a FREE Gryphon
                Bricks Screen Saver. (Mac and Win)                                                                 niques allow Whirl to find the best K an-
     60.30      Vividus Software has a free 30 day demo of                   Web Workshop                     √    swers to a query fairly quickly, even when
                Web Workshop (Web-authoring package for kids!).                                                    the universe of possible answers is
                (Win 95 and Mac)                                                                                   extremely large.
     59.96      Conexus has two shockwave demos—Bubbleoids                   Super Radio Addition             √
                (from Super Radio Addition with Mike and Spike)              with Mike & Spike
                and Hopper (from Phonics Adventure with Sing                                                       What Whirl has accomplished
                Along Sam).                                                                                           Using a search-engine-like interface (in
                                                                                                                   which possible answers come in a ranked list)
                                                                                                                   lets us evaluate Whirl in the same way that
         Who publishes “Escape from Dimension Q”               from the Cartesian product of the game and          information retrieval researchers evaluate
         and where is their home page?                         publisher relations. Whirl scores an-               search engines. In particular, given informa-
    might be encoded as the Whirl query                        swers according to how well they satisfy            tion about which of Whirl’s proposed an-
                                                               the conditions in the WHERE part of the             swers are correct, we can evaluate Whirl
    SELECT      publisher.name,                                query: each similarity condition gets a             using metrics such as recall and precision. We
                publisher.homepage                             score between zero and one, each Boolean            evaluated Whirl on a number of benchmark
    FROM        game, publisher                                condition receives a score of either zero or        problems from several different domains,
    WHERE       (game.pubName ~ game.                          one, and Whirl combines these primitive             using the measure of noninterpolated aver-
                name AND game.name ~                           scores as if they were independent proba-           age precision. (Roughly speaking, this aver-
                “Escape from Dimension Q”)                     bilities to obtain a score for the entire           ages the best level of precision obtained at
                                                               WHERE clause. For the query above, the              each distinct recall level. The highest possible
    Here ~ is a similarity operator, and thus the              score of a tuple 〈u,v,x,y〉 is the product of        value for this measure is 100%.) We discov-
    query asks Whirl to find a tuple 〈u,v〉 from                the similarity of y to u and the similarity of      ered that the off-the-shelf similarity metric
    publisher such that for some tuple 〈x,y〉                   x to “Escape from Dimension Q.”                     we adopted is surprisingly accurate. On 14 of
    from game, y is textually similar to u, and x                 In typical use, Whirl returns only the K         18 benchmark problems, average precision is
    is textually similar to the string “Escape                 highest-scoring answers, where K is a para-         90% or higher; on seven of the 18 problems,
    from Dimension Q.” Such a pair is a plausi-                meter set by the user. From the user’s per-         average precision is 99% or higher.
    ble answer to the query, although not nec-                 spective, interacting with the system is thus          Intriguingly, good performance can
    essarily a correct one.                                    much like interacting with a search engine:         often come even when the names from one
       This query language is central to our                   the user requests the first K answers, ex-          or both sources are embedded in extrane-
    approach, so we will describe it in some                   amines them, and then requests more if              ous text. Table 1 presents the first few an-
    detail. The query language has a “soft”                    necessary.                                          swers for the query
    semantics; the answer to such a query is not                  Semantically, then, the Whirl query lan-
    the set of all tuples that satisfy the query,              guage is quite simple—but as in many en-            SELECT          demo.name,game.name
    but a list of tuples, each of which is consid-             terprises, “the devil is in the details.” To        FROM            demo,game
    ered a plausible answer to the query and                   make this idea work well, we needed to              WHERE           demo.name ~ game.name
    each of which is associated with a numeric                 adopt ideas from several research commu-
    score indicating its perceived plausibility.               nities. Our system computes the similarity          for a problem in which the names in demo
       The universe of possible answers is de-                 of two names using cosine distance in the           are embedded in arbitrary paragraph-long
    termined by the FROM part of the query; in                 vector space model, a metric widely used            passages of free text. As the table’s last col-
    the example above, possible answers come                   in statistical-information retrieval.1 Rough-       umn shows, most top-ranked pairings are

    22                                                                                                                                     IEEE INTELLIGENT SYSTEMS
                                                                                                                                                               .




correct, and the complete ranking of answers
Whirl proposed has a respectable average
precision of 67%. Whirl’s robustness to ex-
traneous noise words means that we can
afford to use approximate methods of ex-
tracting of data from information sources.
   We have also built several nontrivial
integrated-information systems using
Whirl. The domain of the first is children’s
computer games. This application inte-
grates information from 16 Web sites.
Using the HTML form interface shown in
Figure 5, users can construct questions like
the following:

  Help me find reviews of games that are in the
  category “art,” are recommended by two or
  more sites, and are designed for children six
  years old.

The application knows how to find reviews,
demos, and vendors of games, and also
understands several properties of games,           Figure 5. The interface to an information-integration system based on Whirl.
such as which games are popular and who
publishes which games.
   We have built a similar system that inte-       applications can be built, even without the                  great blue heron. This sort of intelligent
grates information about North American            aid of other advanced techniques.                            behavior is enabled by translating the
birds. Collectively, the integrated databases         The two implemented integration appli-                    user’s quick search into a structured query
contain about 100,000 tuples, about 10,000         cations illustrate several other important                   that exploits an information source giving
of which point to external Web pages. Both         points. First, in the games application,                     the scientific nomenclature for birds.
systems are available on the Web (at http://       Whirl extracts information about the age
whirl.research.att.com/cdroms/ and http://whirl.   range for which games are appropriate from                   The future of integration
research.att.com/birds/). The response time        a commercial site; this information can then                    Whirl’s current implementation could be
for complex queries is typically less than         be used to access a collection of reviews                    extended in many ways. Challenging tech-
10 seconds. (These time measurements are           taken from several consumer-oriented sites.                  nical issues include scaling up to larger
on a lightly loaded Sun Ultra 2170 with            The age-range information has, in some                       data sets (the current implementation is
167-MHz processors and 512 Mbytes of               sense, been made portable; it has been dis-                  memory-based), finding more flexible and
main memory. The current server is not             associated from the site that provided it and                more automatic extraction methods, learn-
multithreaded, so response times vary              used for a goal different from its intended                  ing to improve scores based on feedback of
greatly with load.)                                purpose (of improving access to a large                      various kinds, and collecting data effic-
   In building these applications, we delib-       online catalog). Attaining this sort of modu-                iently at query time. We will conclude,
erately sidestepped many of the problems           larity and portability of information was                    however, with some more general remarks.
that have historically been research issues        one of Whirl’s primary goals.
in information integration. Whirl data is not         Second, integration need not require
semistructured, 6 but instead is stored in         complex query interfaces to be useful. As
simple relations. The problem of query             well as providing a query-based interface,
planning7 is simplified by collecting all          the bird application allows data about birds
data with spiders offline. We map individ-         to be browsed, either geographically or by                       Coming Next Issue
ual information sources into a global              scientific order and family. This browsing
schema using manually constructed views,           interface extends the capability of the origi-                        Interactive Fiction
rather than using more powerful meth-              nal sources, which are seldom organized
ods.8,9 Access knowledge is represented in         along both of these dimensions. The bird                                 Haym Hirsh, editor
hand-coded extraction programs,10 rather           application also includes a quick-search
than learned by example as proposed by             feature, in which the user types in the name
                                                                                                                   with essays by Barbara Hayes-
Nicholas Kushmeric and others.11–13 We             of a bird and gets a list of URLs in
made these decisions to highlight the ad-          response. As an example, in response to a
                                                                                                                      Roth, Janet Murray, and
vantages of our approach, relative to earlier      quick search for “great blue heron,” Whirl’s                            Andrew Stern
integration methods: by adopting an uncer-         answer includes a picture indexed only as
tain approach to integration, significant          ardea herodias, the scientific name of the

SEPTEMBER/OCTOBER 1998                                                                                                                                    23
   One possible goal for computer science       sources are integrated. In economic terms,                   Queries Based on Textual Similarity,” Proc.
research is the construction of an informa-     the value of having a well-structured, eas-                  ACM Sigmod-98, ACM Press, New York, 1998,
                                                ily-integrated information source is largely                 pp. 201-212.
tion system with size and scope compara-
                                                                                                       6.    D. Suciu, ed., Proc. Workshop on Management
ble to the Web, but with abilities compara-     external, leading to a classic chicken-and-                  of Semistructured Data; http://www.research.
ble to a knowledge base. In particular, we      egg problem. The availability of cheap,                      att.com/~suciu/workshop-papers. html.
would like a system that can reason about       approximate integration methods could                  7.    J.L. Ambite and C.A. Knoblock, “Planning by
and understand information that, like infor-    help to overcome this problem.                               Rewriting: Efficiently Generating High-quality
mation found on the Web, is constructed            Let us close with an analogy. Information                 Plans,” Proc. 14th Nat’l Conf. AI, AAAI Press,
                                                                                                             Menlo Park, Calif., 1997, pp. 706–713.
and maintained in a decentralized fashion.      integration can be viewed as the problem of
                                                                                                       8.    A.Y. Levy, A. Rajaraman, and J.J. Ordille,
This is a very hard problem and perhaps a       getting information sources to talk to each                  “Querying Heterogeneous Information
very distant goal; however, Whirl repre-        other. Our approach can be viewed as get-                    Sources Using Source Descriptions,” Proc.
sents an important step toward that goal.       ting information sources to talk to each                     22nd Int’l Conf. Very Large Databases (VLDB-
   Previous systems that access information     other in an informal way. We hope that this                  96), Morgan Kaufmann, 1996, pp. 251–262.
                                                kind of informal communication will retain             9.    O.M. Duschka and M.R. Genesereth, “Answer-
from multiple sources fall into two main
                                                                                                             ing Recursive Queries Using Views,” Proc.
classes. Search engines provide weak and        much of the utility of formal communica-                     16th ACM Sigact-Sigmod-Sigart Symp. Princi-
relatively unstructured access to a large       tion, but be far easier to attain—just as in-                ples of Database Systems (PODS-97), ACM
number of sites. Previous Web-based infor-      formal essays like this one can, without                     Press, 1997, pp. 109–116.
mation integration systems provide better       tedious technical detail, communicate the              10.   W.W. Cohen, “A Web-Based Information Sys-
access to a small number of highly struc-       essence of a new technical result.                           tem that Reasons with Structured Collections
                                                                                                             of Text,” Proc. Second Int’l Conf. Autonomous
tured sites. Whirl’s emphasis on inexact,                                                                    Agents, ACM Press, New York, 1998, pp.
uncertain integration provides an interme-      References                                                   400–407.
diate step between these two extremes.          1.   G. Salton, ed., Automatic Text Processing,        11.   N. Kushmerick, D.S. Weld, and R. Doorenbos,
                                                     Addison Wesley, Reading, Mass., 1989.                   “Wrapper Induction for Information Extrac-
   The intermediate step of cheap, approxi-
                                                2.   D. Rus and D. Subramanian, “Customizing                 tion,” Proc. 15th Int’l J. Conf. AI, AAAI Press,
mate information integration is a critical           Capture and Access,” ACM Trans. Information             1997, pp. 729–735.
one. It is unreasonable to expect an infor-          Sys., Vol. 15, No. 1, 1997, pp. 67–101.           12.   C.A. Knoblock et al., “Modeling Web Sources
mation provider whose primary audience is       3.   N. Nilsson, Principles of Artificial Intel-             for Information Integration,” Proc. 15th Nat’l
people to spend much time and energy in              ligence, Morgan Kaufmann, San Francisco,                Conf. AI (AAAI-98), AAAI Press, 1998, pp.
                                                     1987.                                                   211–218.
making his or her information programmat-
                                                4.   J. Pearl. Heuristics: Intelligent Search Strat-   13.   C.-N. Hsu, “Initial Results on Wrapping Semi-
ically available unless there is a clear and         egies for Computer Problem Solving, Addison-            structured Web Pages with Finite-State Trans-
immediate benefit. Unfortunately, while              Wesley, 1984.                                           ducers and Contextual Rules,” Papers from the
integration does provide a benefit, this ben-   5.   W.W. Cohen, “Integration of Heterogeneous               1998 Workshop on AI and Information Integra-
efit does not materialize until a number of          Databases without Common Domains Using                  tion, AAAI Press, 1998, pp. 66–73.


24                                                                                                                                IEEE INTELLIGENT SYSTEMS

						
Related docs
Other docs by nyut545e2