Information Integration across Heterogeneous Domains - Current Scenario, Challenges and the InfoMosaic Approach

Document Sample
Information Integration across Heterogeneous Domains - Current Scenario, Challenges and the InfoMosaic Approach Powered By Docstoc
					             Department of Computer Science and Engineering
                    University of Texas at Arlington
                          Arlington, TX 76019

Information Integration across Heterogeneous Domains: Current
      Scenario, Challenges and the InfoMosaic Approach

             Aditya Telang and Sharma Chakravarthy

                   Technical Report CSE-2007
      Information Integration across Heterogeneous Domains:
     Current Scenario, Challenges and The InfoMosaic Approach
                        Aditya Telang and Sharma Chakravarthy
             IT Laboratory & Department of Computer Science & Engineering
                The University of Texas at Arlington, Arlington, TX 76019.
                              {telang, sharma}

         Today, information retrieval and integration has assumed a totally different, complex con-
     notation than what it used to be. The advent of the Internet, the proliferation of information
     sources, the presence of structured, semi-structured, and unstructured data - all of have added
     new dimensions to the problem of information retrieval and integration as known earlier. From
     the time of distributed databases leading to heterogeneous, federated, and multi-databases, re-
     trieval and integration of heterogeneous information has been an important problem for which
     a complete solution has eluded researchers for a number of decades. Techniques such as global
     schemas, schema integration, dealing with multiple schemas, domain specific wrappers, and
     global transactions have produced significant steps but never reached the stage of maturity for
     large scale deployment and usage. Currently, the problem is even more complicated as repos-
     itories exist in various formats (HTML, XML, Web Databases with query-interfaces, ...) and
     schemas, and both the content and the structure are changing autonomously. In this survey
     paper, we describe the general problem of information retrieval and integration and discuss
     the challenges that need to be addressed to deal with the general problem of information re-
     trieval and integration as it pertains to data sources we encounter today. As the number of
     repositories/sources will increase in an uncontrolled manner, there is no other option but to
     find a solution for integrating information from different autonomous sources as needed for a
     search/query whose (partial) answers have to be retrieved and integrated from multiple sources.
     We survey the current work, identify challenges that need to be addressed, and present InfoMo-
     saic, a framework we are currently designing to handle the challenge of multi-domain integration
     of data on the web.

1    Introduction
The volume of information accessible via the web is staggeringly large and growing rapidly. The cov-
erage of information available from web sources is difficult to match by any other means. Although
the quality of information provided by these sources tends to vary from one to another (sometime
drastically), it is easy to search these repositories and obtain information of interest. Hence, over
the last decade or so, search engines (e.g. Google, Yahoo, etc.) [1] have become extremely popular
and have facilitated users to quickly and effortlessly retrieve information from individual sources.
Conceptually, search engines only perform the equivalence of a simple lookup operation on one or
more keywords, followed by a sophisticated ranking of the results [2]. There exist several advanced
search mechanisms (or meta-search engines) that post-process the output of normal search engines
to organize and classify the sources in a meaningful manner (e.g., Vivisimo [3]). Additionally,
question-answering frameworks (e.g., START [4]) that are language-based retrieval mechanisms,
address the issues of providing detailed information in response to queries on varied concepts (e.g.,
geography, science, etc.) expressed in English (or other) language. Furthermore, several commercial

web-portals (e.g.,, etc.) have been designed to
operate on individual domains (e.g., airlines, book purchase, hotels, etc.) or a set of pre-determined
sources for searching multiple repositories belonging to the same domain1 of interest and provide
results based on some user criteria, such as cost, schedule, proximity, etc. In summary, efficient
and effective search and meta-search engines are available that do specific tasks very well.
    These retrieval mechanisms are widely used mainly because of their simplicity and absence of
a learning curve as compared to other forms of information retrieval. This simplicity, on the other
hand, makes it difficult to specify queries that require extraction of data from multiple repositories
across diverse domains or results that are relevant to a context. Currently, there are no retrieval
mechanisms that support – framing queries that retrieve results from multiple documents and
sources, and combining them in meaningful ways to produce the desired result. For example,
consider the query: “Retrieve all castles within 2 hours by train from London”. Although all the
information for answering different parts of this query is available on the Web, it is currently not
possible to frame it as a single query and get a comprehensive set of relevant answers. The above
example underlines the “Tower of Babel ” problem for integrating information to answer a query
that requires data from multiple independent sources to be combined intelligently. The islands of
information that we are experiencing now is not very different from the islands of automation seen
earlier. This gap needs to be bridged to move from current search and meta-search approaches to
true information integration.
    It is important to understand that information integration is not a new problem. It existed
in the form of querying distributed, heterogeneous, multiple and federated databases. What has
really changed in the last decade is the complexity of the problem (types/models of data sources,
number of data sources, infeasibility of schema integration) and the kind of solution that is being
sought. In this survey paper, we analyze the problem of information integration as it pertains to
extracting and combining heterogeneous data from autonomous Web sources in response to user
queries spanning across several domains.
    The rest of the survey is organized as follows – In Section 2, we elaborate on the challenges
encountered in heterogeneous information integration. Section 3 explains the dimensions along
which existing information integration systems are measured and differentiated. Section 4 elab-
orates some of the salient approaches embraced by the research community for addressing these
challenges. In Section 5, we analyze the existing frameworks, mechanisms and techniques that have
been proposed to solve some of the intricate problems that are encountered in data integration.
Section 6 elucidates our approach in the context of InfoMosaic, a framework proposed for web-based
multi-domain information integration. Section 7 concludes the survey.

2       Challenges in Heterogeneous Information Integration
The challenge posed by information integration is further exemplified by the following query exam-
ple posed by Lesk [5]: “Find kitchen furniture and a place to buy it which is within walking distance
of a metro stop in the Washington DC area”. Again all the individual pieces of information for
answering this request are available on the web. However, it takes considerable effort to correlate
this diverse data (including spatial data) to arrive at the answers for the above query. Below, we
list additional illustrative queries that indicate the gravity of the problem:

Query 1: Retrieve all castles within 2 hours by train from London.

Query 2: Retrieve flights from Dallas for VLDB 2007 Conference that have fares under $1000
     We realize that the notion of a domain is subjective. In the context of this survey, a domain indicates a collection
of sources providing information of similar interest such as travel, books, literature, entertainment etc.
Query 3: List 3-bedroom houses in Austin within 2 miles of school and within 5 miles of highway
    and priced under 250000$

Query 4: Retrieve Movie Theaters near Parks Mall in Arlington, Texas showing Simpsons movie

Query 5: Retrieve Attorney details in Colorado specializing in Immigrations and speaking Spanish
    along with their experience in years

    Although the gist of information integration has not changed and this topic has been investigated
over two decades, the problem at hand is quite different and far more complex than the one
attempted earlier. Although techniques developed earlier – global schemas, schema integration,
dealing with multiple schemas, domain specific wrappers, and global transactions – have produced
significant steps, they never reached the stage of maturity for deployment and usage. On the other
hand, current meta-search engines (that integrate information from multiple sources of a single
domain) had more success. The multi-domain problem is more complicated as repositories exist
in various formats (HTML, XML, Web Databases with query-interfaces, etc.), with and without
schemas, and both the content and the structure are changing autonomously. As the number
of repositories/sources will increase steadily, there is no other option but to find a solution for
integrating information from different autonomous sources as needed for a search/query whose
(partial) answers have to be retrieved and integrated from multiple domains and sources. As the
enumeration of challenges below indicate, existing techniques from multiple domains need to be
eclectically combined as well as new solutions developed in order to address this problem.

2.1   Capturing Imprecise Intent
One of the primary challenges is to provide a mechanism for the user to express his/her intent
in an intuitive, easy-to-describe form. As elaborated by Query-1, user queries are complex, and
are difficult to express using existing query specification formats. In an ideal scenario, the user
should be able to express the query in a natural language. This is one of the primary reasons
for the popularity of search engines since, there is no query language to learn. However, unlike a
search engine operation (that involves a simple lookup of a word, phrase or expression in existing
document repositories), information integration is a complex process of retrieving and combining
data from different sources for different sub-queries embedded in the given user query. An alternate
option is the use of DBMS-style query languages (e.g., SQL) that allow users to specify a query in
a pre-defined format. However, in sharp contrast to Database models (that assume the user knows
what to access and from where to access), the anonymity of the sources and the complexity of the
query involved in a data integration scenario makes it difficult to express the intent using the hard
semantics of these data models.
    Existing frameworks (e.g., Ariadne [6], TSIMMIS [7], and Whirl [8]) extend the database query-
ing models using combinations of templates or menu-based forms to incorporate queries that are
restricted to a single domain (or a set of domains). Other frameworks (such as Havasu [9]) employ
an interface similar to search engines, that take relevant keywords (associated with a concept) from
the user and retrieve information for this particular concept from a range of sources. However,
as the domains for querying established by these systems are fixed (although the sources within
the domain might change), the problem of designing a querying mechanism is simplified to a great
extent. When a more involved query needs to be posed, users may not know how to unambigu-
ously express their needs and may formulate queries that lead to unsatisfactory results. Moreover,
providing a rigid specification format may restrict the user from providing complete information
about his/her intent.
    Additionally, most of these frameworks fail to capture queries that involve a combination of
spatial, temporal, and spatio-temporal conditions. A few systems (e.g., Hermes [10], TerraWorld
[11], etc.) allow a limited set of spatial operations (such as close to, travel time) through its push-
button listing-based interface or a form-based interface. Currently, centralized web-based mapping
interfaces (e.g. Google Maps and Virtual Earth) allow searching and overlaying spatial layers (e.g.,
all hotels and metro stations in current window or a given geo-region) to examine the relationships
among them visually. However, these user interfaces are not expressive enough and restrict users
from specifying their intent in a flexible manner.

2.2   Mapping Imprecise Intent into Precise Queries
The next challenge is to transform the user intent into an appropriate query format that can
be represented using a variant of relational algebra (or similar established mechanisms). Since the
queries in the context of information integration are complex and involve a myriad set of conditions,
it is obvious that applying the existing formalisms of relational algebra may not be sufficient.
     Over the past decade, several querying languages that extend the basics of relational algebra
and allow access to structured data (SQL, OOQL [12], Whirl [8], etc.), semi-structured data (SemQL
[13], CARIN [14], StruQL [15], etc.) and vague (or unstructured) data (VAGUE [16]) have been
designed. These languages have, with limited success, incorporated imprecise user queries posed on
a single-domain (or fixed set of multiple domains). Additionally, several frameworks have deployed
customized models that translate the user query to a query format supported by the internal global
schema (that provides an interface to the underlying sources). Briefly, Havasu’s QPIAD [17] maps
imprecise user queries to a more generic query using a combination of data-mining techniques.
Similarly, Ariandne [6] interprets the user-specified conditions as a sequence of LOOM statements
that are combined to generate a single query. MetaQuerier’s form assistant [18] consists of built-in
type handlers that aids the query translation process with moderate human efforts.
     However, existing mechanisms will prove to be insufficient to represent complex intent spanning
several domains. Hence, it becomes necessary to use domain-related taxonomies/ontologies and
source-related semantics to disambiguate as well as generate multiple potential queries from the
user intent. A feedback and learning mechanism may be appropriate to learn user intent from the
combinations of concepts provided based on user feedback. If multiple queries are generated (which
is very much possible on account of the ambiguity of natural language and the volume of concepts
involved in the domains of integration), an ordering mechanism may be useful to obtain valuable
feedback from the user. Once the query is finalized, a canonical representation can be used to
further transform the query into its components and elaboration.

2.3   Discovery of Domain and Source Semantics
As elucidated by Query-1, user queries inherently consist of multiple sub-queries posed on distinct
domains (or concepts). Gathering appropriate knowledge about the domains and the corresponding
sources within these domains is vital to the success of heterogeneous integration of information. In
order to relate various parts of a user query to appropriate domains (or concepts), the meaning of
information that is interchanged across the system has to be understood.
    Over the past decade, several customized techniques have been adapted by different frameworks
that focus on capturing such meta-data about concepts and sources that facilitate easy mapping of
queries over the global schema and/or the underlying sources. Havasu’s attribute-valued hierarchies
[9] maintain a classification of the attributes of the data sources over which the user queries are
formed. Ariadne uses an independent domain model [6] for each application, that integrates the
information from the underlying sources and provides a single terminology for querying. This
model is represented using the LOOM knowledge representation system [19]. TSIMMIS adopts
an Object Exchange Model (OEM) [7], a self-describing (tagged) object model, in which objects
are identified by labels, types, values, and an optional identifier. Information Manifold’s CARIN
[14] proposes a method for representing local-source completeness and an algorithm for exploiting
source information in query processing. This is an important feature for integration systems, since,
in most scenarios, data sources may be incomplete for the domain they are covering. Furthermore,
it suggests the use of probabilistic reasoning for the ordering of data sources that appear relevant
to answer a given query. InfoMaster’s knowledge base [20] is responsible for the storage of all the
rules and constraints required to describe heterogeneous data sources and their relationships with
each other. In Tukwila, the metadata obtained from several sources is stored in a single data source
catalog [21], and holds different type of information about the data sources such as – semantic
description of the contents of the data sources, overlap information about pairs of data sources,
and key statistics about the data, such as the cost of accessing each source, the sizes of the relations
in the sources, and selectivity information. Additionally, the use of ontologies for modeling implicit
and hidden knowledge has been considered as a possible technique to overcome the problem of
semantic heterogeneity by a number of frameworks such as KRAFT [22], SIMS [23], OntoBroker
[24], etc..
    The proliferation of data on the Internet has ensured that within each domain, there exist
vast number of sources providing adequate yet similar information. For instance, portals such as
Expedia, Travelocity, Orbitz, etc. provide information for the domain of air-travel. Similarly, sources
such as Google Scholar, DBLP, CiteSeer, etc. generate adequate and similar results for the domain
of publications and literature. Thus, the next logical challenge is to automate the current manual
process of identifying appropriate sources associated with individual domains. Semantic discovery
of sources, that involves a combination of - web crawling, interface extraction, source clustering,
semantic matching and source classification, has been extensively researched by the Semantic Web
community [25]. Currently, a significant and increasing amount of information obtained from the
web is hidden behind the query interfaces of searchable databases. The potential of integrating
data from such hidden data sources [26] is enormous. The MetaQuerier project [27] addresses
the challenges for integrating these deep-web sources such as – discovering and integrating sources
automatically, finding an appropriate mechanism for mapping independent user-queries to source-
specific sub-queries, and developing mass collaboration techniques for the management, description
and rating of such sources.
    An ideal archetype would be to design a global taxonomy (that models all the heterogeneous
domains across which user queries might be posed), and a domain taxonomy (that models all
the sources belonging to the domain and orders them based on distinct criteria specified by the
integration system). The construction of such a multi-level ontology requires extensive efforts in the
areas of – domain knowledge aggregation, deep-web exploration, and statistics collection. However,
the earlier work on databases (use of equivalences and statistics in centralized databases, use of
source schemas for obtaining a global schema) and recent work on information integration (as
elaborated earlier) provide adequate reasons to believe that this can be extended to multi-domain
queries and computations that include spatial and temporal constraints, which is being addressed
in our InfoMosaic framework.

2.4   Query Planning and Optimization
Plan generation and optimization in an information integration environment differs from tradi-
tional database query processing in several aspects – i) volume of sources to be integrated is much
larger than in a normal database environment, ii) heterogeneity between the data (legacy database
systems, web-sites, web-services, hidden web-data, etc.) makes it difficult to maintain the same
processing capability as found in a typical database system (e.g., the ability to perform joins), iii)
the query planner and optimizer in information integration has little information about the data
since it resides in remote autonomous sources, and iv) unlike relational databases, there can be
several restrictions on how an autonomous source can be accessed.
    Current frameworks have devised several novel approaches for generating effective plans in the
context of data integration. Havasu’s StatMiner (in association with the Multi-R Optimizer ) [9]
provides a guarantee on the cost and coverage of the results generated on a query by approximating
appropriate source statistics. Ariadne’s Theseus [6] pre-compiles part of the integration model and
uses a local search method for generating query plans across a large number of sources. Information
Manifold’s query-answering approach [14] translates user queries, posed on the mediated schema of
data sources, into a format that maps to the actual relations within the data sources. This approach
differs from the one adopted by Ariadne, and ensures that only the relevant set of data sources
are accessed when answering a particular user query. In Tukwila, if the query planner concludes
that it does not have enough meta-data with which to reliably compare candidate query execution
plans, it chooses to send only a partial plan to the execution engine, and takes further action only
after the partial plan has been completed.
    However, since for these frameworks, the domains involved in the user query are pre-determined,
generalizing and applying these techniques to autonomous heterogeneous sources is not possible.
This is particularly true for techniques that generate their plans based on the type of modeling
applied for the underlying data sources. Furthermore, current optimization strategies [9] focus on a
restricted set of metrics (such as cost, coverage and overlap of sources) for optimization. Additional
metrics such as – volume of data retrieved from each source, number of calls made to and amount
of data sent to each source, quantity of data processed, and the number of integration queries
executed – are currently not considered. It is important to understand that in this problem space,
exact values of some of these measures may not be available and the information available about
the ability of the sources and their characteristics may determine how these measures can be used.
Thus, effective plan generation and evaluation is significantly more complex than a traditional
system and requires to be investigated thoroughly.

2.5   Data Extraction
Typically, in schema-based systems (e.g., RDBMS), the description of data (or meta-data) is avail-
able, query-language syntax is known, and the type and format of results are well-defined, and hence
they can be retrieved programmatically (e.g., ODBC/JDBC connection to a database). However,
in the case of web repositories, although a page can be retrieved based on a URL (or filling forms in
the case of hidden web), or through a standard or non-standard web-service, the output structure of
data is neither pre-determined nor remains the same over extended periods of time. The extracted
information needs to be parsed as HTML or XML data types (using the meta-data of the page)
and interpreted.
    Currently, wrappers [6] are typically employed by most frameworks for the extraction of het-
erogeneous data. However, as the number of data sources on the web and the diversity in their
representation format continues to grow at a rapid rate, manual construction of wrappers proves
to be an expensive task. There is a rapid need for developing automation tools that can design,
develop and maintain wrappers effectively. Even though a number of integration systems have fo-
cussed on automated wrapper generation (Ariadne’s Stalker [28], MetaQuerier [27], TSIMMIS [29],
InfoMaster [30], and Tukwila [21]), since the domains (and the corresponding sources) embedded
within these systems are known and predefined, the task of generating automated wrappers using
mining and learning techniques is simplified by a large extent. There also exist several independent
tools based on solid formal foundations that focus on low-level data extraction from autonomous
sources such as Lixto [31], Stalker [28], etc.. In the case of spatial data integration, (e.g., eMerges
system [32]), ontologies and semantic web-services are defined for integrating spatial objects, in
addition to wrappers and mediators. Heracles [33] (part of TerraWorld and derived from the con-
cepts in Ariadne) combines online and geo-spatial data in a single integrated framework for assisting
travel arrangement and integrating world events in a common interface. A Storage Resource Broker
was proposed in the LTER spatial data workbench [34] to organize data and services for handling
distributed datasets.
    Information Manifold [14] claimed that the problem of wrapping semi-structured sources would
be irrelevant as XML will eliminate the need for wrapper construction tools. This is an optimistic
yet unrealistic assumption since there are some problems in querying semi-structured data that will
not disappear, for several reasons: 1) some data applications may not want to actively share their
data with anyone who can access their web-page, 2) legacy web applications will continue to exist
for many years to come, and 3) within individual domains, XML will greatly simplify the access to
sources; however, across diverse domains, it is highly unlikely that an agreement on the granularity
for modeling the information will be established.

2.6   Data Integration
The most important challenge in the entire integration process involves fusion of the data extracted
from multiple repositories. Since most of the existing frameworks are designed for a single domain
or a set of predetermined domains, the integration task is generalized such that the data gener-
ated by different sources only needs to be “appended ” and represented in a homogeneous format.
Frameworks, such as Havasu, support the “one-query on multiple-sources in single-domain” format
in which, the data fetched from multiple sources is checked for overlap, appended, and displayed
in a homogeneous format to the user. Others, such as Ariadne, support the “multiple sub-queries
on multiple-sources in separate-domains” format which is an extension to the above format, such
that the task of checking data overlap is done at the sub-query level. The non-overlapping results
from each sub-query are then appended and displayed.
    However, the problem of integration becomes more acute when the sub-queries, although be-
longing to distinct domains, are dependent on each other for generating a final result-set. For
instance, in Query-1, although it is possible to extract data independently for “castles near Lon-
don”, and “train-schedules to destinations within 2 hours from London”, the final result-set that
requires generating “castles that are near London and yet reachable in 2 hours by train” cannot be
obtained by simply appending the results of the two sub-queries. For this (and similar complex)
query, it becomes necessary to perform additional processing on the extracted data based on the
sub-query dependencies, before it can be integrated and displayed.

2.7   Other Challenges
In addition to the above challenges, there exist a number of issues that will prove to be significant
as integration frameworks move from prototype designs to large-scale commercial systems.

Ranking Integrated Results: Users should be able to access available information; however,
    this information should be presented in a structured and easy-to-digest format. Returning
    hundreds and thousands of information snippets will not help the user to make sense of the
    information. An interesting option would be to apply a rank on the final integrated results
    and provide only a percentage (top-k) of the total answers generated [35].
      However, unlike the domains of information retrieval [36] or even databases [37], the compu-
      tation of ranking in information integration is more complex due to – autonomous nature of
     sources, lack of information about the quality of information from a source, lack of informa-
     tion about the amount of information (equivalent of cardinality) for a query on the source,
     and lack of support for retrieving results in some order or based on some metrics. To the best
     of our knowledge, ranking has not been addressed explicitly in any of the major projects on
     information integration.
Decentralized Data Sharing: Current data integration systems employ a centralized mediation
    approach [38] for answering user queries that access multiple sources. A centralized schema
    accepts user queries and reformulates them over the schema of different sources. However, the
    design, construction and maintenance of such a mediated schema is often hard to agree upon.
    For instance, data sources providing castle information and train schedules are independent,
    belong to separate domains and are governed by separate companies. To expect these data
    sources to be under the control of a single mediator is an unrealistic assumption.
Naming Inconsistencies: Entities (such as places, countries, companies, ...) are always consis-
   tent within a single data source. However, across heterogeneous sources, the same entity
   might be referred to with different names and in different context. To make sense of the data
   that spans across multiple sources, an integration system must be able to recognize and re-
   solve these differences. For instance, in a query requiring access to sources providing air-travel
   information, one source may list Departure City and Arrival City as the two input locations
   for querying. However, another source might use From and To as its querying input locations.
   Even though, these inputs indicate the same concept in the domain of travel, resolving this
   complexity for an integration environment is a difficult task.
Security and Privacy: Existing information integration systems extracting data from autonomous
    sources assume that the information in each source can be retrieved and shared without any
    security restrictions [39]. However, there is an increasing need for sharing information across
    autonomous entities in a manner that no data apart from the answer to the query is revealed.
    There exist several intricate challenges in specifying and implementing processes for ensuring
    security and privacy measures before data from diverse sources can be integrated.

3    Dimensions for Integration
Existing information integration systems (elaborated in Section 5) tend to be designed along several
dimensions such as:
    Goal of Integration: It indicates the overall goal of the integration framework. Some systems
are portal-oriented (e.g., Whirl [8], Ariadne [40] [41], InfoMaster [42]) in that they aim to support
an integrated browsing experience for the user. Others are more ambitious in that they take user
queries and return results of running those queries on the diverse yet appropriate sources (e.g.,
Havasu [9] [43]).
    Data Representation: Refers to the design assumptions made by the integration system
regards to the syntactic nature of the data being exported by the sources. Some systems assume the
existence of structured data models (e.g., SIMS [44], Havasu [9]). However, since most integration
frameworks perform a web-based fusion of data, they assume the co-existence of semi-structured
and unstructured data (e.g. Ariadne [40], TSIMMIS [45]).
    Source Structure: Illustrates the assumptions made on the inter-relationship between sources.
Most systems assume that sources they are integrating are complementary (horizontal integration)
in that they export different parts of the schema (e.g. Ariadne [41]). Others consider the possibility
that sources may be overlapping (vertical integration) (e.g. Havasu [46], Tukwila [47]) in which
case aggregation of information is required, as opposed to pure integration of information.
    Domain and Source Dynamics: Refers to the extent to which the user has control over
and/or is required to specify the particular sources that are needed to be used in answering the
query. Some systems (e.g. Ariadne [33], BioKleisli [48]) require the user to select the appropriate
sources to be used. Others (e.g. TSMIMMIS [7], Havasu [43]) circumvent this problem by hard-
wiring specific parts of the integrated schema to specific sources.
    User Expertise: Indicates the type of users that the system is directed towards. The systems
that primarily support browsing need to assume very rudimentary expertise on the part of users.
In contrast, systems that support user-queries need to assume some level of expertise on the users
part in formulating queries. Some systems might require user queries to be formulated in specific
languages, while others might provide significant interactive support for users in formulating their

4    Approaches for Integration
Over the past two decades, various approaches have been suggested and adopted in the pursuit of
achieving an ideal information integration system. In this section, we provide a brief description of
the prominent approaches that form the basis for the design of existing integration frameworks:

    Mediator: It is one of the most notable approaches adopted by many integration frameworks
(Ariadne [6], TSIMMIS [7], Havasu [9], ...). A mediator (in the information integration context) is a
system responsible for reformulating user queries (formed on a single mediated schema) into queries
on the local schema of the underlying data sources. The sources contain the actual data, while the
global schema provides a reconciled, integrated, and virtual view of the underlying sources. Mod-
eling the relation between the sources and the global schema is therefore a crucial aspect. The two
distinct approaches for establishing the mapping between each source schema and the centralized
global schema are: i) Global-as-view (GAV), that requires the global schema to be represented in
terms of the underlying data sources, and ii) Local-as-view (LAV), that requires the global schema
to be defined independently from the sources, and the relationships between them are established
by defining every source as a view over the global schema.

    Warehousing: This approach [15] derives its basis from traditional data warehousing tech-
niques. Data from heterogeneous distributed information sources is gathered, mapped to a common
structure and stored in a centralized location. Warehousing emphasizes data translation, as opposed
to query translation in mediator-based integration [38]. In fact, warehousing requires that all the
data loaded from the sources be converted through data mapping to a standard unique format
before it is stored locally. In order to ensure that the information in the warehouse reflects the
current contents of the individual sources, it is necessary to periodically update the warehouse.
In the case of large information repositories, this is not feasible unless the individual information
sources support mechanisms for detecting and retrieving changes in their contents. This is an
inordinate expectation in the case of autonomous information sources spread across a number of
heterogeneous domains.

    Ontological: In the last decade, semantics (which are an important component for data in-
tegration) gained popularity leading to the inception of the celebrated ontology-based integration
approach [49]. The Semantic Web research community [50], [51], [52], [53] has focused extensively
on the problem of semantic integration [54] and the use of ontologies for blending heterogeneous
schemas across multiple domains. Their pioneering efforts have provided a new dimension for re-
searchers to investigate the challenges in information integration. A number of frameworks designed
using ontology-based integration approaches [49] have evolved in the past few years.

    Federated: It [55] is developed on the premise that, information needed to answer a query
is gathered directly from the data sources in response to the posted query. Hence, the results
are up-to-date with respect to the contents of the data sources at the time the query is posted.
More importantly, the database federation approach [56] lends itself to be more readily adapted to
applications that require users to be able to impose their own ontologies on data from distributed
autonomous information sources. The federated approach is preferred in scenarios when the data
sources are autonomous (e.g., Indus [55]), and support for multiple ontologies is needed. However,
this approach fails in situations where the querying frequency is much higher than the frequency
of changes to the underlying sources.

    Navigational: Also known as link-based approach[57], is based on the fact that an increasing
number of sources on the web require users to manually browse through several web-pages and data
sources in order to obtain the desired information. In fact, the major premise and motive justifying
this approach is that some sources provide the users with pages that would be difficult to access
without point-and-click navigation (e.g., hidden-web [26]).

5    Current Integration Techniques and Frameworks
Currently, a number of techniques, and frameworks have tried to address several challenges in the
problem of heterogeneous data integration in a delimited context. In this section, we elaborate
some of the prominent frameworks.

    Havasu: A multi-objective query processing framework comprising of multiple functional mod-
ules, Havasu [9], addresses the challenges of imprecise-query specification [58], query optimization
[59], and source-statistics collection [60] in single-domain web integration. The AIMQ module
provides query independent solution to efficiently handle imprecise user queries using a combina-
tion of source schema collection, attribute dependency mining [61], and source similarity mining
[46]. Unlike traditional data integration systems that aim towards minimizing the cost of query
processing, Havasu’s StatMiner [62] provides a guarantee on the cost as well as the coverage of
the results generated on a given user query, by approximating the coverage and overlap statistics
of the data sources based on attribute-valued hierarchies. The Multi-R optimizer [9] module uses
these source statistics for generating a multi-objective (cost and coverage) query optimization plan.
The Havasu framework has been applied for the design of BibFinder [43], a publicly available
computer science literature retriever, that integrates several autonomous and partially overlapping
bibliography sources.

    MetaQuerier: It is made up of two distinct components that address the challenges in ex-
ploration and integration of deep-web sources [26]. In contrast to the traditional approaches (e.g.,
Wise-Integrator [63]), that aim towards integrating web data sources based on the assumption
that query-interfaces can be extracted perfectly, MetaQuerier tries to perform source integration
by extracting query interfaces from raw HTML pages with subsequent schema matching. Hence, in
essence, MetaQuerier strives to achieve data mining for information integration [64] i.e., it mines
the observable information to discover the underlying semantics from web data sources. The Meta-
Explorer [27] is responsible for dynamic source discovery [65] and on-the-fly integration [66] for the
discovery, modeling, and structuring of web databases, to build a search-able source repository. On
the other hand, the MetaIntegrator [67] focuses on the issues of on-line source integration such as
source selection, query mediation, and schema integration. In contrast to traditional integration
systems, MetaIntegrator is dynamic i.e., new sources may be added as and when they are discovered.

    Ariadne: It extended the information integration approach adopted by the SIMS mediator
architecture [68] to include semi-structured and unstructured data sources (e.g., web data) instead
of simple databases, by using specially designed wrappers [28]. These wrappers, built around
individual web sources, allowed querying for data in a database-like manner (for example, using
SQL). These wrappers were generated semi-automatically using a machine learning approach [69].
Ariadne also constructed an independent domain model [70] for each application, that integrates
the information from the sources and provides a single terminology for querying. This model was
represented using the LOOM knowledge representation system [19]. The SIMS query planner did
not scale effectively when the number of sources increased beyond a certain threshold. Ariadne
solved this problem by embracing an approach [71] that is capable of efficiently constructing large
query plans, by pre-compiling part of the integration model and using a local search method for
query planning.
    Since its inception, Ariadne has been divided into a number of individual projects such as –
Apollo [72], Prometheus [73], and Mercury [74] for addressing separate issues and challenges in
heterogeneous information integration. The early applications of Ariadne included a Country Infor-
mation Agent [40], that integrated data related to countries from a variety of information sources.
Ariadne was also used to build TheaterLoc [41], a system that integrated data from restaurants
and movie theaters, and placed the information on a map for providing efficient access to naive
users. After its atomization, three individual and diverse application projects have been developed
under the Ariadne project – Heracles [33] (an interactive data-driven constraint-based integration
system), TerraWorld [11] (a geospatial data integration system), and Poseidon [75] (that focusses
on the composition, optimization, and execution of query plans for bioinformatics web-services).

    Information Manifold: It focused on efficient query processing [14] by accessing sources that
are capable of providing an appropriate answer to the queries posed by users. In order to facili-
tate efficient user query formulation, as well as to represent background knowledge of the mediated
schema relations (designed over several autonomous sources), Information Manifold proposed an ex-
pressive language, CARIN [76] modeled using a combination of Datalog (database query language)
[77] and Description Logics (knowledge representation language) [78]. Information Manifold also
proposed algorithms [79] for translating user queries, posed on the mediated schema of data sources,
into a format that maps to the actual relations within the data sources. The query-answering ap-
proach adopted by Manifold ensured that only the relevant set of data sources are accessed when
answering a particular user query. This approach differed from the one adopted by Ariadne, that
uses a general purpose planner for query translation. In addition, Information Manifold proposed a
method [80] for representing local-source completeness and an algorithm [81] for exploiting source
information in query processing. This is an important feature for integration systems, since, in most
scenarios, data sources may be incomplete for the domain they are covering. Furthermore, Informa-
tion Manifold suggested the use of probabilistic reasoning [82] for the ordering of data sources that
appear relevant to answer a given query. Such an ordering is dependent on the overlap between
the sources and the query, and on the coverage of the sources.

    TSIMMIS: It addressed the challenges in integration of heterogeneous data extracted from
structured as well as unstructured sources [7]. TSIMMIS adopted a schema-less approach (i.e., no
single global database or schema contained all information needed for integration) for retrieving in-
formation from dynamic sources (i.e., when source contents changed frequently). Each information
source was covered by a wrapper that logically converted the underlying data objects to a common
information model. This logical translation was done by converting queries into requests based on
the information in the model that the source could execute, and converting the data returned by
the source into the common model. TSIMMIS adopted an Object Exchange Model (OEM) [83], a
self-describing (tagged) object model, in which objects were identified by labels, types, values, and
an optional identifier. These objects were requested with the aid of OEM-QL [84], a query language
specifically designed for OEM. The mediators [83] were software modules that refined information
from one or more sources [85], and accepted OEM-QL queries as inputs to generate OEM objects.
This approach allowed the mediators to access new sources transparently in order to process and
refine relevant information efficiently. End users accessed information either by writing applications
that requested OEM objects, or by using generic browsing tools supported by TSIMMIS, such as
MOBIE (MOsaic Based Information Explorer) [7]. The query was expressed as an interactive world
wide web page or in a menu-selected format. The answer was received as a hypertext document.
    Garlic [86], a sister project of TSIMMIS, was developed at IBM to enable large-scale integra-
tion of multimedia information. It was applied in the late 1990s for multimedia data integration in
the fields of Medicine, Home Applications, and Business Agencies.

    InfoMaster: It [42] was a framework that provided integrated access to structured information
sources. InfoMaster’s architecture consisted of a query facilitator, knowledge base, and translation
rules. The facilitator analyzed the sources containing relevant information needed to answer user
queries, generated query plans to access these sources, and mapped these query plans to the sources
using a self-generated wrapper [87]. The knowledge base was responsible for the storage of all the
rules and constraints [20] required to describe heterogeneous data sources and their relationships
with each other. The translation rules described how each distinct source could relate to the refer-
ence schema. InfoMaster was developed and deployed at Stanford University for searching housing
rentals in Bay Area, San Francisco and for scheduling rooms at Stanford’s Housing Department
[88]. It was also the basis for Stanford Information Network (SIN) project [89] that integrated
numerous structured information sources on the Stanford campus. InfoMaster was commercialized
by Epistemics in 1997.

    Tukwila: An adaptive integration framework, Tukwila [47], tackles the challenges encountered
during the integration of autonomous sources on the web, namely source statistics, data arrival,
and source overlap and redundancy. The major contributions of the Tukwila framework are –
interleaved query planning and execution, adaptive operators, and a data source catalog. Query
processing in Tukwila does not require the creation of a complete query execution plan before the
query evaluation step [21]. If the query optimizer concludes that it does not have enough meta-
data with which to reliably compare candidate query execution plans, it chooses to send only a
partial plan to the execution engine, and takes further action only after the partial plan has been
completed. Tukwila incorporates operators that are well suited for adaptive execution, and mini-
mizing the time required to obtain the first answer to a query. In addition, the Tukwila execution
engine includes a collector operator [21] that efficiently integrates data from a large set of possibly
overlapping or redundant sources. The metadata obtained from several sources is stored in a single
data source catalog. The metadata holds different type of information about the data sources such
as – semantic description of the contents of the data sources, overlap information about pairs of
data sources, and key statistics about the data, such as the cost of accessing each source, the sizes
of the relations in the sources, and selectivity information.

   Whirl: The primary contribution of Whirl has been the Whirl Query Language [8]. This
language, based on the concepts extracted from Database Systems (the syntax is based on soft se-
mantics [90] and very similar to SQL), Information Retrieval (uses an inverted-indexing technique
for generating top-k answers), and Artificial Intelligence (using a combination of A* search [91]),
has been used in incorporating single domain queries across structured and semi-structured static
sources. The Whirl framework was applied for design of two applications: i) Children’s Computer
Games, that integrated information from 16 similar web-sites, and ii) North American Birds that
contained a set of integrated databases with large number of tuples, the majority of which pointed
to external web-pages.

    Ontology-based Integration Systems: Semantics play an important role during integration
of heterogeneous data. In order to achieve semantic interoperability, the meaning of information
that is interchanged across the system has to be understood. The use of ontologies, for the extrac-
tion of implicit and hidden knowledge, has been considered as a possible technique to overcome
the problem of semantic heterogeneity by a number of frameworks such as KRAFT [22], SIMS
[44], OntoBroker [24], and InfoSleuth [92]. Other ontology-based systems such as PICSEL [93],
OBSERVER [94], and BUSTER [25] did not propose any methods for creation of ontologies.
The literature associated with these systems makes it clear that there is an obvious lack of real
methodology for ontology-based development.

    GeoSpatial Integration Systems: Vast amount of data available from the web contain
spatial information explicitly or implicitly. Implicit spatial information includes the location of an
event. For example, many news reader today use GeoRSS format, which is an enhancement to the
W3C standard RSS XML file format to include location tags. Explicit spatial objects such as road
networks and disease distributions over different parts of the earth are commonly provided by by
different organizations in different data formats through different interfaces. To this end, several
systems have been proposed to address the issues of geospatial integration.
    Heracles [33] has tried to combine online and geospatial data in a single integrated framework
for assisting travel arrangement and integrating world events into a common space. The spatial
mediated approaches (systems derived using the concepts from Ariadne) [95] combines spatial
information using wrappers and mediators. A storage Resource Broker was used in LTER spatial
data workbench [34] to organize data and services to manage distributed datasets as a single
collection. Query evaluation plan generation is addressed within a spatial meditation context
in [96]. As generic models, XML/GML and web services are also widely used in other spatial
integration systems [97], [98], [99], [100], [101]. The eMerges system [32] further defines ontologies
for spatial objects and uses semantic web services for integration. The other line of work on spatial
data integration focused on studying the conflation of raster images with vector data [102] and
satellite image with vector maps [103].
    The other line of work on spatial data integration [102] focused on studying conflation of raster
images and vector data. The process involves the discovery of control point pairs, which are a set
of conjugate point pairs between two datasets. Extension on integrating satellite image with vector
maps was investigated in [103]. A framework to combine spatial information based on a mediator
that takes metadata on the information needs of the user was proposed in [95]. This infrastructure
uses the view agent approach as the method of communication and is validated in mobile data
settings. Grid based architecture that utilizes service-oriented view to unify various geographic
resources was briefly introduce in [97].

    Biological Integration Systems: In recent years, integration of heterogeneous biological and
genomic data sources [104] has gained immense popularity. A number of prominent systems such as
Indus [55], SRS [105], K2/BioKleisli [48], TAMBIS [106], DiscoveryLink [107], BioHavasu
[104], Entrez [108], and BioSift Radia [104] have been designed for resolving some of the intricate
                                 Figure 1: InfoMosaic Architecture

challenges in integrating assorted biological data.

6     The InfoMosaic Approach
6.1   InfoMosaic Architecture and Dataflow
The architecture of our InfoMosaic framework is shown in Figure 1. The user query is accepted in
an intuitive manner using domain names and keywords of interest, and elaborated/refined using the
domain knowledge in the form of taxonomies and dictionaries (synonyms etc.). Requests are refined
and presented to the user for feedback (Query Refinement module). User feedback is accumulated
and used for elaborating/disambiguating future queries. Once the query is finalized, it is represented
in a canonical form (e.g., query graphs) and transformed into a query plan using a two-phase process:
i) generation of logical plans using domain characteristics, and ii) generation of physical plans using
source semantics. The plan is further optimized by applying several metrics (Query planner and
Optimizer module). The Query Execution and Data Extraction module generates the actual source
queries that are used by the extractor to retrieve the requisite data. It also determines whether a
previously retrieved answer can be reused by checking the data repositories (XML and PostGIS)
for cached data. We have an XML Repository that stores extracted results from each source in
a system-determined format. A separate PostGIS Repository is maintained for storing spatial
data extracted from sources. The Data Integrator formulates XQueries (with external functions
for handling spatial component) on these repositories to compute the final answers and format
them for the user. Ranking is applied at different stages (sub-query execution phase, extraction
phase, or integration phase) depending on the user-ranking metrics, the selected sources and the
corresponding query plan. The Knowledge-base (broadly consisting of domain knowledge and source
semantics) blends all the pieces together in terms of the information used by various modules. The
adaptive capability of the system is based on the ability of the InfoMosaic components to update
these knowledge-bases at runtime.

6.2   Challenges Addressed by InfoMosaic
Capturing and Mapping Imprecise Intent into Precise Query: We address the query spec-
    ification challenge in a multi-domain environment by combining and enhancing techniques
    from natural language processing, database query specification and information retrieval to
    incorporate the following characteristics: i) specification of soft semantics instead of hard
    queries, ii) ability to accept minimal specification and refine it to meet user intent and in
    the process collect feedback for future usage, iii) support queries that include spatial, tem-
    poral, spatio-temporal, and cost-based conditions in addition to regular query conditions,
    iv) accepting optional ranking metrics based on user-specified criteria, and v) support query
    approximation and query relaxation for retrieving approximate answers instead of exact an-
      Moreover, instead of designing a new language that supports all the query conditions, we
      are currently extending the capabilities of SQL to incorporate soft-semantics and conditions
      based on domains rather than sources. In particular, we are trying to enhance the semantics of
      SQL-based spatial query languages for easy specification of spatial relations including metric,
      topological, and directional relationships pertaining to heterogeneous datasets from the web.

Plan Generation and Optimization: We view this challenge as an intelligent query optimiza-
     tion problem involving two stages: logical and physical. In the logical phase, we identify the
     individual domain sub-queries how they come together as a larger query by using appropriate
     domain knowledge. In the physical phase, various source semantics and characteristics are
     used to generate effective plans for each individual sub-query. In addition, we are also in-
     vestigating query optimization techniques for handling spatial, temporal and spatio-temporal

Data Extraction We use Lixto [31], a powerful data extraction engine for programmatically ex-
    tracting portions of a HTML page (based on the need) and converting the result into a specific
    format. It is based on monadic query languages over trees (based on monadic second order
    logic), and automatically generates Elog [31] (a variant of Datalog) programs for data ex-
    traction. For handling extraction of spatial data (that is larger in size and hence difficult
    to extract in a short time), we are planning to use a combination of – i) building a local
    spatial data repository by dynamically downloading related spatial files (using data clearing
    houses such as Map Bureau, etc.) of data that is relatively static , and ii) querying spatial
    web-services for fetching data that tends to change on a more frequent basis.

Data Integration: Extraction of spatial and non-spatial is done with respect to separate Post-
    GIS and XML repositories respectively. We then generate and execute queries (XQuery
    for XML and spatial queries for Post-GIS whose results are converted to GML for further
    processing) to integrate this extracted and processed data. The generation of these queries
    is based on the DTD (generated from the logical query plan) of the stored sub-query results
    and the attributes that need to be joined/combined from different sources. The join can be
    an arbitrary join (not necessarily equality) on multiple attributes. Our approach involves
    generating XQueries for each sub-query and combine them into a larger query using FLOWR
    expressions. It might be possible that the results of some sub-queries are already integrated
        during the execution and extraction phase. This information, based on the physical query
        plan, is taken into consideration for generating the required XQuery.

Result Ranking: We are currently addressing this challenge [35] by investigating the application
    of ranking at different stages in the integration process (i.e., at sub-query execution phase,
    before the integration phase, after the integration phase, etc.).

Representation of Domain Knowledge and Source Semantics: We address this challenge
    by adopting a global taxonomy (that models all the heterogeneous domains across which user
    queries might be posed), and a domain taxonomy (that models all the sources belonging to
    the domain and orders them based on distinct criteria specified by the integration system).
    The construction of such a multi-level ontology is done by combining and enhancing the
    extensive work carried out in the areas of – knowledge representation, domain knowledge ag-
    gregation, deep-web exploration, and statistics collection. The earlier work on databases (use
    of equivalences and statistics in centralized databases, use of source schemas for obtaining
    a global schema) and recent work on information integration (as elaborated earlier) provide
    adequate reasons to believe that this can be extended to multi-domain queries and computa-
    tions that include spatial and temporal constraints, which is being adopted in our InfoMosaic

6.3      Novelty of Our Approach
As seen in Figure 2, although several challenges in information integration problem has been ad-
dressed in a delimited context by a number of projects , a large number of challenges still need
to be tackled in the context of heterogeneous data integration on the Web. To the best of our
knowledge, we are the first ones to address multi-domain information extraction and integration of
results in conjunction with spatial and temporal data which is intended to push the state-of-the-art
in functionality. We believe that it is important to establish the feasibility of the functionality
before addressing performance and scalability issues. Some of the novel aspects our approach are:

      – We are formulating the problem of multi-domain information integration as an intelligent
        query processing and optimization problem with some fundamental differences from con-
        ventional ones. InfoMosaic considers many additional statistics, semantics, domain & source
        knowledge, equivalences and inferencing for plan generation and optimization. We will extend
        conventional optimization techniques to do this by building upon techniques from databases,
        deductive databases, taxonomies, semantic information, and inferencing where appropriate.
        The thrust is to develop new techniques as well as to identify and use the existing knowledge.

      – We plan on choosing a few communities (e.g., tourists, real-estate agents, museum visitors,
        etc.) each needing information from several domains and will address the problem in a real-
        world context. The crux of the problem here is to identify clearly the information needed (from
        sources, ontologies, statistics, QoS, etc.) along with the rules and inferencing techniques to
        develop algorithms and techniques for their usage. We will address the problem using actual
        domains and web sources rather than making assumptions on data sources or using artificial
        sources (as in Havasu [9]) or using small number of pre-determined sources (as in Infomaster
        [42] or Ariadne [6]); however, our techniques will build upon and extend current approaches.

      – Incorporating spatial and temporal data is unique to our proposal as, to the best of our
        knowledge, this has not been addressed in the literature on information integration2 .
      Retrieval of images and location data has been attempted. Currently mashuping locations on a map in Web 2.0
                     Figure 2: Integration Frameworks and Challenges Addressed

    – Extensibility of the system and the ability to incrementally add functionality will be a key
      aspect of our approach. That is, if we identify the information and techniques for represen-
      tative communities, it should be possible to add other communities and domains without
      major modifications to the framework and modules. This is similar to the approach taken for
      DBMS extensibility (by adding blades, cartridges, and extenders).

    – We believe that in order for this system to be acceptable, user input should be intuitive (if not
      in natural language). We intend to develop a feedback-centric user input which can compete
      with the simplicity of a keyword based search request.

    – Adaptability and learning from feedback and actions taken by the system will be central
      to the whole project. The entire knowledge base of various types of information will be
      updated to improve the system (in terms of accuracy, coverage, information content, etc.) on
      a continuous basis.

7    Conclusion
As this survey elicits, the research community has witnessed significant progress on many aspects
of data integration over the past two decades. As elaborated in Sections 3, 4 and 5, flexible
architectures for data integration, powerful methods for mediating between disparate data sources,
tools for rapid wrapping of data sources, methods for optimizing queries across multiple data sources
are a few of the important advances achieved towards solving the problem of information integration.
is very popular as evidenced by various projects. But combining map and information from multiple domains is still
the provenance of custom systems.
However, extensive work is needed on the higher-levels of the system, including managing semantic
heterogeneity in a more scalable fashion, the use of domain knowledge in various parts of the system,
transforming these systems from query-only tools to more active data sharing scenarios, and easy
management of data integration systems.
    Furthermore, the emergence of web-databases and related technologies have completely changed
the landscape of the problem of information integration. First, Web-databases provides access to
many valuable structured data sources at a scale not seen before, and the standards underlying web
services greatly facilitate sharing of data among corporations. Instead of becoming an option, data
integration has become a necessity. Second, business practices are changing to rely on information
integration – in order to stay competitive, corporations must employ tools for business intelligence
and those, in turn, must glean data from multiple sources. Third, recent events have underscored
the need for data sharing among government agencies, and life sciences have reached the point
where data sharing is crucial in order to make sustained progress. Fourth, personal information
management (PIM) is starting to receive significant attention from both the research community
and the commercial world. A significant key to effective PIM is the ability to integrate data from
multiple sources.
    Finally, we believe that information integration is an inherently hard problem that cannot be
solved by just a few years of research. In terms of research style, the development of benchmarks,
theoretical research on foundations of information integration as well as system and toolkit building
is needed. Further progress in this area will be significantly accelerated by combining expertise from
the areas of Database Systems, Artificial Intelligence, and Information Retrieval.
 [1] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine.”
     Computer Networks, vol. 30, no. 1-7, pp. 107–117, 1998.

 [2] M. Sahami, V. O. Mittal, S. Baluja, and H. A. Rowley, “The Happy Searcher: Challenges in
     Web Information Retrieval.” in PRICAI, 2004, pp. 3–12.

 [3] S. M. zu Eissen and B. Stein, “Analysis of Clustering Algorithms for Web-Based Search.” in
     PAKM, 2002, pp. 168–178.

 [4] B. Katz, J. J. Lin, and D. Quan, “Natural Language Annotations for the Semantic Web.” in
     CoopIS/DOA/ODBASE, 2002, pp. 1317–1331.

 [5] M. Lesk, D. R. Cutting, J. O. Pedersen, T. Noreault, and M. B. Koll, “Real Life Information
     Retrieval: Commercial Search Engines (Panel).” in SIGIR, 1997, p. 333.

 [6] C. A. Knoblock, “Planning, Executing, Sensing, and Replanning for Information Gathering.”
     in IJCAI, 1995, pp. 1686–1693.

 [7] S. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. D. Ull-
     man, and J. Widom, “The TSIMMIS Project: Integration of Heterogeneous Information
     Sources.” in IPSJ, 1994, pp. 7–18.

 [8] W. W. Cohen, “Integration of Heterogeneous Databases Without Common Domains Using
     Queries Based on Textual Similarity.” in SIGMOD Conference, 1998, pp. 201–212.

 [9] S. Kambhampati, U. Nambiar, Z. Nie, and S. Vaddi, “Havasu: A Multi-Objective, Adaptive
     Query Processing Framework for Web Data Integration.” Arizona State University, Tech.
     Rep., 2002.

[10] V.    Subrahmanian,   “A     heterougeneous      reasoning    and     mediator    system,”

[11] M. Michalowski and C. A. Knoblock, “A Constraint Satisfaction Approach to Geospatial
     Reasoning.” in AAAI, 2005, pp. 423–429.

[12] L. Liu, C. Pu, and Y. Lee, “An Adaptive Approach to Query Mediation Across Heterogeneous
     Information Sources.” in CoopIS, 1996, pp. 144–156.

[13] J.-O. Lee and D.-K. Baik, “SemQL: A Semantic Query Language for Multidatabase Systems.”
     in CIKM, 1999, pp. 259–266.

[14] A. Y. Levy, “Information Manifold Approach to Data Integration.” IEEE Intelligent Systems,
     pp. 1312–1316, 1998.

[15] D. Florescu, A. Levy, and A. Mendelzon, “Database techniques for the world-wide web: A
     survey.” SIGMOD Record, vol. 27, no. 3, pp. 59–74, 1998.

[16] A. Motro, “VAGUE: A User Interface to Relational Databases that Permits Vague Queries.”
     ACM Trans. Information Systems, vol. 6, no. 3, pp. 187–214, 1988.

[17] J. Fan, H. Khatri, Y. Chen, and S. Kambhampati, “QPIAD: Query processing over Incomplete
     Autonomous Databases.” Arizona State University, Tech. Rep., 2006.
[18] Z. Zhang, B. He, and K. C.-C. Chang, “Light-weight Domain-based Form Assistant: Querying
     Web Databases On the Fly.” in VLDB, 2005, pp. 197–208.

[19] R. M. MacGregor, “Inside the LOOM Description Classifier.” SIGART Bulletin, vol. 2, no. 3,
     pp. 88–92, 1991.

[20] O. M. Duschka and M. R. Genesereth, “Infomaster: An Information Integration Tool.” in
     Intelligent Information Integration, 1997.

[21] Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld, “Adaptive Query Processing
     for Internet Applications.” in IEEE Computer Society Technical Committee on Data Engi-
     neering, 1999, pp. 19–26.

[22] P. M. D. Gray, A. D. Preece, N. J. Fiddian, W. A. Gray, T. J. M. Bench-Capon, M. J. R.
     Shave, N. Azarmi, M. Wiegand, M. Ashwell, M. D. Beer, Z. Cui, B. M. Diaz, S. M. Embury,
     K. ying Hui, A. C. Jones, D. M. Jones, G. J. L. Kemp, E. W. Lawson, K. Lunn, P. Marti,
     J. Shao, and P. R. S. Visser, “KRAFT: Knowledge Fusion from Distributed Databases and
     Knowledge Bases.” in DEXA Workshop, 1997, pp. 682–691.

[23] C.-N. Hsu and C. A. Knoblock, “Reformulating Query Plans for Multidatabase Systems.” in
     CIKM, 1993, pp. 423–432.

[24] S. Decker, M. Erdmann, D. Fensel, and R. Studer, “Ontobroker: Ontology Based Access to
     Distributed and Semi-Structured Information.” in DS-8, 1999, pp. 351–369.

[25] H. Stuckenschmidt and H. Wache, “Context Modelling and Transformation for Semantic
     Interoperability.” in KRDB, 2000.

[26] B. He, M. Patel, C.-C. Chang, and Z. Zhang, “Accessing the Deep Web: A Survey.” University
     of Illinois, Urbana-Champaign, Tech. Rep., 2004.

[27] K. C.-C. Chang, B. He, and Z. Zhang, “Toward Large Scale Integration: Building a Meta-
     Querier over Databases on the Web.” in CIDR, 2005, pp. 44–55.

[28] I. Muslea, S. Minton, and C. A. Knoblock, “Hierarchical Wrapper Induction for Semistruc-
     tured Information sources.” in Autonomous Agents and Multi-Agent Systems, 2001.

[29] J. Hammer, H. Garcia-Molina, S. Nestorov, R. Yerneni, M. Breunig, and V. Vassalos,
     “Template-based wrappers in the TSIMMIS system.” in SIGMOD Conference, 1997, pp.

[30] O. M. Duschka and M. R. Genesereth, “Query Planning in Infomaster.” in Selected Areas in
     Cryptography, 1997, pp. 109–111.

[31] R. Baumgartner, S. Flesca, and G. Gottlob, “Visual Web Information Extraction with Lixto,”
     in VLDB, 2001, pp. 119–128.

[32] L. Tanasescu, A. Gugliotta, J. Domingue, L. G. Villaras, R. Davies, M. Rowlatt, M. Richard-
     son, and S. Stincic, “Spatial Integration of Semantic Web Services: the e-Merges Approach.”
     in International Semantic Web Conference, 2006.

[33] J. L. Ambite, C. A. Knoblock, S. Minton, and M. Muslea, “Heracles II: Conditional con-
     straint networks for interleaved planning and information gathering.” IEEE Intelligent Sys-
     tems, vol. 20, no. 2, pp. 25–33, 2005.
[34] L. N. Office, the San Diego Supercomputer Center, the Northwest Alliance for
     Computation Science,       and Engineering, “The Spatial Data Workbench.”

[35] A. Telang, R. Mishra, and S. Chakravarthy, “Ranking Issues for Information Integration,”
     in First International Workshop on Ranking in Databases in Conjunction with ICDE 2007,

[36] D. L. Lee, H. Chuang, and K. E. Seamons, “Document Ranking and the Vector-Space Model.”
     IEEE Software, vol. 14, no. 2, pp. 67–75, 1997.

[37] S. Chaudhuri, G. Das, V. Hristidis, and G. Weikum, “Probabilistic Ranking of Database
     Query Results.” in VLDB, 2004, pp. 888–899.

[38] M. Lenzerini, “Data Integration: A Theoretical Perspective.” in PODS, 2002, pp. 233–246.

[39] R. Agrawal, A. V. Evfimievski, and R. Srikant, “Information Sharing Across Private
     Databases.” in SIGMOD Conference, 2003, pp. 86–97.

[40] C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Muslea, A. Philpot, and S. Tejada,
     “The Ariadne Approach to Web-Based Information Integration.” Int. J. Cooperative Inf.
     Syst., vol. 10, no. 1-2, pp. 145–169, 2001.

[41] G. Barish, C. A. Knoblock, Y.-S. Chen, S. Minton, A. Philpot, and C. Shahabi, “The The-
     aterLoc Virtual Application.” in AAAI/IAAI, 2000, pp. 980–987.

[42] M. R. Genesereth, A. M. Keller, and O. M. Duschka, “Infomaster: An Information Integration
     System.” in SIGMOD Conference, 1997, pp. 539–542.

[43] Z. Nie, S. Kambhampati, and T. Hernandez, “BibFinder/StatMiner: Effectively Mining and
     Using Coverage and Overlap Statistics in Data Integration.” in VLDB, 2003, pp. 1097–1100.

[44] Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock, “Retrieving and Integrating Data
     from Multiple Information Sources.” Int. J. Cooperative Inf. Syst., vol. 2, no. 2, pp. 127–158,

[45] J. Hammer, H. Garcia-Molina, K. Ireland, Y. Papakonstantinou, J. D. Ullman, and J. Widom,
     “Integrating and Accessing Heterogeneous Information Sources in TSIMMIS.” in AAAI, 1995,
     pp. 61–64.

[46] T. Hernandez and S. Kambhampati, “Improving Text Collection Selection with Coverage and
     Overlap Statistics.” in WWW (Special interest tracks and posters), 2005, pp. 1128–1129.

[47] Z. G. Ives, D. Florescu, M. Friedman, A. Levy, and D. S. Weld, “An Adaptive Query Execu-
     tion System for Data Integration,” in SIGMOD Record, 1999.

[48] S. B. Davidson, J. Crabtree, B. P. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J. S.
     Jr, “K2/Kleisli and GUS: Experiments in integrated access to genomic data sources.” IBM
     Systems Journal, vol. 40, no. 2, pp. 512–531, 2001.

[49] H.Wache, T. Vogele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hubner,
     “Ontology-Based Information Integration: A Survey,” IJCAI, 2001.

[50] A. Doan and A. Y. Halevy, “Semantic-Integration Research in the Database community,” AI
     Magazine, vol. 26, no. 1, pp. 83–94, 2005.
[51] N. F. Noy, “Semantic Integration: A Survey Of Ontology-Based Approaches.” SIGMOD
     Record, vol. 33, no. 4, pp. 65–70, 2004.

[52] M. Doerr, J. Hunter, and C. Lagoze, “Towards a Core Ontology for Information Integration.”
     Journal of Digital Information, vol. 4, no. 1, 2003.

[53] J. A. Hendler and E. A. Feigenbaum, “Knowledge Is Power: The Semantic Web Vision.” in
     Web Intelligence, 2001, pp. 18–29.

[54] D. K. W. Chiu and H. fung Leung, “Towards Ubiquitous Tourist Service Coordination and
     Integration: A Multi-agent and Semantic Web approach.” in ICEC, 2005, pp. 574–581.

[55] J. Reinoso, A. Silvescu, D. Caragea, J. Pathak, and V. Honavar, “Information Extraction and
     Integration from Heterogeneous, Distributed, Autonomous Information Sources : A Federated
     Ontology-Driven Query-Centric Approach.” in IRI, 2003, pp. 183–191.

[56] J. Reinoso-Castillo, “Ontology-driven information extraction and integration from hetero-
     geneous distributed autonomous data sources: A federated query centric approach.” Ph.D.
     dissertation, Artificial Intelligence Research Laboratory, Department of Computer Science,
     Iowa State University, 2002.

[57] M. Friedman, A. Y. Levy, and T. D. Millstein, “Navigational Plans For Data Integration.”
     in AAAI/IAAI, 1999, pp. 67–73.

[58] U. Nambiar and S. Kambhampati, “Mining Approximate Functional Dependencies and Con-
     cept Similarities to Answer Imprecise Queries.” in WebDB, 2004, pp. 73–78.

[59] Z. Nie and S. Kambhampati, “Joint optimization of cost and coverage of query plans in data
     integration.” in CIKM, 2001, pp. 223–230.

[60] Z. Nie, S. Kambhampati, U. Nambiar, and S. Vaddi, “Mining source coverage statistics for
     data integration.” in WIDM, 2001, pp. 1–8.

[61] U. Nambiar and S. Kambhampati, “Answering Imprecise Queries over Autonomous Web
     Databases.” in ICDE, 2006, p. 45.

[62] Z. Nie and S. Kambhampati, “A Frequency-based Approach for Mining Coverage Statistics
     in Data Integration.” in ICDE, 2004, pp. 387–398.

[63] H. He, W. Meng, C. T. Yu, and Z. Wu, “WISE-Integrator: An Automatic Integrator of Web
     Search Interfaces for E-Commerce.” in VLDB, 2003, pp. 357–368.

[64] B. He, Z. Zhang, and K. C.-C. Chang, “Towards Building a MetaQuerier: Extracting and
     Matching Web Query Interfaces.” in ICDE, 2005, pp. 1098–1099.

[65] G. Kabra, C. Li, and K. C.-C. Chang, “Query Routing: Finding Ways in the Maze of the
     Deep Web.” in International Workshop on Challenges in Web Information Retrieval and
     Integration, 2005, pp. 64–73.

[66] B. He, Z. Zhang, and K. C.-C. Chang, “MetaQuerier: Querying Structured Web Sources
     On-the-fly.” in SIGMOD Conference, 2005, pp. 927–929.

[67] B. He and K. C.-C. Chang, “A Holistic Paradigm for Large Scale Schema Matching.” SIG-
     MOD Conference, vol. 33, no. 4, pp. 20–25, 2004.
[68] Y. Arens, C. Y. Chee, C.-N. Hsu, and C. A. Knoblock, “Retrieving and Integrating Data
     from Multiple Information Sources.” Int. J. Cooperative Inf. Syst., vol. 2, no. 2, pp. 127–158,

[69] K. Lerman, S. Minton, and C. A. Knoblock, “Wrapper Maintenance: A Machine Learning
     approach.” JAIR, vol. 18, pp. 149–181, 2003.

[70] S. Tejada, C. A. Knoblock, and S. Minton, “Learning domain-independent string transforma-
     tion weights for high accuracy object identification.” in ACM SIGKDD, 2002, pp. 350–359.

[71] G. Barish and C. A. Knoblock, “An efficient and expressive language for information gathering
     on the web.” in AIPS, 2002.

[72] M. Michelson and C. A. Knoblock, “Learning Blocking Schemes for Record Linkage.” in
     AAAI, 2006.

[73] M. Michalowski, J. L. Ambite, C. A. Knoblock, S. Minton, S. Thakkar, and R. Tuchinda, “Re-
     trieving and Semantically Integrating Heterogeneous Data from the Web.” IEEE Intelligent
     Systems, vol. 19, no. 3, pp. 72–79, 2004.

[74] K. Lerman, A. Plangprasopchok, and C. A. Knoblock, “Automatically Labeling the Inputs
     and Outputs of Web Services.” in AAAI, 2006.

[75] S. Thakkar, J. L. Ambite, and C. A. Knoblock, “Composing, Optimizing, and Executing
     Plans for Bioinformatics Web Services.” VLDB Journal of Data Management, Analysis and
     Mining for Life Sciences, vol. 14, no. 3, pp. 330–353, 2005.

[76] A. Y. Levy and M.-C. Rousset, “CARIN: A Representation Language Combining Horn Rules
     and Description Logics.” in ECAI, 1996, pp. 323–327.

[77] Datalog and L.-B. Databases,          “         aabyan/415/datalog.html,” aabyan/415/Datalog.html.

[78] D. Logics, “,”

[79] A. Y. Levy, A. Rajaraman, and J. J. Ordille, “Query-Answering Algorithms for Information
     Agents.” in AAAI/IAAI, vol. 1, 1996, pp. 40–47.

[80] A. Y. Levy, “Obtaining Complete Answers from Incomplete Databases.” in VLDB, 1996, pp.

[81] A. Y. Levy, A. Rajaraman, and J. J. Ordille, “Querying Heterogeneous Information Sources
     Using Source Descriptions.” in VLDB, 1997, pp. 251–262.

[82] D. Florescu, D. Koller, and A. Y. Levy, “Using Probabilistic Information in Data Integration.”
     in VLDB, 1997, pp. 216–225.

[83] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom, “Object Exchange Across Heteroge-
     neous Information Sources.” in ICDE, 1995, pp. 251–260.

[84] C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou, J. Ullman, and
     M. Valiveti, “Capability based mediation in TSIMMIS.” in SIGMOD Conference, 1998, pp.
 [85] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. D. Ullman,
      V. Vassalos, and J. Widom, “The TSIMMIS Approach to Mediation: Data Models and
      Languages.” Intelligent Information Systems, vol. 8, no. 2, pp. 117–132, 1997.
 [86] W. F. Cody, L. M. Haas, W. Niblack, M. Arya, M. J. Carey, R. Fagin, M. Flickner, D. Lee,
      D. Petkovic, P. M. Schwarz, J. T. II, M. T. Roth, J. H. Williams, and E. L. Wimmers,
      “Querying Multimedia Data from Multiple Repositories by Content: The Garlic Project.” in
      VDB, 1995, pp. 17–35.
 [87] O. M. Duschka, “Query Planning and Optimization in Information Integration.” Ph.D. dis-
      sertation, Stanford University, 1997.
 [88] A. M. Keller and M. R. Genesereth, “Using Infomaster to Create a Housewares Virtual
      Catalog.” International Journal of Electronic Markets, vol. 7, no. 4, pp. 41–45, 1997.
 [89] ——, “Multivendor Catalogs: Smart Catalogs and Virtual Catalogs.” Journal of Electronic
      Commerce, vol. 9, no. 3, 1996.
 [90] W. W. Cohen, “Data Integration using Similarity Joins and a Word-based Information Rep-
      resentation Language.” in ACM Transactions on Information Systems, vol. 18, no. 3, 2000,
      pp. 288–321.
 [91] ——, “Recognizing Structure in Web Pages using Similarity Queries.” in AAAI/IAAI, 1999,
      pp. 59–66.
 [92] T. Ksiezyk, G. Martin, and Q. Jia, “InfoSleuth: Agent-Based System for Data Integration
      and Analysis.” in COMPSAC, 2001, p. 474.
                  e         e
 [93] F. Goasdou´, V. Latt`s, and M.-C. Rousset, “The Use of CARIN Language and Algorithms
      for Information Integration: The PICSEL System.” Int. J. Cooperative Inf. Syst., vol. 9,
      no. 4, pp. 383–401, 2000.
 [94] E. Mena, V. Kashyap, A. P. Sheth, and A. Illarramendi, “OBSERVER: An Approach for
      Query Processing in Global Information Systems based on Interoperation across Pre-existing
      Ontologies.” in CoopIS, 1996, pp. 14–25.
 [95] C. Shahabi, M. R. Kolahdouzan, S. Thakkar, J. L. Ambite, and C. A. Knoblock, “Efficiently
      Querying Moving Objects with pre-defined paths in a Distributed Environment.” in ACM
      International Symposium on Advances in Geographic Information Systems, 2001.
 [96] A. Gupta, I. Zaslavsky, and R. Marciano, “Generating Query Plans within a Spatial Media-
      tor.” in International Symposium on Spatial Data Handling, 2000.
 [97] O. Boucelma, M. Essid, and Z. Lacroix, “A WFS-based mediation system for GIS interoper-
      ability.” in Advances in Geographic Information Systems, 2002.
 [98] A. Gupta, R. Marciano, I. Zaslavsky, and C. K. Baru, “Integrating GIS and Imagery Through
      XML-Based Information Mediation.” in Integrated Spatial Databases, Digital Inages and GIS,
 [99] I. Zaslavsky, R. Marciano, A. Gupta, and C. Baru, “XML-based Spatial Data Mediation
      Infrastructure for Global Interoperability.” in Spatial Data Infrastructure Conference, 2000.
[100] X. Ma, Q. Pan, and M. Li, “Integration and Share of Spatial Data Based on Web Service.”
      in Parallel and Distributed Computing Applications and Technologies, 2005, pp. 328–332.
[101] C. Baru, A. Gupta, I. Zaslavsky, Y. Papakonstantinou, and P. Joftis, “I2T: An Information
      Integration testbed for Digital Government.” in Digital Government Research, 2004, pp. 1–2.

[102] C.-C. Chen, C. A. Knoblock, C. Shahabi, Y.-Y. Chiang, and S. Thakkar, “Automatically and
      Accurately Conflating Orthoimagery and Street maps.” in ACM International Symposium on
      Advances in Geographic Information Systems, 2004.

[103] C.-C. Chen, C. A. Knoblock, C. Shahabi, and S. Thakkar, “Automatically and Accurately
      Conflating Satellite Imagery and Maps.” in International Workshop on Next Generation
      Geospatial Information, 2003.

[104] U. Nambiar and S. Kambhampati, “Answering imprecise database queries: a novel approach.”
      in WIDM, 2003, pp. 126–133.

[105] N. W. Paton and C. A. Goble, “Information Management for Genome Level Bioinformatics.”
      in VLDB, 2001.

[106] R. Stevens, P. G. Baker, S. Bechhofer, G. Ng, A. Jacoby, N. W. Paton, C. A. Goble, and
      A. Brass, “TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources.”
      Bioinformatics, vol. 16, no. 2, pp. 184–186, 2000.

[107] L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope, “DiscoveryLink:
      A system for integrated access to life sciences data sources.” IBM Systems Journal, vol. 40,
      no. 2, pp. 489–511, 2001.

[108] R. C. Geer and E. W. Sayers, “Tutorial Section: Entrez: Making Use of Its Power.” Briefings
      in Bioinformatics, vol. 4, no. 2, pp. 179–184, 2000.

Shared By: