2. Building efficient and effective metasearch engines by bkiran63


									Building Efficient and Effective Metasearch Engines
State University of New York at Binghamton

University of Illinois at Chicago


DePaul University

             Frequently a user’s information needs are stored in the databases of multiple
             search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple
             search engines and identify useful documents from the returned results. To support
             unified access to multiple search engines, a metasearch engine can be constructed.
             When a metasearch engine receives a query from a user, it invokes the underlying
             search engines to retrieve useful information for the user. Metasearch engines have
             other benefits as a search tool such as increasing the search coverage of the Web and
             improving the scalability of the search. In this article, we survey techniques that have
             been proposed to tackle several underlying challenges for building a good metasearch
             engine. Among the main challenges, the database selection problem is to identify search
             engines that are likely to return useful documents to a given query. The document
             selection problem is to determine what documents to retrieve from each identified search
             engine. The result merging problem is to combine the documents returned from multiple
             search engines. We will also point out some problems that need to be further researched.

              Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]:
              Distributed Systems—Distributed databases; H.3.3 [Information Storage and
              Retrieval]: Information Search and Retrieval—Search process; Selection process; H.3.4
              [Information Storage and Retrieval]: Systems and Software—Information networks
              General Terms: Design, Experimentation, Performance
              Additional Key Words and Phrases: Collection fusion, distributed collection, distributed
              information retrieval, information resource discovery, metasearch

This work was supported in part by the following National Science Foundation (NSF) grants: IIS-9902872,
IIS-9902792, and EIA-9911099.
Authors’ addresses: W. Meng, Department of Computer Science, State University of New York at Binghamton,
Binghamton, NY 13902; email: meng@cs.binghamton.edu; C. Yu, Department of Computer Science,
University of Illinois at Chicago, Chicago, IL 60607; email: yu@cs.uic.edu; K.-L. Liu, School of Computer
Science, Telecommunications and Information Systems, DePaul University, Chicago, IL 60604; email:
Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted with-
out fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright
notice, the title of the publication, and its date appear, and notice is given that copying is by permission of
ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific
permission and/or a fee.
 c 2002 ACM 0360-0300/02/0300-0048 $5.00

ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 48–89.
Building Efficient and Effective Metasearch Engines                                             49

1. INTRODUCTION                                      publicly indexable Web pages [Lawrence
                                                     and Lee Giles 1999] and the number is
The Web has become a vast information                well over 2 billion now (Google has indexed
resource in recent years. Millions of people         over 2 billion pages) and is increasing at
use the Web on a regular basis and the               a very high rate. Many believe that em-
number is increasing rapidly. Most data on           ploying a single general-purpose search
the Web is in the form of text or image. In          engine for all data on the Web is unre-
this survey, we concentrate on the search            alistic [Hawking and Thistlewaite 1999;
of text data.                                        Sugiura and Etzioni 2000; Wu et al. 2001].
   Finding desired data on the Web in a              First, its processing power may not scale
timely and cost-effective way is a prob-             to the rapidly increasing and virtually un-
lem of wide interest. In the last several            limited amount of data. Second, gathering
years, many search engines have been cre-            all the data on the Web and keeping it
ated to help Web users find desired in-               reasonably up-to-date are extremely diffi-
formation. Each search engine has a text             cult if not impossible objectives. Programs
database that is defined by the set of docu-          (e.g., Web robots) used by major search en-
ments that can be searched by the search             gines to gather data automatically may
engine. When there is no confusion, the              slow down local servers and are increas-
term database and the phrase search en-              ingly unpopular. Furthermore, many sites
gine will be used interchangeably in this            may not allow their documents to be in-
survey. Usually, an index for all documents          dexed but instead may allow the docu-
in the database is created in advance. For           ments to be accessed through their search
each term that represents a content word             engines only (these sites are part of the so-
or a combination of several (usually adja-           called deep Web [Bergman 2000]). Conse-
cent) content words, this index can iden-            quently, we have to live with the reality of
tify the documents that contain the term             having a large number of special-purpose
quickly. Google, Altavista, Excite, Lycos,           search engines that each covers a portion
and HotBot are all popular search engines            of the Web.
on the Web.                                             A metasearch engine is a system that
   Two types of search engines exist.                provides unified access to multiple ex-
General-purpose search engines aim at                isting search engines. A metasearch en-
providing the capability to search all               gine does not maintain its own index
pages on the Web. The search engines                 of documents. However, a sophisticated
we mentioned in the previous para-                   metasearch engine may maintain infor-
graph are a few of the well-known ones.              mation about the contents of its under-
Special-purpose search engines, on the               lying search engines to provide better
other hand, focus on documents in con-               service. In a nutshell, when a metasearch
fined domains such as documents in                    engine receives a user query, it first passes
an organization or in a specific subject              the query (with necessary reformatting)
area. For example, the Cora search en-               to the appropriate underlying search en-
gine (cora.whizbang.com) focuses on com-             gines, and then collects and reorganizes
puter science research papers and Med-               the results received from them. A simple
ical World Search (www.mwsearch.com)                 two-level architecture of a metasearch en-
is a search engine for medical informa-              gine is depicted in Figure 1. This two-level
tion. Most organizations and business                architecture can be generalized to a hier-
sites have installed search engines for              archy of more than two levels when the
their pages. It is believed that hundreds            number of underlying search engines be-
of thousands of special-purpose search en-           comes large [Baumgarten 1997; Gravano
gines currently exist on the Web [Bergman            and Garcia-Molina 1995; Sheldon et al.
2000].                                               1994; Yu et al. 1999b].
   The amount of data on the Web is huge.               There are a number of reasons for the
It is believed that by February of 1999,             development of a metasearch engine and
there were already more than 800 million             we discuss these reasons below.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
50                                                                                    Meng et al.

                                                  for a special-purpose search engine. As
                                                  a result, the metasearch engine ap-
                                                  proach for searching the entire Web is
                                                  likely to be significantly more scalable
                                                  than the centralized general-purpose
                                                  search engine approach.
                                               (3) Facilitate the invocation of multi-
                                                   ple search engines. The information
                                                   needed by a user is frequently stored
Fig. 1. A simple metasearch architecture.          in the databases of multiple search en-
                                                   gines. As an example, consider the case
(1) Increase the search coverage of the            when a user wants to find the best
    Web. A recent study [Lawrence and              10 newspaper articles about a special
    Lee Giles 1999] indicated that the             event. It is likely that the desired arti-
    coverage of the Web by individual              cles are scattered across the databases
    major general-purpose search engines           of a number of newspapers. The user
    has been decreasing steadily. This is          can send his/her query to every news-
    mainly due to the fact that the Web has        paper database and examine the re-
    been increasing at a much faster rate          trieved articles from each database
    than the indexing capability of any sin-       to identify the 10 best articles. This
    gle search engine. By combining the            is a formidable task. First, the user
    coverages of multiple search engines           will have to identify the sites of the
    through a metasearch engine, a much            newspapers. Second, the user will need
    higher percentage of the Web can be            to send the query to each of these
    searched. While the largest general-           databases. Since different databases
    purpose search engines index less than         may accept queries in different for-
    2 billion Web pages, all special-purpose       mats, the user will have to format
    search engines combined may index              the query correctly for each database.
    up to 500 billion Web pages [Bergman           Third, there will be no overall quality
    2000].                                         ranking among the articles returned
                                                   from these databases even though the
(2) Solve the scalability of searching the
                                                   retrieved articles from each individual
    Web. As we mentioned earlier, the ap-
                                                   database may be ranked. As a result,
    proach of employing a single general-
                                                   it will be difficult for the user, with-
    purpose search engine for the entire
                                                   out reading the contents of the arti-
    Web has poor scalability. In contrast,
                                                   cles, to determine which articles are
    if a metasearch engine on top of all the
                                                   likely to be among the most useful
    special-purpose search engines can be
                                                   ones. If there are a large number of
    created as an alternative to search the
                                                   databases, each returning some arti-
    entire Web, then the problems associ-
                                                   cles to the user, then the user will sim-
    ated with employing a single general-
                                                   ply be overwhelmed. If a metasearch
    purpose search engine will either dis-
                                                   engine on top of these local search en-
    appear or be significantly alleviated.
                                                   gines is built, then the user only needs
    The size of a typical special-purpose
                                                   to submit one query to invoke all lo-
    search engine is much smaller than
                                                   cal search engines via the metasearch
    that of a major general-purpose search
                                                   engine. A good metasearch engine can
    engine. Therefore, it is much easier for
                                                   rank the documents returned from dif-
    it to keep its index data more up to
                                                   ferent search engines properly. Clearly,
    date (i.e., updating of index data to
                                                   such a metasearch engine makes the
    reflect the changes of documents can
                                                   user’s task much easier.
    be carried out more frequently). It is
    also much easier to build the necessary    (4) Improve the retrieval effectiveness.
    hardware and software infrastructure           Consider the scenario where a user

                                                   ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                            51

    needs to find documents in a spe-                 building a good metasearch engine. Sec-
    cific subject area. Suppose that there            ond, we survey different proposed tech-
    is a special-purpose search engine for           niques for tackling these issues. Third, we
    this subject area and there is also              point out new challenges and research di-
    a general-purpose search engine that             rections in the metasearch engine area.
    contains all the documents indexed by               The rest of the article is organized as
    the special-purpose search engine in             follows. In Section 2, we provide a short
    addition to many documents unrelated             overview of some basic concepts on infor-
    to this subject area. It is usually true         mation retrieval (IR). These concepts are
    that if the user submits the same query          important for the discussions in this arti-
    to both of the two search engines, the           cle. In Section 3, we outline the main soft-
    user is likely to obtain better results          ware components of a metasearch engine.
    from the special-purpose search engine           In Section 4, we discuss how the autonomy
    than the general-purpose search en-              of different local search engines, as well
    gine. In other words, the existence of a         as the heterogeneities among them, may
    large number of unrelated documents              affect the building of a good metasearch
    in the general-purpose search engine             engine. In Section 5, we survey reported
    may hinder the retrieval of desired doc-         techniques for the database selection prob-
    uments. In text retrieval, documents             lem (i.e., determining which databases
    in the same collection can be grouped            to search for a given user query). In
    into clusters such that the documents            Section 6, we survey known methods for
    in the same cluster are more related             the document selection problem (i.e., de-
    than documents across different clus-            termining what documents to retrieve
    ters. When evaluating a query, clusters          from each selected database for a user
    related to the query can be identified            query). In Section 7, we report different
    first and then the search can be carried          techniques for the result merging problem
    out for these clusters. This method has          (i.e., combining results returned from dif-
    been shown to improve the retrieval ef-          ferent local databases into a single ranked
    fectiveness of the system [Xu and Croft          list). In Section 8, we present some new
    1999]. For documents on the Web, the             challenges for building a good metasearch
    databases in different special-purpose           engine.
    search engines are natural clusters. As
    a result, if for any given query sub-
    mitted to the metasearch engine, the
                                                     2. BASIC INFORMATION RETRIEVAL
    search can be restricted to only special-
    purpose search engines related to the            Information retrieval deals with tech-
    query, then it is likely that better re-         niques for finding relevant (useful) doc-
    trieval effectiveness can be achieved            uments for any given query from a
    using the metasearch engine than us-             collection of documents. Documents are
    ing a general-purpose search engine.             typically preprocessed and represented in
    While it may be possible for a general-          a form that facilitates efficient and ac-
    purpose search engine to cluster its             curate retrieval. In this section, we first
    documents to improve retrieval effec-            overview some basic concepts in classi-
    tiveness, the quality of these clusters          cal information retrieval and then point
    may not be as good as the ones corre-            out several features specifically associated
    sponding to special-purpose search en-           with Web search engines.
    gines. Furthermore, constructing and
    maintaining the clusters consumes
    more resources of the general-purpose
                                                     2.1. Classical Information Retrieval
    search engine.
                                                     The contents of a document may be rep-
 This article has three objectives. First,           resented by the words contained in it.
we review the main technical issues in               Some words such as “a,” “of,” and “is” do

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
52                                                                                                    Meng et al.

not contain semantic information. These                          A query is simply a question writ-
words are called stop words and are                           ten in text.1 It can be transformed into
usually not used for document represen-                       an n-dimensional vector as well. Specif-
tation. The remaining words are con-                          ically, the noncontent words are elimi-
tent words and can be used to repre-                          nated by comparing the words in the
sent the document. Variations of the same                     query against the stop word list. Then,
word may be mapped to the same term.                          words in the query are mapped into terms
For example, the words “beauty,” “beau-                       and, finally, terms are weighted based
tiful,” and “beautify” can be denoted by                      on term frequency and/or document fre-
the term “beaut.” This can be achieved                        quency information.
by a stemming program. After remov-                              After the vectors of all documents and a
ing stop words and stemming, each doc-                        query are formed, document vectors which
ument can be logically represented by                         are close to the query vector are retrieved.
a vector of n terms [Salton and McGill                        A similarity function can be used to mea-
1983; Yu and Meng 1998], where n is                           sure the degree of closeness between two
the total number of distinct terms in                         vectors. One simple function is the dot
the set of all documents in a document                        product function, dot(q, d ) = n qi ∗ d i ,
collection.                                                   where q = (q1 , . . . , qn ) is the vector of a
   Suppose the document d is represented                      query and d = (d 1 , . . . , d n ) is the vector
by the vector (d 1 , . . . , d i , . . . , d n ), where d i   of a document. The dot product function is
is a number (weight) indicating the im-                       a weighted sum of the terms in common
portance of the ith term in representing                      between the two vectors. The dot prod-
the contents of the document d . Most of                      uct function tends to favor long documents
the entries in the vector will be zero be-                    having many terms, because the chance
cause most terms are absent from any                          of having more terms in common between
given document. When a term is present                        a document and a given query is higher
in a document, the weight assigned to the                     for a longer document than a shorter
term is usually based on two factors. The                     document. In order that all documents
term frequency (tf ) of a term in a doc-                      have a fair chance of being retrieved, the
ument is the number of times the term                         cosine function can be utilized. It is given
occurs in the document. Intuitively, the                      by dot(q, d )/(|q| · |d |), where |q| and |d |
higher the term frequency of a term is,                       denote, respectively, the lengths of the
the more important the term is in repre-                      query vector and the document vector. The
senting the contents of the document. As                      cosine function [Salton and McGill 1983]
a consequence, the term frequency weight                      between two vectors is really the cosine of
(tfw) of the term in the document is usu-                     the angle between the two vectors and it
ally a monotonically increasing function of                   always returns a value between 0 and 1
its term frequency. The second factor af-                     when the weights are nonnegative. It gets
fecting the weight of a term is the docu-                     the value 0 if there is no term in common
ment frequency (df ), which is the number                     between the query and the document; its
of documents having the term. Usually,                        value is 1 if the query and the document
the higher the document frequency of a                        vectors are identical or one vector is a pos-
term is, the less important the term is                       itive constant multiple of the other.
in differentiating documents having the                          A common measure for retrieval effec-
term from documents not having it. Thus,                      tiveness is recall and precision. For a given
the weight of a term based on its docu-                       query submitted by a user, suppose that
ment frequency is usually monotonically
decreasing and is called the inverse docu-                    1 We note that Boolean queries are also supported
ment frequency weight (idfw). The weight                      by many IR systems. In this article, we concentrate
of a term in a document can be the prod-                      on vector space queries only unless other types of
                                                              queries are explicitly identified. A study of 51,473
uct of its term frequency weight and its                      real user queries submitted to the Excite search en-
inverse document frequency weight, that                       gine indicated that less than 10% of these queries are
is, tfw ∗ idfw.                                               Boolean queries [Jansen et al. 1998].

                                                                   ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                          53

the set of relevant documents with respect       that are not usually associated with docu-
to the query in the document collection can      ments in traditional IR systems and these
be determined. The two quantities recall         features have been explored by search en-
and precision can be defined as follows:          gine developers to improve the retrieval
                                                 effectiveness of search engines.
recall                                              The first special feature of Web pages is
    the number of retrieved relevant documents   that they are highly tagged documents. At
 =                                             , present, most Web pages are in HTML for-
         the number of relevant documents
                                           (1) mat. In the foreseeable future, XML docu-
                                                 ments may be widely used. These tags of-
precision                                        ten convey rich information regarding the
    the number of retrieved relevant documents
 =                                             . terms used in documents. For example, a
         the number of retrieved documents       term appearing in the title of a document
                                           (2) or emphasized with a special font can pro-
                                                 vide a hint that the term is rather impor-
   To evaluate the effectiveness of a text re- tant in indicating the contents of the docu-
trieval system, a set of test queries is used. ment. Tag information has been used by a
For each query, the set of relevant docu- number of search engines such as Google
ments is identified in advance. For each and AltaVista to better determine the im-
such query, a precision value for each dis- portance of a term in representing the con-
tinct recall value is obtained. When these tents of a page. For example, a term occur-
sets of recall-precision values are aver- ring in the title or the header of a page may
aged over the set of test queries, an aver- be considered to be more important than
age recall-precision curve is obtained. This the same term occurring in the main text.
curve is used as the measure of the effec- As another example, a term typed in a spe-
tiveness of the system.                          cial font such as bold face and large fonts is
   An ideal information retrieval system likely to be more important than the same
retrieves all relevant documents and noth- term not in any special font. Studies have
ing else (i.e., both recall and precision indicated that the higher weights assigned
equal to 1). In practice, this is not possi- to terms due to their locations or their spe-
ble, as a user’s needs may be incorrectly or cial fonts or tags can yield higher retrieval
imprecisely specified by his/her query and effectiveness than schemes which do not
the user’s concept of relevance varies over take advantage of the location or tag in-
time and is difficult to capture. Thus, the formation [Cutler et al. 1997].
retrieval of documents is implemented by            The second special feature of Web pages
employing some similarity function that is that they are extensively linked. A link
approximates the degrees of relevance of from page A to page B provides a con-
documents with respect to a given query. venient path for a Web user to navigate
Relevance information due to previous re- from page A to page B. Careful analysis
trieval results may be utilized by sys- can reveal that such a simple link could
tems with learning capabilities to improve contain several pieces of information that
retrieval effectiveness. In the remaining may be made use of to improve retrieval
portion of this paper, we shall restrict our- effectiveness. First, such a link indicates
selves to the use of similarity functions a good likelihood that the contents of the
in achieving high retrieval effectiveness, two pages are related. Second, the author
except for certain situations where users’ of page A values the contents of page B.
feedback information is incorporated.            The linkage information has been used
                                                 to compute the global importance (i.e.,
                                                 PageRank) of Web pages based on whether
2.2. Web Search Engines
                                                 a page is pointed to by many pages and/or
A Web search engine is essentially an in- by important pages [Page et al. 1998].
formation retrieval system for Web pages. This has been successfully used in the
However, Web pages have several features Google search engine to improve retrieval

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
54                                                                                     Meng et al.

effectiveness. The linkage information has      “document identifiers” interchangeably
also been used to compute the authority         unless it is important to distinguish them.
(the degree of importance) of Web pages           Now let us introduce the concept of po-
with respect to a given topic [Kleinberg        tentially useful documents.
1998]. IBM’s Clever Project aims to de-
velop a search engine that employs the             Definition 1. Suppose there is a similar-
technique of computing the authorities of       ity function that computes the similarities
Web page for a given query [Chakrabarti         between documents and any given query
et al. 1999].                                   and the similarity of a document with
   Another way to utilize the linkage in-       a given query approximates the degree
formation is as follows. When a page A          of the relevance of the document to the
has a link to page B, a set of terms known      “average user” who submits the query. For
as anchor terms is usually associated with      a given query, a document d is said to be
the link. The purpose of using the anchor       potentially useful if it satisfies one of the
terms is to provide information regarding       following conditions:
the contents of page B to facilitate the nav-   (1) If m documents are desired in the final
igation by human users. The anchor terms            result for some positive integer m, then
often provide related terms or synonyms to          the similarity between d and the query
the terms used to index page B. To utilize          is among the m highest of all similar-
such valuable information, several search           ities between all documents and the
engines like Google [Brin and Page 1998]            query.
and WWWW [McBryan 1994] have sug-               (2) If every document whose similarity
gested also using anchor terms to repre-            with the query exceeds a prespecified
sent linked pages (e.g., page B). In gen-           threshold is desired, then the similar-
eral, a Web page may be linked by many              ity between d and the query is greater
other Web pages and has many associated             than the threshold.
anchor terms.
                                                   In a metasearch engine environment,
                                                different component search engines may
                                                employ different similarity functions. For
                                                a given query and a document, their sim-
In a typical session of using a metasearch      ilarities computed by different local sim-
engine, a user submits a query to the           ilarity functions are likely to be differ-
metasearch engine through a user-               ent and incomparable. To overcome this
friendly interface. The metasearch engine       problem, the similarities in the above
then sends the user query to a number of        definition are computed using a similar-
underlying search engines (which will be        ity function defined in the metasearch en-
called component search engines in this ar-     gine. In other words, global similarities
ticle). Different component search engines      are used.
may accept queries in different formats.           Note that, in principle, the two condi-
The user query may thus need to be trans-       tions in Definition 1 are mutually trans-
lated to an appropriate format for each         latable. In other words, for a given m in
local system. After the retrieval results       Condition 1, a threshold in Condition 2 can
from the local search engines are received,     be determined such that the number of
the metasearch engine merges the results        documents whose similarities exceed the
into a single ranked list and presents the      threshold is m, and vice versa. However,
merged result, possibly only the top por-       in practice, the translation can only be
tion of the merged result, to the user. The     done when substantial statistical informa-
result could be a list of documents or more     tion about the text database is available.
likely a list of document identifiers (e.g.,     Usually, a user specifies the number of
URLs for Web pages on the Web) with             documents he or she would like to view.
possibly short companion descriptions.          The system uses a threshold to determine
In this article, we use “documents” and         what documents should be retrieved and

                                                    ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                             55

                                                        of them. However, if the number is
                                                        large, say in the thousands, then send-
                                                        ing each query to all component search
                                                        engines is no longer a reasonable strat-
                                                        egy. This is because in this case, a large
                                                        percentage of the local databases will
                                                        be useless with respect to the query.
                                                        Suppose a user is interested in only
                                                        the 10 best matched documents for
                                                        a query. Clearly, the 10 desired doc-
                                                        uments are contained in at most 10
                                                        databases. Consequently, if the num-
                                                        ber of databases is much larger than
                                                        10, then a large number of databases
                                                        will be useless with respect to this
                                                        query. Sending a query to the search
Fig. 2. Metasearch software component architec-
                                                        engines of useless databases has sev-
ture.                                                   eral problems. First, dispatching the
                                                        query to useless databases wastes the
                                                        resources at the metasearch engine
displays only the desired number of docu-               site. Second, transmitting the query
ments to the user.                                      to useless component search engines
   The goal of text retrieval is to maximize            from the metasearch engine and trans-
the retrieval effectiveness while minimiz-              mitting useless documents from these
ing the cost. For a centralized retrieval               search engines to the metasearch en-
system, this can be implemented by re-                  gine would incur unnecessary net-
trieving as many potentially useful doc-                work traffic. Third, evaluating a query
uments as possible while retrieving as                  against useless component databases
few nonpotentially useful documents as                  would waste resources at these local
possible. In a metasearch engine environ-               systems. Fourth, if a large number of
ment, the implementation should be car-                 documents were returned from use-
ried in two levels. First, we should select as          less databases, more effort would be
many potentially useful databases (these                needed by the metasearch engine
databases contain potentially useful doc-               to identify useful documents. There-
uments) to search as possible while min-                fore, it is important to send each
imizing the search of useless databases.                user query to only potentially useful
Second, for each selected database, we                  databases. The problem of identifying
should retrieve as many potentially use-                potentially useful databases to search
ful documents as possible while minimiz-                for a given query is known as the
ing the retrieval of useless documents.                 database selection problem. The soft-
   A reference software component archi-                ware component database selector is
tecture of a metasearch engine is illus-                responsible for identifying potentially
trated in Figure 2. The numbers on the                  useful databases for each user query.
edges indicate the sequence of actions for              A good database selector should cor-
a query to be processed. We now discuss                 rectly identify as many potentially use-
the functionality of each software com-                 ful databases as possible while min-
ponent and the interactions among these                 imizing wrongly identifying useless
components.                                             databases as potentially useful ones.
                                                        Techniques for database selection will
Database selector: If the number of                     be covered in Section 5.
  component search engines in a meta-
  search engine is small, it may be rea-             Document selector: For each search en-
  sonable to send each user query to all               gine selected by the database selector,

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
56                                                                                     Meng et al.

     the component document selector de-           geneous information sources is studied
     termines what documents to retrieve           in Chang and Garcia-Molina [1999].
     from the database of the search en-             For vector space queries, query
     gine. The goal is to retrieve as many         translation is usually as straightfor-
     potentially useful documents from the         ward as just retaining all the terms
     search engine as possible while min-          in the user query. There are two ex-
     imizing the retrieval of useless docu-        ceptions, however. First, the relative
     ments. If a large number of useless doc-      weights of query terms in the origi-
     uments were returned from a search            nal user query may be adjusted be-
     engine, more effort would be needed           fore the query is sent to a component
     by the metasearch engine to identify          search engine. This is to adjust the
     potentially useful documents. Several         relative importance of different query
     factors may affect the selection of doc-      terms, which can be accomplished by
     uments to retrieve from a component           repeating some query terms an appro-
     search engine such as the number of           priate number of times. Second, the
     potentially useful documents in the           number of documents to be retrieved
     database and the similarity function          from a component search engine may
     used by the component system. These           be different from that desired by the
     factors help determine either the num-        user. For example, suppose as part of a
     ber of documents that should be re-           query, a user of the metasearch engine
     trieved from the component search en-         indicates that m documents should
     gine or a local similarity threshold          be retrieved. The document selector
     such that only those documents whose          may decide that k documents should
     local similarity with the given query         be retrieved from a particular compo-
     is higher than or equal to the thresh-        nent search engine. In this case, the
     old should be retrieved from the com-         number k, usually different from m,
     ponent search engine. Different meth-         should be part of the translated query
     ods for selecting documents to retrieve       to be sent to the component search
     from local search engines will be de-         engine.
     scribed in Section 6.                      Result merger: After the results from
                                                  selected component search engines are
Query dispatcher: The query dispat-               returned to the metasearch engine, the
  cher is responsible for establishing a          result merger combines the results into
  connection with the server of each se-          a single ranked list. The top m docu-
  lected search engine and passing the            ments in the list are then forwarded
  query to it. HTTP (HyperText Trans-             to the user interface to be displayed,
  fer Protocol) is used for the connection        where m is the number of documents
  and data transfer (sending queries and          desired by the user. A good result
  receiving results). Each search engine          merger should rank all returned doc-
  has its own requirements on the HTTP            uments in descending order of their
  request method (e.g., the GET method            global similarities with the user query.
  or the POST method) and query format            Different result merging techniques
  (e.g., the specific query box name). The         will be discussed in Section 7.
  query dispatcher must follow the re-
  quirements of each search engine cor-           In the remaining discussions, we will
  rectly. Note that, in general, the query      concentrate on the following three main
  sent to a particular search engine may        components, namely, the database selec-
  or may not be the same as that re-            tor, the document selector, and the result
  ceived by the metasearch engine. In           merger. Except for the query translation
  other words, the original query may be        problem, the component query dispatcher
  translated to a new query before being        will not be discussed further in this sur-
  sent to a search engine. The transla-         vey. Query translation for Boolean queries
  tion of Boolean queries across hetero-        will not be discussed in this article as we

                                                    ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                          57

focus on vector space queries only. More                termine what terms should be used to
discussions on query translation for vector             represent a given document. For ex-
space queries will be provided at appropri-             ample, some may consider all terms in
ate places while discussing other software              the document (i.e., full-text indexing)
components.                                             while others may use only a subset
                                                        of the terms (i.e., partial-text index-
                                                        ing). Lycos [Mauldin 1997], for exam-
4. SOURCES OF CHALLENGES                                ple, employs partial-text indexing in
                                                        order to save storage space and be
In this section, we first review the envi-
                                                        more scalable. Some search engines
ronment in which a metasearch engine is
                                                        on the Web use the anchor terms in
to be built and then analyze why such an
                                                        a Web page to index the referenced
environment causes tremendous difficul-
                                                        Web page [Brin and Page 1998; Cutler
ties to building an effective and efficient
                                                        et al. 1997; McBryan 1994] while most
metasearch engine.
                                                        other search engines do not. Other
   Component search engines that partic-
                                                        examples of different indexing tech-
ipate in a metasearch engine are often
                                                        niques involve whether or not to re-
built and maintained independently. Each
                                                        move stopwords and whether or not to
search engine decides the set of documents
                                                        perform stemming. Furthermore, dif-
it wants to index and provide search ser-
                                                        ferent stopword lists and stemming
vice to. It also decides how documents
                                                        algorithms may be used by different
should be represented/indexed and when
                                                        search engines.
the index should be updated. Similarities
between documents and user queries are               Document term weighting scheme:
computed using a similarity function. It is            Different methods exist for determin-
completely up to each search engine to de-             ing the weight of a term in a document.
cide what similarity function to use. Com-             For example, one method is to use
mercial search engines often regard the                the term frequency weight and an-
similarity functions they use and other im-            other is to use the product of the
plementational decisions as proprietary                term frequency weight and the in-
information and do not make them avail-                verse document frequency weight (see
able to the general public.                            Section 2). Several variations of these
   As a direct consequence of the autonomy             schemes exist [Salton 1989]. There are
of component search engines, a number of               also systems that distinguish different
heterogeneities exist. In this section, we             occurrences of the same term [Boyan
first identify major heterogeneities that               et al. 1996; Cutler et al. 1997; Wade
are unique in the metasearch engine en-                et al. 1989] or different fonts of the
vironment. Heterogeneities that are com-               same term [Brin and Page 1998]. For
mon to other automonous systems (e.g.,                 example, the occurrence of a term ap-
multidatabase systems) such as different               pearing in the title of a Web page may
OS platforms will not be described. Then               be considered to be more important
we discuss the impact of these hetero-                 than another occurrence of the same
geneities as well as the autonomy of com-              term not appearing in the title.
ponent search engines on building an ef-
                                                     Query term weighting scheme: In the
fective and efficient metasearch engine.
                                                       vector space model for text retrieval,
                                                       a query can be considered as a special
4.1. Heterogeneous Environment                         document (a very short document
The following heterogeneities can be                   typically). It is possible for a term to
identified among autonomous component                   appear multiple times in a query. Dif-
search engines [Meng et al. 1999b].                    ferent query term weighting schemes
                                                       may utilize the frequency of a term in
Indexing method: Different search en-                  a query differently for computing the
   gines may have different ways to de-                weight of the term in the query.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
58                                                                                  Meng et al.

Similarity function: Different search           periodically (say from one week to
   engines may employ different similar-        one month). As a result, depending
   ity functions to measure the similarity      on when a document is fetched (or
   between a user query and a document.         refetched) and indexed (or reindexed),
   Some popular similarity functions            its representation in a search engine
   were mentioned in Section 2 but other        may be based on an older version or a
   similarity functions (see, for example,      newer version of the document. Since
   Robertson et al. [1999]; Singhal et al.      local search engines are autonomous,
   [1996]) are also possible.                   it is highly likely that different sys-
                                                tems may have indexed different
Document database: The text data-
                                                versions of the same document (in the
  bases of different search engines may
                                                case of WWW, the Web page can still
  differ at two levels. The first level is
                                                be uniquely identified by its URL).
  the domain (subject area) of a data-
  base. For example, one database            Result presentation: Almost all search
  may contain medical documents (e.g.,         engines present their retrieval result
  www.medisearch.co.uk) and another            in descending order of local similar-
  may contain legal documents (e.g.,           ities/ranking scores. However, some
  lawcrawler.lp.findlaw.com). In this           search engines also provide the simi-
  case, the two databases can be said          larities of returned documents (e.g.,
  to have different domains. In prac-          FirstGov (www.firstgov.gov) and Nor-
  tice, the domain of a database may           thern Light) while some do not (e.g.,
  not be easily determined since some          AltaVista and Google).
  databases may contain documents
                                               In addition to heterogeneities between
  from multiple domains. Furthermore,
                                             component search engines, there are also
  a domain may be further divided
                                             heterogeneities between the metasearch
  into multiple subdomains. The sec-
                                             engine and the local systems. For exam-
  ond level is the set of documents.
                                             ple, the metasearch engine uses a global
  Even when two databases have the
                                             similarity function to compute the global
  same domain, the sets of documents
                                             similarities of documents. It is very likely
  in the two databases can still be
                                             that the global similarity function is differ-
  substantially different or even dis-
                                             ent from the similarity functions in some
  joint. For example, Echidna Medical
                                             (or even all) component search engines.
  Search (www.drsref.com.au) and Medi-
  search (www.medisearch.co.uk) are
  both search engines for medical in-        4.2. Impact of Heterogeneities
  formation but the former is for Web
  pages from Australia and the latter        In this subsection, we show that the au-
  for those from the United Kingdom.         tonomy of and the heterogeneities among
                                             different component search engines and
Document version: Documents in a             between the metasearch engine and the
  database may be modified. This is           component search engines have a pro-
  especially true in the World Wide Web      found impact on how to evaluate global
  environment where Web pages can            queries in a metasearch engine.
  often be modified at the wish of their
  authors. Typically, when a Web page        (1) In order to estimate the usefulness of a
  is modified, those search engines that          database to a given query, the database
  indexed the Web page will not be no-           selector needs to know some informa-
  tified of the modification. Some search          tion about the database that charac-
  engines use robots to detect modified           terizes the contents of the database.
  pages and reindex them. However, due           We call the characteristic information
  to the high cost and/or the enormous           about a database the representative of
  amount of work involved, attempts              the database. In the metasearch en-
  to revisit a page can only be made             gine environment, different types of

                                                 ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                              59

    database representatives for different              a given query due to various hetero-
    search engines may be available to                  geneities. For example, for a given
    the metasearch engine. For coopera-                 query q submitted by a global user,
    tive search engines, they may pro-                  whether or not a document d in a com-
    vide database representatives desired               ponent database D is potentially use-
    by the database selector. For unco-                 ful depends on the global similarity
    operative search engines that follow                of d with q. It is highly likely that
    a certain standard, say the proposed                the similarity function and/or the term
    STARTS standard [Gravano et al.                     weighting scheme in D are different
    1997], the database representatives                 from the global ones. As a result, the lo-
    may be obtained from the informa-                   cal similarity of d is likely to be differ-
    tion that can be provided by these                  ent from the global similarity of d . In
    search engines such as the document                 fact, even when the same term weight-
    frequency and the average document                  ing scheme and the same similarity
    term weight of any query term. But                  function are used locally and globally,
    the representatives may not contain                 the global similarity and the local sim-
    certain information desired by a par-               ilarity of d may still be different be-
    ticular database selector. For unco-                cause the similarity computation may
    operative search engines that do not                make use of certain database-specific
    follow any standard, their representa-              information (such as the document fre-
    tives may have to be extracted from                 quencies of terms). This means that
    past retrieval experiences (e.g., Savvy-            a globally highly ranked document in
    Search [Dreilinger and Howe 1997]) or               D may not be a locally highly ranked
    from sampled documents (e.g., Callan                document in D. Suppose the globally
    et al. [1999]; Callan [2000]).                      top-ranked document d is ranked ith
       There are two major challenges in                locally for some i ≥ 1. In order to re-
    developing good database selection al-              trieve d from D, the local system may
    gorithms. One is to identify appropri-              have to also retrieve all documents
    ate database representatives. A good                that have a higher local similarity
    representative should permit fast and               than that of d (text retrieval systems
    accurate estimation of database use-                are generally incapable of retrieving
    fulness. At the same time, a good rep-              lower-ranked documents without first
    resentative should have a small size in             retrieving higher-ranked ones). It is
    comparison to the size of the database              quite possible that some of the doc-
    and should be easy to obtain and main-              uments that are ranked higher than
    tain. As we will see in Section 5, pro-             d locally are not potentially useful
    posed database selection algorithms                 based on their global similarities. The
    often employ different types of repre-              main challenge for document selection
    sentatives. The second challenge is to              is to develop methods that can maxi-
    develop ways to obtain the desired rep-             mize the retrieval of potentially use-
    resentatives. As mentioned above, a                 ful documents while minimizing the
    number of solutions exist depending on              retrieval of useless documents from
    whether a search engine follows some                component search engines. The main
    standard or is cooperative. The issue of            challenge for result merging is to
    obtaining the desired representatives               find ways to estimate the global sim-
    will not be discussed further in this               ilarities of documents so that docu-
    article.                                            ments returned from different compo-
                                                        nent search engines can be properly
(2) The challenges of the document selec-               merged.
    tion problem and the result merging
    problem lie mainly in the fact that                In the next several sections, we examine
    the same document may have differ-               the techniques that have been proposed
    ent global and local similarities with           to deal with the problems of database

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
60                                                                                    Meng et al.

selection, document selection, and result      Learning-based approaches: In these
merging.                                         approaches, the knowledge about
                                                 which databases are likely to return
                                                 useful documents to what types of
                                                 queries is learned from past retrieval
When a metasearch engine receives a              experiences. Such knowledge is then
query from a user, it invokes the database       used to determine the usefulness of
selector to select component search en-          databases for future queries. The re-
gines to send the query to. A good database      trieval experiences could be obtained
selection algorithm should identify poten-       through the use of training queries be-
tially useful databases accurately. Many         fore the database selection algorithm
approaches have been proposed to tackle          is put to use and/or through the real
the database selection problem. These ap-        user queries while database selection
proaches differ on the database represen-        is in active use. The obtained experi-
tatives they use to indicate the contents of     ences against a database will be saved
each database, the measures they use to          as the representative of the database.
indicate the usefulness of each database
                                                 In the following subsections, we sur-
with respect to a given query, and the tech-
                                               vey and discuss different database se-
niques they employ to estimate the useful-
                                               lection approaches based on the above
ness. We classify these approaches into the
following three categories.
Rough representative approaches:               5.1. Rough Representative Approaches
   In these approaches, the contents of a
   local database are often represented        As mentioned earlier, a rough represen-
   by a few selected key words or para-        tative of a database uses only a few key
   graphs. Such a representative is only       words or a few sentences to describe the
   capable of providing a very general         contents of the database. It is only capable
   idea on what a database is about,           of providing a very general idea on what
   and consequently database selection         the database is about.
   methods using rough database rep-             In ALIWEB [Koster 1994], an often
   resentatives are not very accurate          human-generated representative in a
   in estimating the true usefulness           fixed format is used to represent the con-
   of databases with respect to a given        tents of each local database or a site.
   query. Rough representatives are often      An example of the representative used
   manually generated.                         to describe a site containing files for the
                                               Perl Programming Language is as follows
Statistical representative approa-             (www.nexor.com/site.idx):
   ches: These approaches usually rep-
   resent the contents of a database             Template-Type:         DOCUMENT
   using rather detailed statistical infor-      Title:                 Perl
   mation. Typically, the representative         URI:                   /public/perl/perl.
   of a database contains some statis-                                  html
   tical information for each term in            Description:           Information on the
   the database such as the document                                    Perl Programming
   frequency of the term and the average                                Language. Includes
   weight of the term among all docu-                                   a local Hypertext
   ments that have the term. Detailed                                   Perl Manual, and
   statistics allow more accurate esti-                                 the latest FAQ in
   mation of database usefulness with                                   Hypertext.
   respect to any user query. Scalability of     Keywords:              perl, perl-faq,
   such approaches is an important issue                                language
   due to the amount of information that         Author-Handle:         m.koster@nexor.
   needs to be stored for each database.                                co.uk

                                                   ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                            61

   The user query is matched with the rep-             topic:   country
resentative of each component database                          synset: [nation,
to determine how suitable a database is                                 nationality, land,
for the query. The match can be against                                 country, a_people]
one or more fields (e.g. title, description,                     synset: [state, nation,
etc.) of the representatives based on the                               country, land,
user’s choice. Component databases are                                  commonwealth,
ranked based on how closely they match                                  res_publica,
with the query. The user then selects com-                              body_politic]
ponent databases to search from a ranked                        synset: [country, state,
list of component databases, one database                               land, nation]
at a time. Note that ALIWEB is not a                   info-type: facts
full-blown metasearch engine as it only al-
lows users to select one database to search             Each word in WordNet has one or more
at a time and it does not perform result             synsets with each containing a set of
merging.                                             synonyms that together defines a mean-
   Similar to ALIWEB, descriptive repre-             ing. The topical word “country” has four
sentations of the contents of component              synsets of which three are considered to
databases are also used in WAIS [Kahle               be relevant, and are therefore used. The
and Medlar 1991]. For a given query, the             one synset (i.e., [rural area, country])
descriptions are used to rank component              whose meaning does not match the in-
databases according to how similar they              tended meaning of the “country” in the
are to the query. The user then selects com-         above description (i.e., “World facts listed
ponent databases to search for the desired           by country”) is omitted. Each user query is
documents. In WAIS, more than one lo-                a sentence and is automatically converted
cal database can be searched at the same             into a structured and disambiguated rep-
time.                                                resentation similar to a database repre-
   In Search Broker [Manber and Bigot                sentation using a combination of several
1997; Manber and Bigot 1998], each                   techniques. However, not all queries can
database is manually assigned one or two             be handled. The query representation is
words as the subject or category keywords.           then matched with the representatives of
Each user query consists of two parts:               local databases in order to identify poten-
the subject part and the regular query               tially useful databases [Chakravarthy and
part. When a query is received by the                Haase 1995].
system, the subject part of the query is                While most rough database representa-
used to identify the component search                tives are generated with human involve-
engines covering the same subject and                ment, there exist automatically gener-
the regular query part is used to search             ated rough database representatives. In
documents from the identified search                  Q-Pilot [Sugiura and Etzioni 2000], each
engines.                                             database is represented by a vector of
   In NetSerf [Chakravarthy and Haase                terms with weights. The terms can either
1995], the text description of the con-              be obtained from the interface page of the
tents of a database is transformed into              search engine or from the pages that have
a structured representative. The trans-              links to the search engine. In the former
formation is performed manually and                  case, all content words in the interface
WordNet [Miller 1990] is used in the                 page are considered and the weights are
transformation process to disambiguate               the term frequencies. In the latter case,
topical words. As an example, the de-                only terms that appear in the same line as
scription “World facts listed by coun-               the link to the search engine are used and
try” for the World Factbook archive is               the weight of each term is the document
transformed into the following structured            frequency of the term (i.e., the number of
representation [Chakravarthy and Haase               back link documents that contributed the
1995]:                                               term).

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
62                                                                                                      Meng et al.

   The main appeal of rough representa-         and Lee 1997]. In D-WISE, the repre-
tive approaches is that the representatives     sentative of a component search engine
can be obtained relatively easily and they      consists of the document frequency of
require little storage space. If all compo-     each term in the component database
nent search engines are highly specialized      as well as the number of documents in
with diversified topics and their contents       the database. Therefore, the representa-
can be easily summarized, then these ap-        tive of a database with n distinct terms
proaches may work reasonably well. On           will contain n + 1 quantities (the n doc-
the other hand, it is unlikely that the short   ument frequencies and the cardinality of
description of a database can represent         the database) in addition to the n terms.
the database sufficiently comprehensively,       Let ni denote the number of documents in
especially when the database contains           the ith component database and dfi j be the
documents of diverse interests. As a re-        document frequency of term t j in the ith
sult, missing potentially useful databases      database.
can occur easily with these approaches.            Suppose q is a user query. The repre-
To alleviate this problem, most such ap-        sentatives of all databases are used to
proaches involve users in the database se-      compute the ranking score of each com-
lection process. For example, in ALIWEB         ponent search engine with respect to q.
and WAIS, users will make the final              The scores measure the relative useful-
decision on which databases to select           ness of all databases with respect to q.
based on the preliminary selections by          If the score of database A is higher than
the metasearch engine. In Search Broker,        that of database B, then database A will
users are required to specify the sub-          be judged to be more relevant to q than
ject areas for their queries. As users of-      database B. The ranking scores are com-
ten do not know the component databases         puted as follows. First, the cue validity of
well, their involvement in the database         each query term, say term t j , for the ith
selection process can easily miss use-          component database, CVi j , is computed
ful databases. Rough representative ap-         using the following formula:
proaches are considered to be inadequate
for large-scale metasearch engines.                                               dfi j
                                                           CVi j =                        N             ,       (3)
5.2. Statistical Representative Approaches                            dfi j                     dfk j
                                                                              +           k=i
A statistical representative of a database                                                k=i

typically takes every term in every docu-
ment in the database into consideration         where N is the total number of compo-
and keeps one or more pieces of statistical     nent databases in the metasearch engine.
information for each such term. As a re-        Intuitively, CVi j measures the percentage
sult, if done properly, a database selection    of the documents in the ith database that
approach employing this type of database        contain term t j relative to that in all other
representatives may detect the existence        databases. If the ith database has a higher
of individual potentially useful documents      percentage of documents containing t j in
for any given query. A large number of ap-      comparison to other databases, then CVi j
proaches based on statistical representa-       tends to have a larger value. Next, the
tives have been proposed. In this subsec-       variance of the CVi j ’s of each query term
tion, we describe five such approaches.          t j for all component databases, CVV j , is
                                                computed as follows:
  5.2.1. D-WISE Approach. WISE (Web In-                              N
dex and Search Engine) is a centralized                              i=1 (CVi j     − ACV j )2
                                                      CVV j =                                  ,                (4)
search engine [Yuwono and Lee 1996].                                               N
D-WISE is a proposed metasearch en-
gine with a number of underlying search         where ACV j is the average of all CVi j ’s
engines (i.e., distributed WISE) [Yuwono        for all component databases. The value

                                                    ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                  63

CVV j measures the skew of the distri-                     10th ranked database can be very use-
bution of term t j across all component                    ful. Relative ranking scores are not very
databases. For two terms tu and tv , if CVVu               useful in differentiating these situations.
is larger than CVVv , then term tu is more                 Second, the accuracy of this approach is
useful to distinguish different component                  questionable as this approach does not
databases than term tv . As an extreme                     distinguish a document containing, say,
case, if every database had the same per-                  one occurrence of a term from a docu-
centage of documents containing a term,                    ment containing 100 occurrences of the
then the term would not be very useful for                 same term.
database selection (the CVV of the term
would be zero in this case). Finally, the                     5.2.2. CORI Net Approach. In the Collec-
ranking score of component database i                      tion Retrieval Inference Network (CORI
with respect to query q is computed by                     Net) approach [Callan et al. 1995], the rep-
                                                           resentative of a database consists of two
                     M                                     pieces of information for each distinct term
              ri =          CVV j · dfi j ,          (5)   in the database: the document frequency
                     j =1                                  and the database frequency. The latter
                                                           is the number of component databases
where M is the number of terms in the                      containing the term. Note that if a term
query. It can be seen that the ranking                     appears in multiple databases, only one
score of database i is the sum of the docu-                database frequency needs to be stored in
ment frequencies of all query terms in the                 the metasearch engine to save space.
database weighted by each query term’s                        In CORI Net, for a given query q, a
CVV (recall that the value of CVV for                      document ranking technique known as in-
a term reflects the distinguishing power                    ference network [Turtle and Croft 1991]
of the term). Intuitively, the ranking                     used in the INQUERY document retrieval
scores provide clues to where useful query                 system [Callan et al. 1992] is extended
terms are concentrated. If a database has                  to rank all component databases with re-
many useful query terms, each having                       spect to q. The extension is mostly con-
a higher percentage of documents than                      ceptual and the main idea is to visual-
other databases, then the ranking score                    ize the representative of a database as a
of the database will be high. After the                    (super) document and the set of all repre-
ranking scores of all databases are com-                   sentatives as a collection/database of su-
puted with respect to a given query, the                   per documents. This is explained below.
databases with the highest scores will be                  The representative of a database may be
selected for search for this query.                        conceptually considered as a super docu-
   The representative of a database in                     ment containing all distinct terms in the
D-WISE contains one quantity, that is, the                 database. If a term appears in k docu-
document frequency, per distinct term in                   ments in the database, we repeat the term
the database, plus one additional quan-                    k times in the super document. As a re-
tity, that is, the cardinality, for the entire             sult, the document frequency of a term in
database. As a result, this approach is eas-               the database becomes the term frequency
ily scalable. The computation is also sim-                 of the term in the super document. The
ple. However, there are two problems with                  set of all super documents of the compo-
this approach. First, the ranking scores                   nent databases in the metasearch engine
are relative scores. As a result, it will                  form a database of super documents. Let
be difficult to determine the real value                    D denote this database of all super docu-
of a database with respect to a given                      ments. Note that the database frequency
query. If there are no good databases for                  of a term becomes the document frequency
a given query, then even the first ranked                   of the term in D. Therefore, from the rep-
database will have very little value. On                   resentatives of component databases, we
the other hand, if there are many good                     can obtain the term frequency and docu-
databases for another query, then even the                 ment frequency of each term in each super

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
64                                                                                                  Meng et al.

document. In principle, the tfw·idfw (term                formulas, c1 and c2 are constants between
frequency weight times inverse document                   0 and 1, and K = c3 · ((1 − c4 ) + c4 · dwi /
frequency weight) formula could now be                    adw) is a function of the size of database
used to compute the weight of each term                   Di with c3 and c4 being two constants, dwi
in each super document so as to repre-                    being the number of words in Di and adw
sent each super document as a vector of                   being the average number of words in a
weights. Furthermore, a similarity func-                  database. The values of these constants
tion such as the cosine function may be                   (c1 , c2 , c3 and c4 ) can be determined em-
used to compute the similarities (rank-                   pirically by performing experiments on ac-
ing scores) of all super documents (i.e.,                 tual test collections [Callan et al. 1995].
database representatives) with respect to                 Note that the value of p(t j | Di ) is es-
query q and these similarities could then                 sentially the tfw · idfw weight of term
be used to rank all component databases.                  t j in the super document corresponding
The approach employed in CORI Net is                      to database Di . Next, the significance of
an inference network-based probabilistic                  term t j in representing query q, denoted
approach.                                                 p(q | t j ), can be estimated, for example, to
   In CORI Net, the ranking score of a                    be the query term weight of t j in q. Finally,
database with respect to query q is an                    the belief that database Di contains useful
estimated belief that the database con-                   documents with respect to query q, or the
tains useful documents. The belief is es-                 ranking score of Di with respect to q, can
sentially the combined probability that                   be estimated to be
the database contains useful documents
due to each query term. More specifically,                                      k

the belief is computed as follows. Suppose                ri = p(q | Di ) =          p(q | t j ) · p(t j | Di ).
the user query contains k terms t1 , . . . , tk .                             j =1
Let N be the number of databases in the                                                                            (7)
metasearch engine. Let dfi j be the docu-
ment frequency of the j th term in the ith                   In CORI Net, the representative of a
component database Di and dbf j be the                    database contains slightly more than one
database frequency of the j th term. First,               piece of information per term (i.e., the doc-
the belief that Di contains useful docu-                  ument frequency plus the shared database
ments due to the j th query term is com-                  frequency across all databases). Therefore,
puted by                                                  the CORI Net approach also has rather
                                                          good scalability. The information for repre-
     p(t j | Di ) = c1 + (1 − c1 ) · Ti j · I j ,   (6)   senting each component database can also
                                                          be obtained and maintained easily. An ad-
where                                                     vantage of the CORI Net approach is that
                                                          the same method can be used to compute
                                       dfi j              the ranking score of a document with a
         Ti j = c2 + (1 − c2 ) ·                          query as well as the ranking score of a
                                    dfi j + K
                                                          database (through the database represen-
                                                          tative or super document) with a query.
is a formula for computing the term fre-                  Recently, it was shown in Xu and Callan
quency weight of the j th term in the super               [1998] that if phrase information is col-
document corresponding to Di and                          lected and stored in each database repre-
                                                          sentative and queries are expanded based
                              N + 0.5
                        log    dbf j
                                                          on a technique called local context anal-
                Ij =                                      ysis [Xu and Croft 1996], then the CORI
                       log(N + 1.0)                       Net approach can select useful databases
                                                          more accurately.
is a formula for computing the inverse doc-
ument frequency weight of the j th term                     5.2.3. gGlOSS Approach. The gGlOSS
based on all super documents. In the above                (generalized Glossary Of Servers’ Server)

                                                              ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                                      65

system is a research prototype [Gravano                 usefulness of this database is
and Garcia-Molina 1995]. In gGlOSS,
each component database is represented                    usefulness(D, q, T )
by a set of pairs (dfi , Wi ), where dfi                                                                               
                                                                    p                                   k
is the document frequency of the ith                                                                                 Wi 
term and Wi is the sum of the weights                          =          (df j − df j −1 ) ·                qi ·
of the ith term over all documents in                              j =1                                i= j

the component database. A threshold is                              p                                   k
associated with each query in gGlOSS                           =          q j · W j + df p ·                   qj ·        .
to indicate that only documents whose                                                                                 df j
                                                                   j =1                            j = p+1
similarities with the query are higher
than the threshold are of interest. The
                                                     Disjoint case: By the disjoint assump-
usefulness of a component database with
                                                        tion, each document can contain at
respect to a query in gGlOSS is defined
                                                        most one query term. Thus, there are
to be the sum of the similarities of the
                                                        dfi documents that contain term ti and
documents in the component database
                                                        the similarity of these dfi documents
with the query that are higher than
the threshold associated with the query.                with query q is qi · dfi . Therefore, the

The usefulness of a component database is               estimated usefulness of this database
used as the ranking score of the database.              is:
In gGlOSS, two estimation methods are
employed based on two assumptions.                       usefulness(D, q, T )
One is the high-correlation assumption                                                                                 Wi
(for any given database, if query term ti                  =                                            dfi · qi ·
appears in at least as many documents as                                                   W
                                                               i=1,...,k|(df i >0)∧ qi · df i      >T
query term t j , then every document con-
taining term t j also contains term ti ) and               =                                            qi · Wi .
the other is the disjoint assumption (for a                                                W
                                                               i=1,...,k|(df i >0)∧   qi · df i    >T
given database, for any two terms ti and                                                      i

t j , the set of documents containing term
ti is disjoint from the set of documents                In gGlOSS, the usefulness of a database
containing term t j ).                               is sensitive to the similarity threshold
     We now discuss the two estimation               used. As a result, gGlOSS can differenti-
methods for a component database D.                  ate a database with many moderately sim-
Suppose q = (q1 , . . . , qk ) is a query and        ilar documents from a database with a few
T is the associated threshold, where qi is           highly similar documents. This is not pos-
the weight of term ti in q.                          sible in D-WISE and CORI Net. However,
                                                     the two assumptions used in gGlOSS are
High-correlation case: Let terms be ar-              somewhat too restrictive. As a result, the
   ranged in ascending order of document             estimated database usefulness may be in-
   frequency, i.e., dfi ≤ df j for any i < j ,       accurate. It can be shown that, when the
   where dfi is the document frequency of            threshold T is not too large, the estima-
   term ti . This means that every docu-             tion formula based on the high-correlation
   ment containing ti also contains t j for          assumption tends to overestimate the use-
   any j > i. There are df1 documents                fulness and the estimation formula based
                             k                       on the disjoint assumption tends to under-
   having similarity i=1 qi · d fi with q.
   In general, there are df j − df j −1 doc-         estimate the usefulness. Since the two es-
                                      k              timates by the two formulas tend to form
   uments having similarity i= j qi · d fi  W
                                                     upper and lower bounds of the true use-
   with q, 1 ≤ j ≤ k, and df0 is defined to           fulness, the two methods are more use-
   be 0. Let p be an integer between 1 and           ful when used together than when used
   k that satisfies i= p qi · d fi > T and
                                                     separately. For a given database, the size
      i= p+1 qi · d f ≤ T . Then the estimated
                                                     of the database representative in gGlOSS

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
66                                                                                      Meng et al.

is twice the size of that in D-WISE. The        timate NoDoc(D, q, T ) when the global
computation for estimating the database         similarity function is the dot product func-
usefulness in gGlOSS can be carried out         tion (the widely used cosine function is a
efficiently.                                     special case of the dot product function
                                                with each term weight divided by the doc-
  5.2.4. Estimating the Number of Potentially   ument/query length). In this method, the
Useful Documents. One database useful-          representative of a database with n dis-
ness measure used is “the number of po-         tinct terms consists of n pairs {( pi , wi )},
tentially useful documents with respect to      i = 1, . . . , n, where pi is the probabil-
a given query in a database.” This measure      ity that term ti appears in a document in
can be very useful for search services that     D (note that pi is simply the document
charge a fee for each search. For exam-         frequency of term ti in the database di-
ple, the Chicago Tribune Newspaper Com-         vided by the number of documents in the
pany charges a certain fee for retrieving       database) and wi is the average of the
archival newspaper articles. Suppose the        weights of ti in the set of documents con-
fee is independent of the number of re-         taining ti . Let (q1 , q2 , . . . , qk ) be the query
trieved documents. In this case, from the       vector of query q, where qi is the weight of
user’s perspective, a component system          query term ti .
which contains a large number of sim-              Consider the following generating
ilar documents but not necessarily the          function:
most similar documents is preferable to
another component system containing just           ( p1 ∗ X w1 ∗q1 + (1 − p1 )) ∗ ( p2 ∗ X w2 ∗q2
a few most similar documents. On the                  + (1 − p2 )) ∗ · · · ∗ ( pk ∗ X wk ∗qk
other hand, if a fee is charged for each
retrieved document, then the component                + (1 − pk )).                               (9)
system having the few most similar docu-
ments will be preferred. This type of charg-      After the generating function (9) is ex-
ing policy can be incorporated into the         panded and the terms with the same X s
database selector of a metasearch engine        are combined, we obtain
if the number of potentially useful docu-
ments in a database with respect to a given         a1 ∗ X b1 + a2 ∗ X b2 + · · · + ac ∗ X bc ,
query can be estimated.                               b1 > b2 > · · · > bc .                   (10)
   Let D be a component database,
sim(q, d ) be the global similarity between
                                                It can be shown that, if the terms are in-
a query q and a document d in D, and
                                                dependent and the weight of term ti when-
T be a similarity threshold. The number
                                                ever present in a document is wi , which
of potentially useful documents in D with
                                                is given in the database representative
respect to q can be defined precisely as
                                                (1 ≤ i ≤ k), then ai is the probability that
                                                a document in the database has similar-
                                                ity bi with q [Meng et al. 1998]. There-
NoDoc(D, q, T ) = cardinality({d | d ∈ D        fore, if database D contains N documents,
                  and sim(q, d ) > T }).        then N ∗ ai is the expected number of doc-
                                       (8)      uments that have similarity bi with query
                                                q. For a given similarity threshold T , let
  If NoDoc(D, q, T ) can be accurately es-      C be the largest integer to satisfy bC > T .
timated for each database with respect to       Then, NoDoc(D, q, T ) can be estimated by
a given query, then the database selec-         the following formula:
tor can simply select those databases with
                                                                          C                   C
the most potentially useful documents to
search for this query.                           NoDoc(D, q, D) =              N ∗ ai = N          ai .
  In Meng et al. [1998], a generating-                                   i=1                 i=1
function based method is proposed to es-                                                           (11)

                                                     ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                   67

   The above solution has two restrictive            where pi j is the probability that term ti
assumptions. The first is the term inde-              occurs in a document and has a weight
pendence assumption and the second is                in the j th subrange, wmi j is the median
the uniform term weight assumption (i.e.,            of the weights of ti in the j th subrange,
the weights of a term in all documents                j = 1, . . . , l , and l is the number of sub-
containing the term are the same—the                 ranges used. After the generating func-
average weight). These assumptions re-               tion has been obtained, the rest of the
duce the accuracy of the database use-               estimation process is identical to that de-
fulness estimation. One way to address               scribed earlier. It was shown in Meng et al.
the term independence assumption is to               [1999a] that if the maximum normalized
utilize covariances between term pairs,              weight of each term is used in the high-
term triplets, and so on and to incorpo-             est subrange, the estimation accuracy of
rate them into the generating function               the database usefulness can be drastically
(9) [Meng et al. 1998]. The problem with             improved.
this approach is that the storage overhead              The above methods [Liu et al. 2001;
for representing a component database                Meng et al. 1998; Meng et al. 1999a], while
may become too large because a very                  being able to produce accurate estimation,
large number of covariances may be as-               have a large storage overhead. Further-
sociated with each component database.               more, the computation complexity of ex-
A remedy is to use only significant co-               panding the generating function is expo-
variances (those whose absolute values               nential. As a result, they are more suitable
are significantly greater than zero). An-             for short queries.
other way to incorporate dependencies be-
tween terms is to combine certain ad-                  5.2.5. Estimating the Similarity of the Most
jacent terms into a single term [Liu                 Similar Document. Another useful measure
et al. 2001]. This is similar to recognizing         is the global similarity of the most simi-
phrases.                                             lar document in a database with respect
   In Meng et al. [1999a], a method known            to a given query. On one hand, this mea-
as the subrange-based estimation method              sure indicates the best that we can expect
is proposed to deal with the uniform term            from a database as no other documents
weight assumption. This method parti-                in the database can have higher similari-
tions the actual weights of a term ti in             ties with the query. On the other hand, for
the set of documents having the term into            a given query, this measure can be used
a number of disjoint subranges of possi-             to rank databases optimally for retrieving
bly different lengths. For each subrange,            the m most similar documents across all
the median of the weights in the sub-                databases.
range is estimated based on the assump-                 Suppose a user wants the metasearch
tion that the weight distribution of the             engine to find the m most similar docu-
term is normal (hence, the standard de-              ments to his/her query q across M com-
viation of the weights of the term needs             ponent databases D1 , D2 , . . . , D M . The fol-
to be added to the database representa-              lowing definition defines an optimal order
tive). Then, the weights of ti that fall in          of these databases for the query.
a given subrange are approximated by
the median of the weights in the sub-                  Definition 2. A set of M databases is
range. With this weight approximation,               said to be optimally ranked in the order
for a query containing term ti , the poly-           [D1 , D2 , . . . , D M ] with respect to query q if
nomial pi ∗ X wi ∗qi + (1 − pi ) in the generat-     there exists a k such that D1 , D2 , . . . , Dk
ing function (9) is replaced by the following        contain the m most similar documents and
polynomial:                                          each Di , 1 ≤ i ≤ k, contains at least one of
                                                     the m most similar documents.
    pi1 ∗ X wmi1 ∗qi + pi2 ∗ X wmi2 ∗qi                Intuitively, the ordering is optimal
      + · · · + pil ∗ X wmil ∗qi + (1 − pi ), (12)   because whenever the m most similar

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
68                                                                                        Meng et al.

documents to the query are desired, it is          Then msim(q, D) can be estimated as
sufficient to examine the first k databases.         follows:
A necessary and sufficient condition for
the databases D1 , D2 , . . . , D M to be op-                            
timally ranked in the order [D1 , D2 , . . . ,      msim(q, D) = max qi ∗ gidfi ∗ mnwi
D M ] with respect to query q is                                   1≤i≤k 
msim(q, D1 ) > msim(q, D2 ) > · · · > msim                                         
(q, D M ) [Yu et al. 1999b], where msim                   k                        
(q, Di ) is the global similarity of the              +        q j ∗ gidf j ∗ anw j /|q|. (13)
most similar document in database Di                                               
                                                          j =1, j =i
with the query q. Knowing an optimal
rank of the databases with respect to
query q, the database selector can se-                The intuition for having this estimate
lect the top-ranked databases to search            is that the most similar document in a
for q.                                             database is likely to have the maximum
   The challenge here is how to estimate           normalized weight of the ith query term,
msim(q, D) for query q and any database            for some i. This yields the first half of
D. One method is to utilize the Expres-            the above expression within the braces.
sion (10) for D. We can scan this expres-          For each of the other query terms, the
sion in descending order of the exponents          document takes the average normalized
until r ai ∗ N is approximately 1 for              weight. This yields the second half. Then,
some r, where N is the number of docu-             the maximum is taken over all i, since
ments in D. The exponent, br , is an esti-         the most similar document may have the
mate of msim(q, D) as the expected num-            maximum normalized weight of any one of
ber of documents in D with similarity              the k query terms. Normalization by the
greater than or equal to br is approxi-            query length, |q|, yields a value less than
mately 1. The drawback of this solution            or equal to 1. The underlying assumption
is that it requires a large database repre-        of Formula (13) is that terms in each query
sentative and the computation is of high           are independent. Dependencies between
complexity.                                        terms can be captured to a certain extent
   A more efficient method to estimate              by storing the same statistics (i.e. mnw’s
msim(q, D) is proposed in Yu et al. [1999b].       and anw’s) of phrases in the database rep-
In this method, there are two types of rep-        resentatives, i.e., treating each phrase as
resentatives. There is a global representa-        a term.
tive for all component databases. For each            In this method, each database is rep-
distinct term ti , the global inverse docu-        resented by two quantities per term plus
ment frequency weight (gidfi ) is stored in        the global representative shared by all
this representative. There is a local rep-         databases but the computation has linear
resentative for each component database            complexity.
D. For each distinct term ti in D, a pair of          The maximum normalized weight of
quantities (mnwi , anwi ) is stored, where         a term is typically two or more orders
mnwi and anwi are the maximum nor-                 of magnitude larger than the average
malized weight and the average normal-             normalized weight of the term as the
ized weight of term ti , respectively. Sup-        latter is computed over all documents,
pose d i is the weight of ti in a document         including those not containing the term.
d . Then the normalized weight of ti in            This observation implies that in Formula
d is d i /|d |, where |d | denotes the length      (13), if all query terms have the same
of d . The maximum normalized weight               tf weight (a reasonable assumption, as
and the average normalized weight of ti            in a typical query each term appears
in database D are, respectively, the max-          once), gidfi ∗ mnwi is likely to dominate
imum and the average of the normalized                j =1, j =i gidfj ∗ anw j , especially when the
weights of ti in all documents in D. Sup-          number of terms, k, in a query is small
pose q = (q1 , . . . , qk ) is the query vector.   (which is typically true in the Internet

                                                       ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                   69

environment [Jansen et al. 1998; Kirsch              duce several learning-based database se-
1998]). In other words, the rank of                  lection methods.
database D with respect to a given
query q is largely determined by the                    5.3.1. MRDD Approach. The                 MRDD
value of max1≤i≤k {qi ∗ gidfi ∗ mnwi }. This         (Modeling Relevant Document Distribu-
leads to the following more scalable for-            tion) approach [Voorhees et al. 1995b] is
mula to estimate msim(q, D) [Wu et al.               a static learning approach. During learn-
2001]: max1≤i≤k {qi ∗ ami }/|q|, where ami =         ing, a set of training queries is utilized.
gidfi ∗ mnwi is the adjusted maximum                 Each training query is submitted to every
normalized weight of term ti in D. This              component database. From the returned
formula requires only one piece of infor-            documents from a database for a given
mation, namely ami , to be kept in the               query, all relevant documents are identi-
database representative for each distinct            fied and a vector reflecting the distribution
term in the database.                                of the relevant documents is obtained
                                                     and stored. Specifically, the vector has
                                                     the format <r1 , r2 , . . . , rs >, where ri is
5.3. Learning-Based Approaches                       a positive integer indicating that ri top-
                                                     ranked documents must be retrieved from
These approaches predict the usefulness              the database in order to obtain i relevant
of a database for new queries based on the           documents for the query. As an example,
retrieval experiences with the database              suppose for a training query q and a com-
from past queries. The retrieval experi-             ponent database D, 100 documents are
ences may be obtained in a number of                 retrieved in the order (d 1 , d 2 , . . . , d 100 ).
ways. First, training queries can be used            Among these documents, d 1 , d 4 , d 10 , d 17 ,
and the retrieval knowledge of each com-             and d 30 are identified to be relevant. Then
ponent database with respect to these                the corresponding distribution vector is
training queries can be obtained in ad-               r1 , r2 , r3 , r4 , r5 = 1, 4, 10, 17, 30 .
vance (i.e., before the database selec-                 With the vectors for all training queries
tor is enabled). This type of approach               and all databases obtained, the database
will be called the static learning ap-               selector is ready to select databases for
proach as in such an approach, the re-               user queries. When a user query is re-
trieval knowledge, once learned, will not            ceived, it is compared against all training
be changed. The weakness of static learn-            queries and the k most similar training
ing is that it cannot adapt to the changes           queries are identified (k = 8 performed
of database contents and query pattern.              well as reported in [Voorhees et al.
Second, real user queries (in contrast to            1995b]). Next, for each database D, the av-
training queries) can be used and the                erage relevant document distribution vec-
retrieval knowledge can be accumulated               tor over the k vectors corresponding to the
gradually and be updated continuously.               k most similar training queries and D is
This type of approach will be referred to            obtained. Finally, the average distribution
as the dynamic learning approach. The                vectors are used to select the databases to
problem with dynamic learning is that                search and the documents to retrieve. The
it may take a while to obtain sufficient              selection tries to maximize the precision
knowledge useful to the database selector.           for each recall point.
Third, static learning and dynamic learn-
ing can be combined to form a combined-                 Example 1. Suppose for a given query
learning approach. In such an approach,              q, the following three average distribution
initial knowledge may be obtained from               vectors have been obtained for three com-
training queries but the knowledge is up-            ponent databases:
dated continuously based on real user
                                                     D1 : 1, 4, 6, 7, 10, 12, 17
queries. Combined learning can overcome
the weaknesses of the other two learning             D2 : 3, 5, 7, 9, 15, 20
approaches. In this subsection, we intro-            D3 : 2, 3, 6, 9, 11, 16

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
70                                                                                         Meng et al.

   Consider the case when three relevant           SavvySearch also tracks the recent per-
documents are to be retrieved. To maxi-         formance of each search engine in terms
mize the precision (i.e., to reduce the re-     of h, the average number of documents re-
trieval of irrelevant documents), one doc-      turned for the most recent five queries,
ument should be retrieved from D1 and           and r, the average response time for the
three documents should be retrieved from        most recent five queries sent to the com-
D3 (two of the three are supposed to be rel-    ponent search engine. If h is below a
evant). In other words, databases D1 and        threshold Th (the 2 default is 1), then a
D3 should be selected. This selection yields    penalty ph = (ThT 2 for the search engine
a precision of 0.75 as three out of the four    is computed. Similarly, if the average re-
retrieved documents are relevant.               sponse time r is greater than a thresh-
                                                old Tr (the default 2is 15 seconds), then
   In the MRDD approach, the represen-
                                                a penalty pr = (ro −Trr))2 is computed, where
tative of a component database is the
set of distribution vectors for all training    ro = 45 (seconds) is the maximum allowed
queries. The main weakness of this ap-          response time before a timeout.
proach is that the learning has to be car-         For a new query q with terms t1 , . . . , tk ,
ried out manually for each training query.      the ranking score of database D is com-
In addition, it may be difficult to iden-        puted by
tify appropriate training queries and the
learned knowledge may become less accu-                             wti · log(N / f i )
                                                r(q, D) =     i=1
                                                                                          − ( ph + pr ),
rate when the contents of the component                                 k
databases change.                                                       i=1   |wi |
  5.3.2.   SavvySearch   Approach.   Savvy-
Search (www.search.com) is a metasearch         where log(N / f i ) is the inverse database
engine employing the dynamic learning           frequency weight of term ti , N is the num-
approach. In SavvySearch [Dreilinger and        ber of databases, and f i is the number of
Howe 1997], the ranking score of a compo-       databases having a positive weight value
nent search engine with respect to a query      for term ti .
is computed based on the past retrieval            The overhead of storing the represen-
experience of using the terms in the query.     tative information for each local search
More specifically, for each search engine,       engine in SavvySearch is moderate. (Es-
a weight vector (w1 , . . . , wm ) is main-     sentially there is just one piece of informa-
tained by the database selector, where          tion for each term, i.e., the weight. Only
each wi corresponds to the ith term in the      terms that have been used in previous
database of the search engine. Initially, all   queries need to be considered.) Moderate
weights are zero. When a query contain-         effort is needed to maintain the informa-
ing term ti is used to retrieve documents       tion. One weakness of SavvySearch is that
from a component database D, the weight         it will not work well for new query terms
wi is adjusted according to the retrieval       or query terms that have been used only
result. If no document is returned by the       very few times. In addition, the user feed-
search engine, the weight is reduced by         back process employed by SavvySearch
1/k, where k is the number of terms in          is not rigorous and could easily lead to
the query. On the other hand, if at least       the mis-identification of useful databases.
one returned document is read/clicked           Search engine users may have the ten-
by the user (no relevance judgment is           dency to check out top-ranked documents
needed from the user), then the weight          for their queries regardless of whether
is increased by 1/k. Intuitively, a large       or not these documents are actually use-
positive wi indicates that the database         ful. This means that term weights in
D responded well to term ti in the past         the database representative can easily be
and a large negative wi indicates that D        modified in a way not consistent with the
responded poorly to ti .                        meaning of the weights. As a result, it is

                                                    ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                            71

possible that the weight of a term for a             in q belongs to the set of terms associ-
database does not sufficiently reflect how             ated with C. Now the databases will be
well the database will respond to the term.          ranked based on the sum of the confidence
                                                     factors of each database with respect to
  5.3.3. ProFusion    Approach. ProFusion            the mapped categories. Let this sum of the
(www.profusion.com) is a metasearch                  confidence factors of a database with re-
engine employing the combined learning               spect to q be called the ranking score of
approach. In ProFusion [Fan and Gauch                the database for q. In ProFusion, the three
1999; Gauch et al. 1996], 13 preset cate-            databases with the largest ranking scores
gories are utilized in the learning process.         are selected to search for a given query.
The 13 categories are “Science and Engi-               In ProFusion, documents retrieved from
neering,” “Computer Science,” “Travel,”              selected search engines are ranked based
“Medical and Biotechnology,” “Business               on the product of the local similarity of
and Finance,” “Social and Religion,”                 a document and the ranking score of the
“Society, Law and Government,” “Animals              database. Let d in database D be the
and Environment,” “History,” “Recreation             first document read/clicked by the user. If
and Entertainment,” “Art,” “Music,” and              d is not the top-ranked document, then
“Food.” A set of terms is associated with            the ranking score of D should be in-
each category to reflect the topic of the             creased while the ranking scores of those
category. For each category, a set of                databases whose documents are ranked
training queries is identified. The reason            higher than d should be reduced. This is
for using these categories and dedicated             carried out by proportionally adjusting the
training queries is to learn how well                confidence factors of D in mapped cate-
each component database will respond                 gories. For example, suppose for a query
to queries in different categories. For a            q and a database D, two categories C1
given category C and a given component               and C2 are selected and the correspond-
database D, each associated training                 ing confidence factors are 0.6 and 0.4, re-
query is submitted to D. From the top 10             spectively. To increase the ranking score of
retrieved documents, relevant documents              database D by x, the confidence factors of
are identified. Then a score reflecting                D in C1 and C2 are increased by 0.6x and
the performance of D with respect to the             0.4x, respectively. This ranking score ad-
query and the category C is computed
          10                                         justment policy tends to move d higher in
by c ∗ i=1 ∗ 10 , where c is a constant;
                                                     the rank if the same query is processed in
Ni is set to 1/i if the ith-ranked doc-              the future. The rationale behind this pol-
ument is relevant and Ni is set to 0 if              icy is that if the ranking scores were per-
the document is not relevant; R is the               fect, then the top-ranked document would
number of relevant documents in the                  be the first to be read by the user.
10 retrieved documents. It can be seen                 ProFusion combines static learning and
that this formula captures both the rank             dynamic learning, and as a result, over-
order of each relevant document and the              comes some problems associated with em-
precision of the top 10 retrieved docu-              ploying static learning or dynamic learn-
ments. Finally, the scores of all training           ing alone. ProFusion has the following
queries associated with the category C is            shortcomings. First, the static learning
averaged for database D and this average             part is still done mostly manually, i.e.,
is the confidence factor of the database              selecting training queries and identifying
with respect to the category. At the end of          relevant documents are carried out manu-
the training, there is a confidence factor            ally. Second, the higher-ranked documents
for each database with respect to each of            from the same database as the first clicked
the 13 categories.                                   document will remain as higher-ranked
  When a user query q is received by the             documents after the adjustment of con-
metasearch engine, q is first mapped to               fidence factors although they are of no
one or more categories. The query q is               interest to the user. This is a situation
mapped to a category C if at least one term          where the learning strategy does not help

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
72                                                                                     Meng et al.

retrieve better documents for a repeating       retrieve all or as many as possible poten-
query. Third, the employed dynamic learn-       tially useful documents from each compo-
ing method seems to be too simplistic. For      nent database while minimizing the re-
example, very little user feedback infor-       trieval of useless documents. We classify
mation is used and the tendency of users        the proposed approaches for the document
to select the highest-ranked document re-       selection problem into the following four
gardless of the relevance of the document       categories:
is not taken into consideration. One way
                                                User determination: The metasearch
to alleviate this problem is to use the
                                                  engine lets the global user determine
first clicked document that was read for a
                                                  how many documents to retrieve from
“significant” amount of time.
                                                  each component database.
                                                Weighted allocation: The number of
6. DOCUMENT SELECTION                             documents to retrieve from a compo-
                                                  nent database depends on the rank-
After the database selector has chosen the        ing score (or the rank) of the compo-
component databases for a given query,            nent database relative to the ranking
the next task is to determine what doc-           scores (or ranks) of other component
uments to retrieve from each selected             databases. As a result, proportionally
database. A naive approach is to let each         more documents are retrieved from
selected component search engine return           component databases that are ranked
all documents that are retrieved from the         higher or have higher ranking scores.
search engine. The problem with this ap-        Learning-based approaches: These
proach is that too many documents may             approaches determine the number of
be retrieved from the component systems           documents to retrieve from a compo-
unnecessarily. As a result, this approach         nent database based on past retrie-
will not only lead to higher communication        val experiences with the component
cost but also require more effort from the        database.
result merger to identify the best matched
                                                Guaranteed retrieval: This type of ap-
documents. This naive approach will not
                                                  proach aims at guaranteeing the re-
be further discussed in this section.
                                                  trieval of all potentially useful docu-
   As noted previously, a component search
                                                  ments with respect to any given query.
engine typically retrieves documents in
descending order of local similarities. Con-      In the following subsections, we survey
sequently, the problem of selecting what        and discuss approaches from each of the
documents to retrieve from a component          categories.
database can be translated into one of the
following two problems:
                                                6.1. User Determination
(1) Determine the number of documents to
    retrieve from the component database.       In MetaCrawler [Selberg and Etzioni
    If k documents are to be retrieved from     1995; 1997] and SavvySearch [Dreilinger
    a component database, then the k doc-       and Howe 1997], the maximum number of
    uments with the largest local similari-     documents to be returned from each com-
    ties will be retrieved.                     ponent database can be customized by the
(2) Determine a local threshold for the         user. Different numbers can be used for
    component database such that a doc-         different queries. If a user does not select
    ument from the component database           a number, then a query-independent de-
    is retrieved only if its local similarity   fault number set by the metasearch engine
    with the query exceeds the threshold.       will be used. This approach may be reason-
                                                able if the number of component databases
   Both problems have been tackled in ex-       is small and the user is reasonably fa-
isting or proposed metasearch engines.          miliar with all of them. In this case, the
For either problem, the goal is always to       user can choose an appropriate number of

                                                    ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                             73

documents to retrieve for each component             be retrieved from the ith ranked compo-
database and can afford to do so.                    nent database, i = 1, . . . , N (note that
                                                       N 2(1+N −i)
   If the number of component databases                i=1 N (N +1) = 1). In CORI Net, m could
is large, then this method has a serious             be chosen to be larger than the number of
problem. In this case, it is likely that the         desired documents specified by the global
user will not be capable of selecting an             user in order to reduce the likelihood of
appropriate number for each component                missing useful documents.
database. Consequently, the user will be               As a special case of the weighted allo-
forced to choose one number and apply                cation approach, if the ranking score of
that number to all selected component                a component database is the estimated
databases. As the numbers of useful docu-            number of potentially useful documents in
ments in different databases with respect            the database, then the ranking score of a
to a given query are likely to be different,         component database can be used as the
this method may retrieve too many use-               number of documents to retrieve from the
less documents from some component sys-              database.
tems on the one hand while retrieving too              Weighted Allocation is a reasonably
few useful documents from other compo-               flexible and easy-to-implement approach
nent systems on the other hand. If m doc-            based on good intuition (i.e., retrieve more
uments are to be retrieved from N selected           documents from more highly ranked local
databases, the number of documents to re-            databases).
trieve from each database may be set to be
      or slightly higher.
                                                     6.3. Learning-Based Approaches
                                                     It is possible to learn how many doc-
6.2. Weighted Allocation                             uments to retrieve from a component
                                                     database for a given query from past
For a given query, each component                    retrieval experiences for similar queries.
database has a rank (i.e., 1st, 2nd, . . .)          The following are two learning-based app-
and a ranking score as determined by                 roaches [Towell et al. 1995; Voorhees et al.
the database selection algorithm. Both the           1995a; Voorhees et al. 1995b; Voorhees
rank information and the ranking score               1996; Voorhees and Tong 1997].
information can be used to determine the                In Section 5.3, we introduced a learning-
number of documents to retrieve from dif-            based method, namely MRDD (Model-
ferent component systems. In principle,              ing Relevant Document Distribution), for
weighted allocation approaches attempt to            database selection. In fact, this method
retrieve more documents from component               combines the selection of databases and
search engines that are ranked higher (or            the determination of what documents to
have larger ranking scores).                         retrieve from databases. For a given query
   In D-WISE [Yuwono and Lee 1997], the              q, after the average distribution vectors
ranking score information is used. For a             have been obtained for all databases, the
given query q, let ri be the ranking score           decision on what documents to retrieve
of component database Di , i = 1, . . . , N ,        from these databases is made to maxi-
where N is the number of selected compo-             mize the overall precision. In Example 1,
nent databases for the query. Suppose m              when three relevant documents are de-
documents across all selected component              sired from the given three databases, this
databases are desired. Then the number               method retrieves the top one document
of documents to retrieve from database Di            from database D1 and the top three doc-
is m · ri / N=1 r j .
             j                                       uments from D3 .
   In CORI Net [Callan et al. 1995],                    The second method, QC (Query Clus-
the rank information is used. Specifi-                tering), also performs document selec-
cally, if a total number of m documents              tion based on past retrieval experiences.
are to be retrieved from N component                 Again, a set of training queries is utilized.
databases, then m· 2(1+N −i) documents will
                      N (N +1)
                                                     In the training phase, for each component

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
74                                                                                    Meng et al.

database, the training queries are grouped     sume a lot of resources. Third, it is too time
into a number of clusters. Two queries are     consuming for users to identify relevant
placed in the same cluster if the num-         documents for a wide variety of training
ber of common documents retrieved by           queries.
the two queries is large. Next, the cen-
troid of each query cluster is computed
by averaging the vectors of the queries in     6.4. Guaranteed Retrieval
the cluster. Furthermore, for each compo-      Since the similarity function used in a
nent database, a weight is computed for        component database may be different from
each cluster based on the average num-         that used in the metasearch engine, it
ber of relevant documents among the top        is possible for a document with low local
T retrieved documents (T = 8 performed         similarity to have a high global similar-
well as reported in [Voorhees et al. 1995b])   ity, and vice versa. In fact, even when
for each query in the query cluster. For       the global and local similarity functions
a given database, the weight of a clus-        are identical, this scenario regarding local
ter indicates how well the database re-        and global similarities may still occur
sponds to queries in the cluster. When a       due to the use of some database-specific
user query is received, for each component     statistical information in these functions.
database, the query cluster whose centroid     For example, the document frequency of
is most similar to the query is selected.      a term in a component system is prob-
Then the weights associated with all se-       ably very different from that across all
lected query clusters across all databases     systems (i.e., the global document fre-
are used to determine the number of doc-       quency). Consequently, if a component sys-
uments to retrieve from each database.         tem only returns documents with high
Suppose wi is the weight associated with       local similarities, globally potentially use-
the selected query cluster for component       ful documents that are determined based
database Di and m is the total number of       on global similarities from the compo-
documents desired. Then the number of          nent database may be missed. The guar-
documents to retrieve from database Di         anteed retrieval approach tries to ensure
is m · wi / N=1 w j , where N is the num-
              j                                that all globally potentially useful docu-
ber of component databases. It can be seen     ments would be retrieved even when the
that this method is essentially a weighted     global and local document similarities do
allocation method and the weight of a          not match. Note that none of the ap-
database for a given query is the learned      proaches in earlier subsections belongs to
weight of the selected query cluster for the   the guaranteed retrieval category because
database.                                      they do not take global similarities into
   For user queries that have very simi-       consideration.
lar training queries, the above approaches        Many applications, especially those in
may produce very good results. However,        medical and legal fields, often desire to
these approaches also have serious weak-       retrieve all documents (cases) that are
nesses that may prevent them from be-          similar to a given query (case). For these
ing used widely. First, they may not be        applications, the guaranteed retrieval ap-
suitable in environments where new com-        proaches that can minimize the retrieval
ponent search engines may be frequently        of useless documents would be appropri-
added to the metasearch engine because         ate. In this subsection, we introduce some
new training needs to be conducted when-       proposed techniques in the guaranteed re-
ever a new search engine is added. Sec-        trieval category.
ond, it may not be easy to determine what
training queries are appropriate to use.         6.4.1. Query Modification. Under certain
On the one hand, we would like to have         conditions, a global query can be modi-
some similar training queries for each po-     fied before it is submitted to a component
tential user query. On the other hand, hav-    database to yield the global similarities
ing too many training queries would con-       for returned documents. This technique

                                                   ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                                  75

is called query modification [Meng et al.                    In order to trick the component system
1998]. It is essentially a query transla-                   D into computing the global similarity for
tion method for vector queries. Clearly, if                 d , the following procedure is used. When
a component system can be tricked into                      query q = (q1 , . . . , qr ) is received by the
returning documents in descending order                     metasearch engine, it is first modified to
of global similarities, guaranteeing the re-                q ∗ = (q1 ∗ (l 1 /l 1 ), . . . , qr ∗ (l r /l r )). Then the
trieval of globally most similar documents                  modified query q ∗ is sent to the compo-
becomes trivial.                                            nent database D for evaluation. Accord-
   Let D be a component database. Con-                      ing to (15), after D receives q ∗ , it further
sider the case when both the local and the                  modifies q ∗ to (q1 ∗ (l 1 /l 1 ) ∗ l 1 , . . . , qr ∗
global similarity functions are the cosine                  (l r /l r ) ∗ l r ) = (q1 ∗ l 1 , . . . , qr ∗ l r ) = q .
function [Salton and McGill 1983]. Note                     Finally, q is evaluated by D to compute
that although the same similarity function                  the global similarity of d with q.
is used globally and locally, the same doc-                     Unfortunately, query modification is not
ument may still have different global and                   a technique that can work for any combi-
local similarities due to the use of different              nations of local and global similarity func-
local and global document frequencies of                    tions. In general, we still need to deal with
terms. Let d = (w1 , . . . , wr ) be the weight             the situations when documents have dif-
vector of a document in D. Suppose each                     ferent local and global similarities. Fur-
wi is computed using only information in                    thermore, this approach requires knowl-
d (such as term frequency) while a query                    edge of the similarity function and the
may use both the term frequency and the                     term weighting formula used in a compo-
inverse document frequency information.                     nent system. The information is likely to
The idf information for each term in D                      be proprietary and may not be easily avail-
is incorporated into the similarity compu-                  able. A study of discovering such informa-
tation by modifying each query before it                    tion based on sampling queries is reported
is processed [Buckley et al. 1993]. Con-                    in Liu et al. [2000].
sider a user query q = (q1 , . . . , qr ), where
q j is the weight of term t j in the query,                    6.4.2. Computing the Tightest Local Thresh-
 j = 1, . . . , r. It is assumed that q j is ei-            old. For a given query q, suppose the
ther assigned by the user or computed us-                   metasearch engine sets a threshold T
ing the term frequency of t j in the query.                 and uses a global similarity function G
When the component system receives the                      such that any document d that satisfies
query q, it first incorporates the local idf                 G(q, d ) > T is to be retrieved (i.e., the doc-
weight of each query term by modifying                      ument is considered to be potentially use-
query q to                                                  ful). The problem is to determine a proper
                                                            threshold T for each selected component
                                                            database D such that all potentially useful
            q = (q1 ∗ l 1 , . . . , qr ∗ l r )       (15)
                                                            documents will that exist in D can be re-
                                                            trieved using its local similarity function
and then evaluates the modified query,                       L. That is, if G(q, d ) > T , then L(q, d ) >
where l j is the local idf weight of term                   T for any document d in D. Note that in
t j in component system D, j = 1, . . . , r.                order to guarantee that all potentially use-
As a result, when the cosine function is                    ful documents will be retrieved from D,
used, the local similarity of d with q in                   some unwanted documents from D may
D can be computed to be sim D (q, d ) =                     also have to be retrieved. The challenge
( rj =1 q j ∗ l j ∗ w j )/(|q |·|d |), where |q | and       is to minimize the number of documents
|d | are the lengths of q and d , respectively.             to retrieve from D while still guaranteeing
     Let l j be the global idf weight of term               that all potentially useful documents from
t j . Then, when the cosine function is                     D will be retrieved. In other words, it is de-
used, the global similarity of d with q                     sirable to determine the tightest (largest)
should be simG (q, d ) = ( rj =1 q j ∗ l j ∗ wj )/          local threshold T such that if G(q, d ) > T ,
(|q | · |d |), where q = (q1 ∗ l 1 , . . . , qr ∗ l r ).    then L(q, d ) > T .

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
76                                                                                       Meng et al.

   In Gravano and Garcia-Molina [1997],           for a global threshold T , the 1tightest local
it is shown that if (1) the similarities          threshold L(T ) is then T · r ( p −1) .
computed by G and L are between 0 and
1, and (2) G and L are related by the in-            While this method may provide the
equality G(q, d ) − ≤ L(q, d ), where             tightest local threshold for certain combi-
is a constant satisfying 0 ≤ < 1, then            nations of local and global similarity func-
a local threshold T can be determined.            tions, it has two weaknesses. First, a sep-
However, the local threshold determined           arate solution needs to be found for each
using the method in Gravano and Garcia-           different pair of similarity functions and it
Molina [1997] is often not tight.                 is not clear whether a solution can always
   In Meng et al. [1998], several tech-           be found. Second, it is required that the
niques were proposed to find the tightest          local similarity function be known.
local threshold for some popular similarity
function pairs. For a given global similar-
                                                  7. RESULT MERGING
ity threshold T , let L(T ) denote the tight-
est local threshold for a given component         To provide local system transparency to
database D. Then one way to determine             the global users, the results returned from
L(T ) is as follows:                              component search engines should be com-
                                                  bined into a single result. Ideally, doc-
(1) Find the function f (t), the minimum          uments in the merged result should be
    of the local similarity function L(q, d ),    ranked in descending order of global simi-
    over all documents d in D, subject to         larities. However, such an ideal merge is
    t = G(q, d ). In this step, t is fixed and     very hard to achieve due to the various
    d varies over all possible documents          heterogeneities among the component sys-
    in D.                                         tems. Usually, documents returned from
(2) Minimize f (t) in the range t ≥ T . This      each component search engine are ranked
    minimum of f (t) is the desired L(T ).        based on these documents’ local ranking
                                                  scores or similarities. Some component
    Let {ti } be the set of terms in the query    search engines make the local similari-
q. If both L(q, d ) and G(q, d ) are differen-    ties of returned documents available to
tiable with respect to the weight wi of each      the user while other search engines do not
term ti of document d , then finding f (t) in      make them available. For example, Google
the above Step 1 can generally be achieved        and AltaVista do not provide local similar-
using the method of Lagrange in calculus          ities while Northern Light and FirstGov
[Widder 1989]. Once f (t) is found, its min-      do. Local similarities returned from dif-
imum value in the range t ≥ T can usu-            ferent component search engines, even
ally be computed easily. In particular, if        when made available, may be incompa-
 f (t) is nondecreasing, then L(T ) is simply     rable due to the heterogeneities among
 f (T ). The example below illustrates this       these search engines. Furthermore, the
method.                                           local similarities and the global similari-
                                                  ties of the same document may be quite
   Example 2. Let d = (w1 , . . . , wr ) be       different.
a document and q = (u1 , . . . , ur ) be a           The challenge here is to merge the doc-
query. Let the global similarity function         uments returned from different search
G(q, d ) = r ui · wi and the local sim-
              i=1                                 engines into a single ranked list in a rea-
                                        p p 1
ilarity function L(q, d ) = ( r ui wi ) p
                                 i=1              sonable manner in the absence of local
(known as p-norm in Salton and McGill             similarities and/or in the presence of in-
[1983]), p ≥ 1.                                   comparable similarities. A further compli-
   Step 1 is to find f (t), which requires         cation to the problem is that some doc-
                          p p 1
us to minimize ( r ui wi ) p subject to
                      i=1                         uments may be returned from multiple
   i=1 ui · wi = t. Using the Lagrange            component search engines. The question
method, f (t) is found to be t · r ( p −1) . As
                                                  is whether and how this should affect the
this function is an increasing function of t,     ranking of these documents.

                                                      ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                             77

  Existing result merging approaches can             employed ranking technique. A number of
be classified into the following two types:           functions have been proposed to combine
                                                     individual ranking scores of the same
Local similarity adjustment: This                    document, including min, max, average,
   type of approaches adjusts local simi-            sum, weighted average, and other linear
   larities using additional information             combination functions [Cottrell and
   such as the quality of component                  Belew 1994; Fox and Shaw 1994; Lee
   databases. A variation is to convert              1997; Vogt and Cottrell 1999]. One of the
   local document ranks to similarities.             most effective functions for data fusion
Global similarity estimation: This                   is known as CombMNZ, which, for each
   type of approaches attempts to com-               document, sums individual scores and
   pute or estimate the true global sim-             then multiplies the sum by the number of
   ilarities of the returned documents.              nonzero scores [Lee 1997]. This function
                                                     emphasizes those documents that are
   The first type is usually easier to imple-
                                                     ranked high by multiple systems. More
ment but the merged ranking may be in-
                                                     data fusion techniques are surveyed in
accurate as the merge is not based on the
                                                     Croft [2000].
true global similarities of returned docu-
                                                        We now consider more likely scenarios
ments. The second type is more rigorous
                                                     in a metasearch engine context, namely
and has the potential to achieve the ideal
                                                     the selected databases are not identical.
merging. However, it typically needs more
                                                     We first consider the case where the se-
information from local systems. The two
                                                     lected databases are disjoint. In this case,
types of approaches are discussed in the
                                                     all returned documents will be unique. Let
following subsections.
                                                     us first assume that all returned docu-
                                                     ments have local similarities attached. It
7.1. Local Similarity Adjustment                     is possible that different search engines
Three cases can be identified depending on            normalize their local similarities in dif-
the degree of overlap among the selected             ferent ranges. For example, one search
databases for a given query.                         engine may normalize its similarities be-
                                                     tween 0 and 1 and another search engine
Case 1: These databases are pairwise dis-            between 0 and 1,000. In this case, all local
  joint or nearly disjoint. This occurs              similarities should be renormalized based
  when disjoint special-purpose search               on a common range, say [0, 1], to improve
  engines or those with minimal overlap              the comparability of these local similari-
  are selected.                                      ties [Dreilinger and Howe 1997; Selberg
Case 2: The selected databases overlap               and Etzioni 1997]. In the following, we as-
  but are not identical. An example of               sume that all local similarities have been
  this situation is when several general-            normalized based on a common range.
  purpose search engines are selected.                  When database selection is performed
Case 3: These databases are identical.               for a given query, the usefulness or quality
                                                     of each database is estimated and is rep-
  Case 3 usually does not occur in a                 resented as a score. The database scores
metasearch engine environment. Instead,              can be used to adjust the local similarities.
it occurs when multiple ranking tech-                The idea is to give preference to documents
niques are applied to the same collection            from highly ranked databases. In CORI
of documents in order to improve the                 Net [Callan et al. 1995], the adjustment
retrieval effectiveness. The result merg-            works as follows. Let s be the ranking
ing problem in this case is also known               score of component database D and s be  ¯
as data fusion [Vogt and Cottrell 1999].             the average of the scores of all databases
Data fusion has been studied extensively             searched. Then the following weight is as-
in the last decade. One special property             signed to D: w = 1 + N · s−¯ , where N is

of the data fusion problem is that every             the number of component databases
document will be ranked or scored by each            searched for the given query. Clearly, if

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
78                                                                                    Meng et al.

s > s, then w will be greater than 1. Fur-
     ¯                                             ment in the merged list will be the
thermore, the larger the difference is, the        second-highest-ranked document in
larger the weight will be. On the other            the highest-ranked database and the
hand, if s < s, then w will be smaller than
             ¯                                     process continues until the desired
1. Moreover, the larger the difference is,         number of documents are included in
the smaller the weight will be. Let x be the       the merged list. One weakness of this
local similarity of document d from D.             solution is that it does not take into
Then the adjusted similarity of d is com-          consideration the differences between
puted by w · x. The result merger lists re-        the database scores (i.e., only the
turned documents in descending order of            order information is utilized).
adjusted similarities. Based on the way               A randomized version of the above
the weight of a database is computed,              method is proposed in Voorhees et al.
it is clear that documents from higher-            [1995b]. Recall that in the MRDD
ranked databases have a better chance to           database selection method, we first
be ranked higher in the merged result.             determine how many documents to re-
   A similar method is used in ProFusion           trieve from each component database
[Gauch et al. 1996]. For a given query,            for a given query to maximize the pre-
a ranking score is calculated for each             cision of the retrieval. Suppose the de-
database (see the discussion on ProFusion          sired number of documents have been
in Section 5.3.3). The adjusted similarity         retrieved from each selected com-
of a document d from a database D is the           ponent database and N local docu-
product of the local similarity of d and the       ment lists have been obtained, where
ranking score of D.                                N is the number of selected compo-
   Now let us consider the situation where         nent databases. Let Li be the local
the local similarities of the returned doc-        document list for database Di . To
uments from some component search en-              select the next document to be placed
gines are not available. In this case, one         in the merged list, the rolling of a die is
of the following two approaches could be           simulated. The die has N faces corre-
applied to tackle the merging problem.             sponding to the N local lists. Suppose
Again, we assume that no document is re-           n is the total number of documents
turned from multiple search engines, i.e.,         yet to be selected and ni documents
all returned documents are unique.                 are still in the list Li . The die is made
                                                   biased such that the probability that
(1) Use the local document rank infor-             the face corresponding to Li will be
    mation directly to perform the merge.          up when the die is rolled is ni /n.
    Local similarities, if available, will         When the face for Li is up, the current
    be ignored in this approach. First,            top-ranked document in the list Li will
    the searched databases are arranged            be selected as the next-highest-ranked
    in descending order of usefulness or           document in the merged list. After
    quality scores obtained during the             the selection, the selected document
    database selection step. Next, a round-        is removed from Li , and both ni and n
    robin method based on the database or-         are reduced by 1. The probabilities are
    der and the local document rank order          also updated accordingly. In this way,
    is used to merge the local document            the retrieved documents are ranked
    lists. Specifically, the first document          based on the probabilistic model.
    in the merged list is the top-ranked
    document from the highest-ranked           (2) Convert local document ranks to
    database and the second document in            similarities. In D-WISE [Yuwono and
    the merged list is the top-ranked docu-        Lee 1997], the following method is
    ment from the second-highest-ranked            employed. For a given query, suppose
    database. After the top-ranked doc-            ri is the ranking score of database Di ,
    uments from all searched databases             rmin is the lowest database ranking
    have been selected, the next docu-             score (i.e., rmin = min{ri }), r is the local

                                                   ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                            79

    rank of a document from database Di ,            search engine, the above discussed simi-
    and g is the converted similarity of the         larity adjustment techniques can be ap-
    document. The conversion function is             plied. We now consider how to deal with
    g = 1−(r −1)· Fi , where Fi is defined to         documents that are returned by multiple
    be (rmin )/(m · ri ) and m is the number of      search engines. First, each local similarity
    documents desired across all searched            can be adjusted using the techniques dis-
    databases. Intuitively, this conversion          cussed above. Next, adjusted similarities
    function has the following properties.           for the same document can be combined
    First, all top-ranked documents from             in a certain way to produce an overall ad-
    local systems will have the same con-            justed similarity for the document. The
    verted similarity 1. This implies that           combination can be carried out by utilizing
    all top-ranked documents from local              one of the combination functions proposed
    systems are considered to be equally             for data fusion. Indeed, this has been prac-
    potentially useful. Second, Fi is used           ticed by some metasearch engines. For ex-
    to model the distance between the                ample, the max function is used in Pro-
    converted similarities of two consecu-           Fusion [Gauch et al. 1996] and the sum
    tively ranked documents in database              function is used in MetaCrawler [Selberg
    Di . In other words, the difference              and Etzioni 1997]. It should be pointed out
    between the converted similarities               that an effective combination function in
    of the j th- and the ( j + 1)th-ranked           data fusion may not necessarily be effec-
    documents from database Di is Fi . The           tive in a metasearch engine environment.
    distance is larger for databases with            In data fusion, if a document is not re-
    smaller ranking scores. As a result, if          trieved by a retrieval technique, then it
    the rank of a document d in a higher-            is because the document is not considered
    rank database is the same as the                 useful by the technique. In contrast, in a
    rank of document d in a lower rank               metasearch engine, there are two possible
    database but none of d and d is top-             reasons for a document not to be retrieved
    ranked, then the converted similarity            by a selected search engine. The first is the
    of d will be higher than that of d . In          same as in the data fusion case, namely
    addition, this method tends to select            the document is not considered sufficiently
    more documents from databases with               useful by the search engine. The second is
    higher scores into the merged result.            that the document is not indexed by the
       As an example, consider two data-             search engine. In this case, the document
    bases D1 and D2 . Suppose r1 = 0.2 and           did not have a chance to be judged for its
    r2 = 0.5. Furthermore, suppose four              usefulness by the search engine. Clearly, a
    documents are desired. Then, we have             document that is not retrieved due to the
    rmin = 0.2, F1 = 0.25, and F2 = 0.1.             second reason will be put at a disadvan-
    Based on the above conversion func-              tage if a combination function such as sum
    tion, the top three ranked documents             and CombMNZ is used. Finding an effec-
    from D1 will have converted similari-            tive combination function in a metasearch
    ties 1, 0.75, and 0.5, respectively, and         engine environment is an area that still
    the top three ranked documents from              needs further research.
    D2 will have converted similarities 1,
    0.9, and 0.8, respectively. As a result,
    the merged list will contain three docu-         7.2. Global Similarity Estimation
    ments from D2 and one document from              Under certain conditions, it is possible to
    D1 . The documents will be ranked                compute or estimate the global similari-
    in descending order of converted                 ties of returned documents. The following
    similarities in the merged list.                 methods have been reported.
  Now let us consider the situation where              7.2.1. Document Fetching. That a docu-
the selected databases have overlap. For             ment is returned by a search engine typi-
documents that are returned by a single              cally means that the URL of the document

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
80                                                                                    Meng et al.

is returned. Sometimes, additional in-         from these s databases, retrieve all doc-
formation associated with the document,        uments whose actual global similarities
such as a short summary or the first couple     are greater than or equal to the tenta-
of sentences, is also returned. But the doc-   tive threshold min sim. The tightest lo-
ument itself is typically not returned.        cal threshold for each of these s databases
   The document fetching method down-          could be determined and used here. If m or
loads returned documents from their local      more documents have been retrieved, then
servers and computes or estimates their        this process stops. Otherwise, the next top
global similarities in the metasearch en-      ranked database (i.e., the (s + 1)th-ranked
gine. Consider the case in which the global    database) will be considered and its most
similarity function is the cosine function     similar document will be retrieved. The
and the global document frequency of each      actual global similarity of this document
term is known to the metasearch engine         is then compared with min sim and the
(note that if local databases have little or   minimum of these two similarities will
no overlap, then the global document fre-      be used as a new global threshold to re-
quency of a term can be computed or ap-        trieve all documents from these s + 1
proximated as the sum of the local doc-        databases whose actual global similarities
ument frequencies of the term). After a        are greater than or equal to this threshold.
document is downloaded, the term fre-          This process is repeated until m or more
quency of each term in the document can        documents are retrieved. Retrieved docu-
be obtained. As a result, all statistics       ments are ranked in descending order of
needed to compute the global similarity        their actual global similarities. A poten-
of the document will be available and the      tial problem with this approach is that the
global similarity can be computed. The         same database may be searched multiple
Inquirus metasearch engine ranks docu-         times. This problem can be relieved to
ments returned from different search en-       some extent by retrieving and caching a
gines based on analyzing the contents of       larger number of documents when search-
downloaded documents and a ranking for-        ing a database.
mula that combines similarity and prox-           This method has the following two
imity matches is employed [Lawrence and        properties [Yu et al. 1999b]. First, if the
Lee Giles 1998].                               databases are ranked optimally, then all
   A document-fetching-based method that       the m most similar documents can be
combines document selection and result         retrieved while accessing at most one
merging is reported in Yu et al. [1999b].      unnecessary database, for any m. Second,
Suppose that the m most similar docu-          for any single-term query, the optimal
ments across all databases with respect to     rank of databases can be achieved and, as
a given query are desired for some positive    a result, the m most similar documents
integer m. In Section 5.2.5, we introduced     will be retrieved.
a method to rank databases in descending          Downloading documents and analyzing
order of the similarity of the most simi-      them on the fly can be an expensive under-
lar document in each database for a given      taking, especially when the number of doc-
query. Such a rank is an optimal rank for      uments to be downloaded is large and the
retrieving the m most similar documents.       documents have large sizes. A number of
This rank can also be used to perform doc-     remedies have been proposed. First, down-
ument selection as follows.                    loading from different local systems can
   First, for some small positive integer      be carried out in parallel. Second, some
s (e.g., s can start from 2), each of the      documents can be analyzed first and dis-
s top ranked databases are searched to         played to the user so that further analy-
obtain the actual global similarity of its     sis can be done while the user reads the
most similar document. This may re-            initial results [Lawrence and Lee Giles
quire downloading some documents from          1998]. The initially displayed results may
these databases. Let min sim be the            not be correctly ranked and the overall
minimum of these s similarities. Next,         rank needs to be adjusted when more

                                                   ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                       81

documents are analyzed. Third, we may                   If the only difference among these com-
consider downloading only the beginning              ponent search engines is that some remove
portion of each (large) document to ana-             stopwords and some do not (or the stop-
lyze [Craswell et al. 1999].                         word lists are different), then a query may
   On the other hand, downloading-based              be adjusted to generate more comparable
approaches also have some clear advan-               local similarities. For instance, suppose a
tages [Lawrence and Lee Giles 1998].                 term t in query q is a stopword in compo-
First, when trying to download docu-                 nent search engine E1 but not a stopword
ments, obsolete URLs can be identified.               in component search engine E2 . In order
As a result, documents with dead URLs                to generate more comparable similarities,
can be removed from the final result list.            we can remove t from q and submit the
Second, by analyzing downloaded docu-                modified query to E2 (it does not matter
ments, documents will be ranked by their             whether the original q or the modified q is
current contents. In contrast, local sim-            submitted to E 1 ).
ilarities may be computed based on old                  If the idf information is also used, then
versions of these documents. Third, query            we need to either adjust the local similar-
terms in downloaded documents could be               ities or compute the global similarities di-
highlighted when displayed to the user.              rectly to overcome the problem that the
                                                     global idf and the local idf ’s of a term
   7.2.2. Use of Discovered Knowledge. As            may be different. Consider the following
discussed previously, one difficulty with             two cases. It is assumed that both the local
result merging is that local document sim-           similarity function and the global similar-
ilarities may be incomparable because in             ity function are the cosine function.
different component search engines the               Case 1: Query q consists of a single
documents may be indexed differently                   term t. The similarity of q with a doc-
and the similarities may be computed                   ument d in a component database can
using different methods (term weighting                be computed by
schemes, similarity functions, etc.). If the
specific document indexing and similar-                                      qtft (q) × lidft × dtft (d )
                                                          sim(d , q) =                                   ,
ity computation methods used in differ-                                               |q| · |d |
ent component search engines can be dis-
covered, for example, using the techniques             where qtft (q) and dtft (d ) are the tf
proposed in Liu et al. [2000], then we can             weights of term t in q and in d , respec-
be in a better position to figure out (1) what          tively, and lidft is the local idf weight of
local similarities are reasonably compara-             t. If the local idf formula has been dis-
ble; (2) how to adjust some local similar-             covered and the global document fre-
ities so that they will become more com-               quency of t is known, then this local
parable with others; and (3) how to derive             similarity can be adjusted to the global
global similarities from local similarities.           similarity by multiplying it by gidft ,lidft
This is illustrated by the following exam-             where gidft is the global idf weight of t.
ple [Meng et al. 1999b].                             Case 2: Query q has multiple terms
                                                       t1 , . . . , tk. The global similarity be-
   Example 3. Suppose it is discovered
                                                       tween d and q in this case is
that all the component search engines
selected to answer a given user query                                k
                                                                           qtfti (q) × gidfti × dtfti (d )
employ the same methods to index local                    s =
documents and to compute local similari-                                           |q| · |d |
ties, and no collection-dependent statistics                     k
                                                                       qtfti (q) dtfti (d )
such as the idf information are used. Then                   =                  ·           · gidfti .
                                                                         |q|       |d |
the similarities from these local search en-                     i=1
gines can be considered as comparable. As                            qt f (q)
a result, these similarities can be used di-            Clearly, |q| and gidfti , i = 1, . . . , k,

rectly to merge the returned documents.                 can all be computed by the metasearch

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
82                                                                                                         Meng et al.

     engine as the formulas for computing                           (2) Integrate local systems supporting dif-
     them are known. Therefore, in order                                ferent types of queries (e.g., Boolean
                                              dtfti (d )
     to find s, we need to find |d | , i =                                queries versus vector space queries).
                         dtfti (d )                                     Most of our discussions in this article
     1, . . . , k. To find |d | for a given term
                                                                        are based on queries in the vector space
     ti without downloading document d, we
                                                                        model [Salton and McGill 1983]. There
     can submit ti as a single-term query.
                                  qtf (ti ) × lidfti × dtfti (d )       exist metasearch engines that use
     Let si = sim(d , ti ) = ti              |ti |·|d |                 Boolean queries [French et al. 1995; Li
     be the local similarity returned. Then                             and Danzig 1997; NCSTRL n.d.] and
                                                                        a number of works on dealing with
            dtfti (d )        si × |ti |                                Boolean queries in a metasearch en-
                       =                                  (16)          gine have been reported [Gravano et al.
              |d |       qtfti (ti ) × lidfti
                                                                        1994; Li and Danzig 1997; Sheldon
                                                                        et al. 1994]. Since very different meth-
     Note that the expression on the right-                             ods may be used to rank documents for
     hand side of the above formula can                                 Boolean queries (traditional Boolean
     be computed by the metasearch engine                               retrieval systems do not even rank re-
     when all the local formulas are known                              trieved documents) and vector space
     (i.e., have been discovered). In sum-                              queries, we are likely to face many
     mary, k additional single-term queries                             new problems when integrating local
     can be used to compute the global sim-                             systems that support both Boolean
     ilarities between q and all documents                              queries and vector space queries.
     retrieved by q.
                                                                    (3) Discover knowledge about component
                                                                        search engines. Many local systems
8. NEW CHALLENGES                                                       are not willing to provide sufficient
As discussed in previous sections, much                                 design and statistical information
progress has been made to find efficient                                  about their systems. They consider
and accurate solutions to the problem of                                such information proprietary. How-
processing queries in a metasearch en-                                  ever, without sufficient information
gine environment. However, as an emerg-                                 about a local system, the estimation
ing area, many outstanding problems re-                                 about the usefulness of the local
main to be solved. In this section, we list                             system with respect to a given query
a few worthwhile challenges in this area.                               may not be made accurately. One
                                                                        possible solution to this dilemma is to
(1) Integrate local systems employing dif-                              develop tools that can learn about a
    ferent indexing techniques. Using dif-                              local system regarding the indexing
    ferent indexing techniques in differ-                               terms used and certain statistical
    ent local systems can have serious                                  information about these terms as
    impact on the compatibility of local                                well as the similarity function used
    similarities. Careful observation can                               through probe queries. These learning
    reveal that using different indexing                                or knowledge discovering tools can be
    techniques can in fact affect the esti-                             used to facilitate not only the addition
    mation accuracy in each of the three                                of new component search engines to
    software components (i.e., database se-                             an existing metasearch engine but
    lection, document selection, and result                             also the detection of major upgrades
    merging). New studies need to be car-                               or changes of existing component sys-
    ried out to investigate more precisely                              tems. Some preliminary work in this
    what impact it poses and how to over-                               area has started to be reported. Using
    come or alleviate the impact. Previous                              sampling technique to generate ap-
    studies have largely been focused on                                proximate database representatives
    different local similarity functions and                            for CORI Net is reported in Callen
    local term weighting schemes.                                       et al. [1999]. In Liu et al. [2000], a

                                                                        ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                             83

    technique is proposed to discover how               local systems are useful to facilitate
    term weights are assigned in compo-                 the construction of a metasearch en-
    nent search engines. New techniques                 gine. Search engine developers may
    need to be developed to discover knowl-             use such guidelines to design or up-
    edge about component search engines                 grade their search engines. Multiple
    more accurately and more efficiently.                levels of compliance should be allowed,
(4) Develop more effective result merging               with different compliance levels guar-
    methods. Up to now, most result merg-               anteeing different levels of estimation
    ing methods that have under gone                    accuracy. A serious initial effort in this
    extensive experimental evaluation                   regard can be found in Gravano et al.
    are those proposed for data fusion.                 [1997].
    These methods may be unsuitable in
    the metasearch engine environment                (6) Incorporate new indexing and weigh-
    where databases of different compo-                  ting techniques to build better meta-
    nent search engines are not identical.               search engines. Some new indexing
    New methods that take into consid-                   and term weighting techniques have
    eration the special characteristics of               been developed for search engines
    the metasearch engine environment                    for HTML documents. For example,
    need to be designed and evaluated.                   some search engines (e.g., WWWW
    One such special characteristic is that              [McBryan 1994], Google [Brin and
    when a document is not retrieved by                  Page 1998], and Webor [Cutler et al.
    a search engine, it may be because the               1997]) use anchor terms in a Web
    document is not indexed by the search                page to index the Web page that is
    engine.                                              hyperlinked by the URL associated
                                                         with the anchor. The rationale is that
(5) Study the appropriate cooperation be-                when authors of Web pages add a
    tween a metasearch engine and the                    hyperlink to another Web page p, they
    local systems. There are two extreme                 include in the anchor tag a description
    ways to build a metasearch engine.                   of p in addition to its URL. These de-
    One is to impose an interface on top                 scriptions have the potential of being
    of autonomous component search en-                   very important for the retrieval of p
    gines. In this case, no cooperation from             because they include the perception
    these local systems can be expected.                 of these authors about the contents of
    The other is to invite local systems                 p. As another example, some search
    to join a metasearch engine. In this                 engines also compute the weight of a
    case, the developer of the metasearch                term according to its position in the
    engine may set conditions, such as                   Web page and its font type. In SIBRIS
    what similarity function(s) must be                  [Wade et al. 1989], the weight of a
    used and what information about the                  term in a page is increased if the term
    component databases must be pro-                     appears in the title of the page. A
    vided, that must be satisfied for a local             similar method is also employed in
    system to join the metasearch engine.                AltaVista, HotBot, and Yahoo. Google
    Many possibilities exist between the                 [Brin and Page 1998] assigns higher
    two extremes. This means it is likely,               weights to terms in larger or bold
    in a practical environment, that differ-             fonts. It is known that co-occurrences
    ent types of database representatives                and proximities of terms have signif-
    will be available to the metasearch                  icant influence on the relevance of
    engine. How to use different types of                documents. An interesting problem
    database representatives to estimate                 is how to incorporate these new tech-
    comparable database usefulnesses is                  niques into the entire retrieval process
    still a largely untouched problem.                   and into the database representatives
    An interesting issue is to come up with              so that better metasearch engines can
    guidelines on what information from                  be built.

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
84                                                                                     Meng et al.

(7) Improve the effectiveness of meta-             rately determined. Third, user profiles
    search. Most existing techniques rank          may be utilized to support personal-
    databases and documents based on the           ized metasearch. Fourth, collaborative
    similarities between the query and the         filtering (CF) has been shown to be
    documents in each database. Similari-          very effective for recommending useful
    ties are computed based on the match           documents [Konstan et al. 1997] and
    of terms in the query and documents.           is employed by the DirectHit search
    Studies in information retrieval indi-         engine (www.directhit.com). The CF
    cate that when queries have a large            technique may also be useful for
    number of terms, the correlation be-           recommending databases to search for
    tween highly similar documents and             a given query.
    relevant documents exists provided
                                                (8) Decide where to place the software
    appropriate similarity functions and
                                                    components of a metasearch engine.
    term weighting schemes, such as the
                                                    In Section 3, we identified the major
    cosine function and the tfw ∗ idfw
                                                    software components for building a
    weight formula, are used. However,
                                                    good metasearch engine. One issue
    for queries that are short, typical in
                                                    that we have not discussed is where
    the Internet environment [Jansen
                                                    should these components be placed.
    et al. 1998; Kirsch 1998], the above
                                                    An implicit assumption used in this
    correlation is weak. The reason is that
                                                    article is that all components are
    for a long query, the terms in the query
                                                    placed at the site of the metasearch
    provide context to each other to help
                                                    engine. However, valid alternatives
    disambiguate the meanings of differ-
                                                    exist. For example, instead of having
    ent terms. In a short query, the partic-
                                                    the database selector at the global
    ular meaning of a term often cannot
                                                    site, we could distribute it to all local
    be identified correctly. In summary, a
                                                    sites. The representative of each local
    similar document to a short query may
                                                    database can also be stored locally.
    not be useful to the user who submit-
                                                    In this scenario, each user query will
    ted the query because the matching
                                                    be dispatched to all local sites for
    terms may have different meanings.
                                                    database selection. Each site then es-
    Clearly, the same problem also exists
                                                    timates the usefulness of its database
    for search engines. Methods need to be
                                                    with respect to the query to determine
    developed to address this issue. The
                                                    whether its local search engine should
    following are some promising ideas.
                                                    be invoked for the query. Although this
    First, incorporate the importance of a
                                                    placement of the database selector will
    document as determined by linkages
                                                    incur a higher communication cost, it
    between documents (e.g., PageRank
                                                    also has some appealing advantages.
    [Page et al. 1998] and authority
                                                    First, the estimation of database
    [Kleinberg 1998]) with the similarity
                                                    usefulness can now be carried out in
    of the document with a query [Yu et al.
                                                    parallel. Next, as database representa-
    2001]. Second, associate databases
                                                    tives are stored locally, the scalability
    with concepts [Fan and Gauch 1999;
                                                    issue becomes much less significant
    Ipeirotis et al. 2001; Meng et al. 2001].
                                                    than when centralized database selec-
    When a query is received by the
                                                    tion is employed. Other components
    metasearch engine, it is first mapped
                                                    such as the document selector may
    to a number of appropriate concepts
                                                    also have alternative placements. We
    and then those databases associated
                                                    need to investigate the pros and cons of
    with the mapped concepts are used for
                                                    different placements of these software
    database selection. The concepts asso-
                                                    components. New research issues may
    ciated with a database/query are used
                                                    arise from these investigations.
    to provide some contexts for terms in
    the database/query. As a result, the        (9) Create a standard testbed to evaluate
    meanings of terms can be more accu-             the proposed techniques for database

                                                    ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                      85

    selection, document selection, and re- (10) Extend metasearch techniques to differ-
    sult merging. This is an urgent need.         ent types of data sources. Information
    Although most papers that report              sources on the Web often contain mul-
    these techniques include some exper-          timedia data such as text, image and
    imental results, it is hard to draw gen-      video. Most work in metasearch deals
    eral conclusions from these results due       with only text sources or the text as-
    to the limitations of the documents and       pect of multimedia sources. Database
    queries used. Some studies use vari-          selection techniques have also been
    ous portions of some old TREC collec-         investigated for other media types. For
    tions to conduct experiments [Callan          example, selecting image databases
    et al. 1995; Voorhees et al. 1995b; Xu        in a metasearch context was studied
    and Callan 1998] so that the informa-         in Chang et al. [1998]. As another
    tion about the relevance of documents         example, for data sources that can be
    to each query can be utilized. How-           described by attributes, such as book
    ever, the old TREC collections have           title and author name, a necessary
    several limitations. First, the number        and sufficient condition for ranking
    of queries that can be used for different     databases optimally was given in Kirk
    portions of TREC collections is small         et al. [1995]. The database selection
    (from 50 to 250). Second, these queries       method in Liu [1999] also considered
    tend to be much longer on the aver-           only data sources of mostly structured
    age than typical queries encountered          data. But there is a lack of research on
    in the Internet environment [Abdulla          providing metasearch capabilities for
    et al. 1997; Jansen et al. 1998]. Third,      mixed media or multimedia sources.
    the documents do not reflect more
    structured and more extensively hy-          The above list of challenges is by no
    perlinked Web documents. In Gravano means complete. New problems will arise
    and Garcia-Molina [1995], Meng et al. with a deeper understanding of the issues
    [1998, 1999a], and Yu et al. [1999a] in metasearch.
    a collection of up to more than 6,000
    real Internet queries is used. However,
                                              9. CONCLUSIONS
    the database collection is small and
    there is no document relevance infor- With the increase of the number of search
    mation. An ideal testbed should have engines and digital libraries on the World
    a large collection of databases of vari- Wide Web, providing easy, efficient, and
    ous sizes, contents, and structures, and effective access to text information from
    a large collection of queries of vari- multiple sources has increasingly become
    ous lengths with the relevant docu- necessary. In this article, we presented
    ments for each query identified. Re- an overview of existing metasearch tech-
    cently, a testbed based on partitioning niques. Our overview concentrated on the
    some old TREC collections into hun- problems of database selection, document
    dreds of databases has been proposed selection, and result merging. A wide va-
    for evaluating metasearch techniques riety of techniques for each of these prob-
    [French et al. 1998, 1999]. However, lems was surveyed and analyzed. We also
    this testbed is far from being ideal discussed the causes that make these
    due to the problems inherited from problems very challenging. The causes in-
    the used TREC collections. Two new clude various heterogeneities among dif-
    TREC collections consisting of Web ferent component search engines due to
    documents (i.e., WT10g and VLC2; the independent implementations of these
    WT10g is a 10GB subset of the 100GB search engines, and the lack of informa-
    VLC2) have been created recently. The tion about these implementations because
    test queries are also typical Inter- they are mostly proprietary.
    net queries. It is possible that good        Our survey and investigation seem to
    testbeds can be derived from them.        indicate that better solutions to each of the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
86                                                                                               Meng et al.

three main problems, namely database                     BUCKLEY, C., SALTON, G., AND ALLAN, J. 1993. Auto-
selection, document selection, and re-                       matic retrieval with locality information using
                                                             smart. In Proceedings of the First Text Retrieval
sult merging, require more information/                      Conference, NIST Special Publication 500–207
knowledge about the component search                         (March), 59–72.
engines such as more detailed database                   CALLAN, J. 2000. Distributed information re-
representatives, underlying similarity                       trieval. In Advances in Information Retrieval:
functions, term weighting schemes, in-                       Recent Research from the Center for Intelligent
                                                             Information Retrieval, W. Bruce Croft, ed.
dexing methods, and so on. There are                         Kluwer Academic Publishers. 127–150.
currently no sufficiently efficient methods                CALLAN, J., CONNELL, M., AND DU, A. 1999. Auto-
to find such information without the coop-                    matic discovery of language models for text
eration of the underlying search engines.                    databases. In Proceedings of the ACM SIGMOD
A possible scenario is that we will need                     Conference (Philadelphia, PA, June 1999), 479–
good solutions based on different degrees                    490.
of knowledge about each local search en-                 CALLAN, J., CROFT, B., AND HARDING, S. 1992. The
                                                             inquery retrieval system. In Proceedings of the
gine, which we will then apply accordingly.                  Third DEXA Conference (Valencia, Spain, 1992),
   Another important issue is the scalabil-                  78–83.
ity of the solutions. Ultimately, we need                CALLAN, J., LU, Z., AND CROFT, W. 1995. Searching
to develop solutions that can scale in two                   distributed collections with inference networks.
orthogonal dimensions: data and access.                      In Proceedings of the ACM SIGIR Conference
                                                             (Seattle, WA, July 1995), 21–28.
Specifically, a good solution must scale
                                                         CHAKRABARTI, S., DOM, B., KUMAR, S., RAGHAVAN, P.,
to thousands of databases, with many of                      RAJAGOPALAN, S., TOMKINS, A., GIBSON, D., AND
them containing millions of documents,                       KLEINBERG, J. 1999. Mining the web’s link
and to millions of accesses a day. None of                   structure. IEEE Comput. 32, 8 (Aug.), 60–67.
the proposed solutions has been evaluated                CHAKRAVARTHY, A. AND HAASE, K. 1995. Netserf: Us-
under these conditions.                                      ing semantic knowledge to find internet informa-
                                                             tion archives. In Proceedings of the ACM SIGIR
                                                             Conference (Seattle, WA, July 1995), 4–11.
             ACKNOWLEDGMENTS                             CHANG, C. AND GARCIA-MOLINA, H. 1999. Mind your
                                                             vocabulary: query mapping across heteroge-
We are very grateful to the anonymous reviewers and          neous information sources. In Proceedings of the
the editor, Michael Franklin, of the article for their       ACM SIGMOD Conference (Philadelphia, PA,
invaluable suggestions and constructive comments.            June 1999), 335–346.
We also would like to thank Leslie Lander for reading    CHANG, W., MURTHY, D., ZHANG, A., AND SYEDA-
the manuscript and providing suggestions that have           MAHMOOD, T. 1998. Global integration of
improved the quality of the manuscript.                      visual databases. In Proceedings of the IEEE
                                                             International Conference on Data Engineering
                                                             (Orlando, FL, Feb. 1998), 542–549.
                  REFERENCES                             COTTRELL, G. AND BELEW, R. 1994. Automatic com-
                                                             bination of multiple ranked retrieval systems.
ABDULLA, G., LIU, B., SAAD, R., AND FOX, E. 1997.            In Proceedings of the ACM SIGIR Conference
    Characterizing World Wide Web queries. In                (Dublin, Ireland, July 1994), 173–181.
    Technical report TR-97-04, Virginia Tech.            CRASWELL, N., HAWKING, D., AND THISTLEWAITE, P.
BAUMGARTEN, C. 1997. A probabilistic model for               1999. Merging results from isolated search en-
    distributed information retrieval. In Proceed-           gines. In Proceedings of the Tenth Australasian
    ings of the ACM SIGIR Conference (Philadel-              Database Conference (Auckland, New Zealand,
    phia, PA, July 1997), 258–266.                           Jan. 1999), 189–200.
BERGMAN, M. 2000. The deep Web: Surfacing                CROFT, W. 2000. Combining approaches to infor-
    the hidden value. BrightPlanet, www.complete-            mation retrieval. In Advances in Information Re-
    planet.com/Tutorials/DeepWeb/index.asp.                  trieval: Recent Research from the Center for In-
BOYAN, J., FREITAG, D., AND JOACHIMS, T. 1996. A             telligent Information Retrieval, W. Bruce Croft,
    machine learning architecture for optimizing             ed. Kluwer Academic Publishers. 1–36.
    web search engines. In AAAI Workshop on              CUTLER, M., SHIH, Y., AND MENG, W. 1997. Using
    Internet-Based Information Systems (Portland,            the structures of html documents to improve
    OR, 1996).                                               retrieval. In Proceedings of the USENIX Sym-
BRIN, S. AND PAGE, L. 1998. The anatomy of a large-          posium on Internet Technologies and Systems
    scale hypertextual Web search engine. In Pro-            (Monterey, CA, Dec. 1997), 241–251.
    ceedings of the Seventh World Wide Web Confer-       DREILINGER, D. AND HOWE, A. 1997. Experiences
    ence (Brisbane, Australia, April 1998), 107–117.         with selecting search engines using metasearch.

                                                              ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                          87

     ACM Trans. Inform. Syst. 15, 3 (July), 195–         KAHLE, B. AND MEDLAR, A. 1991. An information
     222.                                                   system for corporate users: wide area informa-
FAN, Y. AND GAUCH, S. 1999. Adaptive agents for in-         tion servers. Technical Report TMC199, Think-
     formation gathering from multiple, distributed         ing Machine Corporation (April).
     information sources. In Proceedings of the 1999     KIRK, T., LEVY, A., SAGIV, Y., AND SRIVASTAVA, D. 1995.
     AAAI Symposium on Intelligent Agents in Cy-             The information manifold. In AAAI Spring Sym-
     berspace (Stanford University, Palo Alto, CA,           posium on Information Gathering in Distributed
     March 1999), 40–46.                                     Heterogeneous Environments (1995).
FOX, E. AND SHAW, J. 1994. Combination of multi-         KIRSCH, S. 1998. Internet search: Infoseek’s
     ple searches. In Proceedings of the Second Text          experiences searching the internet. ACM SIGIR
     REtrieval Conference (Gaithersburg, MD, Aug.             Forum 32, 2, 3–7.
     1994), 243–252.                                     KLEINBERG, J. 1998. Authoritative sources in hy-
FRENCH, J., FOX, E., MALY, K., AND SELMAN, A. 1995.           perlinked environment. In Proceedings of the
     Wide area technical report service: technical re-        ACM-SIAM Symposium on Discrete Algorithms
     port online. Commun. ACM 38, 4 (April), 45–46.           (San Francisco, CA, January 1998), 668–677.
     T., PREY, K., AND MOU, Y. 1999. Comparing the            GORDON, L., AND RIEDL, J. 1997. Grouplens:
     performance of database selection algorithms.            Applying collaborative filtering to usenet news.
     In Proceedings of the ACM SIGIR Conference               Commun. ACM 40, 3, 77–87.
     (Berkeley, CA, August 1999), 238–245.               KOSTER, M. 1994. Aliweb: Archie-like indexing in
FRENCH, J., POWELL, A., AND VILES, C. 1998. Evalu-            the Web. Comput. Netw. and ISDN Syst. 27, 2,
     ating database selection techniques: a testbed           175–182.
     and experiment. In Proceedings of the ACM           LAWRENCE, S. AND LEE GILES, C. 1998. Inquirus, the
     SIGIR Conference (Melbourne, Australia,                  neci meta search engine. In Proceedings of the
     August 1998), 121–129.                                   Seventh International World Wide Web Confer-
GAUCH, S., WANG, G., AND GOMEZ, M. 1996. Pro-                 ence (Brisbane, Australia, April 1998), 95–105.
     fusion: intelligent fusion from multiple, dis-      LAWRENCE, S. AND LEE GILES, C. 1999. Accessibility
     tributed search engines. J. Univers. Comput.             of information on the web. Nature 400, 107–109.
     Sci. 2, 9, 637–649.                                 LEE, J.-H. 1997. Analyses of multiple evidence
GRAVANO, L., CHANG, C., GARCIA-MOLINA, H., AND                combination. In Proceedings of the ACM SIGIR
     PAEPCKE, A. 1997. Starts: Stanford proposal              Conference (Philadelphia, PA, July 1997), 267–
     for Internet meta-searching. In Proceedings of           276.
     the ACM SIGMOD Conference (Tucson, AZ, May          LI, S. AND DANZIG, P. 1997. Boolean similarity mea-
     1997), 207–218.                                          sures for resource discovery. IEEE Trans. Knowl.
GRAVANO, L. AND GARCIA-MOLINA, H. 1995. General-              Data Eng. 9, 6 (Nov.), 863–876.
     izing gloss to vector-space databases and broker    LIU, K., MENG, W., YU, C., AND RISHE, N. 2000.
     hierarchies. In Proceedings of the International         Discovery of similarity computations of search
     Conferences on Very Large Data Bases (Zurich,            engines. In Proceedings of the Ninth ACM Inter-
     Switzerland, Sept. 1995), 78–89.                         national Conference on Information and Knowl-
GRAVANO, L. AND GARCIA-MOLINA, H. 1997. Merging               edge Management (Washington, DC, Nov. 2000),
     ranks from heterogeneous Internet sources. In            290–297.
     Proceedings of the International Conferences on     LIU, K., YU, C., MENG, W., WU, W., AND RISHE, N. 2001.
     Very Large Data Bases (Athens, Greece, August            A statistical method for estimating the useful-
     1997), 196–205.                                          ness of text databases. IEEE Trans. Knowl. Data
GRAVANO, L., GARCIA-MOLINA, H., AND TOMASIC, A.               Eng. To appear.
     1994. The effectiveness of gloss for the text       LIU, L. 1999. Query routing in large-scale digi-
     database discovery problem. In Proceedings of            tal library systems. In Proceedings of the IEEE
     the ACM SIGMOD Conference (Minnesota, MN,                International Conference on Data Engineering
     May 1994), 126–137.                                      (Sydney, Australia, March 1999), 154–163.
HAWKING, D. AND THISTLEWAITE, P. 1999. Methods           MANBER, U. AND BIGOT, P. 1997. The search broker.
     for information server selection. ACM Trans.             In Proceedings of the USENIX Symposium on
     Inform. Syst. 17, 1 (Jan.), 40–76.                       Internet Technologies and Systems (Monterey,
IPEIROTIS, P., GRAVANO, L., AND SAHAMI, M. 2001.              CA, December 1997), 231–239.
     Probe, count, and classify: categorizing hidden-    MANBER, U. AND BIGOT, P. 1998. Connecting diverse
     Web databases. In Proceedings of the ACM                 web search facilities. Data Eng. Bull. 21, 2
     SIGMOD Conference (Santa Barbara, CA, 2001),             (June), 21–27.
     67–78.                                              MAULDIN, M. 1997. Lycos: design choices in an in-
JANSEN, B., SPINK, A., BATEMAN, J., AND SARACEVIC, T.         ternet search service. IEEE Expert 12, 1 (Feb.),
     1998. Real life information retrieval: a study           1–8.
     of user queries on the Web. ACM SIGIR               MCBRYAN, O. 1994. Genvl and wwww: Tools for
     Forum 32, 1, 5–17.                                       training the Web. In Proceedings of the

ACM Computing Surveys, Vol. 34, No. 1, March 2002.
88                                                                                              Meng et al.

    First World Wide Web Conference (Geneva,                 Web Conference (Amsterdam, The Netherlands,
    Switzerland, May 1994), 79–90.                           May 2000), 417–429.
    RISHE, N. 1998. Determine text databases to              LAIRD, B. 1995. Learning collection fusion
    search in the internet. In Proceedings of the            strategies for information retrieval. In Proceed-
    International Conferences on Very Large Data             ings of the 12th International Conference on
    Bases (New York, NY, Aug. 1998), 14–25.                  Machine Learning (Tahoe City, CA, July 1995),
MENG, M., LIU, K., YU, C., WU, W., AND RISHE, N.             540–548.
    1999a. Estimating the usefulness of search          TURTLE, H. AND CROFT, B. 1991. Evaluation of an
    engines. In Proceedings of the IEEE Interna-             inference network-based retrieval model. ACM
    tional Conference on Data Engineering (Sydney,           Trans. Inform. Syst. 9, 3 (July), 8–14.
    Australia, March 1999), 146–153.                    VOGT, C. AND COTTRELL, G. 1999. Fusion via a linear
MENG, W., WANG, W., SUN, H., AND YU, C. 2001. Con-           combination of scores. Inform. Retr. 1, 3, 151–
    cept hierarchy based text database categoriza-           173.
    tion. Int. J. Knowl. Inform. Syst. To appear.       VOORHEES, E. 1996. Siemens trec-4 report: further
MENG, W., YU, C., AND LIU, K. 1999b. Detection of            experiments with database merging. In Pro-
    heterogeneities in a multiple text database en-          ceedings of the Fourth Text Retrieval Conference
    vironment. In Proceedings of the Fourth IFCIS            (Gaithersburg, MD, Nov. 1996), 121–130.
    Conference on Cooperative Information Systems       VOORHEES, E., GUPTA, N., AND JOHNSON-LAIRD, B.
    (Edinburgh, Scotland, September 1999), 22–33.            1995a. The collection fusion problem. In Pro-
MILLER, G. 1990. Wordnet: An on-line lexical                 ceedings of the Third Text Retrieval Conference
    database. Int. J. Lexicography 3, 4, 235–312.            (Gaithersburg, MD, Nov. 1995), 95–104.
NCSTRL. n.d. Networked computer science tech-           VOORHEES, E., GUPTA, N., AND JOHNSON-LAIRD, B.
    nical reference library. At Web site http://             1995b. Learning collection fusion strategies.
    cstr.cs.cornell.edu.                                     In Proceedings of the ACM SIGIR Conference
PAGE, L., BRIN, S., MOTWANI, R., AND WINOGRAD, T.            (Seattle, WA, July 1995), 172–179.
    1998. The pagerank citation ranking: bring          VOORHEES, E. AND TONG, R. 1997. Multiple search
    order to the web. Technical report, Stanford Uni-        engines in database merging. In Proceedings of
    versity, Palo, Alto, CA.                                 the Second ACM International Conference on
ROBERTSON, S., WALKER, S., AND BEAULIEU, M. 1999.            Digital Libraries (Philadelphia, PA, July 1997),
    Okapi at trec-7: automatic ad hoc, filtering, vlc,        93–102.
    and interactive track. In Proceedings of the Sev-   WADE, S., WILLETT, P., AND BAWDEN, D. 1989. Sibris:
    enth Text Retrieval Conference (Gaithersburg,            the sandwich interactive browing and ranking
    MD, Nov. 1999), 253–264.                                 information system. J. Inform. Sci. 15, 249–260.
SALTON, G. 1989. Automatic Text Processing: The         WIDDER, D. 1989. Advanced Calculus, 2nd ed.
    Transformation, Analysis, and Retrieval of Infor-        Dover Publications, Inc., New York, NY.
    mation by Computer. Addison Wesley, Reading,        WU, Z., MENG, W., YU, C., AND LI, Z. 2001. Towards a
    MA.                                                      highly-scalable and effective metasearch engine.
SALTON, G. AND MCGILL, M. 1983. Introduction to              In Proceedings of the Tenth World Wide Web
    Modern Information Retrieval. McGraw-Hill,               Conference (Hong Kong, May 2001), 386–395.
    New York, NY.                                       XU, J. AND CALLAN, J. 1998. Effective retrieval with
SELBERG, E. AND ETZIONI, O. 1995. Multi-service              distributed collections. In Proceedings of the
    search and comparison using the metacrawler.             ACM SIGIR Conference (Melbourne, Australia,
    In Proceedings of the Fourth World Wide Web              1998), 112–120.
    Conference (Boston, MA, Dec. 1995), 195–208.        XU, J. AND CROFT, B. 1996. Query expansion us-
SELBERG, E. AND ETZIONI, O. 1997. The metacrawler            ing local and global document analysis. In Pro-
    architecture for resource aggregation on the             ceedings of the ACM SIGIR Conference (Zurich,
    web. IEEE Expert 12, 1, 8–14.                            Switzerland, Aug. 1996), 4–11.
SHELDON, M., DUDA, A., WEISS, R., O’TOOLE, J., AND      XU, J. AND CROFT, B. 1999. Cluster-based language
    GIFFORD, D. 1994. A content routing system               models for distributed retrieval. In Proceedings
    for distributed information servers. In Pro-             of the ACM SIGIR Conference (Berkeley, CA,
    ceedings of the Fourth International Conference          Aug. 1999), 254–261.
    on Extending Database Technology (Cambridge,        YU, C., LIU, K., WU, W., MENG, W., AND RISHE, N.
    England, March 1994), 109–122.                           1999a. Finding the most similar documents
SINGHAL, A., BUCKLEY, C., AND MITRA, M. 1996. Piv-           across multiple text databases. In Proceedings
    oted document length normalization. In Pro-              of the IEEE Conference on Advances in Digital
    ceedings of the ACM SIGIR Conference (Zurich,            Libraries (Baltimore, MD, May 1999), 150–162.
    Switzerland, Aug. 1996), 21–29.                     YU, C. AND MENG, W. 1998. Principles of Database
SUGIURA, A. AND ETZIONI, O. 2000. Query routing              Query Processing for Advanced Applications.
    for Web search engines: architecture and exper-          Morgan Kaufmann Publishers, San Francisco,
    iments. In Proceedings of the Ninth World Wide           CA.

                                                             ACM Computing Surveys, Vol. 34, No. 1, March 2002.
Building Efficient and Effective Metasearch Engines                                                      89

YU, C., MENG, W., LIU, K., WU, W., AND RISHE, N.        YUWONO, B. AND LEE, D. 1996. Search and ranking
    1999b. Efficient and effective metasearch for a         algorithms for locating resources on the World
    large number of text databases. In Proceedings of      Wide Web. In Proceedings of the IEEE Inter-
    the Eighth ACM International Conference on In-         national Conference on Data Engineering (New
    formation and Knowledge Management (Kansas             Orleans, LA, Feb. 1996), 164–177.
    City, MO, Nov. 1999), 217–224.                      YUWONO, B. AND LEE, D. 1997. Server ranking for
YU, C., MENG, W., WU, W., AND LIU, K. 2001. Effi-           distributed text resource systems on the In-
    cient and effective metasearch for text databases      ternet. In Proceedings of the 5th International
    incorporating linkages among documents. In             Conference On Database Systems for Advanced
    Proceedings of the ACM SIGMOD Conference               Applications (Melbourne, Australia, April 1997),
    (Santa Barbara, CA, May 2001), 187–198.                391–400.

Received March 1999; revised October 2000; accepted May 2001

ACM Computing Surveys, Vol. 34, No. 1, March 2002.

To top