Tadpole_ A Meta search engine and Evaluation of ranking strategies

Document Sample
Tadpole_ A Meta search engine and Evaluation of ranking strategies Powered By Docstoc
					                         Tadpole: A Meta search engine
                   Evaluation of Meta Search ranking strategies

          Mahathi S Mahabhashyam                                  Pavan Singitham
          mmahathi@stanford.edu                                 pavan@stanford.edu

Abstract                                              search engines crawl the WWW from time
In this write up, we explain the design of            to time and index the web pages. However,
Tadpole, a Meta search engine which                   it is virtually impossible for any search
obtains results from various search engines           engine to have the entire web indexed. Most
and aggregates them. We discuss three                 of the time a search engine can index only a
meta-search ranking strategies – two                  small portion of the vast set of web pages
positional methods and a scaled foot rule             existing on the Internet. Each search engine
optimization method and study the response-           crawls the web separately and creates its
time/result quality trade-offs involved.              own database of the content. Therefore,
                                                      searching more than one search engine at a
1.Introduction                                        time enables us to cover a larger portion of
                                                      the World Wide Web.
A Meta search engine transmits user’s
search simultaneously to several individual           Secondly, crawling the web is a long
search engines and their databases of web             process, which can take more than a month
pages and gets results from all the search            whereas the content of many web pages
engines queried. We could thus save a lot of          keep changing more frequently and
time by initiating the search at a single point       therefore, it is important to have the latest
and sparing the need to use and learn several         updated information, which could be present
separate search engines. This can be even             in any of the search engines.
more helpful, if we are looking for a broad
                                                      Meta Search engines help us achieve the
range of results.
                                                      afore-mentioned objectives. However, we
In our project, we have implemented a Meta            need good ranking strategies in order to
search engine, which queries Google,                  aggregate the results obtained from the
Altavista and MSN databases. We have                  various search engines. Quite often, many
provided an interface for searching these             web sites successfully spam some of the
search engines along with several advanced            search engines and obtain an unfair rank. By
options for phrase search, conjunction,               using     appropriate    rank      aggregation
disjunction and negation of the key words.            strategies, we can prevent such results from
In order to rank the results obtained, we             appearing in the top results of a meta-search.
have made use of three rank aggregation
                                                      Our primary motivation was to develop a
strategies and evaluated the results obtained.
                                                      simple meta-search engine and study the
Out of these, two are positional methods,
                                                      response-time and performance trade-offs
which make use of the result’s rank in each
of the separate search engine to obtain a new
rank by simple aggregation. The third one is
                                                      3.Previous Work
a scaled foot rule optimization technique.
                                                      There are quite a few Meta search engines
2.Motivation                                          available on the Internet, which can be
                                                      categorized as follows
There are primarily two motivating factors
behind our developing a meta-search engine.           1. Meta search engines for serious deep
Firstly, the World Wide Web is a huge                 digging Ex: Surfwax, Copernic Basic
unstructured corpus of information. Various
2. Meta Search engines which aggregate the          to the top of the Meta search-ranking list.
results obtained from various search engines        This is effective in avoiding spam.
Ex: Vivisimo, Ixquick
3. Meta Search engines which present
                                                    The organization for the report is as
results without aggregating them Ex:
                                                    follows:Section 5 discusses the architecture
                                                    and design of Tadpole, the meta-search
        Meta-search engines of the first kind       engine developed by us. Section 6 gives a
are not available as free-software. So, their       study of the tradeoffs involved. In Section 7.
benefits are not reaped by most users. Some         we describe a few problems we encountered
of the other issues involved and drawbacks          during the project. Section 8 gives the
of meta-search engines are provided in [3].         conclusion and future work.
An aggregation of the results obtained              5.Architecture of Tadpole
would be more useful than just dumping the
                                                    When a user issues a search request,
normal results. For such an aggregation,
                                                    multiple threads are created in order to fetch
Ravi Kumar et al [1] have suggested several
                                                    the results from various search engines.
Rank aggregation methods for the web,
                                                    Each of these threads is given a time limit of
broadly categorized as Borda’s positional
                                                    3 seconds to return the results, failing which
methods, Foot rule /Scaled Foot rule
                                                    a time out occurs and the thread is
Optimization methods, Markov Chain
methods for rank aggregation. They also
suggest a local Kemenization technique,             Each process converts the given query to the
which brings the results that are ranked            format specific to the search engine it is
higher by the majority of the search engines



      SE#                                           Algorithm

                                        Array of                      gated
      SE#                              TreeMaps                       Results
            Parallel processes query
            different search engines
             and obtain the results                             TreeMap sorted
                                                                    on rank

                                             Figure 1
                                                 performance and so we chose these three
dealing with. This request is sent to the
                                                 rank aggregation methods.
search engine via the java URL object and
the results are obtained in the form of a
                                                 5.2 Ranking Aggregation Methods
HTML page. This HTML results page is
parsed by the process and for each result, the
                                                 Take the Best Rank
URL, Title, Description, Rank and
                                                 In this algorithm, we try to place a URL at
SearchSource are stored, creating a Result
                                                 the best rank it gets in any of the search
object. These results are entered into a
                                                 engine rankings.
TreeMap data structure with the key as the
                                                 That is,
url and the item as the Result object.
                                                 MetaRank (x) =
The GUI also provides for advanced search        Min(Rank1(x),Rank2(x),…. , Rankn(x));
options for entering Boolean queries, Phrase     Clashes are avoided by an ordering of the
searches, selecting the number of results per    search engines based on popularity. That
search engine and the selection of search        means, if two results claim the same position
engines to be queried.                           in the meta-rank list, the result from a more
                                                 popular search engine, (say Google) is
5.1 Design Decisions
                                                 preferred to the result from a less popular
        During the design of Tadpole, we         one.
various design decisions were taken. Some
of them are listed below:                        Borda’s Positional Method
Why TreeMap?                                     In this algorithm, the MetaRank of a url is
                                                 obtained by computing the Lp-Norm of the
TreeMap data structure combines the nice         ranks in different search engines.
features of a tree ( low search and retrieval    MetaRank(x)=
time) and Map (easy association) data            [Σ(Rank1(x)p,Rank2(x)p,…. , Rankn(x) p)]1/p
structures. By storing the results with the      In our algorithm, we have considered the
URL as the key, we can retrieve a result in      L1-Norm which is the sum of all the ranks
(log n) time while removing the duplicates       in different search engine result lists.
and merging them in the ranking algorithm.       Clashes are again avoided by search engine
This helps in a considerable speed up when       popularity.
we have hundreds of results from each            The search source for a URL, which is
search engine.                                   displayed in the meta search results, is set as
The TreeMaps thus obtained from each of          the search engine in which the URL is
the threads are then inserted in an array and    ranked the best.
passed on to the Ranking algorithm. The
Ranking algorithm then returns a tree map        Scaled Footrule Optimization Method
sorted on rank.                                  In this algorithm, the scaled footrule
                                                 distances are used to rank the various
Why these three ranking strategies?              results. Let T1, T2 , … Tn be partial lists
The positional methods are computationally       obtained from various search engines. Let
more efficient. They give a good precision       their union be S. A weighted bipartite graph
when compared to just aggregation of results     for scaled footrule optimization (C,P,W) is
without using any ranking. The scaled-           defined as
footrule method is computationally more          C = set of nodes to be ranked
complex, but is proven to have given much        P = set of positions available
better performance. It is also useful in the     W(c,p) = is the scaled- footrule distance (
reduction of spam to an extent. As the basic     from the Ti’s ) of a ranking that places
idea of this project was to study the trade-     element ‘c’ at position ‘p’, given by
offs involved, we wanted to get a gradation        W(c,p) = I=1k | Ti(c)/|Ti| - p/n|
in the level of computational complexity and
  Where n = number of results to be ranked        positional method take linear time, that
and |Ti| gives the cardinality of Ti.             means they have a complexity of O(n).
Computation of foot-rule aggregation for          Scaled Footrule optimization can be solved
partial lists is NP-hard [1]. Hence the use of    using the Hungarian algorithm for Bipartite-
scaled foot-rule distance measure. This           matching.
problem can be converted to a minimum
cost perfect matching in bipartite graphs         6.2 Rank Aggregation Time
described above. There are various
algorithms for finding the minimum cost           The aggregation times of various
perfect matching in bipartite graphs. We          ranking strategies were measured with
have used the Hungarian method for doing          respect to each other and with normal
                                                  search engines. The evaluation was
The Hungarian method proceeds as follows:
    - Obtain the reduced cost matrix from
                                                  carried out with respect to the following
         the given cost matrix by subtracting     set of 38 queries, which were previously
         the minimum of each row and each         used in other studies [1,4,5]
         column from all the other elements
         of it.                                          affirmative action,
    - Try to cover all the zeroes with the               alcoholism, amusement
         minimum number of horizontal and                parks, architecture,
         vertical lines.                                 bicycling, blues, cheese,
    - If the number of lines equals the                  citrus groves, classical
         size of the matrix, find the solution.          guitar, computer vision,
    - If you have covered all of the zeroes              cruises, Death Valley, field
         with fewer lines than the size of the           hockey, gardening, graphic
         matrix, find the minimum number                 design, Gulf war, HIV, java,
         that is uncovered.                              Lipari, lyme disease, mutual
    - Subtract it from all uncovered                     funds, National parks,
         values and add it to any value(s) at            parallel architecture,
         the intersections of your lines.                Penelope Fitzgerald,
    - Repeat until a solution is obtained.               recycling cans, rock
    A detailed description of the algorithm              climbing, San Francisco,
    is provided in [3]                                   Shakespeare, stamp
                                                         collecting, sushi, table
                                                         tennis, telecommuting,
6.Evaluation of Ranking Strategies                       Thailand tourism, vintage
6.1 Algorithmic Complexity                               cars, volcano, zen
        The first parameter for testing the              buddhism, and Zener.
three ranking strategies is the time
complexity of the algorithms. The positional      The results are summarized below:
methods – MinRanker and Borda’s
                                  Rank aggregation time

   Time( in milli
                    300                                                Naïve Ranking
                    200                                                Borda's Ranking
                    100                                                Foot Rule Ranking


Average Rank Aggregation Times
Naïve Ranking - 18.6 msec
Borda’s Ranking - 51.2 msec
FootRule Ranking - 161.5 msec

We observe that the rank aggregation                       considering that the overlapping results are
times for the foot rule ranking are on an                  more relevant.
average thrice those for the Borda’s
positional ranking.                                        6.4 Performance of the various rank
                                                           aggregation methods
6.3 Overlap across search engines –                        In evaluating the performance of the
Relative Search Engine Performance                         ranking strategies for all the queries, we
Among the top 10 results obtained for each                 have chosen precision as a good measure
query , we found the results that overlap                  of relative performance. because all the
across multiple search engines. An                         ranking strategies work on the same set
interesting observation would be to find                   of results and try to get the most relevant
which search engines rank the overlapping                  ones to the top. Hence, a strategy that
results better. An intuition behind such a                 has a higher precision at the top can be
measure is that a search engine, which ranks               rated better from the user’s perspective.
the overlapping results, better can be
regarded as a better search engine

                                       Performance of search engines for
                                              overlapped results

                                    22%                   59%
We have plotted the precision of the ranking                        We have taken the relevance feedback from
strategies with respect to both the number of                      two different judges. The Kappa measure of
search results and the recall.                                     this relevance feedback is 0.78. In the
                                                                   following graphs, we present the results for
In considering the recall, we have taken the
                                                                   two out of the 38 queries run. We also
total number of relevant documents based on
                                                                   present the average of the results obtained
user evaluation of all the top 10 results
                                                                   over the 38 queries.
retrieved by each search engine. The recall
is calculated as the number of relevant                            6.4.1 Precision with respect to Number of
documents retrieved/ total number of                               Results returned
relevant results thus judged.


                                                                              Borda Method


                                0.5                                           Naïve Ranking

                                 0                                            Foot rule

                                           Number of Results


                                                                              Borda Method


                                0.5                                           Naïve Ranking

                                 0                                            Foot rule

                                           Number of Results

                                      Average Precision over 38 queries

                                                                              Borda Method


                                0.5                                           Naïve Ranking

                                 0                                            Foot rule

                                           Number of Results
It can be observed that on an average, the                                        set of results. Also, easily computable
footrule distance ranking aggregation                                             Borda’s method does a good job when
method gives better precision for the given                                       compared to the Naïve ranking method.

6.4.2 Precision vs. Recall

                                                      Query: Alcoholism


                                  0.8                                                        Naïve Ranker
                                  0.4                                                        Borda's
                                  0.2                                                        Ranker
                                    0                                                        Foot rule



                                                      Query: Gardening


                                   1                                                           Naïve Ranker

                                  0.5                                                          Borda's
                                                                                               Foot rule



A similar observation can be made with respect to the precision at a given recall for each of the
ranking strategies.

7.Problems encountered                                                            language specific search which have not
        During the design of the advance                                          been explored as part of this project.
search interface, we realized that all the                                                Another major issue we faced was
options that normal search engines provide,                                       finding an        optimal     algorithm for
could not be made available because, each                                         implementing minimum cost bipartite
search engine provides a different set of                                         matching. We chose to implement the
advanced options.                                                                 Hungarian method, but in retrospect we
        Some of the advanced search                                               think other efficient algorithms would have
options implemented in the different search                                       been better.
engines are tabulated below. There are other
advanced search options like file format,
Feature              Google             MSN           Altavista          Tadpole
Conjunction          Yes                Yes           Yes                Yes
Disjunction          Yes                Yes           Yes                Yes
Negation             Yes                Yes           Yes                Yes
Phrase Search        Yes                Yes           Yes                Yes
Number          of   No (for the API)   No            Yes                No
results per page

8.Conclusion and Future Work                   Methods for the web. In proceedings of the
In the context of our project, we have         Tenth World Wide Web Conference, 2001.
studied some trade-offs that are involved in   [2]Hungarian Method
the design of meta-search engines. We have     http://www.math.nus.edu.sg/~matcgh/MA32
observed that the computational complexity     52/lecture_notes/Hungarian.pdf
of ranking algorithms used and performance     http://www.cob.sjsu.edu/anaya_j/HungMeth.
of the meta-search engine are conflicting      htm
parameters. A compromise must be achieved      [3]http://www.lib.berkeley.edu/TeachingLib
between these two, based on the perceived      /Guides/Internet/MetaSearch.html
applications and environment in which the      [4]K. Bharat and M. Henzinger, Improved
meta-search engine will be used.               algorithms for topic distillation in a
                                               hyperlinked environment.ACM SIGIR, pages
Future work involves, incorporating more       104--111, 1998.
number of search engines in the study,         [5]S. Chakrabarti, B. Dom, D. Gibson, R.
studying the performance for the most          Kumar, P. Raghavan, S. Rajagopalan, and
popular queries published by the various       A. Tomkins.
search    engines,    incorporate   local      Experiments in topic distillation. Proc. ACM
kemmenization to e spam, to incorporate        SIGIR Workshop on Hypertext Information
methods for avoiding mirrored search           Retrieval on the Web, 1998.
results.                                       [6]H. P. Young. An axiomatization of
                                               Borda's rule. Journal of Economic Theory,
Bibliography                                   9:43--52, 1974.
[1] Cynthia Dwork, Ravi Kumar, Moni
Naor, D Siva Kumar, Rank Aggregation

Shared By: