Tadpole A Meta search engine and Evaluation of ranking strategies

Document Sample
Tadpole A Meta search engine and Evaluation of ranking strategies Powered By Docstoc
					                         Tadpole: A Meta search engine
                   Evaluation of Meta Search ranking strategies

          Mahathi S Mahabhashyam                                  Pavan Singitham

Abstract                                              search engines crawl the WWW from time
In this write up, we explain the design of            to time and index the web pages. However,
Tadpole, a Meta search engine which                   it is virtually impossible for any search
obtains results from various search engines           engine to have the entire web indexed. Most
and aggregates them. We discuss three                 of the time a search engine can index only a
meta-search ranking strategies – two                  small portion of the vast set of web pages
positional methods and a scaled foot rule             existing on the Internet. Each search engine
optimization method and study the response-           crawls the web separately and creates its
time/result quality trade-offs involved.              own database of the content. Therefore,
                                                      searching more than one search engine at a
1.Introduction                                        time enables us to cover a larger portion of
                                                      the World Wide Web.
A Meta search engine transmits user’s
search simultaneously to several individual           Secondly, crawling the web is a long
search engines and their databases of web             process, which can take more than a month
pages and gets results from all the search            whereas the content of many web pages
engines queried. We could thus save a lot of          keep changing more frequently and
time by initiating the search at a single point       therefore, it is important to have the latest
and sparing the need to use and learn several         updated information, which could be present
separate search engines. This can be even             in any of the search engines.
more helpful, if we are looking for a broad
                                                      Meta Search engines help us achieve the
range of results.
                                                      afore-mentioned objectives. However, we
In our project, we have implemented a Meta            need good ranking strategies in order to
search engine, which queries Google,                  aggregate the results obtained from the
Altavista and MSN databases. We have                  various search engines. Quite often, many
provided an interface for searching these             web sites successfully spam some of the
search engines along with several advanced            search engines and obtain an unfair rank. By
options for phrase search, conjunction,               using     appropriate    rank      aggregation
disjunction and negation of the key words.            strategies, we can prevent such results from
In order to rank the results obtained, we             appearing in the top results of a meta-search.
have made use of three rank aggregation
                                                      Our primary motivation was to develop a
strategies and evaluated the results obtained.
                                                      simple meta-search engine and study the
Out of these, two are positional methods,
                                                      response-time and performance trade-offs
which make use of the result’s rank in each
of the separate search engine to obtain a new
rank by simple aggregation. The third one is
                                                      3.Previous Work
a scaled foot rule optimization technique.
                                                      There are quite a few Meta search engines
2.Motivation                                          available on the Internet, which can be
                                                      categorized as follows
There are primarily two motivating factors
behind our developing a meta-search engine.           1. Meta search engines for serious deep
Firstly, the World Wide Web is a huge                 digging Ex: Surfwax, Copernic Basic
unstructured corpus of information. Various
2. Meta Search engines which aggregate the             higher by the majority of the search engines
results obtained from various search engines           to the top of the Meta search-ranking list.
Ex: Vivisimo, Ixquick                                  This is effective in avoiding spam.
3. Meta Search engines which present                   4.Organization
results without aggregating them Ex:
                                                       The organization for the report is as
                                                       follows:Section 5 discusses the architecture
        Meta-search engines of the first kind          and design of Tadpole, the meta-search
are not available as free-software. So, their          engine developed by us. Section 6 gives a
benefits are not reaped by most users. Some            study of the tradeoffs involved. In Section 7.
of the other issues involved and drawbacks             we describe a few problems we encountered
of meta-search engines are provided in [3].            during the project. Section 8 gives the
                                                       conclusion and future work.
An aggregation of the results obtained
would be more useful than just dumping the             5.Architecture of Tadpole
normal results. For such an aggregation,
                                                       When a user issues a search request,
Ravi Kumar et al [1] have suggested several
                                                       multiple threads are created in order to fetch
Rank aggregation methods for the web,
                                                       the results from various search engines.
broadly categorized as Borda’s positional
                                                       Each of these threads is given a time limit of
methods, Foot rule /Scaled Foot rule
                                                       3 seconds to return the results, failing which
Optimization methods, Markov Chain
                                                       a time out occurs and the thread is
methods for rank aggregation. They also
suggest a local Kemenization technique,
which brings the results that are ranked               Each process converts the given query to the
                                                       format specific to the search engine it is



          SE#                                            Algorithm

                                            Array of                        gated
          SE#                              TreeMaps                         Results
                Parallel processes query
                different search engines
                 and obtain the results                              TreeMap sorted
                                                                         on rank
                                                                                        Figure 1
                                                       URL, Title, Description, Rank and
dealing with. This request is sent to the
                                                       SearchSource are stored, creating a Result
search engine via the java URL object and
                                                       object. These results are entered into a
the results are obtained in the form of a
                                                       TreeMap data structure with the key as the
HTML page. This HTML results page is
                                                       url and the item as the Result object.
parsed by the process and for each result, the
The GUI also provides for advanced search       Min(Rank1(x),Rank2(x),…. , Rankn(x));
options for entering Boolean queries, Phrase    Clashes are avoided by an ordering of the
searches, selecting the number of results per   search engines based on popularity. That
search engine and the selection of search       means, if two results claim the same position
engines to be queried.                          in the meta-rank list, the result from a more
                                                popular search engine, (say Google) is
5.1 Design Decisions
                                                preferred to the result from a less popular
        During the design of Tadpole, we        one.
various design decisions were taken. Some
of them are listed below:                       Borda’s Positional Method
Why TreeMap?                                    In this algorithm, the MetaRank of a url is
                                                obtained by computing the Lp-Norm of the
TreeMap data structure combines the nice        ranks in different search engines.
features of a tree ( low search and retrieval   MetaRank(x)=
time) and Map (easy association) data           [Σ(Rank1(x)p,Rank2(x)p,…. , Rankn(x) p)]1/p
structures. By storing the results with the     In our algorithm, we have considered the
URL as the key, we can retrieve a result in     L1-Norm which is the sum of all the ranks
(log n) time while removing the duplicates      in different search engine result lists.
and merging them in the ranking algorithm.      Clashes are again avoided by search engine
This helps in a considerable speed up when      popularity.
we have hundreds of results from each           The search source for a URL, which is
search engine.                                  displayed in the meta search results, is set as
The TreeMaps thus obtained from each of         the search engine in which the URL is
the threads are then inserted in an array and   ranked the best.
passed on to the Ranking algorithm. The
Ranking algorithm then returns a tree map       Scaled Footrule Optimization Method
sorted on rank.                                 In this algorithm, the scaled footrule
                                                distances are used to rank the various
Why these three ranking strategies?             results. Let T1, T2 , … Tn be partial lists
The positional methods are computationally      obtained from various search engines. Let
more efficient. They give a good precision      their union be S. A weighted bipartite graph
when compared to just aggregation of results    for scaled footrule optimization (C,P,W) is
without using any ranking. The scaled-          defined as
footrule method is computationally more         C = set of nodes to be ranked
complex, but is proven to have given much       P = set of positions available
better performance. It is also useful in the    W(c,p) = is the scaled- footrule distance (
reduction of spam to an extent. As the basic    from the Ti’s ) of a ranking that places
idea of this project was to study the trade-    element ‘c’ at position ‘p’, given by
offs involved, we wanted to get a gradation        W(c,p) = I=1k | Ti(c)/|Ti| - p/n|
in the level of computational complexity and      Where n = number of results to be ranked
performance and so we chose these three         and |Ti| gives the cardinality of Ti.
rank aggregation methods.                       Computation of foot-rule aggregation for
                                                partial lists is NP-hard [1]. Hence the use of
5.2 Ranking Aggregation Methods                 scaled foot-rule distance measure. This
Implemented                                     problem can be converted to a minimum
Take the Best Rank                              cost perfect matching in bipartite graphs
In this algorithm, we try to place a URL at     described above. There are various
the best rank it gets in any of the search      algorithms for finding the minimum cost
engine rankings.                                perfect matching in bipartite graphs. We
That is,                                        have used the Hungarian method for doing
MetaRank (x) =                                  it.
The Hungarian method proceeds as follows:
   - Obtain the reduced cost matrix from               6.2 Rank Aggregation Time
        the given cost matrix by subtracting
        the minimum of each row and each               The aggregation times of various
        column from all the other elements             ranking strategies were measured with
        of it.                                         respect to each other and with normal
   - Try to cover all the zeroes with the
                                                       search engines. The evaluation was
        minimum number of horizontal and
        vertical lines.
                                                       carried out with respect to the following
   - If the number of lines equals the                 set of 38 queries, which were previously
        size of the matrix, find the solution.         used in other studies [1,4,5]
   - If you have covered all of the zeroes
        with fewer lines than the size of the                   affirmative action,
        matrix, find the minimum number                         alcoholism, amusement
        that is uncovered.                                      parks, architecture,
   - Subtract it from all uncovered                             bicycling, blues, cheese,
        values and add it to any value(s) at                    citrus groves, classical
        the intersections of your lines.                        guitar, computer vision,
   - Repeat until a solution is obtained.                       cruises, Death Valley, field
   A detailed description of the algorithm                      hockey, gardening, graphic
   is provided in [3]                                           design, Gulf war, HIV, java,
                                                                Lipari, lyme disease, mutual
                                                                funds, National parks,
6.Evaluation of Ranking Strategies                              parallel architecture,
6.1 Algorithmic Complexity                                      Penelope Fitzgerald,
        The first parameter for testing the                     recycling cans, rock
three ranking strategies is the time                            climbing, San Francisco,
complexity of the algorithms. The positional                    Shakespeare, stamp
methods – MinRanker and Borda’s                                 collecting, sushi, table
positional method take linear time, that                        tennis, telecommuting,
means they have a complexity of O(n).                           Thailand tourism, vintage
Scaled Footrule optimization can be solved                      cars, volcano, zen
using the Hungarian algorithm for Bipartite-                    buddhism, and Zener.
                                                       The results are summarized below:

                                  Rank aggregation time

   Time( in milli


                                                                    Naïve Ranking
                    200                                             Borda's Ranking
                    100                                             Foot Rule Ranking






Average Rank Aggregation Times
Naïve Ranking - 18.6 msec
Borda’s Ranking - 51.2 msec
FootRule Ranking - 161.5 msec

We observe that the rank aggregation                 regarded as a better search engine
times for the foot rule ranking are on an            considering that the overlapping results are
average thrice those for the Borda’s                 more relevant.
positional ranking.
                                                     6.4 Performance of the various rank
6.3 Overlap across search engines –                  aggregation methods
Relative Search Engine Performance                   In evaluating the performance of the
Among the top 10 results obtained for each           ranking strategies for all the queries, we
query , we found the results that overlap            have chosen precision as a good measure
across multiple search engines. An                   of relative performance. because all the
interesting observation would be to find             ranking strategies work on the same set
which search engines rank the overlapping            of results and try to get the most relevant
results better. An intuition behind such a           ones to the top. Hence, a strategy that
measure is that a search engine, which ranks         has a higher precision at the top can be
the overlapping results, better can be
                                                     rated better from the user’s perspective.

                                 Performance of search engines for
                                        overlapped results

                               22%                  59%

We have plotted the precision of the ranking          We have taken the relevance feedback from
strategies with respect to both the number of        two different judges. The Kappa measure of
search results and the recall.                       this relevance feedback is 0.78. In the
                                                     following graphs, we present the results for
In considering the recall, we have taken the
                                                     two out of the 38 queries run. We also
total number of relevant documents based on
                                                     present the average of the results obtained
user evaluation of all the top 10 results
                                                     over the 38 queries.
retrieved by each search engine. The recall
is calculated as the number of relevant              6.4.1 Precision with respect to Number of
documents retrieved/ total number of                 Results returned
relevant results thus judged.

                                                                           Borda Method


                               0.5                                         Naïve Ranking

                                0                                          Foot rule

                                         Number of Results


                                                                           Borda Method


                               0.5                                         Naïve Ranking

                                0                                          Foot rule

                                         Number of Results

                                     Average Precision over 38 queries

                                                                           Borda Method


                               0.5                                         Naïve Ranking

                                0                                          Foot rule

                                         Number of Results

It can be observed that on an average, the                       set of results. Also, easily computable
footrule distance ranking aggregation                            Borda’s method does a good job when
method gives better precision for the given                      compared to the Naïve ranking method.

6.4.2 Precision vs. Recall
                                                     Query: Alcoholism


                                 0.8                                                           Naïve Ranker
                                 0.4                                                           Borda's
                                 0.2                                                           Ranker
                                   0                                                           Foot rule




                                                     Query: Gardening


                                  1                                                             Naïve Ranker

                                 0.5                                                            Borda's
                                                                                                Foot rule



A similar observation can be made with respect to the precision at a given recall for each of the
ranking strategies.

7.Problems encountered                                                          language specific search which have not
        During the design of the advance                                        been explored as part of this project.
search interface, we realized that all the                                              Another major issue we faced was
options that normal search engines provide,                                     finding    an     optimal     algorithm  for
could not be made available because, each                                       implementing minimum cost bipartite
search engine provides a different set of                                       matching. We chose to implement the
advanced options.                                                               Hungarian method, but in retrospect we
        Some of the advanced search                                             think other efficient algorithms would have
options implemented in the different search                                     been better.
engines are tabulated below. There are other
advanced search options like file format,
Feature             Google             MSN                                                Altavista           Tadpole
Conjunction         Yes                Yes                                                Yes                 Yes
Disjunction         Yes                Yes                                                Yes                 Yes
Negation            Yes                Yes                                                Yes                 Yes
Phrase Search       Yes                Yes                                                Yes                 Yes
Number         of No (for the API) No                                                     Yes                 No
results per page

8.Conclusion and Future Work                   Methods for the web. In proceedings of the
In the context of our project, we have         Tenth World Wide Web Conference, 2001.
studied some trade-offs that are involved in   [2]Hungarian Method
the design of meta-search engines. We have
observed that the computational complexity     52/lecture_notes/Hungarian.pdf
of ranking algorithms used and performance
of the meta-search engine are conflicting      htm
parameters. A compromise must be achieved      [3]
between these two, based on the perceived      /Guides/Internet/MetaSearch.html
applications and environment in which the      [4]K. Bharat and M. Henzinger, Improved
meta-search engine will be used.               algorithms for topic distillation in a
                                               hyperlinked environment.ACM SIGIR, pages
Future work involves, incorporating more       104--111, 1998.
number of search engines in the study,         [5]S. Chakrabarti, B. Dom, D. Gibson, R.
studying the performance for the most          Kumar, P. Raghavan, S. Rajagopalan, and
popular queries published by the various       A. Tomkins.
search    engines,    incorporate   local      Experiments in topic distillation. Proc. ACM
kemmenization to e spam, to incorporate        SIGIR Workshop on Hypertext Information
methods for avoiding mirrored search           Retrieval on the Web, 1998.
results.                                       [6]H. P. Young. An axiomatization of
                                               Borda's rule. Journal of Economic Theory,
Bibliography                                   9:43--52, 1974.
[1] Cynthia Dwork, Ravi Kumar, Moni
Naor, D Siva Kumar, Rank Aggregation

Shared By:
Abbydoc Abbydoc