Tadpole: A Meta search engine
Evaluation of Meta Search ranking strategies
Mahathi S Mahabhashyam Pavan Singitham
Abstract search engines crawl the WWW from time
In this write up, we explain the design of to time and index the web pages. However,
Tadpole, a Meta search engine which it is virtually impossible for any search
obtains results from various search engines engine to have the entire web indexed. Most
and aggregates them. We discuss three of the time a search engine can index only a
meta-search ranking strategies – two small portion of the vast set of web pages
positional methods and a scaled foot rule existing on the Internet. Each search engine
optimization method and study the response- crawls the web separately and creates its
time/result quality trade-offs involved. own database of the content. Therefore,
searching more than one search engine at a
1.Introduction time enables us to cover a larger portion of
the World Wide Web.
A Meta search engine transmits user’s
search simultaneously to several individual Secondly, crawling the web is a long
search engines and their databases of web process, which can take more than a month
pages and gets results from all the search whereas the content of many web pages
engines queried. We could thus save a lot of keep changing more frequently and
time by initiating the search at a single point therefore, it is important to have the latest
and sparing the need to use and learn several updated information, which could be present
separate search engines. This can be even in any of the search engines.
more helpful, if we are looking for a broad
Meta Search engines help us achieve the
range of results.
afore-mentioned objectives. However, we
In our project, we have implemented a Meta need good ranking strategies in order to
search engine, which queries Google, aggregate the results obtained from the
Altavista and MSN databases. We have various search engines. Quite often, many
provided an interface for searching these web sites successfully spam some of the
search engines along with several advanced search engines and obtain an unfair rank. By
options for phrase search, conjunction, using appropriate rank aggregation
disjunction and negation of the key words. strategies, we can prevent such results from
In order to rank the results obtained, we appearing in the top results of a meta-search.
have made use of three rank aggregation
Our primary motivation was to develop a
strategies and evaluated the results obtained.
simple meta-search engine and study the
Out of these, two are positional methods,
response-time and performance trade-offs
which make use of the result’s rank in each
of the separate search engine to obtain a new
rank by simple aggregation. The third one is
a scaled foot rule optimization technique.
There are quite a few Meta search engines
2.Motivation available on the Internet, which can be
categorized as follows
There are primarily two motivating factors
behind our developing a meta-search engine. 1. Meta search engines for serious deep
Firstly, the World Wide Web is a huge digging Ex: Surfwax, Copernic Basic
unstructured corpus of information. Various
2. Meta Search engines which aggregate the higher by the majority of the search engines
results obtained from various search engines to the top of the Meta search-ranking list.
Ex: Vivisimo, Ixquick This is effective in avoiding spam.
3. Meta Search engines which present 4.Organization
results without aggregating them Ex:
The organization for the report is as
follows:Section 5 discusses the architecture
Meta-search engines of the first kind and design of Tadpole, the meta-search
are not available as free-software. So, their engine developed by us. Section 6 gives a
benefits are not reaped by most users. Some study of the tradeoffs involved. In Section 7.
of the other issues involved and drawbacks we describe a few problems we encountered
of meta-search engines are provided in . during the project. Section 8 gives the
conclusion and future work.
An aggregation of the results obtained
would be more useful than just dumping the 5.Architecture of Tadpole
normal results. For such an aggregation,
When a user issues a search request,
Ravi Kumar et al  have suggested several
multiple threads are created in order to fetch
Rank aggregation methods for the web,
the results from various search engines.
broadly categorized as Borda’s positional
Each of these threads is given a time limit of
methods, Foot rule /Scaled Foot rule
3 seconds to return the results, failing which
Optimization methods, Markov Chain
a time out occurs and the thread is
methods for rank aggregation. They also
suggest a local Kemenization technique,
which brings the results that are ranked Each process converts the given query to the
format specific to the search engine it is
Array of gated
SE# TreeMaps Results
Parallel processes query
different search engines
and obtain the results TreeMap sorted
URL, Title, Description, Rank and
dealing with. This request is sent to the
SearchSource are stored, creating a Result
search engine via the java URL object and
object. These results are entered into a
the results are obtained in the form of a
TreeMap data structure with the key as the
HTML page. This HTML results page is
url and the item as the Result object.
parsed by the process and for each result, the
The GUI also provides for advanced search Min(Rank1(x),Rank2(x),…. , Rankn(x));
options for entering Boolean queries, Phrase Clashes are avoided by an ordering of the
searches, selecting the number of results per search engines based on popularity. That
search engine and the selection of search means, if two results claim the same position
engines to be queried. in the meta-rank list, the result from a more
popular search engine, (say Google) is
5.1 Design Decisions
preferred to the result from a less popular
During the design of Tadpole, we one.
various design decisions were taken. Some
of them are listed below: Borda’s Positional Method
Why TreeMap? In this algorithm, the MetaRank of a url is
obtained by computing the Lp-Norm of the
TreeMap data structure combines the nice ranks in different search engines.
features of a tree ( low search and retrieval MetaRank(x)=
time) and Map (easy association) data [Σ(Rank1(x)p,Rank2(x)p,…. , Rankn(x) p)]1/p
structures. By storing the results with the In our algorithm, we have considered the
URL as the key, we can retrieve a result in L1-Norm which is the sum of all the ranks
(log n) time while removing the duplicates in different search engine result lists.
and merging them in the ranking algorithm. Clashes are again avoided by search engine
This helps in a considerable speed up when popularity.
we have hundreds of results from each The search source for a URL, which is
search engine. displayed in the meta search results, is set as
The TreeMaps thus obtained from each of the search engine in which the URL is
the threads are then inserted in an array and ranked the best.
passed on to the Ranking algorithm. The
Ranking algorithm then returns a tree map Scaled Footrule Optimization Method
sorted on rank. In this algorithm, the scaled footrule
distances are used to rank the various
Why these three ranking strategies? results. Let T1, T2 , … Tn be partial lists
The positional methods are computationally obtained from various search engines. Let
more efficient. They give a good precision their union be S. A weighted bipartite graph
when compared to just aggregation of results for scaled footrule optimization (C,P,W) is
without using any ranking. The scaled- defined as
footrule method is computationally more C = set of nodes to be ranked
complex, but is proven to have given much P = set of positions available
better performance. It is also useful in the W(c,p) = is the scaled- footrule distance (
reduction of spam to an extent. As the basic from the Ti’s ) of a ranking that places
idea of this project was to study the trade- element ‘c’ at position ‘p’, given by
offs involved, we wanted to get a gradation W(c,p) = I=1k | Ti(c)/|Ti| - p/n|
in the level of computational complexity and Where n = number of results to be ranked
performance and so we chose these three and |Ti| gives the cardinality of Ti.
rank aggregation methods. Computation of foot-rule aggregation for
partial lists is NP-hard . Hence the use of
5.2 Ranking Aggregation Methods scaled foot-rule distance measure. This
Implemented problem can be converted to a minimum
Take the Best Rank cost perfect matching in bipartite graphs
In this algorithm, we try to place a URL at described above. There are various
the best rank it gets in any of the search algorithms for finding the minimum cost
engine rankings. perfect matching in bipartite graphs. We
That is, have used the Hungarian method for doing
MetaRank (x) = it.
The Hungarian method proceeds as follows:
- Obtain the reduced cost matrix from 6.2 Rank Aggregation Time
the given cost matrix by subtracting
the minimum of each row and each The aggregation times of various
column from all the other elements ranking strategies were measured with
of it. respect to each other and with normal
- Try to cover all the zeroes with the
search engines. The evaluation was
minimum number of horizontal and
carried out with respect to the following
- If the number of lines equals the set of 38 queries, which were previously
size of the matrix, find the solution. used in other studies [1,4,5]
- If you have covered all of the zeroes
with fewer lines than the size of the affirmative action,
matrix, find the minimum number alcoholism, amusement
that is uncovered. parks, architecture,
- Subtract it from all uncovered bicycling, blues, cheese,
values and add it to any value(s) at citrus groves, classical
the intersections of your lines. guitar, computer vision,
- Repeat until a solution is obtained. cruises, Death Valley, field
A detailed description of the algorithm hockey, gardening, graphic
is provided in  design, Gulf war, HIV, java,
Lipari, lyme disease, mutual
funds, National parks,
6.Evaluation of Ranking Strategies parallel architecture,
6.1 Algorithmic Complexity Penelope Fitzgerald,
The first parameter for testing the recycling cans, rock
three ranking strategies is the time climbing, San Francisco,
complexity of the algorithms. The positional Shakespeare, stamp
methods – MinRanker and Borda’s collecting, sushi, table
positional method take linear time, that tennis, telecommuting,
means they have a complexity of O(n). Thailand tourism, vintage
Scaled Footrule optimization can be solved cars, volcano, zen
using the Hungarian algorithm for Bipartite- buddhism, and Zener.
The results are summarized below:
Rank aggregation time
Time( in milli
200 Borda's Ranking
100 Foot Rule Ranking
Average Rank Aggregation Times
Naïve Ranking - 18.6 msec
Borda’s Ranking - 51.2 msec
FootRule Ranking - 161.5 msec
We observe that the rank aggregation regarded as a better search engine
times for the foot rule ranking are on an considering that the overlapping results are
average thrice those for the Borda’s more relevant.
6.4 Performance of the various rank
6.3 Overlap across search engines – aggregation methods
Relative Search Engine Performance In evaluating the performance of the
Among the top 10 results obtained for each ranking strategies for all the queries, we
query , we found the results that overlap have chosen precision as a good measure
across multiple search engines. An of relative performance. because all the
interesting observation would be to find ranking strategies work on the same set
which search engines rank the overlapping of results and try to get the most relevant
results better. An intuition behind such a ones to the top. Hence, a strategy that
measure is that a search engine, which ranks has a higher precision at the top can be
the overlapping results, better can be
rated better from the user’s perspective.
Performance of search engines for
We have plotted the precision of the ranking We have taken the relevance feedback from
strategies with respect to both the number of two different judges. The Kappa measure of
search results and the recall. this relevance feedback is 0.78. In the
following graphs, we present the results for
In considering the recall, we have taken the
two out of the 38 queries run. We also
total number of relevant documents based on
present the average of the results obtained
user evaluation of all the top 10 results
over the 38 queries.
retrieved by each search engine. The recall
is calculated as the number of relevant 6.4.1 Precision with respect to Number of
documents retrieved/ total number of Results returned
relevant results thus judged.
0.5 Naïve Ranking
0 Foot rule
Number of Results
0.5 Naïve Ranking
0 Foot rule
Number of Results
Average Precision over 38 queries
0.5 Naïve Ranking
0 Foot rule
Number of Results
It can be observed that on an average, the set of results. Also, easily computable
footrule distance ranking aggregation Borda’s method does a good job when
method gives better precision for the given compared to the Naïve ranking method.
6.4.2 Precision vs. Recall
0.8 Naïve Ranker
0 Foot rule
1 Naïve Ranker
A similar observation can be made with respect to the precision at a given recall for each of the
7.Problems encountered language specific search which have not
During the design of the advance been explored as part of this project.
search interface, we realized that all the Another major issue we faced was
options that normal search engines provide, finding an optimal algorithm for
could not be made available because, each implementing minimum cost bipartite
search engine provides a different set of matching. We chose to implement the
advanced options. Hungarian method, but in retrospect we
Some of the advanced search think other efficient algorithms would have
options implemented in the different search been better.
engines are tabulated below. There are other
advanced search options like file format,
Feature Google MSN Altavista Tadpole
Conjunction Yes Yes Yes Yes
Disjunction Yes Yes Yes Yes
Negation Yes Yes Yes Yes
Phrase Search Yes Yes Yes Yes
Number of No (for the API) No Yes No
results per page
8.Conclusion and Future Work Methods for the web. In proceedings of the
In the context of our project, we have Tenth World Wide Web Conference, 2001.
studied some trade-offs that are involved in Hungarian Method
the design of meta-search engines. We have http://www.math.nus.edu.sg/~matcgh/MA32
observed that the computational complexity 52/lecture_notes/Hungarian.pdf
of ranking algorithms used and performance http://www.cob.sjsu.edu/anaya_j/HungMeth.
of the meta-search engine are conflicting htm
parameters. A compromise must be achieved http://www.lib.berkeley.edu/TeachingLib
between these two, based on the perceived /Guides/Internet/MetaSearch.html
applications and environment in which the K. Bharat and M. Henzinger, Improved
meta-search engine will be used. algorithms for topic distillation in a
hyperlinked environment.ACM SIGIR, pages
Future work involves, incorporating more 104--111, 1998.
number of search engines in the study, S. Chakrabarti, B. Dom, D. Gibson, R.
studying the performance for the most Kumar, P. Raghavan, S. Rajagopalan, and
popular queries published by the various A. Tomkins.
search engines, incorporate local Experiments in topic distillation. Proc. ACM
kemmenization to e spam, to incorporate SIGIR Workshop on Hypertext Information
methods for avoiding mirrored search Retrieval on the Web, 1998.
results. H. P. Young. An axiomatization of
Borda's rule. Journal of Economic Theory,
Bibliography 9:43--52, 1974.
 Cynthia Dwork, Ravi Kumar, Moni
Naor, D Siva Kumar, Rank Aggregation