query-journal 248 248 by tabindah

VIEWS: 4 PAGES: 10

More Info
									       Impact of Query Correlation on Web Searching
                                                        Ash Mohammad Abbas
                                                 Department of Computer Engineering
                                          Zakir Husain College of Engineering and Technology
                                          Aligarh Muslim University, Aligarh - 202002, India.




    Abstract— Correlation among queries is an important factor to                 on the Web are much higher than the user which is simply
analyze as it may affect the results delivered by a search engine.                retrieving some information from a traditional database. This
In this paper, we analyze correlation among queries and how                       makes the task of extracting information from the Web a bit
it affects the information retrieved from the Web. We analyze
two types of queries: (i) queries with embedded semantics, and                    challenging [1].
(ii) queries without any semantics. In our analysis, we consider                     Since the Web searching is an important activity and the
parameters such as search latencies and search relevance. We                      results obtained so may affect decisions and directions for
focus on two major search portals that are mainly used by                         individuals as well as for organizations, therefore, it is of
end users. Further, we discuss a unified criteria for comparison                   utmost importance to analyze the parameters or constituents
among the performance of the search engines.
                                                                                  involved in it. Many researchers have analyzed many different
  Index Terms— Query correlation, search portals, Web infor-                      issues pertaining to Web searching that include index quality
mation retrieval, unified criteria for comparison, earned points.                  [2], user-effort measures [3], Web page reputation [6], and user
                                                                                  perceived quality [7].
                                                                                     In this paper, we try to answer the following question: What
                           I. I NTRODUCTION                                       happens when a user fires queries to a search engine one by
   The Internet that was aimed to communicate research ac-                        one that are correlated? Specifically, we wish to evaluate the
tivities among a few universities in United States has now                        effect of correlation among the queries submitted to a search
become a basic need of life for all people who can read and                       engine (or a search portal).
write throughout the world. It has become possible only due                          Rest of this paper is organized as follows. In section II, we
to the proliferation of the World Wide Web (WWW) which is                         briefly review methodologies used in popular search engines.
now simply called as the Web. The Web has become the largest                      In section III, we describe query correlation. Section IV
source of information in all parts of life. Users from different                  contains results and discussion. In section V, we describe a
domains often extract information that fits to their needs.                        criteria for comparison of search engines. Finally, section VI
The term Web information retrieval1 is used for extracting                        is for conclusion and future work.
information from the Web.
   Although, Web information retrieval finds its roots to tra-                        II. A R EVIEW      OF   M ETHODOLOGIES U SED            IN   S EARCH
ditional database systems [4], [5]. However, the retrieval of                                                    E NGINES
information from the Web is more complex as compared to the
                                                                                     First we discuss a general strategy employed for retrieving
information retrieval from a traditional database. This is due
                                                                                  information from the Web and then we shall review some of
to subtle differences in their respective underlying databases2 .
                                                                                  the search portals.
   In a traditional database, the data is often organized, limited,
and static. As opposed to that the Webbase is unorganized,
unlimited, and is often dynamic. Every second a large number                      A. A General Strategy for Searching
of updates are carried out in the Webbase. Moreover, as                              A general strategy for searching information on the Web is
opposed to a traditional database which is controlled by a                        shown in Fig. 1. Broadly a search engine consists of the fol-
specific operating system and the data is located either at                        lowing components: User Interface, Query Dispatcher, Cache 3 ,
a central location or at least at a few known locations, the                      Server Farm, and Web Base. The way these components
Webbase is not controlled by any specific operating system                         interact with one another depends upon the strategy employed
and its data may not reside either at a central site or at few                    in a particular search engine. We describe here a broad view.
known locations. Further, the Webbase can be thought as a                         An end user fires a query using an interface, say User Interface.
collection of a large number of traditional databases of various                  The User Interface provides a form to the user. The user fills
organization. The expectations of a user searching information                    the form with a set of keywords to be searched. The query
   1 The terms Web surfing, Web searching, Web information retrieval, Web          goes to the Query Dispatcher which, after performing some
mining are often used in the same context. However, they differ depending         refinements, sends it to the Cache. If the query obtained after
upon the methodologies involved, intensity of seeking information, and
intentions of users who extract information from the Web.                           3 We use the word Cache to mean Search Engine Cache i.e. storage space
   2 Let us use the term Webbase for the collection of data in case of the Web,   where results matching to previously fired queries or words are kept for future
in order to differentiate it from the traditional database.                       use.
                                      2                                     4
                      1       U                                  3
                              S                  Query                                                                                5
                    query     E                 Dispatcher
                              R
                              I       8
                              N                              7
                              T
                                                                                                                                                    WEB BASE
                              E
                              R
                  response    F                                                      Server Farm
                                                Cache
                              A
                              C
                      9       E                                                                                                   6




Fig. 1.   A general strategy for information retrieval from the Web.




refinement4 is matched to a query in the Cache, the results are                   search engine may not search words that are not part of its
immediately sent by the Query Dispatcher to the User Interface                   ontology. It can modify its ontology with time. One step more,
and hence to the user. Otherwise, the Query Dispatcher sends                     an ontology based search engine may also shorten the set of
the query to one of the Server in the Server Farm which are                      results searched before presenting it to the end users that are
busy in building a Web Base for the search engine. The server                    not part of the ontology of the given term.
so contacted, after due consideration from the Web Base sends                       We now describe an important aspect pertaining to informa-
it to the Cache so that the Cache may store those results                        tion retrieval from the Web. The results delivered by a search
for future reference, if any. Cache sends them to the Query                      engine may depend how the queries are formulated and what
Dispatcher. Finally, through the User Interface, response is                     relation a given query has with previously fired queries, if any.
returned to the end user.                                                        We wish to study the effect of correlation among the queries
   In what follows, we briefly review the strategies employed                     submitted to a search engine.
by different search portals.
                                                                                                                          III. Q UERY C ORRELATION
B. Review of Strategies of Search Portals                                           The searched results may differ depending upon whether
   The major search portals or search engines5 which end users                   a search engine treats a set of words as an ordered set or an
generally use for searching are GoogleTM and YahooTM . Let                       unordered set. In what follows, we consider each one of them.
us briefly review the methodologies behind their respective
search engines6 of these search portals.                                         A. Permutations
   Google is based on the PageRank scheme described in [8].
                                                                                    Searched results delivered by a search engine may depend
It is somewhat similar to the scheme proposed by Kleinberg in
                                                                                 upon the order of words appearing in a given query7. If we
[9] which is based on hub and authority weights and focuses
                                                                                 take into account order of words, the same set of words may
on the citations of a given page. To understand the Google’s
                                                                                 form different queries for different orderings. The different
strategy, one has to first understand the HITS (Hyperlink-
                                                                                 orderings of the set of words of the given query are called
Induced Topic Search) algorithm proposed by Klienberg. For
                                                                                 permutations. The formal definition of permutations of a given
that the readers are directed to [9] for HITS and to [8] for
                                                                                 query is as follows.
PageRank.                                                                                                                                                    ¡  
                                                                                    Definition 1: Let the query Q      wi 1 i m, Q φ, be                                ¥ ¤¢
                                                                                                                                                                         £             ¥            ¦ 
   On the other hand, Yahoo employs an ontology based
                                                                                 a set of words excluding stop words of a natural language. Let
search engine. An ontology is a formal term used to mean a                           §
                                                                                     ¨ 
                                                                                 P      x j 1 j m be a set of words excluding stop words.
                                                                                                       ¥ ©
                                                                                                         £            ¥
hierarchical structure of terms (or keywords) that are related.                                                                    
                                                                                 If P is such that wi x j for some j not necessarily equal to
The relationships among the keywords are governed by a set                                                                                                          
                                                                                 i, and wi Q x j P such that wi x j where j may not be
                                                                                                               
                                                                                                                 £           
of rules. As a result, an ontology based search engine such
                                                                                 equal to i, then P is called a permutation of Q.
as Yahoo may search other related terms that are part of
                                                                                    In the above definition, stop words are language dependent.
the ontology of the given term. Further, an ontology based
                                                                                 For example in the English language, the set of stop words,
   4
     By refinement of a query, we mean that the given query is transformed        S, is often taken as
in such a way so that the words and forms that are not so important are              ¡  
eliminated so that they do not affect the results.                               S         a       £   an    £   the      £   is      £   am   £   are   £     will    £   shall   £       of   £   in   £   for   ¢
   5 A search engine is a part of search portal. A search portal provides many
other facilities or services such as Advanced Search, News etc.                    7 The term ’query’ means a set of words that is given to a search engine to
   6 The respective products are trademarks of their organizations.              search for the information available on the Web.
                                                                                                                                                                                                                                                                   1
Note that if there are m words (excluding the stop words) in                                                                                                                                                                                                                                                    Google
                                                                                                                                                                                                                                                                                                                Yahoo

the given query, the number of permutations is m!.
   The permutations are concerned with a single query. Sub-                                                                                                                                                                                                       0.8


mitting different permutations of the given query to a search
engine, one may evaluate how the search engine behaves for                                                                                                                                                                                                        0.6




                                                                                                                                                                                                                                                       Latency
different orderings of the same set of words. However, one
would like to know how the given search engine behaves when                                                                                                                                                                                                       0.4

an end user fires different queries that may or may not be
related. Specifically, one would be interested in the behavior                                                                                                                                                                                                     0.2

of a given search engine when the queries are related. In what
follows, we discuss what is meant by the correlation among                                                                                                                                                                                                         0
                                                                                                                                                                                                                                                                        1   2   3    4     5       6    7   8        9   10
different queries.                                                                                                                                                                                                                                                                        Page Number


                                                                                                                                                                                                                                             Fig. 2.             Latency versus page number for permutation P1.
B. Correlation
   An important aspect that may affect the results of Web                                                                                                                                                                                                          1
                                                                                                                                                                                                                                                                                                                Google
searching is how different queries are related. Two queries                                                                                                                                                                                                                                                     Yahoo


are said to be correlated if there are common words between                                                                                                                                                                                                       0.8
them. A formal definition of correlation among queries is as
follows.
                                                                                                                                                                                                                                                                  0.6
   Definition 2: Let Q1 and Q2 be queries given to a search



                                                                                                                                                                                                                                                       Latency
engine such that Q1 and Q2 are sets of words of a natural
language and Q1 Q2 φ. Q1 and Q2 are said to be correlated       £        ¦                                                                                                                                                                                        0.4


if and only if there exists a set C Q1 Q2 , C φ.                                                                                                                                                                    ¦ 
   One may use the above definition to define k-correlation                                                                                                                                                                                                         0.2


between any two queries. Formally, it can be stated as a
corollary of Definition 2.                                                                                                                                                                                                                                          0
                                                                                                                                                                                                                                                                        1   2   3    4     5       6    7   8        9   10

   Corrollary 1: Two queries are said to be k-correlated if and                                                                                                                                                                                                                           Page Number


only if C     k, where   ¡
                                              
                           denotes the cardinality.
                                         ¡                                     ¤£¡
                                                                               ¡ ¢                                                                                                                                                           Fig. 3.             Latency versus page number for permutation P2.
   For two queries that are correlated, we define a parameter
called Correlation Factor8 as follows.                                                                                                                                                                                                                             1
                                                                                                                                                                                                                                                                                                                Google
                                                                                                                                                                                                                                                                                                                Yahoo
                                       Q1 Q2                                                                                               ¡                                                      ¡
               Correlation Factor                            (1)                                                                                                                                          ¢
                                       Q1 Q2                                                                                               ¡                         ¥                            ¡                                                               0.8


This is based on the fact that Q1 Q2               Q1      Q2                                                                  ¡                             ¥                    ¡
                                                                                                                                                                                                                ¡            ¡ ¦
                                                                                                                                                                                                                             ¨§¡       ©
                                                                                                                                                                                                                                       ¡

¡Q1 Q2 .                         ¡                                                                                                                                                                                                                                0.6
                                                                                                                                                                                                                                                       Latency




   Note that 0 Correlation Factor 1. For two uncorrelated
                                                      ¥                                                                            ¥
queries the Correlation Factor is 0. Further, one can see from                                                                                                                                                                                                    0.4


Definition 1 that for the permutations of the same query,
Correlation Factor is 1.                                                                                                                                                                                                                                          0.2

   Similarly, one may define the Correlation Factor for a
cluster of queries. Let the number of queries be O. The                                                                                                                                                                                                            0
                                                                                                                                                                                                                                                                        1   2   3    4     5       6    7   8        9   10
cardinality of the union of the given cluster of queries is given                                                                                                                                                                                                                         Page Number

by the following equation.                                                                                                                                                                                                                   Fig. 4.             Latency versus page number for permutation P3.
            O
                                                 ∑ Qi ∑                                                                                    ∑
        
    ¡               Qo       ¡
                                                          ¡         ¡
                                                                    ©           ¡   Qi          Qj       ¡
                                                                                                          ¦                                                                  ¡   Qi                          Qj             Qk   ¡
                                                                                                                                                                                                                                                                   1
                                                                                                                                                                                                                                                                                                                Google
        o 1                                     i                       i j
                                                                                                                                  i j k
                                                                                                                                                                                                                                                                                                              Yahoo

                                                                                                     O 1
                                                     # !¦  ©
                                                     © " ¦                               1   $
                                                                                                      %
                                                                                                                           ¡       Q1                                       Q2                                
                                                                                                                                                                                                                            QO    ¡   (2)                        0.8


Using (2), one may define the Correlation Factor of a cluster
of queries as follows.                                                                                                                                                                                                                                            0.6
                                                                                                                                                                                                                                                       Latency




                                                                                                                                                                     O
                                                                                                                                                 ¡                   o 1 Qo                           ¡
                                                              Correlation Factor                                                                                     O
                                                                                                                                                                         
                                                                                                                                                                                                                                       (3)                        0.4

                                                                                                                                               ¥ ¡                   o 1 Qo
                                                                                                                                                                         
                                                                                                                                                                                                      ¡

A high correlation factor means that the queries in the cluster                                                                                                                                                                                                   0.2

are highly correlated, and vice versa.
  In what follows, we discuss results pertaining to query
                                                                                                                                                                                                                                                                   0
correlation.                                                                                                                                                                                                                                                            1   2   3    4     5       6
                                                                                                                                                                                                                                                                                          Page Number
                                                                                                                                                                                                                                                                                                        7   8        9   10


  8 This correlation factor is nothing but Jaccard’s Coefficient, which is often                                                                                                                                                              Fig. 5.             Latency versus page number for permutation P4.
used as a measure of similarity.
                                                                                                    TABLE I
 S EARCH LATENCIES , QUERY SPACE , AND                        THE NUMBER OF RELEVANT RESULTS FOR DIFFERENT PERMUTATIONS OF THE QUERY:                                                   Ash Mohammad Abbas
                                                                                                  FOR    G OOGLE .
                                   Permutation       p1           p2            p3          p4            p5          p6                   p7           p8          p9           p10
                                   1                 0.22         0.15          0.04        0.08          0.33        0.29                 0.15         0.13        0.16         0.17
                                                     300000       300000        300000      300000        300000      300000               300000       300000      300000       300000
                                                     8            5             1           0             2           0                    0            3           0            0
                                   2                 0.51         0.15          0.22        0.19          0.13        0.12                 0.10         0.27        0.16         0.15
                                                     300000       300000        300000      300000        300000      300000               300000       300000      300000       300000
                                                     3            2             2           1             1           0                    2            0           0            0
                                   3                 0.30         0.08          0.18        0.20          0.14        0.25                 0.13         0.21        0.14         0.21
                                                     300000       300000        300000      300000        300000      300000               300000       300000      300000       300000
                                                     6            4             1           3             2           1                    1            0           0            0
                                   4                 0.60         0.07          0.35        0.11          0.13        0.15                 0.23         0.13        0.28         0.26
                                                     300000       300000        300000      300000        300000      300000               300000       300000      300000       300000
                                                     3            0             2           1             0           0                    2            0           1            1
                                   5                 0.38         0.09          0.39        0.14          0.17        0.15                 0.14         0.16        0.15         0.13
                                                     300000       300000        300000      300000        300000      300000               300000       300000      300000       300000
                                                     3            2             1           2             1           0                    0            1           1            1
                                   6                 0.36         0.15          0.10        0.12          0.18        0.17                 0.15         0.13        0.20         0.15
                                                     300000       300000        300000      300000        300000      300000               300000       300000      300000       300000
                                                     5            4             1           3             0           2                    1            2           2            0


                                                                                                    TABLE II
 S EARCH LATENCIES , QUERY SPACE , AND                        THE NUMBER OF RELEVANT RESULTS FOR DIFFERENT PERMUTATIONS OF THE QUERY:                                                   Ash Mohammad Abbas
                                                                                                   FOR   YAHOO .
                                           Permutation      p1          p2          p3           p4        p5         p6                   p7       p8           p9         p10
                                           1                0.15        0.15        0.27         0.24      0.25       0.23                 0.34     0.21         0.27       0.30
                                                            26100       26400       26400        27000     27000      26900                26900    26900        25900      25900
                                                            10          4           1            0         0          1                    0        0            0          0
                                           2                0.18        0.13        0.20         0.15      0.19       0.10                 0.15     0.09         0.12       0.13
                                                            26900       27000       27000        26900     25800      26900                26900    26800        26800      26800
                                                            4           6           1            1         0          1                    1        1            0          0
                                           3                0.12        0.11        0.15         0.14      0.11       0.10                 0.11     0.12         0.09       0.13
                                                            26900       27100       26900        26900     26500      26800                26800    26500        26500      26700
                                                            10          3           1            2         0          0                    0        0            0          0
                                           4                0.03        0.10        0.14         0.13      0.12       0.20                 0.10     0.19         0.12       0.17
                                                            27000       26400       26400        26700     27000      26700                26400    26900        26800      26800
                                                            7           4           0            2         1          0                    0        1            0          1
                                           5                0.12        0.12        0.20         0.08      0.13       0.10                 0.12     0.09         0.13       0.20
                                                            26400       26800       26800        26800     26800      26700                26700    26800        26700      26200
                                                            8           5           1            1         0          0                    0        0            0          1
                                           6                0.16        0.10        0.16         0.12      0.13       0.11                 0.10     0.11         0.12       0.15
                                                            27100       26700       27100        26700     26600      27000                26600    26900        26500      26500
                                                            10          5           0            0         0          1                    0        0            0          0




                      1                                                                                                           1
                                                                                Google                                                                                                        Google
                                                                                Yahoo                                                                                                         Yahoo



                     0.8                                                                                                         0.8




                     0.6                                                                                                         0.6
          Latency




                                                                                                                      Latency




                     0.4                                                                                                         0.4




                     0.2                                                                                                         0.2




                      0                                                                                                           0
                           1   2       3       4     5       6      7      8         9      10                                         1     2      3       4       5       6       7     8        9   10
                                                    Page Number                                                                                                    Page Number


Fig. 6.             Latency versus page number for permutation P5.                                          Fig. 7.             Latency versus page number for permutation P6.




                               IV. R ESULTS AND D ISCUSSION
                                                                                                            classes of search engines. As mentioned earlier, Yahoo is based
  The search portals that we have evaluated are Google and                                                  on ontology while Google is based on page ranks. Therefore,
Yahoo. We have chosen them because they represent the search                                                if one selects them, one may evaluate two distinct classes of
portals that majority of end users in today’s world use in                                                  search engines.
their day-to-day searching. One more reason behind choosing                                                    The search environment is as follows. The client from where
them for performance evaluation is that they represent different                                            queries were fired was a Pentium III machine. The machine
                     0.65
                                                                     Google:Q1
                                                                     Google:Q2
                                                                                      was part of a 512Kbps local area network. The operating
                                                                     Yahoo:Q1
                      0.6                                            Yahoo:Q2         system was Windows XP.
                     0.55                                                                In what follows, we discuss behavior of search engines for
                      0.5                                                             different permutations of a query.
                     0.45
          Latency




                      0.4
                                                                                      A. Query Permutations
                     0.35
                                                                                         To see how a search engine behaves for different permuta-
                      0.3
                                                                                      tions of a query, we consider the following query.
                     0.25


                      0.2
                                                                                                              Ash Mohammad Abbas
                               1    1.5   2         2.5       3        3.5       4
                                                Correlation
                                                                                      The different permutations of this query are
Fig. 8.             Latency versus correlation for queries with embedded semantics.
                                                                                                1    Ash                Mohammad          Abbas
                     1.1
                                                                                                2    Ash                Abbas             Mohammad
                                                                     Google:Q1
                                                                     Google:Q2
                                                                     Yahoo:Q1
                                                                                                3    Abbas              Ash               Mohammad
                      1                                              Yahoo:Q2
                                                                                                4    Abbas              Mohammad          Ash
                     0.9
                                                                                                5    Mohammad           Ash               Abbas
                     0.8
                                                                                                6    Mohammad           Abbas             Ash
                     0.7
          Latency




                                                                                         We have assigned a number to each permutation to differ-
                     0.6
                                                                                      entiate from one another. We wish to analyze search results on
                     0.5                                                              the basis of search time, number of relevant results and query
                     0.4                                                              space. The query space is nothing but the cardinality of all
                     0.3
                                                                                      results returned by a given search engine in response to a given
                                                                                      query. Note that search time is defined as the actual time taken
                     0.2
                           1       1.5    2        2.5
                                               Correlation
                                                              3        3.5       4
                                                                                      by the search engine to deliver the results searched. Ideally, it
Fig. 9.             Latency versus correlation for random queries.                    does not depend upon the speeds of hardware, software, and
                                                                                      network components from where queries are fired because it is
                                                                                      the time taken by the search engine server. Relevant results are
                     0.65
                                                                     Google:Q1
                                                                     Google:Q2        those which the user intends to search. For example, the user
                                                                     Yahoo:Q1
                      0.6                                            Yahoo:Q2
                                                                                      intends to search information about Ash Mohammad Abbas9.
                     0.55                                                             Therefore, all those results that contain Ash Mohammad Abbas
                      0.5                                                             are relevant for the given query.
                     0.45                                                                In what follows, we discuss the results obtained for different
          Latency




                      0.4
                                                                                      permutation of a given query. Let the given query be Ash
                                                                                      Mohammad Abbas. For all permutations, all those results that
                     0.35
                                                                                      contain Ash Mohammad Abbas are counted as relevant results.
                      0.3
                                                                                      Since both Google and Yahoo deliver the results page wise,
                     0.25                                                             therefore, we list all parameters mentioned in the previous
                      0.2
                               1    1.5   2         2.5       3        3.5       4
                                                                                      paragraph page wise. We go up to 10 pages for both the search
                                                Correlation                           engines as the results beyond that are rarely significant.
Fig. 10. Query Space versus correlation for queries with embedded semantics.             Table I shows search latencies, query space, and the number
                                                                                      of relevant results for different permutations of the given query.
                     1.1
                                                                     Google:Q1
                                                                                      The search portal is Google. Our observations are as follows.
                                                                     Google:Q2
                      1
                                                                     Yahoo:Q1
                                                                     Yahoo:Q2
                                                                                          




                                                                                             For all permutations, the query space remains the same
                     0.9                                                                     and it does not vary along the pages of results.
                     0.8
                                                                                          




                                                                                             The time to search the first page of the results in response
                                                                                             to a the given query is the largest for all permutations.
                     0.7
          Latency




                                                                                          




                                                                                             The first page of results contain the most relevant results.
                     0.6


                     0.5                                                                 9 We have intentionally taken the query: Ash Mohammad Abbas. We wish

                     0.4
                                                                                      to search for different permutations of a query and the effect of those
                                                                                      permutations on query space and on the number of relevant results. The
                     0.3                                                              relevance is partly related to the intentions of an end-user. Since we already
                                                                                      know what are the relevant results for the chosen query, therefore, this is
                     0.2
                           1       1.5    2        2.5        3        3.5       4    easier to decide what relevant results out of them have been returned by a
                                               Correlation
                                                                                      search engine. The reader may take any other query, if he/she wishes so. In
Fig. 11.             Query Space versus correlation for random queries.               that case, he has to decide what are the results that are relevant to his/her
                                                                                      query and this will partly depend upon what he/she intended to search.
                                                                      TABLE III
                                                      Q UERIES WITH EMBEDDED SEMANTICS .
                        S. No.   Query No.       Query                                                        Correlation
                        E1       Q1              node         disjoint     multipath                          1
                                 Q2              edge         disjoint     multicast
                        E2       Q1              node         disjoint     multipath    routing               2
                                 Q2              edge         disjoint     multicast    routing
                        E3       Q1              node         disjoint     multipath    routing               3
                                 Q2              edge         disjoint     multipath    routing
                        E4       Q1              node         disjoint     multipath    routing     ad hoc    4
                                 Q2              wireless     node         disjoint     multipath   routing

                                                                      TABLE IV
                                        Q UERIES WITHOUT EMBEDDED SEMANTICS ( RANDOM QUERIES ).
              S. No.   Query No.     Query                                                                              Correlation
              R1       Q1            adhoc         node         ergonomics                                              1
                       Q2            quadratic     power        node
              R2       Q1            computer      node         constellations     parity                               2
                       Q2            hiring        parity       node               biased
              R3       Q1            wireless      node         parity             common      mitigate                 3
                       Q2            mitigate      node         shallow            rough       parity
              R4       Q1            few           node         parity             mitigate    common     correlation   4
                       Q2            shallow       mitigate     node               parity      common     stanza

                                                                 TABLE V
                                   S EARCH TIME AND Q UERY S PACE FO QUERIES WITH EMBEDDED SEMANTICS .
                                       S. No.    Query No.               Google                Yahoo
                                                               Time       Query Space   Time    Query Space
                                       E1        Q1            0.27       43100         0.37    925
                                                 Q2            0.23       63800         0.28    1920
                                       E2        Q1            0.48       37700         0.40    794
                                                 Q2            0.32       53600         0.32    1660
                                       E3        Q1            0.48       37700         0.40    794
                                                 Q2            0.24       21100         0.34    245
                                       E4        Q1            0.31       23500         0.64    79
                                                 Q2            0.33       25600         0.44    518


                                                                TABLE VI
                                            S EARCH TIME AND QUERY SPACE FOR RANDOM QUERIES .
                                       S. No.    Query No.               Google                Yahoo
                                                               Time       Query Space   Time    Query Space
                                       R1        Q1            0.44       28500         0.57    25
                                                 Q2            0.46       476000        0.28    58200
                                       R2        Q1            0.46       34300         0.55    164
                                                 Q2            0.42       25000         0.35    90
                                       R3        Q1            0.47       25000         0.40    233
                                                 Q2            0.33       754           0.68    31
                                       R4        Q1            0.34       20000         0.58    71
                                                 Q2            1.02       374           0.64    23




  Table II shows the same set of parameters for different                          number of relevant results. For permutation 2 (i.e. Ash
permutations of the given query for search portal Yahoo. From                      Abbas Mohammad), the second page contains the largest
the table, we observe that                                                         number of relevant results.
   




      As opposed to Google, the query space does not remain                      Let us discuss reasons for the above mentioned observations.
      same, rather it varies with the pages of searched results.              Consider the question why query space in case of Google is
      The query space in this case is less than Google.                       larger than that of Yahoo. We have pointed out that Google
   




      The time to search the first page of results is not neces-               is based on the page ranks. For a given query (or a set of
      sarily the largest of the pages considered. More precisely,             words), it ranks the pages. It delivers all the ranked pages
      it is larger for the pages where there is no relevant result.           that contain the words contained in the given query. On the
      Further, the time taken by Yahoo is less than that of                   other hand, Yahoo is an ontology based search engine. As
      Google.                                                                 mentioned earlier, it will search only that part of its Webbase
   




      In most of the cases, the first page contains the largest                that constitutes the ontology of the given query. This is the
                                  TABLE VII                                                                           TABLE IX
           L ATENCY MATRIX , L,   FOR DIFFERENT PERMUTATIONS .                  R ELEVANCE MATRIX FOR DIFFERENT PERMUTATIONS FOR G OOGLE .
      P     p1    p2   p3    p4     p5   p6    p7    p8    p9    p10               P       p1        p2   p3     p4      p5       p6        p7   p8     p9    p10
      1     1     1
                       1     1      0    0     1     1     1     1                 1       8         5    1      0       2        0         0    3      0     0
                  2
      2     0     0    0     0      1    0     1     0     0     0                 2       3         2    2      1       1        0         2    0      0     0
      3     0     1    0     0      0    0     0     0     0     0                 3       6         4    1      3       2        1         1    0      0     0
      4     0     1    0     1      0    1     0     1     0     0                 4       3         0    2      1       0        0         2    0      1     1
      5     0     1    0     0      0    0     0     0     0     1                 5       3         2    1      2       1        0         0    1      1     1
      6     0     0    1      1
                                    0    0     0     0     0     1                 6       5         4    1      3       0        2         1    2      2     0
                              2                                  2



                               TABLE VIII                                                                             TABLE X
          Query Space MATRIX , S, FOR DIFFERENT PERMUTATIONS .                   R ELEVANCE MATRIX FOR DIFFERENT PERMUTATIONS FOR YAHOO .
                                                                                   P       p1        p2   p3     p4      p5       p6        p7   p8     p9    p10
      P     p1    p2   p3    p4     p5   p6    p7    p8    p9    p10
                                                                                   1       10        4    1      0       0        1         0    0      0     0
      1     1     1    1     1      1    1     1     1     1     1
                                                                                   2       4         6    1      1       0        1         1    1      0     0
      2     1     1    1     1      1    1     1     1     1     1
                                                                                   3       10        3    1      2       0        0         0    0      0     0
      3     1     1    1     1      1    1     1     1     1     1
                                                                                   4       7         4    0      2       1        0         0    1      0     1
      4     1     1    1     1      1    1     1     1     1     1
                                                                                   5       8         5    1      1       0        0         0    0      0     1
      5     1     1    1     1      1    1     1     1     1     1
                                                                                   6       10        5    0      0       0        1         0    0      0     0
      6     1     1    1     1      1    1     1     1     1     1




reason why query space in case of Google is larger than that                  shown in Table IV. The words contained in these queries are
of Yahoo.                                                                     random and are not related semantically.
   Let us answer the question why query space changes in                         We wish to evaluate the performance of a search engine
case of Yahoo and why it remains constant in case of Google.                  for k-correlated queries. For that we evaluate search time and
Note that ontology may change with time and with order of                     query space of a search engine for the first page of results.
words in the given query. For every page of results, Yahoo                    Since both Google and Yahoo deliver 10 results per page,
estimates the ontology of the given permutation of the query                  therefore, looking for the first page of results means that we
before delivering the results to the end user. Therefore, the                 are evaluating 10 top most results of these search engines. Note
query space for different permutations of the given query is                  that we do not consider number of relevant results because
different and it changes with pages of the searched results10 .               relevancy in this case would be query dependent. Since there
However, page ranks do not change either with pages or with                   is no single query, therefore, evaluation of relevancy would
order of words. The page ranks will only change when new                      not be so useful.
links or documents are added to the Web that are relevant to                     Table V shows search time and query space for k-correlated
the given query. Since neither a new link nor a new document                  queries with embedded semantics (see Table III). The second
is added to the Web during the evaluation of permutations of                  query, Q2 , is fired after the first query Q1 . On the other hand,
the query, therefore, the query space does not change in case                 Table VI shows search time and query space for k-correlated
of Google.                                                                    queries whose words may not be related (see Table IV).
   In order to compare the performance of Google and Yahoo,
the latencies versus page numbers for different permutations                                                   TABLE XI
of the query have been shown in Figures 2 through 7. Let us                                     R ELEVANCE FOR DIFFERENT PERMUTATIONS .
consider the question why search time in case of Google is                                                P             Google         Yahoo
larger than that of Yahoo. Note that Google ranks the results                                             1             19             16
before delivering them to end users while Yahoo does not. The                                             2             11             15
                                                                                                          3             18             16
ranking of pages takes time. This is the reason why search time                                           4             10             16
taken by Google is larger than that of Yahoo.                                                             5             12             16
   In what follows, we discuss how a search engine behaves                                                6             20             16
                                                                                                          Total         90             95
for correlated queries.

                                                                                                                   TABLE XII
B. Query Correlation                                                                       Earned Points (EP) FOR DIFFERENT PERMUTATIONS .
  We have formulated k-correlated queries as shown in Ta-                          P                           Google                                 Yahoo
ble III. Since all words contained in a query are related11,                                    Latency        Query      EP           Latency        Query   EP
                                                                                                               Space                                  Space
therefore, we call them queries with embedded semantics. On
                                                                                   1            14 5           19         33 5         3              0       3
the other hand, we have another set of k-correlated queries as                     2            3
                                                                                                      




                                                                                                               11         14
                                                                                                                               




                                                                                                                                       14             0       14
                                                                                   3            4              18         22           13             0       13
   10 This observed behavior may also be due to the use of a randomized
                                                                                   4            1              10         11           9              0       9
algorithm. To understand the behavior of randomized algorithms, readers are        5            3              12         15           10             0       10
referred to any text on randomized algorithms such as [10].                        6            25
                                                                                                               20         22 5         16             0       16
   11 More precisely, all words in these queries are from ad hoc wireless          Total                                  118                                 65
networks, an area that authors of this paper like to work.
                                                                                                                                                             TABLE XIII
   In order to compare the performance of Yahoo and Google,
                                                                                           CEP FOR DIFFERENT PERMUTATIONS FOR G OOGLE .
the latencies versus correlation for queries with embedded
                                                                                                          P                                     Latency                      Query Space
semantics is shown in Figure 8 and that for randomized queries                                                                                  Contribution                 Contribution
is shown in Figure 9. Similarly, the query space for queries                                              1                                     123 834                      5700000
with embedded semantics is shown in Figure 10 and that for                                                2                                     61 262                       3300000
randomized queries is shown in Figure 11.                                                                 3                                     116 534                      5400000
                                                                                                          4                                     35 918                       3000000
   The query space of Yahoo is much less than that of Google                                              5                                     73 458                       3600000
for the reasons discussed in the previous subsection. Other                                               6                                     530 376                      6000000
                                                                                                          Total                                 530 376                      27000000
important observations are as follows.                                                                                                                        




    




       In case of k-correlated queries with embedded semantics,
                                                                                                                                                             TABLE XIV
       generally the time to search for Q2 is less than that of
                                                                                           CEP FOR DIFFERENT PERMUTATIONS FOR YAHOO .
       Q1 .
                                                                                                          P                                     Latency                      Query Space
       This is due to the fact that since the queries are correlated,                                                                           Contribution                 Contribution
       some of the words of Q2 have already been searched                                                 1                                     101 385                      419900
       while searching for Q1 .                                                                           2                                     107 821                      404100
    




       The query space is increased when the given query has                                              3                                     131 558                      431000
                                                                                                          4                                     308 197                      428700
       a word that is more frequently found in Web pages (e.g.
                                                                                                                                                              




                                                                                                          5                                     130 833                      425000
       in R1: Q2 , the word quadratic that is frequently used                                             6                                     121 591                      431500
       in Engineering, Science, Maths, Arts, etc.). The query                                             Total                                 901 385                      2540200
       space is decreased when there is a word included in
       the query which is rarely used (e.g. mitigate included
       in R3,R4:Q1 Q2 and shallow included in R3,R4:Q2).
                      £                                                          as follows.
    




       The search time is larger in case of randomized queries                                                  ¡¢          1                                if latency1j         £          latency0j
       as compared to queries with embedded semantics.                                                                                                                  i                           i
                                                                                                                            1
       The reason for the this observation is as follows. In case                              li j    
                                                                                                                            2                                if latency1j
                                                                                                                                                                        i                    latency0j
                                                                                                                                                                                                    i           (4)
       of queries with embedded semantics, the words of a given                                                             0                                otherwise.
       query are related and are found in Web pages that are not
                                                                                                                     ¡  
                                                                                    Similarly, let S                                    si j be a matrix where si j is defined as
                                                                                                                                                 ¢
       too far from one another either from the point of view                    follows.
       of page rank as in Google or from the point of view of                                                          ¡¢               1                         if space1j
                                                                                                                                                                          i            £     space0j
                                                                                                                                                                                                  i
       ontology as in Yahoo.                                                                                                            1
                                                                                                 si j                                                             if space1j
                                                                                                                                                                          i                  space0j
                                                                                                                                                                                                  i             (5)
    




       One cannot infer anything about the search time of                                                                               2
       Google and Yahoo as it depends upon the query. More                                                                              0                         otherwise.
       precisely, it depends upon the fact which strategy takes                  In matrices defined above, where there is a ’1’, it means at
       more time whether page rank in Google or estimation of                    that place Google is the winner and a ’ 1 ’ represents that there
                                                                                                                         2
       ontology in Yahoo.                                                        has been a tie between Google and Yahoo. We now define a
   However, from Table V and Table VI, one can infer the                         parameter that we call Earned Points (EP) which is as follows.
following. Google is better in the sense that its query space                                                               pages
is much larger than that of Yahoo. However, Yahoo takes                                          EPk            ¤
                                                                                                              ¤ ¥           ∑                                    relevantk
                                                                                                                                                                         i   ©§
                                                                                                                                                                             ¨              Lk
                                                                                                                                                                                             i   ¦   Sik   
                                                                                                                                                                                                              (6)
less time as compared to Google for different permutations                                                                  i 1     
                                                                                                                                            ¡
                                                                                                                                                     ¦
of the same query. For k-correlated queries with embedded                        where, superscript k     0 1 denotes the search engine.
                                                                                                                                                            ¢ £
semantics, Google takes less time to search for the first query                      Table VII shows a latency matrix, L, for different permu-
as compared to Yahoo. It also applies to randomized queries                      tations of the query as that for Table I and Table II, and has
with some exceptions. In exceptional cases, Google takes                         been constructed using both of them. In the latency matrix,
much more time as compared to Yahoo. We have mentioned                           there are 40 ’0’s, 17 ’1’s, and 3 ’ 1 ’. We observe from the
                                                                                                                      2
it previously that it depends upon the given query as well as                    latency matrix that Yahoo is the winner (as far as latencies
the strategy employed in the search engine.                                      are concerned), as there are 40 ’0’s out of 60 entries in total.
   In what follows, we describe a unified criteria for comparing                     On the other hand, Table VIII shows the query space
the search engines considered in this paper.                                     matrix, S, for different permutations of the same query and
                                                                                 is constructed using the tables mentioned in the preceding
                                                                                 paragraph. One can see that as far as query space is concerned,
          V. A U NIFIED C RITERIA          FOR   C OMPARISON                     Google is the sole winner. Infact, query space of Google is
                                                                                 much larger than that of Yahoo.
  Let us denote Google by a superscript ’1’ and Yahoo by a
                             ¡                                                      The relevance matrix for Google is shown in Table IX and
superscript ’0’12 . Let L li j be a matrix where li j is defined
                                    ¢                                            that for Yahoo is shown in Table X. The total relevance for the
                                                                                 first ten pages is shown in Table XI for both Google as well
   12 This is simply a representation. One may consider a representation which   as Yahoo. It is seen from Table XI that the total relevance for
is reverse of it, then also, there will not be any effect on the criteria.       Google is 90 and that for Yahoo is 95. Average relevance per
                                                           TABLE XV
                                                                                                                                                            Table XIII shows contributions of latency and query space
 C ONTRIBUTION DUE                 TO QUERY SPACE IN                                               CEP FOR DIFFERENT SETS OF
                                                                                                                                                         in CEP for Google. Similarly, Table XIV shows the same for
                                                                   WEIGHTS .
                                                                                                                                                         Yahoo. We observe that contribution of latency for Google is
           Weights                                                       Google                              Yahoo
                                        ¡¡                 6
                                                                                                                                                         530 376 and that for Yahoo is 901 385. However, contribution
                                                                                                                                                                ¢                                                                                      ¢
           wl             1,   wq            10
                                                           5
                                                                         27 00
                                                                                                             2 54  




                                                                                                                                                         of query space for Google is 27000000 and that for Yahoo is
           wl
           wl     
                          1,
                          1,
                               wq
                               wq
                                      ¡¡     10
                                             10            4
                                                                         270 00 




                                                                         2700 00    
                                                                                                             25 40
                                                                                                             254 02
                                                                                                                           




                                                                                                                                   
                                                                                                                                                         2540200. In other words, the contribution of query space for
                                                           3
           wl             1,   wq
                                    ¡¡       10                          27000 00                            2540 20                   




                                                                                                                                                         Google is approximately 11 times of that for Yahoo. Adding
                                                           2
           wl
           wl
                          1,
                          1,
                               wq
                               wq
                                             10
                                               
                                             10            1
                                                                         270000 00
                                                                         2700000 00
                                                                                                             25402 00
                                                                                                             254020 00
                                                                                                                                           




                                                                                                                                                         these contributions shall result in a larger CEP for Google as
                                                                                                                                                         compared to Yahoo. The CEP defined using (7) has a problem
                                                                                                                                               




           wl             1,   wq            1                           27000000                            2540200
                                                                                                                                                         that we call dominating constituent problem (DCP). The larger
                                                                                                                                                         parameter suppresses the smaller parameter. Note that the
                                                           TABLE XVI
                                                                                                                                                         definition of CEP in (7) assumes equal weights for latency
         CCEP FOR DIFFERENT SETS OF comparable weights.
                                                                                                                                                         and query space. On the other hand, one may be interested in
             Weights                                                      Google                             Yahoo                                       assigning different weights to constituents of CEP depending
             wl 0 9,
             wl 0 8,
                                




                                
                                     wq
                                     wq
                                                           01
                                                           02
                                                                




                                                                
                                                                          486 3384
                                                                          442 3008
                                                                                        




                                                                                        
                                                                                                             811 2465
                                                                                                             721 1080
                                                                                                                               




                                                                                                                               
                                                                                                                                                         upon the importance of constituents. Let us rewrite (7) to
             wl 0 7,                 wq                    03             398 2632                           630 9695          
                                                                                                                                                         incorporate weights. Let wl and wq be the weights assigned to
             wl 0 6,                 wq                    04             354 2256                           540 8310          




                                                                                                                                                         latency and query space, respectively. The (7) can be written
             wl 0 5,
             wl 0 4,
                                     wq
                                     wq
                                                           05
                                                           06
                                                                          310 1880
                                                                          266 1504
                                                                                                             450 6925
                                                                                                             360 5540
                                                                                                                               




                                                                                                                                                         as follows.
             wl 0 3,   
                                




                                     wq                    07
                                                                




                                                                          222 1128
                                                                                        




                                                                                                             270 4155
                                                                                                                               




                                                                                                                                                                                                pages                   ¢                                    1
                                                                                                                                                                                                    ∑
                                                                                                                               




             wl 0 2,                 wq                    08             178 0752                           180 2770          




                                                                                                                                                                         CEPk               ¤ ¤                              relevantk
                                                                                                                                                                                                                                     i                £ §        wl          ¦       q k wq
                                                                                                                                                                                                                                                                                       i            ¨¥
                                                                                                                                                                                                                                                                                                    ¦     (8)
             wl 0 1,                 wq                    09             134 0376                           90 1385   




                                                                                                                                                                                                    i 1
                                                                                                                                                                                                    
                                                                                                                                                                                                                                                             dik
                                                                                                                                                         The weights should be chosen carefully. For example, the
                                                                                                                                                         weights wl 1, wq 10 6 will add 27 to the contribution in
                                                                                                                                                                                                                              ©
permutation and per page for Google is 1 5 and that for Yahoo                                            ¢                                               CEP due to query space for Google and 2 54 to Yahoo. On the                                                 ¢

is 1 583. Therefore, as far as average relevance is concerned,
    ¢                                                                                                                                                    other hand, a set of weights wl 1, wq 10 5 shall add 270                                                                    ©
Yahoo is the winner.                                                                                                                                     for Google and 25 4 for Yahoo. Table XV shows contribution
                                                                                                                                                                                                    ¢

   Table XII shows the number of earned points for both                                                                                                  of query space in CEP for different sets of weights. It is to note
Google as well as Yahoo for different permutations of the                                                                                                that wl is fixed to 1 for all sets, and only wq is varied. As wq is
query mentioned earlier. We observe that the number of earned                                                                                            increased beyond 10 5, the contribution of query space starts
                                                                                                                                                                                                            ©
points for Google is 118 and that for Yahoo is 65. The number                                                                                            dominating over the contribution of latency. The set of weight
of earned points of Google is far greater than Yahoo. The                                                                                                wl 1 wq 10 5 indicates that one can ignore contribution
                                                                                                                                                                     £                 ©
reason behind this is that query space of Yahoo is always less                                                                                           of query space in comparison to the contribution of latencies
than that of Google and it does not contribute to the number                                                                                             provided that one is more interested in comparing search
of earned points.                                                                                                                                        engines with respect to latency. In that case, an approximate
   A closer look on the definition of EP reveals that while                                                                                               expression for CEP can be written as follows.
defining the parameter EP in (6) together with (4) and (5),                                                                                                                                                                   pages          ¢                                1
we have assumed that a search engine either has a constituent
parameter (latency or query space) or it does not have that
                                                                                                                                                                                       CEPk                     ¤ ¤              ∑             relevantk
                                                                                                                                                                                                                                                        i        §           dik         ¦                (9)
                                                                                                                                                                                                                             i 1  

parameter at all. The contribution of some of the parameter                                                                                                 Alternatively, one may consider an approach that is combi-
is lost due the fact that the effective contribution of other                                                                                            nation of the definition of EP defined in (6) (together with (4)
parameter by which the given parameter is multiplied is zero.                                                                                            and (5) and that of CEP defined in (7). In that we may use
Note that our goal behind introduction of (6) was to rank                                                                                                the definition of matrix S which converts the contribution of
the given set of search engines. We call this type of ranking                                                                                            query space in the form of binaries13 . The modified definition
of search engines as Lossy Constituent Ranking (LCR). We,                                                                                                is as follows.
therefore, feel that there should be a method of comparison                                                                                                                                                                           ¢
                                                                                                                                                                                                            pages
between a given set of search engines that is lossless in nature.                                                                                                                                                                                                    1
For that purpose, we define another parameter that we call
                                                                                                                                                                          CCEPk                             ¤ ¤     ∑                     relevantk
                                                                                                                                                                                                                                                  i         §
                                                                                                                                                                                                                                                            £        dik
                                                                                                                                                                                                                                                                                 ¦           Sik   ¦
                                                                                                                                                                                                                                                                                                   ¨¥    (10)
                                                                                                                                                                                                                i 1 
Contributed Earned Points (CEP). The definition of CEP is as
follows.                                                                                                                                                 where Sik is in accordance with the definition of S given by
                                                                                                                                                         (5). The acronym CCEP stands for Combined Contributory
                                    pages              ¢                                                1                                                Earned Points. If one wishes to incorporate weights, then the
           CEPk              ¤
                           ¤ ¥      ∑                              relevantk
                                                                           i                       ¤§
                                                                                                   £    dik
                                                                                                              ¦                       qk
                                                                                                                                       i      ¦
                                                                                                                                              §¥   (7)   definition of CCEP becomes as follows.
                                    i 1
                                                                                                                                                                                                               ¢
                                                                                                                                                                                           pages
                                                                                                                                                                                                                                                            1
                                                                                                                                                                                           ∑
                                                   ¡                                                                                                                                 ¤
                                                                                                                                                                                   ¤ ¥ 
where, superscript k      0 1 denotes the search engine, d
                                                                  ¢ £                                                                                              CCEPk                                                   relevantk
                                                                                                                                                                                                                                    i                £ §        wl       ¦       Sik wq            ¦
                                                                                                                                                                                                                                                                                                   ¨¥    (11)
denotes the actual latency, and q denotes the actual query                                                                                                                                 i 1
                                                                                                                                                                                            
                                                                                                                                                                                                                                                            dik
space. The reason behind having an inverse of actual latency                                                                                                13 We mean that the matrix S says either there is a contribution of query
in (7) is that the better search engine would be that which                                                                                              space of a search engine provided that its query space is larger than that of
takes less time.                                                                                                                                         the other one or there is no contribution of query space at all, if otherwise.
In the definition of CCEP given by (11) the weights can be                                                                     weights to different constituents of the criteria—latency and
comparable and the dominant constituent problem mentioned                                                                     query space. Our observations are as follows.
earlier can be mitigated for comparable weights. We define                                                                         




                                                                                                                                   We observed that performance of Yahoo is better in terms
comparable weights as follows.                                                             ¡                                       of the latencies, however, Google performs better in terms
   Definition 3: A set of weights W        wi       wi 0, is said                                                £ ¢     £         of query space.
to have comparable weights if and only if ∑i wi 1 and the                                                                         




                                                                                                                                   We discussed the dominant constituent problem. We
                w
condition 1 w ij 9 is satisfied wi w j W .
           9          ¥                   ¥                                                   £                                  discussed that this problem can be mitigated using the
   Table XVI shows the values of CCEP for different sets of                                                                        concept of contributory earned points if weights assigned
comparable weights. We observe that the rate of decrease of                                                                        to constituents are comparable. If both the constituent
CCEP for Yahoo is larger than that of Google. For example, for                                                                     are assigned equal weights, we found that Yahoo is the
                           
wl 0 9 wq 0 1, CCEP for Google is 486 3384 and that for
              ¢   £                   ¢                                                                          ¢                 winner.
Yahoo is 811 2465. For wl 0 8 wq 0 2, CCEP for Google                                       
                                  ¢                                    ¢       £                       ¢
                                                                                                                              However, the performance of a search engine may depend
is 442.3008 and that for Yahoo is 721 1080. In other words,                                        ¢
                                                                                                                              upon the criteria itself and only one criteria may not be
the rate of decrease in CCEP for Google is 9 05% and that for                                                        ¢
                                                                                                                              sufficient for an exact analysis of the performance. Further
Yahoo is 11 11%. The reason being that in the query space
                              ¢
                                                                                                                              investigations and improvements in this direction forms our
matrix, S, (see Table VIII) all entries are ’1’. It means that                                                                future work.
query space of Google is always larger than that of Yahoo.
Therefore, in case of Yahoo, the contribution due to query                                                                                                 R EFERENCES
space is always 0 irrespective of the weight assigned to it.
                                                                                                                               [1] S. Malhotra, ”Beyond Google”, CyberMedia Magazine on Data Quest,
However, in case of Google the contribution due to query space                                                                     vol. 23, no. 24, pp.12, December 2005.
is nonzero and increases with an increase in weight assigned                                                                   [2] M.R. Henzinger, A. Haydon, M. Mitzenmacher, M. Nozark, ”Measuring
to the contribution due to query space. Moreover, for a set of                                                                     Index Quality Using Random Walks on the Web”, Proceedings of 8th
                      ¡                                                                                                            International World Wide Web Conference, pp. 213-225, May 1999.
weights, W       wl 0 5 wq 0 5 , the values of CCEP are¢   £               ¢           ¢                                       [3] M.C. Tang, Y. Sun, ”Evaluation of Web-Based Search En-
310 1880 and 450 6925 for Google and Yahoo, respectively.
          ¢                                   ¢                                                                                    gines Using User-Effort Measures”, Library an Information Sci-
It means that if one wishes to assign equal weights to latency                                                                     ence Research Science Electronic Journal, vol. 13, issue 2, 2003,
                                                                                                                                   http://libres.curtin.edu.au/libres13n2 /tang.htm.
and query space then Yahoo is the winner in terms of the                                                                       [4] C.W. Cleverdon, J. Mills, E.M. Keen, An Inquiry in Testing of Infor-
parameter CCEP.                                                                                                                    mation Retrieval Systems, Granfield, U.K., 1966.
   In case of CCEP, the effect of the dominating constituent                                                                   [5] J. Gwidzka, M. Chignell, ”Towards Information Retrieval Mea-
                                                                                                                                   sures for Evaluation of Web Search Engines”, http://www.imedia.
problem is less as compared to that in case of CEP. In other                                                                       mie.utoronto.ca/people/jacek/pubs/webIR eval1 99.pdf, 1999.
words, the effect of large values of query space is fairly smaller                                                             [6] D. Rafiei, A.O. Mendelzon, ”What is This Page Known For: Computing
in case of CCEP as compared to that in case of CEP. This is                                                                        Web Page Reputations”, Elsevier Journal on Computer Networks, vol
                                                                                                                                   33, pp. 823-835, 2000.
with reference to our remark that with the use of CCEP the                                                                     [7] N. Bhatti, A. Bouch, A. Kuchinsky, ”Integrating User-Perceived Quality
dominating constituent problem is mitigated.                                                                                       into Web Server Design”, Elsevier Journal on Computer Networks, vol
                                                                                                                                   33, pp. 1-16, 2000.
                                                                                                                               [8] S. Brin, L. Page, ”The Anatomy of a Large-Scale Hypertextual Web
                                                      VI. C ONCLUSIONS                                                             Search Engine”, http://www-db.stanford.edu/pub/papers/google.pdf,
  In this paper, we analyzed the impact of correlation among                                                                       2000.
                                                                                                                               [9] J. Kleinberg, ”Authoritative Sources in a Hyperlinked Environment”,
queries on search results for two representative search portals                                                                    Proceedings of 9th ACM/SIAM Symposium on Discrete Algorithms,
namely Google and Yahoo. The major accomplishments of the                                                                          1998.
paper are as follows:                                                                                                         [10] R. Motwani, P. Raghavan, Randomized Algorithms, Cambridge Univer-
                                                                                                                                   sity Press, August 1995.
   




     We analyzed the search time, the query space and the
     number of relevant results per page for different per-
     mutations of the same query. We observed that these
     parameters vary with pages of searched results and are
     different for different permutations of the given query.
   




     We analyzed the impact of k-correlation among two
     subsequent queries given to a search engine. In that
     we analyzed the search time and the query space. We
     observed that
       – The search time is less in case of queries with
          embedded semantics as compared to randomized
          queries without any semantic consideration.
       – In case of randomized query, the query space is
          increased in case the given query includes a word
          that is frequently found on the Web and vice versa.
Further, we considered a unified criteria for comparison be-
tween the search engines. Our criteria is based upon the
concept of earned points. An end-user may assign different

								
To top