Docstoc

query-journal 248 248

Document Sample
query-journal 248 248 Powered By Docstoc
					Impact of Query Correlation on Web Searching
Ash Mohammad Abbas Department of Computer Engineering Zakir Husain College of Engineering and Technology Aligarh Muslim University, Aligarh - 202002, India.

Abstract— Correlation among queries is an important factor to analyze as it may affect the results delivered by a search engine. In this paper, we analyze correlation among queries and how it affects the information retrieved from the Web. We analyze two types of queries: (i) queries with embedded semantics, and (ii) queries without any semantics. In our analysis, we consider parameters such as search latencies and search relevance. We focus on two major search portals that are mainly used by end users. Further, we discuss a unified criteria for comparison among the performance of the search engines. Index Terms— Query correlation, search portals, Web information retrieval, unified criteria for comparison, earned points.

I. I NTRODUCTION The Internet that was aimed to communicate research activities among a few universities in United States has now become a basic need of life for all people who can read and write throughout the world. It has become possible only due to the proliferation of the World Wide Web (WWW) which is now simply called as the Web. The Web has become the largest source of information in all parts of life. Users from different domains often extract information that fits to their needs. The term Web information retrieval1 is used for extracting information from the Web. Although, Web information retrieval finds its roots to traditional database systems [4], [5]. However, the retrieval of information from the Web is more complex as compared to the information retrieval from a traditional database. This is due to subtle differences in their respective underlying databases2 . In a traditional database, the data is often organized, limited, and static. As opposed to that the Webbase is unorganized, unlimited, and is often dynamic. Every second a large number of updates are carried out in the Webbase. Moreover, as opposed to a traditional database which is controlled by a specific operating system and the data is located either at a central location or at least at a few known locations, the Webbase is not controlled by any specific operating system and its data may not reside either at a central site or at few known locations. Further, the Webbase can be thought as a collection of a large number of traditional databases of various organization. The expectations of a user searching information
1 The terms Web surfing, Web searching, Web information retrieval, Web mining are often used in the same context. However, they differ depending upon the methodologies involved, intensity of seeking information, and intentions of users who extract information from the Web. 2 Let us use the term Webbase for the collection of data in case of the Web, in order to differentiate it from the traditional database.

on the Web are much higher than the user which is simply retrieving some information from a traditional database. This makes the task of extracting information from the Web a bit challenging [1]. Since the Web searching is an important activity and the results obtained so may affect decisions and directions for individuals as well as for organizations, therefore, it is of utmost importance to analyze the parameters or constituents involved in it. Many researchers have analyzed many different issues pertaining to Web searching that include index quality [2], user-effort measures [3], Web page reputation [6], and user perceived quality [7]. In this paper, we try to answer the following question: What happens when a user fires queries to a search engine one by one that are correlated? Specifically, we wish to evaluate the effect of correlation among the queries submitted to a search engine (or a search portal). Rest of this paper is organized as follows. In section II, we briefly review methodologies used in popular search engines. In section III, we describe query correlation. Section IV contains results and discussion. In section V, we describe a criteria for comparison of search engines. Finally, section VI is for conclusion and future work. II. A R EVIEW
OF

M ETHODOLOGIES U SED E NGINES

IN

S EARCH

First we discuss a general strategy employed for retrieving information from the Web and then we shall review some of the search portals. A. A General Strategy for Searching A general strategy for searching information on the Web is shown in Fig. 1. Broadly a search engine consists of the following components: User Interface, Query Dispatcher, Cache 3 , Server Farm, and Web Base. The way these components interact with one another depends upon the strategy employed in a particular search engine. We describe here a broad view. An end user fires a query using an interface, say User Interface. The User Interface provides a form to the user. The user fills the form with a set of keywords to be searched. The query goes to the Query Dispatcher which, after performing some refinements, sends it to the Cache. If the query obtained after
3 We use the word Cache to mean Search Engine Cache i.e. storage space where results matching to previously fired queries or words are kept for future use.

2 1 query U S E R I N T E R F A C E 8 7 Query Dispatcher 3

4 5

WEB BASE Cache Server Farm 6

response 9

Fig. 1.

A general strategy for information retrieval from the Web.

refinement4 is matched to a query in the Cache, the results are immediately sent by the Query Dispatcher to the User Interface and hence to the user. Otherwise, the Query Dispatcher sends the query to one of the Server in the Server Farm which are busy in building a Web Base for the search engine. The server so contacted, after due consideration from the Web Base sends it to the Cache so that the Cache may store those results for future reference, if any. Cache sends them to the Query Dispatcher. Finally, through the User Interface, response is returned to the end user. In what follows, we briefly review the strategies employed by different search portals. B. Review of Strategies of Search Portals The major search portals or search engines5 which end users generally use for searching are GoogleTM and YahooTM . Let us briefly review the methodologies behind their respective search engines6 of these search portals. Google is based on the PageRank scheme described in [8]. It is somewhat similar to the scheme proposed by Kleinberg in [9] which is based on hub and authority weights and focuses on the citations of a given page. To understand the Google’s strategy, one has to first understand the HITS (HyperlinkInduced Topic Search) algorithm proposed by Klienberg. For that the readers are directed to [9] for HITS and to [8] for PageRank. On the other hand, Yahoo employs an ontology based search engine. An ontology is a formal term used to mean a hierarchical structure of terms (or keywords) that are related. The relationships among the keywords are governed by a set of rules. As a result, an ontology based search engine such as Yahoo may search other related terms that are part of the ontology of the given term. Further, an ontology based
4 By refinement of a query, we mean that the given query is transformed in such a way so that the words and forms that are not so important are eliminated so that they do not affect the results. 5 A search engine is a part of search portal. A search portal provides many other facilities or services such as Advanced Search, News etc. 6 The respective products are trademarks of their organizations.

search engine may not search words that are not part of its ontology. It can modify its ontology with time. One step more, an ontology based search engine may also shorten the set of results searched before presenting it to the end users that are not part of the ontology of the given term. We now describe an important aspect pertaining to information retrieval from the Web. The results delivered by a search engine may depend how the queries are formulated and what relation a given query has with previously fired queries, if any. We wish to study the effect of correlation among the queries submitted to a search engine. III. Q UERY C ORRELATION The searched results may differ depending upon whether a search engine treats a set of words as an ordered set or an unordered set. In what follows, we consider each one of them. A. Permutations Searched results delivered by a search engine may depend upon the order of words appearing in a given query7. If we take into account order of words, the same set of words may form different queries for different orderings. The different orderings of the set of words of the given query are called permutations. The formal definition of permutations of a given query is as follows. Definition 1: Let the query Q wi 1 i m, Q φ, be a set of words excluding stop words of a natural language. Let P x j 1 j m be a set of words excluding stop words. If P is such that wi x j for some j not necessarily equal to i, and wi Q x j P such that wi x j where j may not be equal to i, then P is called a permutation of Q. In the above definition, stop words are language dependent. For example in the English language, the set of stop words, S, is often taken as

7 The term ’query’ means a set of words that is given to a search engine to search for the information available on the Web.

¢

£

£

£

£

£

£

£

£

£

£

S

a

an

the

is

am

are

will

shall

of

in

¦ 

¥

£ ¥ ¤¢

 

¡  

 



¥

 £

£ ¥ © 



§ ¨ 

¡  

for

Note that if there are m words (excluding the stop words) in the given query, the number of permutations is m!. The permutations are concerned with a single query. Submitting different permutations of the given query to a search engine, one may evaluate how the search engine behaves for different orderings of the same set of words. However, one would like to know how the given search engine behaves when an end user fires different queries that may or may not be related. Specifically, one would be interested in the behavior of a given search engine when the queries are related. In what follows, we discuss what is meant by the correlation among different queries. B. Correlation An important aspect that may affect the results of Web searching is how different queries are related. Two queries are said to be correlated if there are common words between them. A formal definition of correlation among queries is as follows. Definition 2: Let Q1 and Q2 be queries given to a search engine such that Q1 and Q2 are sets of words of a natural language and Q1 Q2 φ. Q1 and Q2 are said to be correlated if and only if there exists a set C Q1 Q2 , C φ. One may use the above definition to define k-correlation between any two queries. Formally, it can be stated as a corollary of Definition 2. Corrollary 1: Two queries are said to be k-correlated if and k, where denotes the cardinality. only if C For two queries that are correlated, we define a parameter called Correlation Factor8 as follows. Q1 Q2 Correlation Factor (1) Q1 Q2 This is based on the fact that Q1 Q2 Q1 Q2 Q1 Q2 . Note that 0 Correlation Factor 1. For two uncorrelated queries the Correlation Factor is 0. Further, one can see from Definition 1 that for the permutations of the same query, Correlation Factor is 1. Similarly, one may define the Correlation Factor for a cluster of queries. Let the number of queries be O. The cardinality of the union of the given cluster of queries is given by the following equation.
O
¡ ¡   ¡ ¦ ¡  ¡ © ¡ ¡ © ¡ ¡ ¨§¡ ¦

1 Google Yahoo

0.8

0.6 Latency 0.4 0.2 0 1 2 3 4 5 6 Page Number 7 8 9 10

Fig. 2.

Latency versus page number for permutation P1.

1 Google Yahoo

0.8

0.6 Latency 0.4 0.2 0 1 2 3 4 5 6 Page Number 7 8 9 10

Latency

Qo


Using (2), one may define the Correlation Factor of a cluster of queries as follows.


Correlation Factor

A high correlation factor means that the queries in the cluster are highly correlated, and vice versa. In what follows, we discuss results pertaining to query correlation.
8 This correlation factor is nothing but Jaccard’s Coefficient, which is often used as a measure of similarity.

¡

¡

O o 1 Qo O o 1 Qo


(3)

Latency

¡

   





  ¡

¥ ¡

¡

 

%

$

1

O 1

Q1





© " # !¦  © ¦    



o 1

∑ Qi ∑
 
i

Qi

Qj

i j

i j k

∑

Qi

Qj

Q2

¦ 

¡

¢

¡

¡

 

¡

 

¥

 

¥

¡

¡

¥

¡

 

 

¡ ¢ ¤£¡

¦ 

£

¥

 

Fig. 3.

Latency versus page number for permutation P2.

¡

¡

¡

1 Google Yahoo

0.8

0.6

 



¡

¡

0.4

0.2

0 1 2 3 4 5 6 Page Number 7 8 9 10

Fig. 4.

Latency versus page number for permutation P3.

Qk QO

1 Google Yahoo

(2)

0.8

0.6

0.4

0.2

0 1 2 3 4 5 6 Page Number 7 8 9 10

Fig. 5.

Latency versus page number for permutation P4.

TABLE I S EARCH LATENCIES , QUERY SPACE , AND
Permutation 1 p1 0.22 300000 8 0.51 300000 3 0.30 300000 6 0.60 300000 3 0.38 300000 3 0.36 300000 5
THE NUMBER OF RELEVANT RESULTS FOR DIFFERENT PERMUTATIONS OF THE QUERY: FOR

Ash Mohammad Abbas

G OOGLE .
p5 0.33 300000 2 0.13 300000 1 0.14 300000 2 0.13 300000 0 0.17 300000 1 0.18 300000 0 p6 0.29 300000 0 0.12 300000 0 0.25 300000 1 0.15 300000 0 0.15 300000 0 0.17 300000 2 p7 0.15 300000 0 0.10 300000 2 0.13 300000 1 0.23 300000 2 0.14 300000 0 0.15 300000 1 p8 0.13 300000 3 0.27 300000 0 0.21 300000 0 0.13 300000 0 0.16 300000 1 0.13 300000 2 p9 0.16 300000 0 0.16 300000 0 0.14 300000 0 0.28 300000 1 0.15 300000 1 0.20 300000 2 p10 0.17 300000 0 0.15 300000 0 0.21 300000 0 0.26 300000 1 0.13 300000 1 0.15 300000 0

2

3

4

5

6

p2 0.15 300000 5 0.15 300000 2 0.08 300000 4 0.07 300000 0 0.09 300000 2 0.15 300000 4

p3 0.04 300000 1 0.22 300000 2 0.18 300000 1 0.35 300000 2 0.39 300000 1 0.10 300000 1

p4 0.08 300000 0 0.19 300000 1 0.20 300000 3 0.11 300000 1 0.14 300000 2 0.12 300000 3

TABLE II S EARCH LATENCIES , QUERY SPACE , AND
Permutation 1
THE NUMBER OF RELEVANT RESULTS FOR DIFFERENT PERMUTATIONS OF THE QUERY: FOR

Ash Mohammad Abbas

YAHOO .
p5 0.25 27000 0 0.19 25800 0 0.11 26500 0 0.12 27000 1 0.13 26800 0 0.13 26600 0 p6 0.23 26900 1 0.10 26900 1 0.10 26800 0 0.20 26700 0 0.10 26700 0 0.11 27000 1 p7 0.34 26900 0 0.15 26900 1 0.11 26800 0 0.10 26400 0 0.12 26700 0 0.10 26600 0 p8 0.21 26900 0 0.09 26800 1 0.12 26500 0 0.19 26900 1 0.09 26800 0 0.11 26900 0 p9 0.27 25900 0 0.12 26800 0 0.09 26500 0 0.12 26800 0 0.13 26700 0 0.12 26500 0 p10 0.30 25900 0 0.13 26800 0 0.13 26700 0 0.17 26800 1 0.20 26200 1 0.15 26500 0

2

3

4

5

6

p1 0.15 26100 10 0.18 26900 4 0.12 26900 10 0.03 27000 7 0.12 26400 8 0.16 27100 10

p2 0.15 26400 4 0.13 27000 6 0.11 27100 3 0.10 26400 4 0.12 26800 5 0.10 26700 5

p3 0.27 26400 1 0.20 27000 1 0.15 26900 1 0.14 26400 0 0.20 26800 1 0.16 27100 0

p4 0.24 27000 0 0.15 26900 1 0.14 26900 2 0.13 26700 2 0.08 26800 1 0.12 26700 0

1 Google Yahoo

1 Google Yahoo

0.8

0.8

0.6 Latency Latency 0.4

0.6

0.4

0.2

0.2

0 1 2 3 4 5 6 Page Number 7 8 9 10

0 1 2 3 4 5 6 Page Number 7 8 9 10

Fig. 6.

Latency versus page number for permutation P5.

Fig. 7.

Latency versus page number for permutation P6.

IV. R ESULTS AND D ISCUSSION The search portals that we have evaluated are Google and Yahoo. We have chosen them because they represent the search portals that majority of end users in today’s world use in their day-to-day searching. One more reason behind choosing them for performance evaluation is that they represent different

classes of search engines. As mentioned earlier, Yahoo is based on ontology while Google is based on page ranks. Therefore, if one selects them, one may evaluate two distinct classes of search engines. The search environment is as follows. The client from where queries were fired was a Pentium III machine. The machine

0.65 Google:Q1 Google:Q2 Yahoo:Q1 Yahoo:Q2

0.6

0.55

0.5

was part of a 512Kbps local area network. The operating system was Windows XP. In what follows, we discuss behavior of search engines for different permutations of a query.

Latency

0.45

0.4

A. Query Permutations To see how a search engine behaves for different permutations of a query, we consider the following query. Ash Mohammad Abbas
1 1.5 2 2.5 Correlation 3 3.5 4

0.35

0.3

0.25

0.2

Fig. 8.

Latency versus correlation for queries with embedded semantics.

The different permutations of this query are 1 2 3 4 5 6 Ash Ash Abbas Abbas Mohammad Mohammad Mohammad Abbas Ash Mohammad Ash Abbas Abbas Mohammad Mohammad Ash Abbas Ash

1.1 Google:Q1 Google:Q2 Yahoo:Q1 Yahoo:Q2

1

0.9

0.8

Latency

0.7

0.6

0.5

0.4

0.3

0.2 1 1.5 2 2.5 Correlation 3 3.5 4

Fig. 9.

Latency versus correlation for random queries.

0.65 Google:Q1 Google:Q2 Yahoo:Q1 Yahoo:Q2

0.6

0.55

0.5

Latency

0.45

0.4

0.35

0.3

0.25

0.2 1 1.5 2 2.5 Correlation 3 3.5 4

Fig. 10. Query Space versus correlation for queries with embedded semantics.

1.1 Google:Q1 Google:Q2 Yahoo:Q1 Yahoo:Q2

We have assigned a number to each permutation to differentiate from one another. We wish to analyze search results on the basis of search time, number of relevant results and query space. The query space is nothing but the cardinality of all results returned by a given search engine in response to a given query. Note that search time is defined as the actual time taken by the search engine to deliver the results searched. Ideally, it does not depend upon the speeds of hardware, software, and network components from where queries are fired because it is the time taken by the search engine server. Relevant results are those which the user intends to search. For example, the user intends to search information about Ash Mohammad Abbas9. Therefore, all those results that contain Ash Mohammad Abbas are relevant for the given query. In what follows, we discuss the results obtained for different permutation of a given query. Let the given query be Ash Mohammad Abbas. For all permutations, all those results that contain Ash Mohammad Abbas are counted as relevant results. Since both Google and Yahoo deliver the results page wise, therefore, we list all parameters mentioned in the previous paragraph page wise. We go up to 10 pages for both the search engines as the results beyond that are rarely significant. Table I shows search latencies, query space, and the number of relevant results for different permutations of the given query. The search portal is Google. Our observations are as follows. For all permutations, the query space remains the same and it does not vary along the pages of results. The time to search the first page of the results in response to a the given query is the largest for all permutations. The first page of results contain the most relevant results.
     

1

0.9

0.8

Latency

0.7

0.6

0.5

0.4

0.3

0.2 1 1.5 2 2.5 Correlation 3 3.5 4

Fig. 11.

Query Space versus correlation for random queries.

9 We have intentionally taken the query: Ash Mohammad Abbas. We wish to search for different permutations of a query and the effect of those permutations on query space and on the number of relevant results. The relevance is partly related to the intentions of an end-user. Since we already know what are the relevant results for the chosen query, therefore, this is easier to decide what relevant results out of them have been returned by a search engine. The reader may take any other query, if he/she wishes so. In that case, he has to decide what are the results that are relevant to his/her query and this will partly depend upon what he/she intended to search.

TABLE III Q UERIES WITH EMBEDDED SEMANTICS .

S. No. E1 E2 E3 E4

Query No. Q1 Q2 Q1 Q2 Q1 Q2 Q1 Q2

Query node edge node edge node edge node wireless

disjoint disjoint disjoint disjoint disjoint disjoint disjoint node

multipath multicast multipath multicast multipath multipath multipath disjoint

Correlation 1 routing routing routing routing routing multipath 2 3 ad hoc routing 4

TABLE IV Q UERIES WITHOUT EMBEDDED SEMANTICS ( RANDOM QUERIES ).

S. No. R1 R2 R3 R4

Query No. Q1 Q2 Q1 Q2 Q1 Q2 Q1 Q2

Query adhoc quadratic computer hiring wireless mitigate few shallow

node power node parity node node node mitigate

ergonomics node constellations node parity shallow parity node

Correlation 1 parity biased common rough mitigate parity 2 mitigate parity common common 3 correlation stanza 4

TABLE V S EARCH TIME AND Q UERY S PACE FO QUERIES WITH EMBEDDED SEMANTICS . S. No. E1 E2 E3 E4 Query No. Q1 Q2 Q1 Q2 Q1 Q2 Q1 Q2 Time 0.27 0.23 0.48 0.32 0.48 0.24 0.31 0.33 Google Query Space 43100 63800 37700 53600 37700 21100 23500 25600 Time 0.37 0.28 0.40 0.32 0.40 0.34 0.64 0.44 Yahoo Query Space 925 1920 794 1660 794 245 79 518

TABLE VI S EARCH TIME AND QUERY SPACE FOR RANDOM QUERIES . S. No. R1 R2 R3 R4 Query No. Q1 Q2 Q1 Q2 Q1 Q2 Q1 Q2 Time 0.44 0.46 0.46 0.42 0.47 0.33 0.34 1.02 Google Query Space 28500 476000 34300 25000 25000 754 20000 374 Time 0.57 0.28 0.55 0.35 0.40 0.68 0.58 0.64 Yahoo Query Space 25 58200 164 90 233 31 71 23

Table II shows the same set of parameters for different permutations of the given query for search portal Yahoo. From the table, we observe that As opposed to Google, the query space does not remain same, rather it varies with the pages of searched results. The query space in this case is less than Google. The time to search the first page of results is not necessarily the largest of the pages considered. More precisely, it is larger for the pages where there is no relevant result. Further, the time taken by Yahoo is less than that of Google. In most of the cases, the first page contains the largest
     

number of relevant results. For permutation 2 (i.e. Ash Abbas Mohammad), the second page contains the largest number of relevant results. Let us discuss reasons for the above mentioned observations. Consider the question why query space in case of Google is larger than that of Yahoo. We have pointed out that Google is based on the page ranks. For a given query (or a set of words), it ranks the pages. It delivers all the ranked pages that contain the words contained in the given query. On the other hand, Yahoo is an ontology based search engine. As mentioned earlier, it will search only that part of its Webbase that constitutes the ontology of the given query. This is the

TABLE VII L ATENCY MATRIX , L, P 1 2 3 4 5 6 p1 1 0 0 0 0 0 p2 0 1 1 1 0
1 2

TABLE IX R ELEVANCE MATRIX FOR DIFFERENT PERMUTATIONS FOR G OOGLE . p10 1 0 0 0 1
1 2

FOR DIFFERENT PERMUTATIONS .

p3 1 0 0 0 0 1

p4 1 0 0 1 0
1 2

p5 0 1 0 0 0 0

p6 0 0 0 1 0 0

p7 1 1 0 0 0 0

p8 1 0 0 1 0 0

p9 1 0 0 0 0 0

P 1 2 3 4 5 6

p1 8 3 6 3 3 5

p2 5 2 4 0 2 4

p3 1 2 1 2 1 1

p4 0 1 3 1 2 3

p5 2 1 2 0 1 0

p6 0 0 1 0 0 2

p7 0 2 1 2 0 1

p8 3 0 0 0 1 2

p9 0 0 0 1 1 2

p10 0 0 0 1 1 0

TABLE VIII Query Space MATRIX , S, FOR DIFFERENT PERMUTATIONS . P 1 2 3 4 5 6 p1 1 1 1 1 1 1 p2 1 1 1 1 1 1 p3 1 1 1 1 1 1 p4 1 1 1 1 1 1 p5 1 1 1 1 1 1 p6 1 1 1 1 1 1 p7 1 1 1 1 1 1 p8 1 1 1 1 1 1 p9 1 1 1 1 1 1 p10 1 1 1 1 1 1

TABLE X R ELEVANCE MATRIX FOR DIFFERENT PERMUTATIONS FOR YAHOO . P 1 2 3 4 5 6 p1 10 4 10 7 8 10 p2 4 6 3 4 5 5 p3 1 1 1 0 1 0 p4 0 1 2 2 1 0 p5 0 0 0 1 0 0 p6 1 1 0 0 0 1 p7 0 1 0 0 0 0 p8 0 1 0 1 0 0 p9 0 0 0 0 0 0 p10 0 0 0 1 1 0

reason why query space in case of Google is larger than that of Yahoo. Let us answer the question why query space changes in case of Yahoo and why it remains constant in case of Google. Note that ontology may change with time and with order of words in the given query. For every page of results, Yahoo estimates the ontology of the given permutation of the query before delivering the results to the end user. Therefore, the query space for different permutations of the given query is different and it changes with pages of the searched results10 . However, page ranks do not change either with pages or with order of words. The page ranks will only change when new links or documents are added to the Web that are relevant to the given query. Since neither a new link nor a new document is added to the Web during the evaluation of permutations of the query, therefore, the query space does not change in case of Google. In order to compare the performance of Google and Yahoo, the latencies versus page numbers for different permutations of the query have been shown in Figures 2 through 7. Let us consider the question why search time in case of Google is larger than that of Yahoo. Note that Google ranks the results before delivering them to end users while Yahoo does not. The ranking of pages takes time. This is the reason why search time taken by Google is larger than that of Yahoo. In what follows, we discuss how a search engine behaves for correlated queries. B. Query Correlation We have formulated k-correlated queries as shown in Table III. Since all words contained in a query are related11, therefore, we call them queries with embedded semantics. On the other hand, we have another set of k-correlated queries as
10 This observed behavior may also be due to the use of a randomized algorithm. To understand the behavior of randomized algorithms, readers are referred to any text on randomized algorithms such as [10]. 11 More precisely, all words in these queries are from ad hoc wireless networks, an area that authors of this paper like to work.

shown in Table IV. The words contained in these queries are random and are not related semantically. We wish to evaluate the performance of a search engine for k-correlated queries. For that we evaluate search time and query space of a search engine for the first page of results. Since both Google and Yahoo deliver 10 results per page, therefore, looking for the first page of results means that we are evaluating 10 top most results of these search engines. Note that we do not consider number of relevant results because relevancy in this case would be query dependent. Since there is no single query, therefore, evaluation of relevancy would not be so useful. Table V shows search time and query space for k-correlated queries with embedded semantics (see Table III). The second query, Q2 , is fired after the first query Q1 . On the other hand, Table VI shows search time and query space for k-correlated queries whose words may not be related (see Table IV).
TABLE XI R ELEVANCE FOR DIFFERENT PERMUTATIONS . P 1 2 3 4 5 6 Total Google 19 11 18 10 12 20 90 Yahoo 16 15 16 16 16 16 95

TABLE XII Earned Points (EP) FOR DIFFERENT PERMUTATIONS . P Latency
 

 

 

 

1 2 3 4 5 6 Total

14 5 3 4 1 3 25

Google Query Space 19 11 18 10 12 20

EP 33 5 14 22 11 15 22 5 118

Latency 3 14 13 9 10 16

Yahoo Query Space 0 0 0 0 0 0

EP 3 14 13 9 10 16 65

li j

1 2

0

si j

1 2

0

i 1

V. A U NIFIED C RITERIA

FOR

C OMPARISON

Let us denote Google by a superscript ’1’ and Yahoo by a superscript ’0’12 . Let L li j be a matrix where li j is defined
12 This is simply a representation. One may consider a representation which is reverse of it, then also, there will not be any effect on the criteria.

where, superscript k 0 1 denotes the search engine. Table VII shows a latency matrix, L, for different permutations of the query as that for Table I and Table II, and has been constructed using both of them. In the latency matrix, there are 40 ’0’s, 17 ’1’s, and 3 ’ 1 ’. We observe from the 2 latency matrix that Yahoo is the winner (as far as latencies are concerned), as there are 40 ’0’s out of 60 entries in total. On the other hand, Table VIII shows the query space matrix, S, for different permutations of the same query and is constructed using the tables mentioned in the preceding paragraph. One can see that as far as query space is concerned, Google is the sole winner. Infact, query space of Google is much larger than that of Yahoo. The relevance matrix for Google is shown in Table IX and that for Yahoo is shown in Table X. The total relevance for the first ten pages is shown in Table XI for both Google as well as Yahoo. It is seen from Table XI that the total relevance for Google is 90 and that for Yahoo is 95. Average relevance per

 

¦

¨ ©§

EPk

¢ £

¦

¤ ¥  ¤

¡





However, from Table V and Table VI, one can infer the following. Google is better in the sense that its query space is much larger than that of Yahoo. However, Yahoo takes less time as compared to Google for different permutations of the same query. For k-correlated queries with embedded semantics, Google takes less time to search for the first query as compared to Yahoo. It also applies to randomized queries with some exceptions. In exceptional cases, Google takes much more time as compared to Yahoo. We have mentioned it previously that it depends upon the given query as well as the strategy employed in the search engine. In what follows, we describe a unified criteria for comparing the search engines considered in this paper.

In matrices defined above, where there is a ’1’, it means at that place Google is the winner and a ’ 1 ’ represents that there 2 has been a tie between Google and Yahoo. We now define a parameter that we call Earned Points (EP) which is as follows.
pages

∑

relevantk i

£ 

1

¢

  ¡¢

Similarly, let S follows.

si j be a matrix where si j is defined as if space1j i if space1j i otherwise. space0j i space0j i

£ 

¡  

  ¡¢

as follows.

1

if latency1j i if latency1j i otherwise.

 

 

 

 

 

 

 

 

 

In case of k-correlated queries with embedded semantics, generally the time to search for Q2 is less than that of Q1 . This is due to the fact that since the queries are correlated, some of the words of Q2 have already been searched while searching for Q1 . The query space is increased when the given query has a word that is more frequently found in Web pages (e.g. in R1: Q2 , the word quadratic that is frequently used in Engineering, Science, Maths, Arts, etc.). The query space is decreased when there is a word included in the query which is rarely used (e.g. mitigate included in R3,R4:Q1 Q2 and shallow included in R3,R4:Q2). The search time is larger in case of randomized queries as compared to queries with embedded semantics. The reason for the this observation is as follows. In case of queries with embedded semantics, the words of a given query are related and are found in Web pages that are not too far from one another either from the point of view of page rank as in Google or from the point of view of ontology as in Yahoo. One cannot infer anything about the search time of Google and Yahoo as it depends upon the query. More precisely, it depends upon the fact which strategy takes more time whether page rank in Google or estimation of ontology in Yahoo.

TABLE XIV CEP FOR DIFFERENT PERMUTATIONS FOR YAHOO . P 1 2 3 4 5 6 Total Latency Contribution 101 385 107 821 131 558 308 197 130 833 121 591 901 385 Query Space Contribution 419900 404100 431000 428700 425000 431500 2540200

 

 

 

 

 

 

 

In order to compare the performance of Yahoo and Google, the latencies versus correlation for queries with embedded semantics is shown in Figure 8 and that for randomized queries is shown in Figure 9. Similarly, the query space for queries with embedded semantics is shown in Figure 10 and that for randomized queries is shown in Figure 11. The query space of Yahoo is much less than that of Google for the reasons discussed in the previous subsection. Other important observations are as follows.
 

TABLE XIII CEP FOR DIFFERENT PERMUTATIONS FOR G OOGLE . P 1 2 3 4 5 6 Total Latency Contribution 123 834 61 262 116 534 35 918 73 458 530 376 530 376 Query Space Contribution 5700000 3300000 5400000 3000000 3600000 6000000 27000000

¢

¡  

£

     

latency0j i latency0j i

(4)

(5)

Lk i

Sik

(6)

TABLE XV C ONTRIBUTION DUE Weights wl wl wl wl wl wl wl
TO QUERY SPACE IN WEIGHTS .

CEP FOR DIFFERENT SETS OF Yahoo 2 54 25 40 254 02 2540 20 25402 00 254020 00 2540200
           

Google
6 5 4 3 2 1

TABLE XVI CCEP FOR DIFFERENT SETS OF comparable weights. wq wq wq wq wq wq wq wq wq Weights wl 0 9, wl 0 8, wl 0 7, wl 0 6, wl 0 5, wl 0 4, wl 0 3, wl 0 2, wl 0 1,
                 

i 1

i 1


Alternatively, one may consider an approach that is combination of the definition of EP defined in (6) (together with (4) and (5) and that of CEP defined in (7). In that we may use the definition of matrix S which converts the contribution of query space in the form of binaries13 . The modified definition is as follows.

i 1


i 1


13 We mean that the matrix S says either there is a contribution of query space of a search engine provided that its query space is larger than that of the other one or there is no contribution of query space at all, if otherwise.

¦ ¨¥

¦

£ §

where, superscript k 0 1 denotes the search engine, d denotes the actual latency, and q denotes the actual query space. The reason behind having an inverse of actual latency in (7) is that the better search engine would be that which takes less time.

CCEPk

¤ ¥  ¤

pages

∑

¢

i 1

¦ §¥

¦

£ ¤§

CEPk

¢ £

¡

¤ ¥  ¤

pages

∑



relevantk i

1 dik

qk i

(7)

where Sik is in accordance with the definition of S given by (5). The acronym CCEP stands for Combined Contributory Earned Points. If one wishes to incorporate weights, then the definition of CCEP becomes as follows. relevantk i 1 wl dik Sik wq (11)

¦ ¨¥

¦

£ §

 

CCEPk

¤ ¤

pages

∑

relevantk i

1 dik

¦

§

CEPk

¤ ¤ 

pages

∑

relevantk i

1 dik

©

¢

 

 

¢

¢

©

¢

©

 

£

 

permutation and per page for Google is 1 5 and that for Yahoo is 1 583. Therefore, as far as average relevance is concerned, Yahoo is the winner. Table XII shows the number of earned points for both Google as well as Yahoo for different permutations of the query mentioned earlier. We observe that the number of earned points for Google is 118 and that for Yahoo is 65. The number of earned points of Google is far greater than Yahoo. The reason behind this is that query space of Yahoo is always less than that of Google and it does not contribute to the number of earned points. A closer look on the definition of EP reveals that while defining the parameter EP in (6) together with (4) and (5), we have assumed that a search engine either has a constituent parameter (latency or query space) or it does not have that parameter at all. The contribution of some of the parameter is lost due the fact that the effective contribution of other parameter by which the given parameter is multiplied is zero. Note that our goal behind introduction of (6) was to rank the given set of search engines. We call this type of ranking of search engines as Lossy Constituent Ranking (LCR). We, therefore, feel that there should be a method of comparison between a given set of search engines that is lossless in nature. For that purpose, we define another parameter that we call Contributed Earned Points (CEP). The definition of CEP is as follows.
¢

The weights should be chosen carefully. For example, the weights wl 1, wq 10 6 will add 27 to the contribution in CEP due to query space for Google and 2 54 to Yahoo. On the other hand, a set of weights wl 1, wq 10 5 shall add 270 for Google and 25 4 for Yahoo. Table XV shows contribution of query space in CEP for different sets of weights. It is to note that wl is fixed to 1 for all sets, and only wq is varied. As wq is increased beyond 10 5, the contribution of query space starts dominating over the contribution of latency. The set of weight wl 1 wq 10 5 indicates that one can ignore contribution of query space in comparison to the contribution of latencies provided that one is more interested in comparing search engines with respect to latency. In that case, an approximate expression for CEP can be written as follows.

Sik

¦ ¨¥

¦

£ §

©

 



 

CEPk

¤ ¤

pages

∑

¢

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

01 02 03 04 05 06 07 08 09
                 

Google 486 3384 442 3008 398 2632 354 2256 310 1880 266 1504 222 1128 178 0752 134 0376

 

 

 

 

 

 

1, 1, 1, 1, 1, 1, 1,

wq wq wq wq wq wq wq

10 10 10 10 10 10 1

27 00 270 00 2700 00 27000 00 270000 00 2700000 00 27000000

Yahoo 811 2465 721 1080 630 9695 540 8310 450 6925 360 5540 270 4155 180 2770 90 1385

Table XIII shows contributions of latency and query space in CEP for Google. Similarly, Table XIV shows the same for Yahoo. We observe that contribution of latency for Google is 530 376 and that for Yahoo is 901 385. However, contribution of query space for Google is 27000000 and that for Yahoo is 2540200. In other words, the contribution of query space for Google is approximately 11 times of that for Yahoo. Adding these contributions shall result in a larger CEP for Google as compared to Yahoo. The CEP defined using (7) has a problem that we call dominating constituent problem (DCP). The larger parameter suppresses the smaller parameter. Note that the definition of CEP in (7) assumes equal weights for latency and query space. On the other hand, one may be interested in assigning different weights to constituents of CEP depending upon the importance of constituents. Let us rewrite (7) to incorporate weights. Let wl and wq be the weights assigned to latency and query space, respectively. The (7) can be written as follows. relevantk i
¢ ¢

¢

   ¡¡    ¡¡    ¡¡                

  

  

  

  

 

          

1 wl dik

q k wq i

(8)

¢

(9)

(10)

VI. C ONCLUSIONS In this paper, we analyzed the impact of correlation among queries on search results for two representative search portals namely Google and Yahoo. The major accomplishments of the paper are as follows: We analyzed the search time, the query space and the number of relevant results per page for different permutations of the same query. We observed that these parameters vary with pages of searched results and are different for different permutations of the given query. We analyzed the impact of k-correlation among two subsequent queries given to a search engine. In that we analyzed the search time and the query space. We observed that – The search time is less in case of queries with embedded semantics as compared to randomized queries without any semantic consideration. – In case of randomized query, the query space is increased in case the given query includes a word that is frequently found on the Web and vice versa. Further, we considered a unified criteria for comparison between the search engines. Our criteria is based upon the concept of earned points. An end-user may assign different
   

[1] S. Malhotra, ”Beyond Google”, CyberMedia Magazine on Data Quest, vol. 23, no. 24, pp.12, December 2005. [2] M.R. Henzinger, A. Haydon, M. Mitzenmacher, M. Nozark, ”Measuring Index Quality Using Random Walks on the Web”, Proceedings of 8th International World Wide Web Conference, pp. 213-225, May 1999. [3] M.C. Tang, Y. Sun, ”Evaluation of Web-Based Search Engines Using User-Effort Measures”, Library an Information Science Research Science Electronic Journal, vol. 13, issue 2, 2003, http://libres.curtin.edu.au/libres13n2 /tang.htm. [4] C.W. Cleverdon, J. Mills, E.M. Keen, An Inquiry in Testing of Information Retrieval Systems, Granfield, U.K., 1966. [5] J. Gwidzka, M. Chignell, ”Towards Information Retrieval Measures for Evaluation of Web Search Engines”, http://www.imedia. mie.utoronto.ca/people/jacek/pubs/webIR eval1 99.pdf, 1999. [6] D. Rafiei, A.O. Mendelzon, ”What is This Page Known For: Computing Web Page Reputations”, Elsevier Journal on Computer Networks, vol 33, pp. 823-835, 2000. [7] N. Bhatti, A. Bouch, A. Kuchinsky, ”Integrating User-Perceived Quality into Web Server Design”, Elsevier Journal on Computer Networks, vol 33, pp. 1-16, 2000. [8] S. Brin, L. Page, ”The Anatomy of a Large-Scale Hypertextual Web Search Engine”, http://www-db.stanford.edu/pub/papers/google.pdf, 2000. [9] J. Kleinberg, ”Authoritative Sources in a Hyperlinked Environment”, Proceedings of 9th ACM/SIAM Symposium on Discrete Algorithms, 1998. [10] R. Motwani, P. Raghavan, Randomized Algorithms, Cambridge University Press, August 1995.

   

In the definition of CCEP given by (11) the weights can be comparable and the dominant constituent problem mentioned earlier can be mitigated for comparable weights. We define comparable weights as follows. wi wi 0, is said Definition 3: A set of weights W to have comparable weights if and only if ∑i wi 1 and the w condition 1 w ij 9 is satisfied wi w j W . 9 Table XVI shows the values of CCEP for different sets of comparable weights. We observe that the rate of decrease of CCEP for Yahoo is larger than that of Google. For example, for wl 0 9 wq 0 1, CCEP for Google is 486 3384 and that for Yahoo is 811 2465. For wl 0 8 wq 0 2, CCEP for Google is 442.3008 and that for Yahoo is 721 1080. In other words, the rate of decrease in CCEP for Google is 9 05% and that for Yahoo is 11 11%. The reason being that in the query space matrix, S, (see Table VIII) all entries are ’1’. It means that query space of Google is always larger than that of Yahoo. Therefore, in case of Yahoo, the contribution due to query space is always 0 irrespective of the weight assigned to it. However, in case of Google the contribution due to query space is nonzero and increases with an increase in weight assigned to the contribution due to query space. Moreover, for a set of weights, W wl 0 5 wq 0 5 , the values of CCEP are 310 1880 and 450 6925 for Google and Yahoo, respectively. It means that if one wishes to assign equal weights to latency and query space then Yahoo is the winner in terms of the parameter CCEP. In case of CCEP, the effect of the dominating constituent problem is less as compared to that in case of CEP. In other words, the effect of large values of query space is fairly smaller in case of CCEP as compared to that in case of CEP. This is with reference to our remark that with the use of CCEP the dominating constituent problem is mitigated.

weights to different constituents of the criteria—latency and query space. Our observations are as follows. We observed that performance of Yahoo is better in terms of the latencies, however, Google performs better in terms of query space. We discussed the dominant constituent problem. We discussed that this problem can be mitigated using the concept of contributory earned points if weights assigned to constituents are comparable. If both the constituent are assigned equal weights, we found that Yahoo is the winner. However, the performance of a search engine may depend upon the criteria itself and only one criteria may not be sufficient for an exact analysis of the performance. Further investigations and improvements in this direction forms our future work. R EFERENCES

£ 

 £ ¢
¢ ¢



¢

¢

¡   £  

¢



£

¢

¢

 

 

£

¢

 

¢

¥

¢

¡  

¢

¢

 

¥

£

¢

¢

 


				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:18
posted:10/7/2008
language:English
pages:10