Correlation between Experts and Search Engines Rankings
Project Report Submitted to the Faculty of the Department of Computer Science, Old Dominion University, in Partial Fulfillment of the requirements for the Degree of
Master of Science
Project Advisor:
Dr. Michael L. Nelson
Submitted by:
Olena K. Hunsicker
Olena Hunsicker, MS Degree Project November 2008
1
Contents
Abstract 1. Introduction 2. Background Information 2.1. 2.2. Search Engine APIs Measuring the correlation
3. Experiment Setup 3.1. 3.2. 3.3. 4. Results 4.1 Correlations 4.2 College Football Rankings 5. Related Work 6. Conclusions Acknowledgements References Choosing Experts Lists Mapping Real-World Terms into Representative URLs Ranking of URLs and Assigning Weights
Olena Hunsicker, MS Degree Project November 2008
2
Abstract There have been previous studies which have investigated the quality of web documents by using various computable metrics such as In-Degree, Kleinberg’s authority score and PageRank. It has been shown that link-based web metrics have a high degree of correlation with topic experts regarding the quality of web pages. This project investigates the question how closely the experts’ ranking of realworld objects correlates with the major search engines rankings of URLs, associated with those objects. A real-world object from the experts’ list can have several equally good representative URLs. In the project we have used a methodology of mapping real-world entities into the set of associated URLs. We have conducted 9 experiments with 9 experts’ lists in 3 categories, such as sports, popular culture and academic institutions. We have used experts’ lists of length 10, 25 and 50. We have assigned from 1 to 8 representative URLs to each term in experts’ list and have performed 590 search engines vs. experts’ lists comparisons. We have found 87 weak, 35 moderate and 11 strong correlations which were statistically significant. In addition, for period of 11 weeks we have observed the changes in correlation between 4 college football polls with three major search engines rankings.
Olena Hunsicker, MS Degree Project November 2008
3
1. Introduction
There are a few reasons why people are drawn to the lists of the best places to retire, or the most powerful people in the world, or the best companies to work for. At first, the lists are simple and easy to understand, and they bring a lot of information in one source. Also, they make our choices easier, for example, when you are looking for a movie to rent, most likely you choose one that was in the list of “The Top 100 Best Movies of All Times”. Theoretically, the “top” lists should be made by people who have thorough understanding of the subject, but it is not always a case, especially in Internet. We introduce the notion of “experts’ lists” or the lists that were created by people with a high degree of skill in or knowledge of a certain subject. The good examples of experts’ lists are rankings created by Forbes.com, US News and World Report and Billboard.com On the other hand, our choices are often influenced by the information, provided by the search engines. For any product, song or university is very important to have high Page Rank and be on the top of the results returned by the major search engines such as Live Search, Google and Yahoo in order to be “visible” in Internet. Both experts’ lists and search engines are powerful tools which shape public’s preferences, and in this project we are going to determine if there is any correlation between them. Every real-world object, such as the movie, celebrity or university, has a set of associated web resources or URLs. For instance, tennis player can have his own website, and have a page on Wikipedia or MySpace.com. But which web page has better a representation of the term? There is no simple answer because a few web resources could be equally good. In our experiment we have assigned up to 8 representative URLs to each term in the experts’ list. Then we have ranked the URLs by using a variation of strand sort algorithm. Finally, we have assigned weights to all ranked URLs and have calculated how the terms are ranked by search engines. In order to quantify the correlation between experts and search engine rankings, we have used Kendall tau distance measure. The main purpose of this study was to investigate the correlation between experts’ and three major search engines rankings. Another goal of the experiment was to examine
Olena Hunsicker, MS Degree Project November 2008
4
the change in correlation between experts’ lists and search engines rankings over the time.
2. Background Information 2.1 Search Engines APIs
Nowadays, all major search engines provide public web services APIs, which include such faculties as Web Search, Maps, Mail and Shopping, ect. Yahoo offers “REST-style” web services that use HTTP GET and POST requests with the parameters embedded into the URL. Yahoo Search web service returns the response body in XML format. Also, Yahoo limits the number of queries to 5000 per IP address per day per API. In the past 3 years Google made a transition from SOAP-based API to AJAX API. In April 2008 Google added new “RESTful” functionality to their AJAX Search API and allowed the access from the non-JavaScript environments. The supported method is HTTP GET and HTTP response body is JSON encoded result set with embedded status codes. Google SOAP API allowed only 1000 queries per day, but new AJAX API doesn’t specify the number of queries per day. The main requirement is to provide correct HTTP referrer header in the requests. In addition, the use of Google API key is recommended, but not required. MSN offers SOAP API, and its Search web service allows 10000 queries per day, per IP address. Table 1 provides the comparisons between major search engines.
Provided by the search engines Search web service have allowed easy data collection for this project. Table 1. Major Search Engines Comparison
Olena Hunsicker, MS Degree Project November 2008
5
2.2 Measuring the Correlation
In order to measure the degree of correspondence between experts and search engine rankings, we have used Kendall Tau correlation coefficient τ. Kendall Tau
measure ranges from -1 to +1, and a positive correlation indicates that the ranks of the variables in both lists increase together, while a negative correlation indicates that as the rank of one variable increases the other one decreases. We analyzed only statistically significant (double-sided p-value < 0.05) results with correlation between two lists as strong for Kendall tau (0.6<τ<1.0), moderate (0.4<τ< 0.6) and weak (τ< 0.4).
3. Experiment Setup We have used 9 experts’ lists in 3 following categories: sports, popular culture and academic institutions. Also, we have analyzed lists of size 10, 25 and 50. As a part of the experiment, we have developed a PHP program that takes data from the experts list and queries three major search engines – Yahoo, Google and MSN – in order to assign up to 8 representative URLs to each real-world term in the list. Then, we have determined how the search engines indexed those URLs. In order to rank the URLs, we have used a variation of strand sort which merges sorted sub-lists into final sorted array. Afterward, we have assigned weights to all URL in the ranked list in such way that: distance (URL1, URL2) = distance (URL3, URL4) Next, we have calculated the cumulative weight of all URLs, associated with the particular entity in experts’ list and created the list of the terms ranked by the search engines. Finally, the program calculates Kendall Tau measure and double-sided p-value. In the following sections we explain how the experts’ lists have been chosen and discuss the program in more details. 3.1 Choosing Experts’ Lists Experts’ lists are widely recognized as a reliable source of information. We have included the following experts’ lists in the category Sports:
Olena Hunsicker, MS Degree Project November 2008
6
1.
ATP – The top 50 male tennis players as of September 29, 2008. ATP is the
official international tennis circuit of men’s professional tennis tournaments, and its ranking reflects the performances of the world's best players in the current year. 2. The Sony Ericsson WTA Tour – The top 50 female tennis players as of
September, 29 2008. The Sony Ericsson WTA Tour is a worldwide computer system for women’s professional tennis players. Its ranking system reflects the players’ performance for 52-week period, and the results are computed and published weekly. 3. Official World Golf Ranking – The top 50 best golf players as of October 05,
2008. Each player is ranked according to one’s average points per tournament.
Popular culture category includes the following experts’ lists: 4. Forbes.com – The top 100 Web Celebrities special report as of 06/11/2008. Ranking system reflects the earnings and in the popularity of the celebrity on the web, TV and radio. 5. Billboard.com – The Hot 100 Airplay as of 11/01/2008. The Hot 100 Airplay is a chart released weekly by Billboard in the USA. Chart rankings are based on data collected from more that 1000 radio stations and electronically monitored 24/7.
6. People.com - Top 25 Celebrity Hot List. The Hot List shows the most discussed celebrities as measured by reader s’ activity on People.com. Academic Institutions category includes the following experts’ lists: 7. Academic Ranking of World Universities – The top 100 Latin and North American Universities of 2007. The organization rank universities by academic or research performance such as including alumni and staff winning Nobel Prizes and Fields Medals, highly cited researchers, articles published in Nature and Science, articles indexed in major citation , ect.
Olena Hunsicker, MS Degree Project November 2008
7
8. US News and World Report – the national universities rankings of the best colleges in 2009. 9. Forbes.com – The top 50 Best colleges 2008 as of 08/13/2008, which was created as an alternative to US News & World Report rankings.
3.2 Mapping Real-World Terms into Representative URLs
More important terms on the web have a greater number of associated web pages, and it is difficult to find a single canonical URL for a real-world object because a few URLs could be equally good. In the experiment we have used general search term queries in order to map the term into a set of representative URLs. For instance, in order to find 5 URLs associated with golf player Tiger Woods, we supply the search term query to each Search Engine API. Yahoo REST-style query
http://search.yahooapis.com/WebSearchService/V1/webSearch?appid=gNW9.PDV34HNkaCxQ_AZbLi_1 U1ctUilbrRCVXYd6tX6V8d_b31ezAp7fkox7MT7F6o-&query=Tiger+Woods&results=30
returns 30 results, from which we choose top 5 and assign them to the term “Tiger Woods”:
1. 2. 3. 4. 5. http://www.tigerwoods.com/ http://en.wikipedia.org/wiki/Tiger_Woods http://www.pgatour.com/players/00/87/93/index.html http://sports.espn.go.com/golf/players/profile?playerId=462 http://www.tigerwoodsfoundation.org/
Similarly, we pick top 5 results from the Google AJAX API reply:
1. 2. 3. 4. 5. http://www.tigerwoods.com/ http://www.tigerwoods.com/splash/splash.html http://en.wikipedia.org/wiki/Tiger_Woods http://www.pgatour.com/players/00/87/93/ http://www.tigerwoodsfoundation.org/
and MSN SOAP API:
Olena Hunsicker, MS Degree Project November 2008
8
1. 2. 3. 4. 5.
http://www.tigerwoods.com/ http://en.wikipedia.org/wiki/Tiger_Woods http://wap.tigerwoods.com/ http://www.pgatour.com/players/00/87/93/ http://www.tigerwoodsfoundation.org/
It is not surprising that different Search Engines rank the same URL differently. Each major search engine use its own ranking algorithm, for example, Google uses PageRank in scale of 1-10, Yahoo uses WebRank with a ranking of 1-10, and MSN uses DomainRank. In the process of mapping URLs to the search term, we have ignored URLs that included unescaped white spaces, unescaped unsafe characters and URLs with more than one parameter in url queries such as: http://www.example.com?parameter1=a¶meter2=b, because they are simply ignored by all search engines when queries to retrieve indexed or cached URLs are supplied to their APIs.
3.3 Ranking of URLs and Assigning Weights
All major search engines set the limits not only on numbers of queries per day, but also on the query size. For instance, Yahoo REST query size must be smaller than 1 kB. Considering the limitations, we have implemented the variant of strand sort algorithm. The strand sort algorithm works by taking sub-lists of URLs from the list of all representative URLs, sorting them and merging in the final sorted array. We have used site: query modifier which is supported by Google, and url: query modifier for querying Yahoo and MSN in order to determine how the subsets of URLs are indexed by the search engines. After we have ranked the list of representative URLs, we assign the weights to all URLs in the list. The formula for calculating the weight is: Weight = 1 – P/T, where T– total number of URLs in the list P – position of URL in the list
Olena Hunsicker, MS Degree Project November 2008
9
The formula allows to assign weights in such way that distance(URLa,URLb) = distance(URLb,URLc).
4. Results
4.1 Correlations
We have completed 590 experts vs. search engine comparisons. The correlations are more frequent in sports and popular culture categories than in academic institutions category. The results are grouped by category and represented in Tables 2, 3 and 4.
Table 2. Statistics for Sports category
Olena Hunsicker, MS Degree Project November 2008
10
Table 3. Statistics for Pop Culture category
Table 4. Statistics for Academic Institutions category
Olena Hunsicker, MS Degree Project November 2008
11
4.1 College Football Rankings
As a part of the experiment, we have used the opportunity to investigate the change in correlation between experts and search engine rankings at college football season. We have used 4 experts’ lists such as USA Today, Associated Press, Harris Poll and Massey Rankings, and have analyzed top 10 and top 25 results. Every Sunday at college football season all mentioned above experts update their rankings, and every week we have performed on average 190 experts vs. search engine comparisons. In overall, we have completed more than 2000 comparisons during 12 weeks. For the baseline were chosen 2007 rankings. Fig 1. – Fig 8. depict the change in correlation for top 10 and top 25 results. Fig 1. Top 10 Associated Press vs SE rankings
Fig. 2 Top 10 USA Today vs SE rankings
Olena Hunsicker, MS Degree Project November 2008
12
Fig. 3 Top 10 Harris Poll vs SE rankings
Fig. 4 Top 10 Massey vs SE rankings
Fig 5. Top 25 Associated Press vs SE rankings
Olena Hunsicker, MS Degree Project November 2008
13
Fig. 6 Top 25 USA Today vs SE rankings
Fig. 7 Top 25 Harris Poll vs SE rankings
Fig. 8 Top 25Massey vs SE rankings
Olena Hunsicker, MS Degree Project November 2008
14
Unexpectedly, we found that during the college football season the correlation between experts and search engine rankings actually decreasing over the time. On week 8 we also can observe sharp drop in correlation for top 10 results. Also, there are very few negative correlations for top 25 results.
5. Related Work
There have been a few previous studies in quality of search engines results. Brian Amento, Loren Terveen and Will Hill investigated the quality of web documents by using such computable metrics as In-Degree, Kleinberg’s authority score and PageRank. They have shown that there is a high correlation in quality of web pages between topic experts and authority algorithms. Also, they claim that a simple count of pages on the website is almost as good as the sophisticated algorithms in defining the quality of the web documents. Manoranjan Magudamudi studied the correlation between search engine rankings and experts’ rankings in his Master’s Project in 2007. He found very statistically significant correlations between experts’ and search engines rankings. The reason is that he was using the methodology of mapping the real-world object into a single canonical representative URL.
6. Conclusions
In the experiment we have performed 590 search engines vs. experts lists comparisons and have found 87 weak, 35 moderate and 11 strong correlations which were statistically significant with p-value < 0.05. In almost all cases of correlation, we have had more than one representative URL assigned to a real-world object. Also, Yahoo rankings have showed most frequent correlation with experts’ lists.
Olena Hunsicker, MS Degree Project November 2008
15
Acknowledgements:
I would like to express my deepest gratitude to Dr. Nelson for the idea for this project, for his guidance, inspiration and for his willingness to help with all kinds of issues.
Olena Hunsicker, MS Degree Project November 2008
16
References:
1. Michael
L.
Nelson,
M.
Klein,
M.
Magudamudi
http://arxiv.org/PS_cache/arxiv/pdf/0809/0809.2851v2.pdf, Oct. 21, 2008 2. http://code.google.com/apis/ajaxsearch/documentation/#fonje 3. http://blogs.msdn.com/livesearch/archive/2005/09/15/467830.aspx 4. http://www.atptennis.com/3/en/rankings/entrysystem/default.asp 5. http://www.sonyericssonwtatour.com/2/rankings/singles_numeric.asp 6. http://www.officialworldgolfranking.com/rankings/default.sps?region=world 7. http://www.forbes.com/lists/2008/53/celebrities08_The-Celebrity-100_Rank.html 8. http://www.billboard.com/bbcom/charts/chart_display.jsp?g=Singles&f=The+Bill board+Hot+100 9. http://www.people.com/people/static/h/package/top25celebrityhotlist/all.html
10. http://www.arwu.org/rank/2007/ARWU2007_TopAmer.htm 11. http://colleges.usnews.rankingsandreviews.com/college/national-search 12. http://www.forbes.com/lists/2008/94/opinions_college08_Americas-BestColleges_Rank.htm 13. http://developer.yahoo.com/search/web/V1/webSearch.html
Olena Hunsicker, MS Degree Project November 2008
17