Web Document Ranking
DEI, Faculdade de Engenharia
Universidade do Porto
SSIIM, MIEIC, 2010/11
Overview of concepts and techniques
for ranking web documents
The World Wide Web
The World Wide Web is a distributed information system
unprecedented in many ways — in size, in lack of central
coordination, and in the diversity of users’ backgrounds.
The ﬁrst published vision of a large-scale distributed
hypertext system can be traced back to Vannevar Bush’s
seminal article “As We May Think” (1945).
Web pages >> web hosts.
Altavista reported an index of 30 million web pages in 1995.
At least 11.5 billion indexable web pages in 2005 [Gulli et al.].
How can we estimate the size of the web?
Several factors have led to the mass adoption of the web as a
publishing medium — from anonymous individuals to
The lack of a central authority or coordination, the simplicity
of the underlying technology, and the easy access to free web
publishing tools, means that anybody can publish anything.
How can we assess the reliability of content found on the web?
Which pages can we trust?
A web directory is a hierarchical structure, organized by
topics, containing selected web sites — e.g. dmoz.org.
In the early days of the web, these directories were very
popular — human editors selected the highest quality pages
for each category.
This approach quickly became unfeasible at web-scale.
Additionally, these approaches implied a strong semantic
agreement between the directory’s editors and the users.
First generation search engines were based on classic keyword
matching techniques developed for text search. The main
challenge was dealing with the size of the web.
While classic text search techniques provided suﬃcient results,
the overall quality was questionable due to the nature of web
Most notably, the web has no central editorial control, there is
a complete lack of publishing standards, there is a high degree
of content duplication and some content is published with
malicious intents (i.e. spam).
Estimating the size of the web is not a trivial problem — e.g.
the number of dynamic web pages is technically inﬁnite.
The deep web is estimated to be several orders of magnitude
bigger than the surface web.
The size of the surface web was considered to be 170 TB in
2003. The deep web was several orders of magnitude bigger,
with approximately 90,000 TB.
“How Much Information? 2003”
On the web, spam is an issue of major importance.
At its root, spam exists due to commercial motivations — e.g.
achieve better rankings in search engines. There is a wide
range of techniques for web spam, from simple to highly
Keyword stuffing Repetition of high-value keywords in content.
Cloaking (mask) Show diﬀerent content to search engines.
Link spam Artiﬁcial links created using hidden links, link farms, etc.
Web search engines operate in an adversarial information
retrieval environment (research topic).
1. Scrape content from real web documents: blogs,
Wikipedia, news sites, etc.
2. Mix and generate synthetic content to avoid duplicate
3. Insert key words and phrases.
4. Replace or insert links to sites being “promoted”.
5. Publish content on the web using free publishing
platforms (e.g. wordpress, blogspot, comments, etc).
The Web Graph
The web is usually modeled as a directed graph, where each
web page is a node and each link is a directed edge.
The hyperlinks that point to a page are called in-links and
those originating in the page are called out-links. The number
of in-links to a page is called in-degree.
The Bowtie Model
IN SCC OUT
A web surfer can pass from any page in IN to any page in SCC by
following hyperlinks. Likewise, from any page in SCC to any page
in OUT. SCC is a strongly connected core.
“Graph structure in the Web” (2000) http://dx.doi.org/10.1016/S1389-1286(00)00083-9
Web Document Ranking
Web documents can be ranked in a static, absolute way or
ranked in a given context.
The static ranking of document is typically called
query-independent — i.e. documents have a weight regardless
of a query or a context. E.g.: best document in the world
In query-dependent ranking, each document has a diﬀerent
weight depending on the query of context being analyzed.
E.g.: best document for learning how to cook.
Documents are scored (i.e. ranked) using various sources of
information, usually called features or, more generically,
signals. A multitude of signals can be identiﬁed:
▶ Length of document ▶ Number of query terms
▶ Age of document ▶ Time of query
▶ Number of incoming links ▶ Query terms in document
▶ Number of outgoing links ▶ Query terms in collection
▶ Document’s host domain ▶ Query terms in document title
▶ Document’s language ▶ Query’s language
On the left are examples of query-independent signals, on the
right are query-dependent examples.
Google reportedly uses more than 200 signals in their ranking.
Types of Signals
The signals available in a collection of web documents can be
divided in two groups depending on their origins.
The signals obtained directly from the document are named
document-based signals. E.g.: term frequency, doc length, etc.
Signals obtained from the Web are named web-based signals.
E.g.: number of citations, anchor text, etc.
Web search engines have access to other sources of signals:
click data, external collections, etc.
The number of occurrences of a terms in a document is a
signal typically used in text retrieval. However, the web is an
adversarial information retrieval environment.
Quasi architecto Quasi architecto Quasi architecto
Sed ut perspiciatis unde omnis iste Sed ut ﬂowers unde omnis ﬂowers ﬂowers ut ﬂowers ﬂowers omnis
natus error sit ﬂowers accusantium natus error sit ﬂowers accusantium ﬂowers ﬂowers ﬂowers sit ﬂowers
doloremque laudantium, totam rem ﬂowers laudantium, totam rem ﬂowers ﬂowers ﬂowers, totam
aperiam, eaque ipsa quae ab illo aperiam, eaque ipsa quae ab illo ﬂowers aperiam, ﬂowers ipsa ﬂowers
ﬂowers veritatis et quasi architecto ﬂowers veritatis et quasi ﬂowers ab ﬂowers ﬂowers ﬂowers et quasi
beatae vitae dicta sunt explicabo. beatae vitae dicta sunt explicabo. ﬂowers ﬂowers ﬂowers dicta ﬂowers.
Nemo enim ﬂowers voluptatem quia Nemo enim ﬂowers voluptatem quia ﬂowers enim ﬂowers ﬂowers quia
voluptas sit aspernatur aut odit aut voluptas sit aspernatur aut ﬂowers aut ﬂowers ﬂowers ﬂowers aut ﬂowers
fugit, sed quia consequuntur magni fugit, sed quia ﬂowers magni dolores aut ﬂowers, ﬂowers quia ﬂowers
dolores eos qui ratione voluptatem eos qui ratione voluptatem sequi ﬂowers dolores ﬂowers qui ﬂowers
sequi nesciunt. ﬂowers. ﬂowers sequi ﬂowers.
TF("ﬂowers") = 3 TF("ﬂowers") = 10 TF("ﬂowers") = ∞
Inverse Document Frequency
Terms that appear in fewer documents of a collection have
more discriminative power, thus are given an higher weight.
|Documents in collection|
IDF (term) =
|Documents containing term|
Measures the general importance of a term. Combined with
term frequency, results in the classic tf.idf measure.
The position of a term within an HTML ﬁle has impact on its
meaning and importance. Terms within the title or strong
tags are highlighted diﬀerently.
Quasi architecto Quasi ﬂowers
Sed ut perspiciatis unde omnis iste Sed ut perspiciatis unde omnis iste
natus ﬂowers sit olucap accusantium natus error sit olucap accusantium
doloremque laudantium, totam rem doloremque ﬂowers, totam rem
aperiam, eaque ipsa quae ab illo sumo aperiam, eaque ipsa quae ab illo sumo
veritatis et quasi ﬂowers beatae vitae veritatis et quasi architecto beatae vitae
dicta sunt explicabo. dicta sunt explicabo.
Nemo enim etupm voluptatem quia Nemo enim etupm voluptatem quia
ﬂowers sit aspernatur aut odit aut fugit, voluptas sit aspernatur aut odit aut
sed quia consequuntur ﬂowers dolores fugit, sed quia consequuntur magni
eos qui ratione voluptatem sequi dolores eos qui ratione voluptatem
nesciunt. sequi nesciunt.
Regardless of the HTML structure, should terms in diﬀerent
positions have diﬀerent weights?
Quasi architecto Quasi architecto
Sed ut ﬂowers unde ﬂowers iste natus Sed ut perspiciatis unde omnis iste
ﬂowers sit olucap ﬂowers doloremque natus error sit olucap accusantium
ﬂowers, totam rem aperiam, eaque doloremque laudantium, totam rem
ipsa quae ab illo sumo veritatis et quasi aperiam, eaque ipsa quae ab illo sumo
architecto beatae vitae dicta sunt veritatis et quasi architecto beatae vitae
explicabo. dicta sunt explicabo.
Nemo enim etupm voluptatem quia Nemo enim etupm voluptatem quia
voluptas sit aspernatur aut odit aut voluptas sit aspernatur aut odit aut
fugit, sed quia consequuntur magni fugit, ﬂowers quia ﬂowers magni
dolores eos qui ratione voluptatem dolores ﬂowers qui ratione ﬂowers
sequi nesciunt. ﬂowers nesciunt.
Web documents in the same host are related to each other.
A document in a high-value host like www.bbc.co.uk should
be valued higher than www.besttopnews.com.
The location of a document in a site structure is an important
signal. Documents that are closer to the root of a site is
typically more important.
A citation between web documents is deﬁned by an HTML
anchor tag that requires a content. The text used in anchor
tags is one of the most valuable signals
Link analysis has many aspects in common with the ﬁeld of
bibliometrics, more speciﬁcally citation analysis.
Central assumption → a link is an endorsement.
A hyperlink from page A to page B represents a vote in page
B from the creator of page A.
Simply using the in-degree of a page as a measure of its
importance would be easy to manipulate (e.g. link spam).
Originated from Stanford and used by Google.
The PageRank algorithm depends on the link structure of the
web graph and assigns a score between 0 and 1 to each page.
The PageRank weight is a query-independent score.
“The PageRank Citation Ranking: Bringing Order to the Web”
Larry Page, Sergey Brin, Rajeev Motwani and Terry Winograd (1998)
PageRank Random Surfer
1. Consider a random surfer visiting web pages and
following the out-links in a random fashion at each point.
2. Eventually, the nodes with an higher in-degree will be
visited more often.
3. The idea behind PageRank is that pages that have more
visits are more important.
∑ P R(p)
P R(A) = (1 − d) + d ×
0.6 0.2 0.2
Computation is performed iteratively until
a minimum threshold is achieved.
P R(B) P R(C) P R(E)
P R(A) = + +
2 1 3
P R(B) =
P R(C) =
P R(A) P R(E)
P R(D) = +
P R(E) =
The Hyperlinked Induced Topic Selection (HITS) was
proposed by Jon Kleinberg in 1999.
HITS is an algorithm that uses the link structure of the web
to produce two query-dependent scores — an authority score
and a hub score.
An authority is a page with many citations from hubs.
A hub is a page that cites a large number of authorities.
Three major diﬀerences from PageRank:
(1) it is computed at query time (!); (2) it produces two values
for each page; (3) it is applied to subsets of the web.
1. Select a collection of documents related to a query.
2. Iteratively calculate authority and hub values for each
Authority(A) = Hub(p)
Hub(A) = Authority(p)
With so many signals, how to obtain a single ranking score?
Score(P ) = α × Signal 1 (P ) + β × S2 (P ) + γ × S3 (P ) . . .
1. Manually tuning by experts based on real-data
2. Use machine-learning methods to automatically build
ranking formulas: learning to rank / machine-learned
There are two broad categories of services for facilitating the
discovering of information on the web.
Full-Text Search Engines
Generically known as web search engines, these services crawl
the web, index their contents and rank the documents.
Topic-oriented collections, maintained by human editors.
Search Engine Architecture
Disk Disk Disk
Includes the software that ﬁnds and fetches web pages.
Multiple and distributed crawlers operate simultaneously.
First generation search engines had a scheduled periodic crawl
of the web. In current search engines, crawlers operate
continuously — e.g. very popular and dynamic documents are
crawled multiples times a day.
There is an inﬁnite number of pages on the Web, thus the
crawler must decide which will be crawled and which won’t.
A crawler must be robust and polite.
A crawler should be distributed, scalable, eﬃcient, fresh,
quality-targeted and extensible.
User-agent: * User-agent: *
Disallow: /ADS/ Disallow: /search
Disallow: /banners/ Disallow: /groups
Disallow: /bartoon/ Disallow: /images
Disallow: /bdt/ Disallow: /catalogs
Disallow: /bin/ Disallow: /catalogues
Disallow: /calvin_and_hobbes/ Disallow: /news
Disallow: /cinecartaz/ Allow: /news/directory
Disallow: /desportohtml/ Disallow: /nwshp
Disallow: /emprego/ Disallow: /setnewsprefs?
Disallow: /especial/ Disallow: /index.html?
Disallow: /img/ Disallow: /?
Disallow: /includeKimus/ Disallow: /addurl/image?
Disallow: /lazer/ Disallow: /pagead/
Disallow: /mail/ Disallow: /relpage/
Disallow: /static/ Disallow: /relcontent
Disallow: /xsl/ Disallow: /imgres
Indices are data structures designed for fast reading.
The index is the biggest component of a search engine.
Web documents are parsed and separated into tokens. This is
a very challenging task due to the diversity of the web: ﬁle
formats, language ambiguity, word boundaries, etc.
a —› d1...
domingo —› d1,d17,d30
estranho —› d2
flores —› d1,d3,d5
porto —› d4,d18
Research challenges in: size optimization, parallelism,
maintenance, lookup speed, etc.
Ranking and Presentation
QUERY MAGIC 10 DOCS
in x millisecs
For a given query, documents are ordered combining hundreds
of signals. Additionally, ads are selected ($) and snippets are
produced for each document. All in a few milliseconds.
“1% of the web search market is worth over $1 billion”
Search engine’s business model is based on advertisement.
First business models were based on small per-view charges.
Ads were indiscriminately published, resulting a low
The use of targeted advertising (ads are related to searches)
resulted in much higher conversion rates. Advertisers bid on
query terms and pay-per-click.
Search engines operate complex systems that try to maximize
revenue by selecting which ads to display.
The World Wide Web didn’t exist 20 years ago.
The Web is scientiﬁcally young and combines research from
many diﬀerent ﬁelds, not just technology.
There are many open problems and much more to be opened.
Some currently hot topics: learning to rank, wisdom of the
crowds, social media, real-time, contextual, hcir.
▶ An Introduction to Information Retrieval (2009)
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
▶ Web Information Retrieval (2009)
Nick Craswell and David Hawking