PageRank
Document Sample


The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Sergey Brin, Lawrence Page
Presented By: Paolo Lim
April 10, 2007
CS 331 - Data Mining 1
AKA: The Original Google Paper
Larry Page and Sergey Brin
CS 331 - Data Mining 2
Presentation Outline
Design goals of Google search engine
Link Analysis and other features
System architecture and major structures
Crawling, indexing, and searching the web
Performance and results
Conclusions
Final exam questions
CS 331 - Data Mining 3
Linear Algebra Background
PageRank involves knowledge of:
Matrix addition/multiplication
Eigenvectors and Eigenvalues
Power iteration
Dot product
Not discussed in detail in presentation
For reference:
http://cs.wellesley.edu/~cs249B/math/Linear%20Alg
ebra/CS298LinAlgpart1.pdf
http://www.cse.buffalo.edu/~hungngo/classes/2005/
Expanders/notes/LA-intro.pdf
CS 331 - Data Mining 4
Google Design Goals
Scaling with the web’s growth
Improved search quality
Number of documents increasing rapidly, but user’s
ability to look at documents lags
Lots of “junk” results, little relevance
Academic search engine research
Development and understanding in academic realm
System that reasonable number of people can actually
use
Support novel research activities of large-scale web
data by other researchers and students
CS 331 - Data Mining 5
Link Analysis Basics
PageRank Algorithm
A Top 10 IEEE ICDM data mining algorithm
Large basis for ranking system (discussed later)
Tries to incorporate ideas from academic
community (publishing and citations)
Anchor Text Analysis
<a href=http://www.com> ANCHOR TEXT </a>
CS 331 - Data Mining 6
Intuition: Why Links, Anyway?
Links represent citations
Quantity of links to a website makes the
website more popular
Quality of links to a website also helps in
computing rank
Link structure largely unused before Larry
Page proposed it to thesis advisor
CS 331 - Data Mining 7
Naïve PageRank
Each link’s vote is proportional to the
importance of its’ source page
If page P with important I has N outlinks,
then each link gets I / N votes
Simple recursive formulation:
PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)
PR(X) PageRank of page X
C(X) number of links going out of page X
CS 331 - Data Mining 8
Naïve PageRank Model
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
The web in 1839 y = y /2 + a /2
y/2 a = y /2 + m
Yahoo
y m = a /2
a/2 y/2
m
Amazon M’soft
a/2 m
a
CS 331 - Data Mining 9
Solving the flow equations
3 equations, 3 unknowns, no constants
No unique solution
All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
y+a+m = 1
y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for
small examples, but we need a better
method for large graphs
CS 331 - Data Mining 10
Matrix formulation
Matrix M has one row and one column for each web
page
Suppose page j has n outlinks
If j ! i, then Mij=1/n
Else Mij=0
M is a column stochastic matrix
Columns sum to 1
Suppose r is a vector with one entry per web page
ri is the importance score of page i
Call it the rank vector
CS 331 - Data Mining 11
Example
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Suppose page j links to 3 pages, including i
j
i i
=
1/3
M r r
CS 331 - Data Mining 12
Eigenvector formulation
The flow equations can be written
r = Mr
So the rank vector is an eigenvector of the
stochastic web matrix
In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
CS 331 - Data Mining 13
Example
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
y a m
Yahoo y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0
r = Mr
Amazon M’soft
y 1/2 1/2 0 y
y = y /2 + a /2 a = 1/2 0 1 a
a = y /2 + m m 0 1/2 0 m
m = a /2
CS 331 - Data Mining 14
Power Iteration
Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1,….,1]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 <
|x|1 = 1·i·N|xi| is the L1 norm
Can use any other vector norm e.g., Euclidean
CS 331 - Data Mining 15
Power Iteration Example
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Yahoo y a m
y 1/2 1/2 0
a 1/2 0 1
m 0 1/2 0
Amazon M’soft
y 1 1 5/4 9/8 6/5
a = 1 3/2 1 22/24 . . . 6/5
m 1 1/2 3/4 1/2 3/5
CS 331 - Data Mining 16
Random Surfer
Imagine a random web surfer
At any time t, surfer is on some page P
At time t+1, the surfer follows an outlink from P
uniformly at random
Ends up on some page Q linked from P
Process repeats indefinitely
Let p(t) be a vector whose ith component is the
probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
CS 331 - Data Mining 17
The stationary distribution
Where is the surfer at time t+1?
Follows a link uniformly at random
p(t+1) = Mp(t)
Suppose the random walk reaches a state such that
p(t+1) = Mp(t) = p(t)
Then p(t) is called a stationary distribution for the
random walk
Our rank vector r satisfies r = Mr
So it is a stationary distribution for the random
surfer
CS 331 - Data Mining 18
Spider traps
A group of pages is a spider trap if there
are no links from within the group to
outside the group
Random surfer gets trapped
Spider traps violate the conditions needed
for the random walk theorem
CS 331 - Data Mining 19
Microsoft becomes a spider trap
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Yahoo y a m
y 1/2 1/2 0
a 1/2 0 0
m 0 1/2 1
Amazon M’soft
y 1 1 3/4 5/8 0
a = 1 1/2 1/2 3/8 ... 0
m 1 3/2 7/4 2 3
CS 331 - Data Mining 20
Random teleports
The Google solution for spider traps
At each time step, the random surfer has
two options:
With probability , follow a link at random
With probability 1-, jump to some page
uniformly at random
Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within
a few time steps
CS 331 - Data Mining 21
Matrix formulation
Suppose there are N pages
Consider a page j, with set of outlinks O(j)
We have Mij = 1/|O(j)| when j!i and Mij = 0
otherwise
The random teleport is equivalent to
adding a teleport link from j to every other page with
probability (1-)/N
reducing the probability of following each outlink from
1/|O(j)| to /|O(j)|
Equivalent: tax each page a fraction (1-) of its score
and redistribute evenly Mining
CS 331 - Data 22
Page Rank
Construct the NxN matrix A as follows
Aij = Mij + (1-)/N
Verify that A is a stochastic matrix
The page rank vector r is the principal
eigenvector of this matrix
satisfying r = Ar
Equivalently, r is the stationary distribution
of the random walk with teleports
CS 331 - Data Mining 23
Previous example with =0.8
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
1/2 1/2 0 1/3 1/3 1/3
Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
Amazon M’soft
y 1 1.00 0.84 0.776 7/11
a = 1 0.60 0.60 0.536 . . . 5/11
m 1 1.40 1.56 1.688 21/11
CS 331 - Data Mining 24
Dead ends
Pages with no outlinks are “dead ends” for
the random surfer
Nowhere to go on next step
CS 331 - Data Mining 25
Microsoft becomes a dead end
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
1/2 1/2 0 1/3 1/3 1/3
Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 0 1/3 1/3 1/3
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 1/15
Amazon M’soft
y Non-
1 1 0.787 0.648 0
a = stochastic!
1 0.6 0.547 0.430 . . . 0
m 1 0.6 0.387 0.333 0
CS 331 - Data Mining 26
Dealing with dead-ends
Teleport
Follow random teleport links with probability 1.0
from dead-ends
Adjust matrix accordingly
Prune and propagate
Preprocess the graph to eliminate dead-ends
Might require multiple passes
Compute page rank on reduced graph
Approximate values for dead ends by
propagating values from reduced graph
CS 331 - Data Mining 27
Anchor Text
Can be more accurate description of target
site than target site’s text itself
Can point at non-HTTP or non-text
Images
Videos
Databases
Possible for non-crawled pages to be
returned in the process
CS 331 - Data Mining 28
Other Features
List of occurrences of a particular word in
a particular document (Hit List)
Location information and proximity
Keeps track of visual presentation details:
Font size of words
Capitalization
Bold/Italic/Underlined/etc.
Full raw HTML of all pages is available in
repository
CS 331 - Data Mining 29
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
Implemented in C and C++ on Solaris and Linux
CS 331 - Data Mining 30
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
Multiple crawlers run in parallel.
Keeps track of URLs Each crawler keeps its own DNS Compresses and
that have and need lookup cache and ~300 open stores web pages
to be crawled connections open at once.
Stores each link and
text surrounding link.
Converts relative URLs
into absolute URLs.
Uncompresses and parses Contains full html of every web
link
documents. Stores- Data Mining
CS 331
page. Each document is prefixed
31
information in anchors file. by docID, length, and URL.
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
Maps absolute URLs into docIDs stored in Doc Parses & distributes hit lists into
Index. Stores anchor text in “barrels”. “barrels.”
Generates database of links (pairs of docIds).
Partially sorted forward
indexes sorted by docID. Each
barrel stores hitlists for a given
range of wordIDs.
In-memory hash table that
maps words to wordIds.
Contains pointer to doclist in
barrel which wordId falls into.
Creates inverted index
whereby document list
containing docID and hitlists
can be retrieved given wordID.
DocID keyed index where each entry includes info such as pointer to doc in
repository, checksum, statistics, status, etc. Also contains URL info if doc 32
CS 331 - Data Mining
has been crawled. If not just contains URL.
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
2 kinds of barrels. Short
barrell which contain hit
list which include title or
anchor hits. Long barrell
for all hit lists.
List of wordIds produced
by Sorter and lexicon
created by Indexer used
New lexicon keyed by
to create new lexicon
wordID, inverted doc
used by searcher. Lexicon
index keyed by docID,
stores ~14 million words.
and PageRanks used to
answer queries CS 331 - Data Mining 33
Google Query Evaluation
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short barrel for every
word.
4. Scan through the doclists until there is a document that
matches all the search terms.
5. Compute the rank of that document for the query.
6. If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full barrel for
every word and go to step 4.
7. If we are not at the end of any doclist go to step 4.
8. Sort the documents that have matched by rank and
return the top k. CS 331 - Data Mining 34
Single Word Query Ranking
Hitlist is retrieved for single word
Each hit can be one of several types: title,
anchor, URL, large font, small font, etc.
Each hit type is assigned its own weight
Type-weights make up vector of weights
Number of hits of each type is counted to form
count-weight vector
Dot product of type-weight and count-weight
vectors is used to compute IR score
IR score is combined with PageRank to compute
final rank CS 331 - Data Mining 35
Multi-word Query Ranking
Similar to single-word ranking except now must
analyze proximity of words in a document
Hits occurring closer together are weighted higher
than those farther apart
Each proximity relation is classified into 1 of 10 bins
ranging from a “phrase match” to “not even close”
Each type and proximity pair has a type-prox weight
Counts converted into count-weights
Take dot product of count-weights and type-prox
weights to computer for IR score
CS 331 - Data Mining 36
Scalability
Cluster architecture combined with
Moore’s Law make for high scalability. At
time of writing:
~ 24 million documents indexed in one week
~518 million hyperlinks indexed
Four crawlers collected 100 documents/sec
CS 331 - Data Mining 37
Key Optimization Techniques
Each crawler maintains its own DNS lookup cache
Use flex to generate lexical analyzer with own stack for
parsing documents
Parallelization of indexing phase
In-memory lexicon
Compression of repository
Compact encoding of hit lists for space saving
Indexer is optimized so it is just faster than the crawler
so that crawling is the bottleneck
Document index is updated in bulk
Critical data structures placed on local disk
Overall architecture designed avoid to disk seeks
wherever possible
CS 331 - Data Mining 38
Storage Requirements
(from http://www.ics.uci.edu/~scott/google.htm)
At the time of publication, Google had the following
statistical breakdown for storage requirements:
CS 331 - Data Mining 39
Conclusions
Search is far from perfect
Topic/Domain-specific PageRank
Machine translation in search
Non-hypertext search
Business potential
Brin and Page worth around $15 billion each…
at 32 years old!
If you have a better idea than how Google does
search, please remember me when you’re
hiring software engineers!
CS 331 - Data Mining 40
Possible Exam Questions
Given a web/link graph, formulate a Naïve
PageRank link matrix and do a few steps of
power iteration.
Slides 14 – 16
What are spider traps and dead ends, and how
does Google deal with these?
Spider Trap: Slides 19 – 21
Dead End: Slides 25 – 27
Explain difference between single and multiple
word search query evaluation.
Slides 35 – 36
CS 331 - Data Mining 41
References
Brin, Page. The Anatomy of a Large-Scale
Hypertextual Web Search Engine.
Brin, Page, Motwani, Winograd. The PageRank
Citation Ranking: Bringing Order to the Web.
http://www.stanford.edu/class/cs345a/lectureslid
es/PageRank.pdf
www.cs.duke.edu/~junyang/courses/cps296.1-
2002-spring/lectures/02-web-search.pdf
http://www.ics.uci.edu/~scott/google.htm
CS 331 - Data Mining 42
Thank you!
CS 331 - Data Mining 43
Related docs
Other docs by pengtt
Introduction to IPv6 IPv6 deployment IPv6 Forum IPv6 Transition support IPv6 IPv4 and
Views: 5 | Downloads: 0
Get documents about "