Embed
Email

PageRank

Document Sample
PageRank
Shared by: HC11111004228
Categories
Tags
Stats
views:
4
posted:
11/9/2011
language:
English
pages:
43
The Anatomy of a Large-Scale

Hypertextual Web Search Engine

Sergey Brin, Lawrence Page



Presented By: Paolo Lim

April 10, 2007





CS 331 - Data Mining 1

AKA: The Original Google Paper









Larry Page and Sergey Brin





CS 331 - Data Mining 2

Presentation Outline



Design goals of Google search engine

Link Analysis and other features

System architecture and major structures

Crawling, indexing, and searching the web

Performance and results

Conclusions

Final exam questions



CS 331 - Data Mining 3

Linear Algebra Background

 PageRank involves knowledge of:

Matrix addition/multiplication

Eigenvectors and Eigenvalues

Power iteration

Dot product

 Not discussed in detail in presentation

 For reference:

http://cs.wellesley.edu/~cs249B/math/Linear%20Alg

ebra/CS298LinAlgpart1.pdf

http://www.cse.buffalo.edu/~hungngo/classes/2005/

Expanders/notes/LA-intro.pdf



CS 331 - Data Mining 4

Google Design Goals

 Scaling with the web’s growth

 Improved search quality

Number of documents increasing rapidly, but user’s

ability to look at documents lags

Lots of “junk” results, little relevance

 Academic search engine research

Development and understanding in academic realm

System that reasonable number of people can actually

use

Support novel research activities of large-scale web

data by other researchers and students



CS 331 - Data Mining 5

Link Analysis Basics



PageRank Algorithm

A Top 10 IEEE ICDM data mining algorithm

Large basis for ranking system (discussed later)

Tries to incorporate ideas from academic

community (publishing and citations)

Anchor Text Analysis

 ANCHOR TEXT







CS 331 - Data Mining 6

Intuition: Why Links, Anyway?



Links represent citations

Quantity of links to a website makes the

website more popular

Quality of links to a website also helps in

computing rank

Link structure largely unused before Larry

Page proposed it to thesis advisor



CS 331 - Data Mining 7

Naïve PageRank



Each link’s vote is proportional to the

importance of its’ source page

If page P with important I has N outlinks,

then each link gets I / N votes

Simple recursive formulation:

PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)

PR(X)  PageRank of page X

C(X)  number of links going out of page X



CS 331 - Data Mining 8

Naïve PageRank Model

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)







The web in 1839 y = y /2 + a /2

y/2 a = y /2 + m

Yahoo

y m = a /2

a/2 y/2



m

Amazon M’soft

a/2 m

a

CS 331 - Data Mining 9

Solving the flow equations



3 equations, 3 unknowns, no constants

No unique solution

All solutions equivalent modulo scale factor

Additional constraint forces uniqueness

y+a+m = 1

y = 2/5, a = 2/5, m = 1/5

Gaussian elimination method works for

small examples, but we need a better

method for large graphs

CS 331 - Data Mining 10

Matrix formulation

 Matrix M has one row and one column for each web

page

 Suppose page j has n outlinks

If j ! i, then Mij=1/n

Else Mij=0

 M is a column stochastic matrix

Columns sum to 1

 Suppose r is a vector with one entry per web page

ri is the importance score of page i

Call it the rank vector

CS 331 - Data Mining 11

Example

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



Suppose page j links to 3 pages, including i

j



i i

=

1/3







M r r









CS 331 - Data Mining 12

Eigenvector formulation



The flow equations can be written

r = Mr

So the rank vector is an eigenvector of the

stochastic web matrix

In fact, its first or principal eigenvector, with

corresponding eigenvalue 1









CS 331 - Data Mining 13

Example

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



y a m

Yahoo y 1/2 1/2 0

a 1/2 0 1

m 0 1/2 0





r = Mr

Amazon M’soft

y 1/2 1/2 0 y

y = y /2 + a /2 a = 1/2 0 1 a

a = y /2 + m m 0 1/2 0 m

m = a /2

CS 331 - Data Mining 14

Power Iteration



Simple iterative scheme (aka relaxation)

Suppose there are N web pages

Initialize: r0 = [1,….,1]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < 

|x|1 = 1·i·N|xi| is the L1 norm

Can use any other vector norm e.g., Euclidean





CS 331 - Data Mining 15

Power Iteration Example

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)









Yahoo y a m

y 1/2 1/2 0

a 1/2 0 1

m 0 1/2 0





Amazon M’soft





y 1 1 5/4 9/8 6/5

a = 1 3/2 1 22/24 . . . 6/5

m 1 1/2 3/4 1/2 3/5

CS 331 - Data Mining 16

Random Surfer

 Imagine a random web surfer

At any time t, surfer is on some page P

At time t+1, the surfer follows an outlink from P

uniformly at random

Ends up on some page Q linked from P

Process repeats indefinitely

 Let p(t) be a vector whose ith component is the

probability that the surfer is at page i at time t

p(t) is a probability distribution on pages



CS 331 - Data Mining 17

The stationary distribution

 Where is the surfer at time t+1?

Follows a link uniformly at random

p(t+1) = Mp(t)

 Suppose the random walk reaches a state such that

p(t+1) = Mp(t) = p(t)

Then p(t) is called a stationary distribution for the

random walk

 Our rank vector r satisfies r = Mr

So it is a stationary distribution for the random

surfer

CS 331 - Data Mining 18

Spider traps



A group of pages is a spider trap if there

are no links from within the group to

outside the group

Random surfer gets trapped

Spider traps violate the conditions needed

for the random walk theorem







CS 331 - Data Mining 19

Microsoft becomes a spider trap

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)









Yahoo y a m

y 1/2 1/2 0

a 1/2 0 0

m 0 1/2 1



Amazon M’soft



y 1 1 3/4 5/8 0

a = 1 1/2 1/2 3/8 ... 0

m 1 3/2 7/4 2 3



CS 331 - Data Mining 20

Random teleports



The Google solution for spider traps

At each time step, the random surfer has

two options:

With probability , follow a link at random

With probability 1-, jump to some page

uniformly at random

Common values for  are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within

a few time steps

CS 331 - Data Mining 21

Matrix formulation



Suppose there are N pages

Consider a page j, with set of outlinks O(j)

We have Mij = 1/|O(j)| when j!i and Mij = 0

otherwise

The random teleport is equivalent to

adding a teleport link from j to every other page with

probability (1-)/N

reducing the probability of following each outlink from

1/|O(j)| to /|O(j)|

Equivalent: tax each page a fraction (1-) of its score

and redistribute evenly Mining

CS 331 - Data 22

Page Rank



Construct the NxN matrix A as follows

Aij = Mij + (1-)/N

Verify that A is a stochastic matrix

The page rank vector r is the principal

eigenvector of this matrix

satisfying r = Ar

Equivalently, r is the stationary distribution

of the random walk with teleports

CS 331 - Data Mining 23

Previous example with =0.8

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



1/2 1/2 0 1/3 1/3 1/3

Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3

0 1/2 1 1/3 1/3 1/3



y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 13/15

Amazon M’soft



y 1 1.00 0.84 0.776 7/11

a = 1 0.60 0.60 0.536 . . . 5/11

m 1 1.40 1.56 1.688 21/11



CS 331 - Data Mining 24

Dead ends



Pages with no outlinks are “dead ends” for

the random surfer

Nowhere to go on next step









CS 331 - Data Mining 25

Microsoft becomes a dead end

(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)



1/2 1/2 0 1/3 1/3 1/3

Yahoo 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3

0 1/2 0 1/3 1/3 1/3



y 7/15 7/15 1/15

a 7/15 1/15 1/15

m 1/15 7/15 1/15

Amazon M’soft



y Non-

1 1 0.787 0.648 0

a = stochastic!

1 0.6 0.547 0.430 . . . 0

m 1 0.6 0.387 0.333 0

CS 331 - Data Mining 26

Dealing with dead-ends

 Teleport

Follow random teleport links with probability 1.0

from dead-ends

Adjust matrix accordingly

 Prune and propagate

Preprocess the graph to eliminate dead-ends

Might require multiple passes

Compute page rank on reduced graph

Approximate values for dead ends by

propagating values from reduced graph

CS 331 - Data Mining 27

Anchor Text



Can be more accurate description of target

site than target site’s text itself

Can point at non-HTTP or non-text

Images

Videos

Databases

Possible for non-crawled pages to be

returned in the process

CS 331 - Data Mining 28

Other Features

List of occurrences of a particular word in

a particular document (Hit List)

Location information and proximity

Keeps track of visual presentation details:

Font size of words

Capitalization

Bold/Italic/Underlined/etc.

Full raw HTML of all pages is available in

repository

CS 331 - Data Mining 29

Google Architecture

(from http://www.ics.uci.edu/~scott/google.htm)





Implemented in C and C++ on Solaris and Linux









CS 331 - Data Mining 30

Google Architecture

(from http://www.ics.uci.edu/~scott/google.htm)



Multiple crawlers run in parallel.

Keeps track of URLs Each crawler keeps its own DNS Compresses and

that have and need lookup cache and ~300 open stores web pages

to be crawled connections open at once.









Stores each link and

text surrounding link.









Converts relative URLs

into absolute URLs.





Uncompresses and parses Contains full html of every web

link

documents. Stores- Data Mining

CS 331

page. Each document is prefixed

31

information in anchors file. by docID, length, and URL.

Google Architecture

(from http://www.ics.uci.edu/~scott/google.htm)



Maps absolute URLs into docIDs stored in Doc Parses & distributes hit lists into

Index. Stores anchor text in “barrels”. “barrels.”

Generates database of links (pairs of docIds).

Partially sorted forward

indexes sorted by docID. Each

barrel stores hitlists for a given

range of wordIDs.



In-memory hash table that

maps words to wordIds.

Contains pointer to doclist in

barrel which wordId falls into.



Creates inverted index

whereby document list

containing docID and hitlists

can be retrieved given wordID.



DocID keyed index where each entry includes info such as pointer to doc in

repository, checksum, statistics, status, etc. Also contains URL info if doc 32

CS 331 - Data Mining

has been crawled. If not just contains URL.

Google Architecture

(from http://www.ics.uci.edu/~scott/google.htm)









2 kinds of barrels. Short

barrell which contain hit

list which include title or

anchor hits. Long barrell

for all hit lists.









List of wordIds produced

by Sorter and lexicon

created by Indexer used

New lexicon keyed by

to create new lexicon

wordID, inverted doc

used by searcher. Lexicon

index keyed by docID,

stores ~14 million words.

and PageRanks used to

answer queries CS 331 - Data Mining 33

Google Query Evaluation

1. Parse the query.

2. Convert words into wordIDs.

3. Seek to the start of the doclist in the short barrel for every

word.

4. Scan through the doclists until there is a document that

matches all the search terms.

5. Compute the rank of that document for the query.

6. If we are in the short barrels and at the end of any

doclist, seek to the start of the doclist in the full barrel for

every word and go to step 4.

7. If we are not at the end of any doclist go to step 4.

8. Sort the documents that have matched by rank and

return the top k. CS 331 - Data Mining 34

Single Word Query Ranking

 Hitlist is retrieved for single word

 Each hit can be one of several types: title,

anchor, URL, large font, small font, etc.

 Each hit type is assigned its own weight

 Type-weights make up vector of weights

 Number of hits of each type is counted to form

count-weight vector

 Dot product of type-weight and count-weight

vectors is used to compute IR score

 IR score is combined with PageRank to compute

final rank CS 331 - Data Mining 35

Multi-word Query Ranking

 Similar to single-word ranking except now must

analyze proximity of words in a document

 Hits occurring closer together are weighted higher

than those farther apart

 Each proximity relation is classified into 1 of 10 bins

ranging from a “phrase match” to “not even close”

 Each type and proximity pair has a type-prox weight

 Counts converted into count-weights

 Take dot product of count-weights and type-prox

weights to computer for IR score



CS 331 - Data Mining 36

Scalability

Cluster architecture combined with

Moore’s Law make for high scalability. At

time of writing:

~ 24 million documents indexed in one week

~518 million hyperlinks indexed

Four crawlers collected 100 documents/sec









CS 331 - Data Mining 37

Key Optimization Techniques

 Each crawler maintains its own DNS lookup cache

 Use flex to generate lexical analyzer with own stack for

parsing documents

 Parallelization of indexing phase

 In-memory lexicon

 Compression of repository

 Compact encoding of hit lists for space saving

 Indexer is optimized so it is just faster than the crawler

so that crawling is the bottleneck

 Document index is updated in bulk

 Critical data structures placed on local disk

 Overall architecture designed avoid to disk seeks

wherever possible

CS 331 - Data Mining 38

Storage Requirements

(from http://www.ics.uci.edu/~scott/google.htm)





At the time of publication, Google had the following

statistical breakdown for storage requirements:









CS 331 - Data Mining 39

Conclusions

Search is far from perfect

Topic/Domain-specific PageRank

Machine translation in search

Non-hypertext search

Business potential

Brin and Page worth around $15 billion each…

at 32 years old!

If you have a better idea than how Google does

search, please remember me when you’re

hiring software engineers! 

CS 331 - Data Mining 40

Possible Exam Questions

 Given a web/link graph, formulate a Naïve

PageRank link matrix and do a few steps of

power iteration.

Slides 14 – 16

 What are spider traps and dead ends, and how

does Google deal with these?

Spider Trap: Slides 19 – 21

Dead End: Slides 25 – 27

 Explain difference between single and multiple

word search query evaluation.

Slides 35 – 36

CS 331 - Data Mining 41

References

 Brin, Page. The Anatomy of a Large-Scale

Hypertextual Web Search Engine.

 Brin, Page, Motwani, Winograd. The PageRank

Citation Ranking: Bringing Order to the Web.

 http://www.stanford.edu/class/cs345a/lectureslid

es/PageRank.pdf

 www.cs.duke.edu/~junyang/courses/cps296.1-

2002-spring/lectures/02-web-search.pdf

 http://www.ics.uci.edu/~scott/google.htm

CS 331 - Data Mining 42

Thank you!









CS 331 - Data Mining 43


Related docs
Other docs by HC11111004228
Paxton
Views: 3  |  Downloads: 0
ismail_ozturk
Views: 1  |  Downloads: 0
2005 EB 4
Views: 1  |  Downloads: 0
SearchProcessTraining
Views: 0  |  Downloads: 0
102
Views: 0  |  Downloads: 0
Gerodontologia06 07
Views: 2  |  Downloads: 0
jbptunikompp gdl alisyamsdu 23033 1 bahanaj m
Views: 24  |  Downloads: 0
Organization 20List 202 20 202009
Views: 3  |  Downloads: 0
edinvest
Views: 0  |  Downloads: 0
report07
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!