Document Sample
PageRank Powered By Docstoc
					  The Anatomy of a Large-Scale
Hypertextual Web Search Engine
           Sergey Brin, Lawrence Page

                  Presented By: Paolo Lim
                            April 10, 2007

           CS 331 - Data Mining              1
AKA: The Original Google Paper

  Larry Page and Sergey Brin

                               CS 331 - Data Mining   2
Presentation Outline

Design goals of Google search engine
Link Analysis and other features
System architecture and major structures
Crawling, indexing, and searching the web
Performance and results
Final exam questions

                  CS 331 - Data Mining       3
Linear Algebra Background
 PageRank involves knowledge of:
  Matrix addition/multiplication
  Eigenvectors and Eigenvalues
  Power iteration
  Dot product
 Not discussed in detail in presentation
 For reference:

                      CS 331 - Data Mining         4
Google Design Goals
 Scaling with the web’s growth
 Improved search quality
  Number of documents increasing rapidly, but user’s
   ability to look at documents lags
  Lots of “junk” results, little relevance
 Academic search engine research
  Development and understanding in academic realm
  System that reasonable number of people can actually
  Support novel research activities of large-scale web
   data by other researchers and students

                       CS 331 - Data Mining               5
Link Analysis Basics

PageRank Algorithm
  A Top 10 IEEE ICDM data mining algorithm
  Large basis for ranking system (discussed later)
  Tries to incorporate ideas from academic
   community (publishing and citations)
Anchor Text Analysis
  <a href=> ANCHOR TEXT </a>

                     CS 331 - Data Mining         6
Intuition: Why Links, Anyway?

Links represent citations
Quantity of links to a website makes the
 website more popular
Quality of links to a website also helps in
 computing rank
Link structure largely unused before Larry
 Page proposed it to thesis advisor

                   CS 331 - Data Mining        7
Naïve PageRank

Each link’s vote is proportional to the
 importance of its’ source page
If page P with important I has N outlinks,
 then each link gets I / N votes
Simple recursive formulation:
  PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)
  PR(X)  PageRank of page X
  C(X)  number of links going out of page X

                   CS 331 - Data Mining         8
Naïve PageRank Model

The web in 1839                                       y = y /2 + a /2
                                  y/2                 a = y /2 + m
           y                                          m = a /2
     a/2             y/2

Amazon                             M’soft
                      a/2             m
                                          CS 331 - Data Mining           9
Solving the flow equations

3 equations, 3 unknowns, no constants
  No unique solution
  All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
  y+a+m = 1
  y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for
 small examples, but we need a better
 method for large graphs
                     CS 331 - Data Mining         10
Matrix formulation
 Matrix M has one row and one column for each web
 Suppose page j has n outlinks
   If j ! i, then Mij=1/n
   Else Mij=0
 M is a column stochastic matrix
   Columns sum to 1
 Suppose r is a vector with one entry per web page
   ri is the importance score of page i
   Call it the rank vector
                     CS 331 - Data Mining         11

Suppose page j links to 3 pages, including i

               i                                                                  i

                                M                                 r           r

                                           CS 331 - Data Mining                       12
Eigenvector formulation

The flow equations can be written
                  r = Mr
So the rank vector is an eigenvector of the
 stochastic web matrix
  In fact, its first or principal eigenvector, with
   corresponding eigenvalue 1

                       CS 331 - Data Mining            13

                                                                    y a       m
                 Yahoo                                           y 1/2 1/2    0
                                                                 a 1/2 0      1
                                                                 m 0 1/2      0

                                                                         r = Mr
Amazon                         M’soft
                                                                  y   1/2 1/2 0   y
      y = y /2 + a /2                                             a = 1/2 0 1     a
      a = y /2 + m                                                m    0 1/2 0    m
      m = a /2
                                          CS 331 - Data Mining                        14
Power Iteration

Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1,….,1]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < 
  |x|1 = 1·i·N|xi| is the L1 norm
  Can use any other vector norm e.g., Euclidean

                    CS 331 - Data Mining           15
Power Iteration Example

               Yahoo                                                y a      m
                                                                 y 1/2 1/2   0
                                                                 a 1/2 0     1
                                                                 m 0 1/2     0

Amazon                       M’soft

        y                1         1          5/4            9/8             6/5
        a =              1         3/2        1              22/24 . . .     6/5
        m                1         1/2        3/4            1/2             3/5
                                          CS 331 - Data Mining                     16
Random Surfer
 Imagine a random web surfer
   At any time t, surfer is on some page P
   At time t+1, the surfer follows an outlink from P
     uniformly at random
   Ends up on some page Q linked from P
   Process repeats indefinitely
 Let p(t) be a vector whose ith component is the
  probability that the surfer is at page i at time t
   p(t) is a probability distribution on pages

                       CS 331 - Data Mining             17
The stationary distribution
 Where is the surfer at time t+1?
   Follows a link uniformly at random
   p(t+1) = Mp(t)
 Suppose the random walk reaches a state such that
  p(t+1) = Mp(t) = p(t)
   Then p(t) is called a stationary distribution for the
     random walk
 Our rank vector r satisfies r = Mr
   So it is a stationary distribution for the random
                        CS 331 - Data Mining            18
Spider traps

A group of pages is a spider trap if there
 are no links from within the group to
 outside the group
  Random surfer gets trapped
Spider traps violate the conditions needed
 for the random walk theorem

                  CS 331 - Data Mining        19
Microsoft becomes a spider trap

               Yahoo                                                 y a       m
                                                                  y 1/2 1/2    0
                                                                  a 1/2 0      0
                                                                  m 0 1/2      1

Amazon                       M’soft

           y                1        1           3/4             5/8           0
           a =              1        1/2         1/2             3/8     ...   0
           m                1        3/2         7/4             2             3

                                          CS 331 - Data Mining                     20
Random teleports

The Google solution for spider traps
At each time step, the random surfer has
 two options:
  With probability , follow a link at random
  With probability 1-, jump to some page
   uniformly at random
  Common values for  are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within
 a few time steps
                    CS 331 - Data Mining          21
Matrix formulation

Suppose there are N pages
  Consider a page j, with set of outlinks O(j)
  We have Mij = 1/|O(j)| when j!i and Mij = 0
  The random teleport is equivalent to
     adding a teleport link from j to every other page with
      probability (1-)/N
     reducing the probability of following each outlink from
      1/|O(j)| to /|O(j)|
     Equivalent: tax each page a fraction (1-) of its score
      and redistribute evenly Mining
                           CS 331 - Data                     22
Page Rank

Construct the NxN matrix A as follows
  Aij = Mij + (1-)/N
Verify that A is a stochastic matrix
The page rank vector r is the principal
 eigenvector of this matrix
  satisfying r = Ar
Equivalently, r is the stationary distribution
 of the random walk with teleports
                       CS 331 - Data Mining   23
Previous example with =0.8

                                               1/2 1/2 0                         1/3 1/3 1/3
               Yahoo                       0.8 1/2 0 0                     + 0.2 1/3 1/3 1/3
                                                0 1/2 1                          1/3 1/3 1/3

                                                                 y 7/15 7/15 1/15
                                                                 a 7/15 1/15 1/15
                                                                 m 1/15 7/15 13/15
Amazon                        M’soft

         y                1        1.00 0.84                     0.776          7/11
         a =              1        0.60 0.60                     0.536 . . .    5/11
         m                1        1.40 1.56                     1.688         21/11

                                          CS 331 - Data Mining                            24
Dead ends

Pages with no outlinks are “dead ends” for
 the random surfer
  Nowhere to go on next step

                   CS 331 - Data Mining   25
Microsoft becomes a dead end

                                               1/2 1/2 0                       1/3 1/3 1/3
               Yahoo                       0.8 1/2 0 0                   + 0.2 1/3 1/3 1/3
                                                0 1/2 0                        1/3 1/3 1/3

                                                                 y 7/15 7/15 1/15
                                                                 a 7/15 1/15 1/15
                                                                 m 1/15 7/15 1/15
Amazon                       M’soft

  y                                                                            Non-
                   1        1          0.787 0.648                       0
  a =                                                                          stochastic!
                   1        0.6        0.547 0.430 . . .                 0
  m                1        0.6        0.387 0.333                       0
                                          CS 331 - Data Mining                          26
Dealing with dead-ends
 Teleport
   Follow random teleport links with probability 1.0
    from dead-ends
   Adjust matrix accordingly
 Prune and propagate
   Preprocess the graph to eliminate dead-ends
   Might require multiple passes
   Compute page rank on reduced graph
   Approximate values for dead ends by
    propagating values from reduced graph
                       CS 331 - Data Mining             27
Anchor Text

Can be more accurate description of target
 site than target site’s text itself
Can point at non-HTTP or non-text
Possible for non-crawled pages to be
 returned in the process
                  CS 331 - Data Mining    28
Other Features
List of occurrences of a particular word in
 a particular document (Hit List)
Location information and proximity
Keeps track of visual presentation details:
  Font size of words
Full raw HTML of all pages is available in
                     CS 331 - Data Mining      29
Google Architecture

       Implemented in C and C++ on Solaris and Linux

                                           CS 331 - Data Mining   30
  Google Architecture

                                 Multiple crawlers run in parallel.
Keeps track of URLs              Each crawler keeps its own DNS           Compresses and
that have and need                lookup cache and ~300 open             stores web pages
   to be crawled                    connections open at once.

 Stores each link and
text surrounding link.

Converts relative URLs
 into absolute URLs.

                          Uncompresses and parses            Contains full html of every web
                           documents. Stores- Data Mining
                                         CS 331
                                                            page. Each document is prefixed
                         information in anchors file.          by docID, length, and URL.
   Google Architecture

Maps absolute URLs into docIDs stored in Doc            Parses & distributes hit lists into
   Index. Stores anchor text in “barrels”.                         “barrels.”
Generates database of links (pairs of docIds).
                                                              Partially sorted forward
                                                          indexes sorted by docID. Each
                                                          barrel stores hitlists for a given
                                                                 range of wordIDs.

                                                            In-memory hash table that
                                                             maps words to wordIds.
                                                           Contains pointer to doclist in
                                                          barrel which wordId falls into.

                                                              Creates inverted index
                                                              whereby document list
                                                           containing docID and hitlists
                                                          can be retrieved given wordID.

        DocID keyed index where each entry includes info such as pointer to doc in
         repository, checksum, statistics, status, etc. Also contains URL info if doc 32
                                      CS 331 - Data Mining
                        has been crawled. If not just contains URL.
Google Architecture

                                                          2 kinds of barrels. Short
                                                          barrell which contain hit
                                                          list which include title or
                                                          anchor hits. Long barrell
                                                                for all hit lists.

                                                          List of wordIds produced
                                                             by Sorter and lexicon
                                                           created by Indexer used
              New lexicon keyed by
                                                             to create new lexicon
              wordID, inverted doc
                                                          used by searcher. Lexicon
             index keyed by docID,
                                                          stores ~14 million words.
             and PageRanks used to
                 answer queries    CS 331 - Data Mining                           33
Google Query Evaluation
1.   Parse the query.
2.   Convert words into wordIDs.
3.   Seek to the start of the doclist in the short barrel for every
4.   Scan through the doclists until there is a document that
     matches all the search terms.
5.   Compute the rank of that document for the query.
6.   If we are in the short barrels and at the end of any
     doclist, seek to the start of the doclist in the full barrel for
     every word and go to step 4.
7.   If we are not at the end of any doclist go to step 4.
8.   Sort the documents that have matched by rank and
     return the top k.      CS 331 - Data Mining                   34
Single Word Query Ranking
 Hitlist is retrieved for single word
 Each hit can be one of several types: title,
  anchor, URL, large font, small font, etc.
 Each hit type is assigned its own weight
 Type-weights make up vector of weights
 Number of hits of each type is counted to form
  count-weight vector
 Dot product of type-weight and count-weight
  vectors is used to compute IR score
 IR score is combined with PageRank to compute
  final rank        CS 331 - Data Mining       35
Multi-word Query Ranking
 Similar to single-word ranking except now must
  analyze proximity of words in a document
 Hits occurring closer together are weighted higher
  than those farther apart
 Each proximity relation is classified into 1 of 10 bins
  ranging from a “phrase match” to “not even close”
 Each type and proximity pair has a type-prox weight
 Counts converted into count-weights
 Take dot product of count-weights and type-prox
  weights to computer for IR score

                        CS 331 - Data Mining            36
Cluster architecture combined with
 Moore’s Law make for high scalability. At
 time of writing:
  ~ 24 million documents indexed in one week
  ~518 million hyperlinks indexed
  Four crawlers collected 100 documents/sec

                    CS 331 - Data Mining        37
Key Optimization Techniques
 Each crawler maintains its own DNS lookup cache
 Use flex to generate lexical analyzer with own stack for
  parsing documents
 Parallelization of indexing phase
 In-memory lexicon
 Compression of repository
 Compact encoding of hit lists for space saving
 Indexer is optimized so it is just faster than the crawler
  so that crawling is the bottleneck
 Document index is updated in bulk
 Critical data structures placed on local disk
 Overall architecture designed avoid to disk seeks
  wherever possible
                          CS 331 - Data Mining                 38
Storage Requirements

     At the time of publication, Google had the following
     statistical breakdown for storage requirements:

                                           CS 331 - Data Mining   39
Search is far from perfect
  Topic/Domain-specific PageRank
  Machine translation in search
  Non-hypertext search
Business potential
  Brin and Page worth around $15 billion each…
   at 32 years old!
  If you have a better idea than how Google does
   search, please remember me when you’re
   hiring software engineers! 
                    CS 331 - Data Mining        40
Possible Exam Questions
 Given a web/link graph, formulate a Naïve
  PageRank link matrix and do a few steps of
  power iteration.
  Slides 14 – 16
 What are spider traps and dead ends, and how
  does Google deal with these?
  Spider Trap: Slides 19 – 21
  Dead End: Slides 25 – 27
 Explain difference between single and multiple
  word search query evaluation.
  Slides 35 – 36
                       CS 331 - Data Mining        41
 Brin, Page. The Anatomy of a Large-Scale
  Hypertextual Web Search Engine.
 Brin, Page, Motwani, Winograd. The PageRank
  Citation Ranking: Bringing Order to the Web.
                     CS 331 - Data Mining        42
Thank you!

 CS 331 - Data Mining   43

Shared By: