Try the all-new QuickBooks Online for FREE.  No credit card required.

The PageRank Citation Ranking Bringing Order to the Web

Document Sample
The PageRank Citation Ranking Bringing Order to the Web Powered By Docstoc
					  The PageRank Citation Ranking:
     Bringing Order to the Web
   Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd
                      January 29th, 1998
                       Stanford InfoLab

Adaptive methods for the computation
           of PageRank
        Sepandar Kamvar, Taher Haveliwala, Gene Golub

                                           Presented By:
                                                          Wang Hao
                                                       March 8th, 2011
    Technology Overview
    Introduction & Motivation
    Link Structure of the Web
    Simplified PageRank
    PageRank Definition
    How we can get PageRank
    Dangling Links
    PageRank Implementation
    Adaptive Methods for computation of PageRank
    Searching with PageRnnk
    Personalized PageRank
    Application
    Conclusion
    References
Technology Overview
 Recognized the need for a new kind of server setup

 Linked PCs to quickly find each query’s answers
  This resulted in: Faster Response Time
                    Greater Scalability
                    Lower costs

 Google uses more than 200 signals (including PageRank
  algorithm) to determine which pages are important

 Google then performs hypertext-matching

                                     - Google Corporate Information
Life of a Google Query

                         - Google Corporate Information
Introduction & Motivation
 WWW is very large and heterogeneous
 The web pages are extremely diverse in terms of content,
  quality and structure
 Challenging for information retrieval on WWW

                               Academic Citations link to
                                other well known papers
                               But they are peer reviewed
                                and have quality control
                               Web of academic documents
                                are homogeneous in their
                                quality, usage, citation &
                                  Most web pages link to web
                                   pages as well
                                  Quality measure of a web
                                   page is subjective to the user
                                  Importance of a page is a
                                   quantity that isn’t intuitively
                                   possible to capture

  How can the most relevant pages be ranked at the top?
  Take advantage of the link structure of the Web to produce ranking
  of every web page known as PageRank
Link Structure of the Web
                   A and B are Backlinks of C

                   •Every page has some number of
                   forward links (outedges) and
                   backlinks (inedges)

                   •We can never know all the
                   backlinks of a page, but we know
                   all of its forward links

                   •Generally, highly linked pages are
                   more “important”
PageRank Definition
• PageRank - a method for computing a ranking for every web page based on
  the graph of the web
• A page has high rank if the sum of the ranks of its backlinks is high
      • Page has many backlinks
      • Page has a few highly ranked backlinks

           A page is important if important pages refer to it

• PageRank is a link analysis algorithm that assigns a numerical weight that
  represents how important a page is on the web

• The web is democratic i.e., pages
  vote for pages

  Google interprets a link from page A
  to page B as a vote, by page A, for page B.
  It also analyses the page that cast the vote.
Simple Ranking Function:
                                        u: web page
                                        Bu: backlinks
                                        Nu = |Fu| number of links from u
                                        c: factor used for normalization

Simplified PageRank Calculation

  In principle, the PageRanks form a probability distribution over
  web pages, so the sum of all web pages’ PageRanks will be one
Computing PageRank given a Directed Graph

                                  The transition matrix A =

                                   We get the eigenvalue λ = 1

    Calculating the eigenvector

  On substituting             we get,

   so the vector u is of the form

 Choose v to be the unique eigenvector with the sum of all entries equal to 1

                                                 PageRank vector
How we can get PageRank
It is a Markov chain.
Set the probability distribution at time 0: X0
Set one-step transition probability matrix: A

What we would like to get is the unique stationary
distribution of the Markov chain:              by
successively iterating                 until convergence
This is the principal eigenvector of the matrix A, which
is exactly the PageRank vector.
Problem 1: Dangling Links
 Dangling links are links that point to any page with
 no outgoing links or pages not downloaded yet.

 Problem : how their weights should be distributed.
 Solution 1: they are removed from the system until
 all the PageRanks are calculated. Afterwards, they
 are added in without affecting things significantly.
   Problem 1: Dangling Links (cont’d)
     Solution 2 (presented in the second paper):
    Let v be a vector representing a uniform distribution over all nodes

In terms of the random walk, the effect of D is to modify the transition probabilities so
that a surfer visiting a dangling page randomly jumps to another page in the next time
step, using the distribution given by v.
Problem 2: Rank Sink
                                Some pages form a loop
                                that accumulates rank
                                (rank sink) to the infinity.

 Random Surfer Model
 Jump to a random page based
 on some distribution E (rank
Convergence and Random Walks : Why does it work?
  Irreducible Aperiodic Markov Chains with a Primitive transition
   probability matrix

  What are the issues all about?
    We need a transition matrix model that can guarantee convergence
      and does indeed converge to a unique stationary distribution vector.
PageRank Expression:
 Let E(u) be some vector over the Web pages that corresponds to a source of
 rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the
 Web pages which satisfies
                        PageRank of
                        document v                         Vector of web
 PageRank of            that links to u                    pages that the
 document u                                                Surfer randomly
                                                           jumps to u

        Normalization             Number of outlinks
        factor                    from document v

 such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L1 norm of
Computing PageRank

         R0  S             S: any vector over the web pages


         R AR
           i
                            Calculate the Ri+1 vector using Ri

          i1               Calculate the normalizing factor
           i  1
         d RR
            1 

         i i  E
         RR       Find the vector Ri+1 using d
         1 d

          RR
            
              i1   i
                            Find the norm of the difference of 2
 while   
                           Loop until convergence
PageRank Implementation
  Convert each URL into a unique integer ID
  Sort the link structure by ID
  Remove the dangling links
  Make an initial assignment of ranks
  Iteratively compute PageRank until Convergence
  Add the dangling links back
  Recompute the rankings

  NOTE: After adding the dangling links back, we need to iterate as
  many times as was required to remove the dangling links
The mechanism

  •Web Crawler: Finds and retrieves pages on the web
  •Repository: web pages are compressed and stored here
  •Indexer: each index entry has a list of documents in which the term appears
            and the location within the text where it occurs
 PR (322 Million Links): 52 iterations
 PR (161 Million Links): 45 iterations
 Scaling factor is roughly linear in logn
Adaptive Methods for the computation of PageRank

This paper presents two contributions:

 First, it shows that most pages in the web converge to
  their true PageRank quickly, while relatively few pages
  take much longer to converge.
 And it further shows that those slow-converging pages
  generally have high PageRank, and those pages that
  converge quickly generally have low PageRank.
 Experimental results supports the findings:
Experimental results
  Adaptive Algorithms
 Second, the authors develop two algorithms, called Adaptive
  PageRank and Modified Adaptive PageRank, that exploit this
  observation to speed up the computation of PageRank by 18%
  and 28%, respectively.

 The main ideas of the all the proposed algorithms are the same,
  which is to speed up the computation of PageRank by reducing
  the cost (not computing the PageRank of converged pages at
  each iteration).

  Notations to be included:
    A: one-step transition probability matrix.
    x(k): probability distribution vector at time k.
    N = not yet converged; C = converged.
Adaptive PageRank
  Filter-based Adaptive PageRank
 Reordering the matrix A at each iteration is expensive.
 Reducing the cost by introducing sparse (zero) entries.
 Filter-based Modified Adaptive PageRank
 Reducing redundant computation by not recomputing
  the components of the PageRanks of those pages in N
  due to links from those pages in C.
 Split A even further.
Performance comparison
Searching with PageRank
• Two search engines:
   – Title-based search engine
   – Full text search engine

• Title-based search engine
   – Searches only the “Titles”
   – Finds all the web pages whose titles contain all the query words
   – Sorts the results by PageRank
   – Very simple and cheap to implement
   – Title match ensures high precision, and PageRank ensures high

• Full text search engine
   – Called Google
   – Examines all the words in every stored document and also
     performs PageRank (Rank Merging)
   – More precise but more complicated
Title-based search for University
Personalized PageRank
 Important component of PageRank calculation is E
    A vector over the web pages (used as source of rank)
    Powerful parameter to adjust the page ranks
 E vector corresponds to the distribution of web pages that a
  random surfer periodically jumps to

 Having an E vector that is uniform over all the web pages
  results in some web pages with many related links receiving
  an overly high rank e.g.: copyright page or forums
    General Search over the internet

 Instead in Personalized PageRank E consists of a single web
 Estimating Web Traffic
  On analyzing the statistics, it was found that there are some sites that
  have a very high usage, but low PageRank.
  e.g.: Links to pirated software

 PageRank as Backlink Predictor
  The goal is to try to crawl the pages in as close to the optimal order as
  possible i.e., in the order of their rank according to an evaluation func.
  PageRank is a better predictor than citation counting

 User Navigation: The PageRank Proxy
  The user receives some information about the link before they click on it
  This proxy can help users decide which links are more likely to be
 PageRank is a global ranking of all web pages based on their
  locations in the web graph structure

 PageRank uses information which is external to the web pages
  – backlinks

 Backlinks from important pages are more significant than
  backlinks from average pages

 The structure of the web graph is very useful for information
  retrieval tasks.
 L. Page, S. Brin, R. Motwani, T. Winograd. The PageRank Citation Ranking:
  Bringing Order to the Web, 1998
 Sepandar Kamvar, Taher Haveliwala, Gene Golub, Adaptive methods for
  the computation of PageRank, Linear Algebra and its Applications 386, 2004
  Published by Elsevier Inc., pp 51–65.
 L. Page and S. Brin. The anatomy of a large-scale hypertextual web search
  engine, 1998
 Google Corporate Information:
Thank You!


Shared By: