Bill_PageRank_and_HITS by huanghengdong


									Overview of Web Ranking
Algorithms: HITS and PageRank

            April 6, 2006
       Presented by: Bill Eberle
   Problem
   Web as a Graph
   HITS
   PageRank
   Comparison
   Specific queries (scarcity problem).
   Broad-topic queries (abundance
   Goal: to find the smallest set of
    “authoritative” sources.
Web as a Graph
   Web pages as nodes of a graph.
   Links as directed edges.
       my page
                                      my page

Link Structure of the Web
   Forward links (out-edges).
   Backward links (in-edges).
   Approximation of importance/quality: a
    page may be of high quality if it is
    referred to by many other pages, and by
    pages of high quality.
   HITS (Hyperlinked-Induced Topic
   “Authoritative Sources in a Hyperlinked
    Environment”, Jon Kleinberg, Cornell
    University. 1998.
Authorities and Hubs
   Authority is a page which has relevant
    information about the topic.
   Hub is a page which has collection of
    links to pages about that topic.

            h            a3

Authorities and Hubs (cont.)
   Good hubs are the ones that point to good
   Good authorities are the ones that are
    pointed to by          h
                                             a   1

    good hubs.
                          h2                 a2

                          h3                 a3
                          h5                 a6
Finding Authorities and Hubs
   First, construct a focused sub-graph of the
   Second, compute Hubs and Authorities from
    the sub-graph.
  Construction of Sub-graph
Topic   Search Engine               Crawler        set
                        Pages                      Pages

                                        Forward link pages

Root Set and Base Set
   Use query term to
    collect a root set of
    pages from text-
    based search engine
                            Root set
Root Set and Base Set (cont.)
   Expand root set into
    base set by including
    (up to a designated
    size cut-off):                            Base set

       All pages linked to by
        pages in root set
       All pages that link to a
        page in root set
                                   Root set
        Hubs & Authorities Calculation
   Iterative algorithm on Base Set: authority weights a(p), and
    hub weights h(p).
      Set authority weights a(p) = 1, and hub weights h(p) = 1

        for all p.
      Repeat following two operations

        (and then re-normalize a and h to have unit norm):

h(v1)   v1                                                           v1   a(v1)

h(v2)   v2                             p    p                        v2   a(v2)

h(v3)   v3                                                           v3   a(v3)
             a( p)         h(q)
                       q points to p
                                           h( p)        a(q)
                                                     p points to q
                              0.45, 0.45

                 0.45, 0.45

Hub 0.45, Authority 0.45
                              0.45, 0.45
Example (cont.)
                             0.45, 0.9

                 1.35, 0.9

Hub 0.9, Authority 0.45
                             0.45, 0.9
Algorithmic Outcome
   Applying iterative multiplication (power
    iteration) will lead to calculating
    eigenvector of any “non-degenerate”
    initial vector.
   Hubs and authorities as outcome of
   Principal eigenvector contains highest
    hub and authorities.
   Although HITS is only link-based (it
    completely disregards page content) results
    are quite good in many tested queries.
   When the authors tested the query “search
       The algorithm returned Yahoo!, Excite, Magellan,
        Lycos, AltaVista
       However, none of these pages described
        themselves as a “search engine” (at the time of
        the experiment)
   From narrow topic, HITS tends to end in
    more general one.
   Specific of hub pages - many links can
    cause algorithm drift. They can point to
    authorities in different topics.
   Pages from single domain / website can
    dominate result, if they point to one
    page - not necessarily a good authority.
Possible Enhancements
   Use weighted sums for link calculation.
   Take advantage of “anchor text” - text
    surrounding link itself.
   Break hubs into smaller pieces. Analyze each
    piece separately, instead of whole hub page
    as one.
   Disregard or minimize influence of links inside
    one domain.
   IBM expanded HITS into Clever; not seen as
    viable real-time search engine.
   “The PageRank Citation Ranking:
    Bringing Order to the Web”, Lawrence
    Page and Sergey Brin, Stanford
    University. 1998.
Basic Idea
   Back-links coming from important pages
    convey more importance to a page. For
    example, if a web page has a link off the
    yahoo home page, it may be just one link but
    it is a very important one.
   A page has high rank if the sum of the ranks
    of its back-links is high. This covers both the
    case when a page has many back-links and
    when a page has a few highly ranked back-
   My page’s rank is equal to the sum of
    all the pages pointing to me.

     Rank(u )    
                  vBu     Nv
     Bu  set of pages with links to u
     N v  number of links f rom v
Simplified PageRank Example
   Rank(u) = Rank of
    page u , where c is
    a normalization
    constant (c < 1 to
    cover for pages with
    no outgoing links).
Expanded Definition
   R(u): page rank of page u
   c: factor used for normalization (<1)
   Bu: set of pages pointing to u
   Nv: outbound links of v
   R(v): page rank of site v that points to u
   E(u): distribution of web pages that a random
    surfer periodically jumps (set to 0.15)

                      R (v )
        R (u )  c           cE (u )
                  vBu N v
Problem 1 - Rank Sink
   Page cycles pointed by some incoming link.

   Loop will accumulate rank but never
    distribute it.
Problem 2 - Dangling Links
   In general, many Web pages do not have either back links or forward

   Dangling links do not affect the ranking of any other page directly, so
    they are removed until all the PageRanks are calculated.
Random Surfer Model
   PageRank corresponds to the probability
    distribution of a random walk on the web
Solution – Escape Term
   Escape term: E(u) can be thought of as the
    random surfer gets bored periodically and jumps
    to a different page – not staying in the loop
                        R (v )
          R (u )  c           cE (u )
                    vBu N v

   We term this E to be a vector over all the web
    pages that accounts for each page’s escape
    probability (user defined parameter).
    PageRank Computation
    R0  S                  - initialize vector over web pages
  Ri 1  AT Ri - new ranks sum of normalized backlink ranks

  d  Ri 1  Ri 1 1         - compute normalizing factor

  Ri 1  Ri 1  dE         - add escape term

    Ri 1  Ri            - control parameter

While                   - stop when converged
   A is designated to be a matrix, u and v correspond to the
    columns of this matrix.

   Given that A is a matrix, and R be a vector over all the Web
    pages, the dominant eigenvector is the one associated with
    the maximal eigenvalue.

Example (cont.)
  c : eigenvalue
  R : eigenvector of A
    | A - λI | x = 0

           R=            Normalized =
1. URL -> id
2. Store each hyperlink in a database.
3. Sort link structure by Parent id.
4. Remove dangling links.
5. Calculate the PR giving each page an
  initial value.
6. Iterate until convergence.
7. Add the dangling links.

Which of these three has the highest page

         Page A                    Page B

         NA  2                    NB  1

                        Page C

                       NC  1
   Example (cont.)
                                                            Rank(C )
Rank( A)              0                  0           
                    Rank( A)
Rank( B)                                 0                  0
                    Rank( A)            Rank( B)
Rank(C )                                                    0
                       2                   1

           Page A                                  Page B

           NA  2                              NB  1

                               Page C

                               NC  1
    Example (cont.)
 Re-write the system of equations as a Matrix-
  Vector product.
                       Rank( A)    0   0   1  Rank( A) 
                                                       
                                                       
                       Rank( B)    1          Rank( B) 
                                      0   0           
                                   2                   
                       Rank(C )    1          Rank(C ) 
                                   2   1   0           
                                                       

The PageRank vector is simply an eigenvector
(scalar*vector = matrix*vector) of the coefficient
Example (cont.)
PageRank = 0.4            PageRank = 0.2
    Page A                   Page B

    NA  2                   NB  1

                 Page C

                 NC  1

             PageRank = 0.4
Example (cont.)
A     B
                  with d= 0.5
          Pr(A)        PR(B)    PR(C)

C                                       0
   PageRank computation is O(log(|V|)).
Other Applications
   Help user decide if a site is trustworthy.
   Estimate web traffic.
   Spam detection and prevention.
   Predict citation counts.
   Users are not random walkers.
   Starting point distribution (actual usage
    data as starting vector).
   Bias towards main pages.
   Linkage spam.
   No query specific rank.
PageRank vs. HITS

   PageRank                       HITS
    (Google)                        (CLEVER)
       computed for all web           performed on the set
        pages stored in the             of retrieved web
        database prior to the           pages for each query
        query                          computes authorities
       computes authorities            and hubs
        only                           easy to compute, but
       Trivial and fast to             real-time execution
        compute                         is hard
   “Authoritative Sources in a Hyperlinked
    Environment”, Jon Kleinberg, Cornell
    University. 1998.
   “The PageRank Citation Ranking:
    Bringing Order to the Web”, Lawrence
    Page and Sergey Brin, Stanford
    University. 1998.

To top