# Bill_PageRank_and_HITS by huanghengdong

VIEWS: 6 PAGES: 43

• pg 1
```									Overview of Web Ranking
Algorithms: HITS and PageRank

April 6, 2006
Presented by: Bill Eberle
Overview
   Problem
   Web as a Graph
   HITS
   PageRank
   Comparison
Problem
   Specific queries (scarcity problem).
problem).
   Goal: to find the smallest set of
“authoritative” sources.
Web as a Graph
   Web pages as nodes of a graph.
my page       www.uta.edu
my page
www.uta.edu
www.uta.edu

   Approximation of importance/quality: a
page may be of high quality if it is
referred to by many other pages, and by
pages of high quality.
HITS
Search)
   “Authoritative Sources in a Hyperlinked
Environment”, Jon Kleinberg, Cornell
University. 1998.
Authorities and Hubs
   Authority is a page which has relevant
   Hub is a page which has collection of
a1

a2
h            a3

a4
Authorities and Hubs (cont.)
   Good hubs are the ones that point to good
authorities.
   Good authorities are the ones that are
pointed to by          h
1
a   1

good hubs.
h2                 a2

h3                 a3
a4
h4
a5
h5                 a6
Finding Authorities and Hubs
   First, construct a focused sub-graph of the
www.
   Second, compute Hubs and Authorities from
the sub-graph.
Construction of Sub-graph
Expanded
Rootset
Topic   Search Engine               Crawler        set
Pages                      Pages

Rootset
Root Set and Base Set
   Use query term to
collect a root set of
pages from text-
based search engine
(AltaVista).
Root set
Root Set and Base Set (cont.)
   Expand root set into
base set by including
(up to a designated
size cut-off):                            Base set

   All pages linked to by
pages in root set
   All pages that link to a
page in root set
Root set
Hubs & Authorities Calculation
   Iterative algorithm on Base Set: authority weights a(p), and
hub weights h(p).
 Set authority weights a(p) = 1, and hub weights h(p) = 1

for all p.
 Repeat following two operations

(and then re-normalize a and h to have unit norm):

h(v1)   v1                                                           v1   a(v1)

h(v2)   v2                             p    p                        v2   a(v2)

h(v3)   v3                                                           v3   a(v3)
a( p)         h(q)
q points to p
h( p)        a(q)
p points to q
Example
0.45, 0.45

0.45, 0.45

Hub 0.45, Authority 0.45
0.45, 0.45
Example (cont.)
0.45, 0.9

1.35, 0.9

Hub 0.9, Authority 0.45
0.45, 0.9
Algorithmic Outcome
   Applying iterative multiplication (power
eigenvector of any “non-degenerate”
initial vector.
   Hubs and authorities as outcome of
process.
   Principal eigenvector contains highest
hub and authorities.
Results
   Although HITS is only link-based (it
completely disregards page content) results
are quite good in many tested queries.
   When the authors tested the query “search
engines”:
   The algorithm returned Yahoo!, Excite, Magellan,
Lycos, AltaVista
   However, none of these pages described
themselves as a “search engine” (at the time of
the experiment)
Issues
   From narrow topic, HITS tends to end in
more general one.
   Specific of hub pages - many links can
cause algorithm drift. They can point to
authorities in different topics.
   Pages from single domain / website can
dominate result, if they point to one
page - not necessarily a good authority.
Possible Enhancements
   Use weighted sums for link calculation.
   Take advantage of “anchor text” - text
   Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page
as one.
   Disregard or minimize influence of links inside
one domain.
   IBM expanded HITS into Clever; not seen as
viable real-time search engine.
PageRank
   “The PageRank Citation Ranking:
Bringing Order to the Web”, Lawrence
Page and Sergey Brin, Stanford
University. 1998.
Basic Idea
   Back-links coming from important pages
convey more importance to a page. For
example, if a web page has a link off the
it is a very important one.
   A page has high rank if the sum of the ranks
of its back-links is high. This covers both the
case when a page has many back-links and
when a page has a few highly ranked back-
Definition
   My page’s rank is equal to the sum of
all the pages pointing to me.

Rank(v)
Rank(u )    
vBu     Nv
Bu  set of pages with links to u
N v  number of links f rom v
Simplified PageRank Example
   Rank(u) = Rank of
page u , where c is
a normalization
constant (c < 1 to
cover for pages with
Expanded Definition
   R(u): page rank of page u
   c: factor used for normalization (<1)
   Bu: set of pages pointing to u
   Nv: outbound links of v
   R(v): page rank of site v that points to u
   E(u): distribution of web pages that a random
surfer periodically jumps (set to 0.15)

R (v )
R (u )  c           cE (u )
vBu N v
Problem 1 - Rank Sink
   Page cycles pointed by some incoming link.

   Loop will accumulate rank but never
distribute it.
   In general, many Web pages do not have either back links or forward

   Dangling links do not affect the ranking of any other page directly, so
they are removed until all the PageRanks are calculated.
Random Surfer Model
   PageRank corresponds to the probability
distribution of a random walk on the web
graphs.
Solution – Escape Term
   Escape term: E(u) can be thought of as the
random surfer gets bored periodically and jumps
to a different page – not staying in the loop
forever.
R (v )
R (u )  c           cE (u )
vBu N v

   We term this E to be a vector over all the web
pages that accounts for each page’s escape
probability (user defined parameter).
PageRank Computation
R0  S                  - initialize vector over web pages
Loop:
Ri 1  AT Ri - new ranks sum of normalized backlink ranks

d  Ri 1  Ri 1 1         - compute normalizing factor

Ri 1  Ri 1  dE         - add escape term

  Ri 1  Ri            - control parameter

While                   - stop when converged
Matrices
   A is designated to be a matrix, u and v correspond to the
columns of this matrix.

   Given that A is a matrix, and R be a vector over all the Web
pages, the dominant eigenvector is the one associated with
the maximal eigenvalue.
Example

AT=
Example (cont.)
R=cAR=MR
c : eigenvalue
R : eigenvector of A
A=
Ax=λx
| A - λI | x = 0

R=            Normalized =
Implementation
1. URL -> id
2. Store each hyperlink in a database.
3. Sort link structure by Parent id.
5. Calculate the PR giving each page an
initial value.
6. Iterate until convergence.
Example

Which of these three has the highest page
rank?

Page A                    Page B

NA  2                    NB  1

Page C

NC  1
Example (cont.)
Rank(C )
Rank( A)              0                  0           
1
Rank( A)
Rank( B)                                 0                  0
2
Rank( A)            Rank( B)
Rank(C )                                                    0
2                   1

Page A                                  Page B

NA  2                              NB  1

Page C

NC  1
Example (cont.)
 Re-write the system of equations as a Matrix-
Vector product.
 Rank( A)    0   0   1  Rank( A) 
                                 
                                 
 Rank( B)    1          Rank( B) 
                0   0           
             2                   
 Rank(C )    1          Rank(C ) 
             2   1   0           
                                 

The PageRank vector is simply an eigenvector
(scalar*vector = matrix*vector) of the coefficient
matrix.
Example (cont.)
PageRank = 0.4            PageRank = 0.2
Page A                   Page B

NA  2                   NB  1

Page C

NC  1

PageRank = 0.4
Example (cont.)
A     B
with d= 0.5
Pr(A)        PR(B)    PR(C)

C                                       0
1
2
3
.
.
.
.
11
12
Convergence
   PageRank computation is O(log(|V|)).
Other Applications
   Help user decide if a site is trustworthy.
   Estimate web traffic.
   Spam detection and prevention.
   Predict citation counts.
Issues
   Users are not random walkers.
   Starting point distribution (actual usage
data as starting vector).
   Bias towards main pages.
   No query specific rank.
PageRank vs. HITS

   PageRank                       HITS
   computed for all web           performed on the set
pages stored in the             of retrieved web
database prior to the           pages for each query
query                          computes authorities
   computes authorities            and hubs
only                           easy to compute, but
   Trivial and fast to             real-time execution
compute                         is hard
References
   “Authoritative Sources in a Hyperlinked
Environment”, Jon Kleinberg, Cornell
University. 1998.
   “The PageRank Citation Ranking:
Bringing Order to the Web”, Lawrence
Page and Sergey Brin, Stanford
University. 1998.

```
To top