# Intelligent Information Retrieval and Web Search

Document Sample

```					Page Link Analysis and Anchor Text for
Web Search
Lecture 9
Many slides based on lectures by Chen Li
(UCI) an Raymond Mooney (UTexas)

1
PageRank
• Intuition:
– The importance of each page should be decided
– One naïve implementation: count the # of pages
pointing to each page (i.e., # of inlinks)
• Problem:
– We can easily fool this technique by generating
many dummy pages that point to our class page

2
Initial PageRank Idea
• Just measuring in-degree (citation count) doesn’t
account for the authority of the source of a link.
• Initial page rank equation for page p:
R(q)
R( p)  c 
q:q  p N q
– Nq is the total number of out-links from page q.
– A page, q, “gives” an equal fraction of its authority to
all the pages it points to (e.g. p).
– c is a normalizing constant set so that the rank of all
pages always sums to 1.

3
Initial PageRank Idea (cont.)

• Can view it as a process of PageRank
“flowing” from pages to the pages they cite.

.1           .08
.05

.05
.03
.09           .08
.03
.03

.03

4
Initial Algorithm

• Iterate rank-flowing process until convergence:
Let S be the total set of pages.
Initialize pS: R(p) = 1/|S|
Until ranks do not change (much) (convergence)
For each pS:
R(q)
R( p )     p N
q:q    q

c  1 /  R( p)
pS

For each pS: R(p) = cR´(p) (normalize)

5
Sample Stable Fixpoint

0.4                   0.2
0.2

0.2
0.2
0.4     0.4

6
Linear Algebra Version

• Treat R as a vector over web pages.
• Let A be a 2-d matrix over pages where
– Avu= 1/Nu if u v else Avu= 0
• Then R=cAR
• R converges to the principal eigenvector of A.

7
Example: MiniWeb
• Our “MiniWeb” has only three web sites:
Netscape, Amazon, and Microsoft.
• Their weights are represented as a vector

n      1 / 2 0 1 / 2   n 
Ne                               m   0 0 1 / 2  m
                      
MS           a  new 1 / 2 1 0  a  old
                      
Am                  For instance, in each iteration, half of the weight of AM
goes to NE, and half goes to MS.

Materials by courtesy of Jeff Ullman

8
Iterative computation

n  1 1  5 / 4 9 / 8  5 / 4             6 / 5
m  1 1 / 2  3 / 4  1 / 2  11 / 16   3 / 5 
                                           
a  1 3 / 2 1  11 / 8 17 / 16
                                       6 / 5
      

Ne

Final result:
MS
• Netscape and Amazon have the same
Am                      importance, and twice the importance
of Microsoft.
• Does it capture the intuition? Yes.
9
Observations
• We cannot get absolute weights:
– We can only know (and we are only interested in) those
relative weights of the pages
• The matrix is stochastic (sum of each column is 1). So the
iterations converge, and compute the principal eigenvector of
the following matrix equation:

 n  1 / 2 0 1 / 2   n 
m   0 0 1 / 2  m
                   
 a  1 / 2 1 0   a 
                   

10
Problem 1 of algorithm: dead ends!

n      1 / 2 0 1 / 2   n 
m   0 0 1 / 2  m
Ne

                      
MS
a  new 1 / 2 0 0  a  old
                      
Am        • MS does not point to anybody
• Result: weights of the Web “leak out”

n  1 1  3 / 4 5 / 8 1 / 2           0 
m  1 1 / 2 1 / 4  1 / 4  3 / 16  0
                                  
a  1 1 / 2 1 / 2  3 / 8 5 / 16
                                    0 
 

11
Problem 2 of algorithm: spider traps

n      1 / 2 0 1 / 2   n 
m   0 1 1 / 2  m
Ne

                      
MS
a  new 1 / 2 0 0  a  old
                      
Am        • MS only points to itself
• Result: all weights go to MS!

n  1 1  3 / 4  5 / 8 1 / 2         0 
m  1 3 / 2 7 / 4 2  35 / 16  3 
                                  
a  1 1 / 2  1 / 2  3 / 8 5 / 16 
                                   0 
 

12

• Like people paying taxes, each page pays some weight into a
public pool, which will be distributed to all pages.
• Example: assume 20% tax rate in the “spider trap” example.
n         1 / 2 0 1 / 2  n  0.2
m  0.8 *  0 1 1 / 2  m  0.2
                            
a         1 / 2 0 0  a  0.2
                            
n  7 / 11 
m  21 / 11
           
a  5 / 11 
           

13
The War of Search Engines

• More companies are realizing the importance of
search engines
• More competitors in the market: Ask.com,
Microsoft, Yahoo!, etc.

14
HITS

• Algorithm developed by Kleinberg in 1998.
• Attempts to computationally determine
hubs and authorities on a particular topic
through analysis of a relevant subgraph of
the web.
• Based on mutually recursive facts:
– Hubs point to lots of authorities.
– Authorities are pointed to by lots of hubs.

15
Hubs and Authorities

• Motivation: find web pages to a topic
– E.g.: “find all web sites about automobiles”
• “Authority”: a page that offers info about a topic
– E.g.: DBLP is a page about papers
– E.g.: google.com, aj.com, teoma.com, lycos.com
• “Hub”: a page that doesn’t provide much info, but
tell us where to find pages about a topic
– E.g.: www.searchenginewatch.com is a hub of search
engines
– http://www.ics.uci.edu/~ics214a/ points to many biblio-
search engines

16
The hope

AT&T
Alice
Authorities
Hubs
Sprint
Bob
MCI

Long distance telephone companies
17
Base set

• Given text query (say browser), use a text
index to get all pages containing browser.
– Call this the root set of pages.
• Add in any page that either
– points to a page in the root set, or
– is pointed to by a page in the root set.
• Call this the base set.

18
Visualization

Root
set

Base set

19
Assembling the base set

• Root set typically 200-1000 nodes.
• Base set may have up to 5000 nodes.
• How do you find the base set nodes?
server.
– (Actually, suffices to text-index strings of the
form href=“URL” to get in-links to URL.)

20
Distilling hubs and authorities

• Compute, for each page x in the base set, a
hub score h(x) and an authority score a(x).
• Initialize: for all x, h(x)1; a(x) 1;
Key
• Iteratively update all h(x), a(x);
• After iterations
– output pages with highest h() scores as top hubs
– highest a() scores as top authorities.

21
Iterative update

• Repeat the following updates, for all x:

h( x)   a( y)            x
x y

a( x)   h( y)                  x
y x

22
Scaling

• To prevent the h() and a() values from
getting too big, can scale down after each
iteration.
• Scaling factor doesn’t really matter:
– we only care about the relative values of the
scores.

23
How many iterations?

• Claim: relative values of scores will
converge after a few iterations:
– in fact, suitably scaled, h() and a() scores settle
– proof of this comes later.
• We only require the relative orders of the
h() and a() scores - not their absolute
values.
• In practice, ~5 iterations get you close to
stability.
24
Iterative Algorithm
• Use an iterative algorithm to slowly converge on a
mutually reinforcing set of hubs and authorities.
• Maintain for each page p  S:
– Authority score: ap   (vector a)
– Hub score:       hp   (vector h)
• Initialize all ap = hp = 1
• Maintain normalized scores:

 a         1     h         1
2                   2
p                   p
pS                 pS

25
HITS Update Rules

• Authorities are pointed to by lots of good hubs:

ap     h
q:q  p
q

• Hubs point to lots of good authorities:

hp     a
q: p q
q

26
Illustrated Update Rules

1

4   a4 = h1 + h2 + h3
2

3

5

h4 = a5 + a6 + a7   4   6

7

27
HITS Iterative Algorithm

Initialize for all p  S: ap = hp = 1
For i = 1 to k:
For all p  S: a p   hq (update auth. scores)
q:q  p

For all p  S: hp   aq (update hub scores)
q: p q
For all p  S: ap= ap/c c:     a p / c2  1 (normalize a)
pS

For all p  S: hp= hp/c c:  hp / c 2  1 (normalize h)
pS

28
Example: MiniWeb

hn              an                1 1 1
a                        
H  hm 
           A   m             M   0 0 1
ha              aa                 1 1 0
                                        

H new   * M * Aold
Normalization!
Anew   * M T * H old
Ne
Therefore:
MS                H new   * M * M T * H old
Am                            Anew   * M T * M * Aold

29
Example: MiniWeb

1 1 1          1 0 1          3 1 2           2 2 1
                                                   
M   0 0 1   M T  1 0 1  MM T   1 1 0    M M   2 2 1
T

 1 1 0         1 1 0          2 0 2           1 1 2
                                                   

1 6 28 132    2  3 
1 2 8  36   1
H          


1 4 20 96     1  3 
                    
Ne

1 5  24  114     1  3 
MS           1 5  24  114   1  3 
A                       
Am                         1 4 18  84       2     
                 
      

30
Convergence
• Algorithm converges to a fix-point if iterated
indefinitely.
• Define A to be the adjacency matrix for the
subgraph defined by S.
– Aij = 1 for i  S, j  S iff ij
• Authority vector, a, converges to the principal
eigenvector of ATA
• Hub vector, h, converges to the principal
eigenvector of AAT
• In practice, 20 iterations produces fairly stable
results.
31
Results
• Authorities for query: “Java”
– java.sun.com
– comp.lang.java FAQ
• Authorities for query “search engine”
–   Yahoo.com
–   Excite.com
–   Lycos.com
–   Altavista.com
• Authorities for query “Gates”
– Microsoft.com

32

• In most cases, the final authorities were not
in the initial root set generated using
Altavista.
• Authorities were brought in from linked and
computed their high authority score.

33
Comparison

Pagerank                              HITS & Variants
Pros                                          Pros
– Hard to spam                                – Easy to compute, real-time
– Computes quality signal for all pages         execution is hard [Bhar98b,
Stat00]
– Query specific
– Works on small graphs
Cons
– Non-trivial to compute
Cons
– Not query specific
– Doesn’t work on small graphs                – Local graph structure can be
manufactured (spam!)
Proven to be effective for general purpose        – Provides a signal only when
ranking                                          there’s direct connectivity (e.g.,
Well suited for supervised directory34
Tag/position heuristics

• Increase weights of terms
– in titles
– in tags
– near the beginning of the doc, its
chapters and sections

35
Anchor text (first used WWW Worm - McBryan [Mcbr94])

Tiger image

Here is a great picture
of a tiger

Cool tiger webpage

The text in the vicinity of a hyperlink is
descriptive of the page it points to.                    36
Two uses of anchor text

• When indexing a page, also index the
anchor text of links pointing to it.
– Retrieve a page when query matches its anchor
text.
• To weight links in the hubs/authorities
algorithm.
• Anchor text usually taken to be a window of
6-8 words around a link anchor.

37
Indexing anchor text

• When indexing a document D, include
anchor text from links pointing to D.

Armonk, NY-based computer
giant IBM announced today

www.ibm.com

Big Blue today announced
record profits for the quarter
Compaq
HP
IBM
38
Indexing anchor text

• Can sometimes have unexpected side
effects - e.g., evil empire.
• Can index anchor text with less weight.

39

• In hub/authority link analysis, can match
anchor text to query, then weight link.

h( x)   a( y)         h( x)   w( x, y )  a( y )
x y
x y

a( x)   h( y)         a( x)   w( x, y )  h( y )
y x
y x

40

• What is w(x,y)?
• Should increase with the number of query
terms in anchor text.
– E.g.: 1+ number of query terms.

Armonk, NY-based computer
x    giant IBM announced today
www.ibm.com   y
Weight of this
computer is 2.
41
Weighted hub/authority computation

• Recall basic algorithm:
– Iteratively update all h(x), a(x);
– After iteration, output pages with
• highest h() scores as top hubs
• highest a() scores as top authorities.
• Now use weights in iteration.
• Raises scores of pages with “heavy”
convergence of scores? To
what?                   42
Anchor Text

• Other applications
– Weighting/filtering links in the graph
• HITS [Chak98], Hilltop [Bhar01]
– Generating page descriptions from anchor
text [Amit98, Amit00]

43
Behavior-based ranking
Behavior-based ranking

• For each query Q, keep track of which docs
in the results are clicked on
• On subsequent requests for Q, re-order docs
in results based on click-throughs
• First due to DirectHit AskJeeves
• Relevance assessment based on
– Behavior/usage
– vs. content
Query-doc popularity matrix B

Docs
j

q

Queries

Bqj = number of times doc j
clicked-through on query q

When query q issued again, order docs
by Bqj values.
Issues to consider

• Weighing/combining text- and click-based
scores.
• What identifies a query?
–   Ferrari Mondial
–   Ferrari Mondial
–   Ferrari mondial
–   ferrari mondial
–   “Ferrari Mondial”
• Can use heuristics, but search parsing slowed.
Vector space implementation

• Maintain a term-doc popularity matrix C
– as opposed to query-doc popularity
– initialized to all zeros
• Each column represents a doc j
– If doc j clicked on for query q, update Cj Cj
+ q (here q is viewed as a vector).
• On a query q’, compute its cosine proximity
to Cj for all j.
• Combine this with the regular text score.
Issues

• Normalization of Cj after updating
• Assumption of query compositionality
– “white house” document popularity derived
from “white” and “house”
• Updating - live or batch?
Basic Assumption

• Relevance can be directly measured by
number of click throughs
• Valid?
Validity of Basic Assumption

• Click through to docs that turn out to be
non-relevant: what does a click mean?
• Self-perpetuating ranking
• Spam
• All votes count the same
Variants

• Time spent viewing page
– Difficult session management
– Inconclusive modeling so far
• Does user back out of page?
• Does user stop searching?
• Does user transact?

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 8 posted: 7/15/2012 language: pages: 52
How are you planning on using Docstoc?