# Allen

Document Sample

```					PageSim: A Link-based Measure
of Web Page Similarity

Research Group Presentation
Allen Z. Lin, 8 Mar 2006

1
Outline
   What & Why?
   Existing approaches
   PageSim: a new approach
   Demostrations
   Conclusion and current work

2
What & Why?
 Ranking similarity between web pages.
 Applications on the Web
– Finding related, or similar,
web pages to a page.
– Web page classification.
YAHOO!‘s Web Directory.
http://dir.yahoo.com/
hierarchical structure
 Key question:
How to measure the similarity?          3
Existing approaches
 Text-based
– Using common features of two web pages.
– Using neighbors between two web pages.
Common neighbor, Co-citation, SimRank
– Using paths between two web pages.
Katz index, Hitting time

4
Existing approaches (cont.)
 Notations
– Sim(a,b): similarity score of web page a and b.
– I(a): in-link neighbors of web page a.
– O(a): out-link neighbors of web page a.
 Common neighbor method
– Sim(a,b) = |O(a)∩O(b)|
= |(c,d)| = 2
 Cocitation method
– Sim(a,b) = |I(a)∩I(b)|
= |(c,d)| = 2
5
Existing approaches (cont.)
 SimRank
– Two pages are similar if they are referenced
(cited, or linked to) by similar pages.
– 1. Sim(u,u)=1; 2. Sim(u,v)=0 if |I(u)| |I(v)| = 0.
Recursive definition

– C is a constant between 0 and 1.
– The iteration starts with Sim(u,u)=1, Sim(u,v)=0
if u≠ v.                                         6
PageSim: a new approach
 Two problems
– On the Web, not all links are equally important.
Common neighbor, Cocitation
– A similarity measure should be able to measure
the similarity between any two web pages.
SimRank
 PageSim
– Take the above problems into account.

7
PageSim: a new approach (cont.)
 Cocitation

 Which page is more similar to d, c or e?
 Suppose page a is YAHOO!’s homepage,
and b is a personal web page.
Authoritative pages are more important.    8
PageSim: a new approach (cont.)
 SimRank

 Are a and b similar?
– SimRank says “NO”s.
9
PageSim: a new approach (cont.)
 Page a linking to b and c means a “thinks”

– b and c are kind of similar.
– both b and c are kind of similar to a too.
 Page a spreads similarity to its neighbors.
 Authoritative pages spread more similarity.    10
PageSim: a new approach (cont.)
 PageSim – PageRank score propagation
– In PageSim, PageRank (PR) score is used to
measure the authority of a web page.
PR assigns global importance scores to all web pages.
– Each page spreads its own similarity score (PR
score) to its neighbors.
– Each page also propagates other pages’
similarity scores to its neighbors.
– After the similarity score propagation finished,
each page contains an array of similarity scores.
11
PageSim: a new approach (cont.)
 Example: similarity propagation (page a only)
– PR(a)=100, PR(b)=55, PR(c)=102
– Each page propagate 80% of its similarity score
averagely to its neighbors.

12
PageSim: a new approach (cont.)
 Example: similarity propagation (cont.)
– PR(a)=100, PR(b)=55, PR(c)=102
– Each page contains a similarity score vector(SV).
 SV(a) = (100, 35, 82 ),
 SV(b) = ( 40, 55, 33 ),
 SV(c) = ( 72, 44, 102 ),
– PageSim score (PS) computation
 PS(a,b)=Σmin( SV(a), SV(b) )
= 40+35+33 = 108
– Two pages are more similar if they share more
common similarity scores.                       13
PageSim: a new approach (cont.)
– PageSim score matrix
 PS_matrix = (PS(u,v))nxn=
a: 217
b: 108 128
c: 189 117 219
– PS_matrix is symmetric.
 PS(a,b) = PS(b, a)
– Any web page is most similar to itself.
 PS(u,u) = max ( PS(u,v) ), for any v.
14
Demostrations

– PageSim matrix                – SimRank matrix
a: 100                         1
b: 80   265                    0     1
c: 64   212   469.2            0     0     1
d: 51.2 169.6 375.4   694.1    0     0     0     1
– PR = (100, 185, 257.2, 318.6)

15
Demostrations (cont.)

– PageSim matrix                   – SimRank matrix
a: 295.2                          1
b: 246.4 295.2                    0     1
c: 230.4 246.4   295.2            0     0     1
d: 246.4 230.4   246.4   295.2    0     0     0     1
– PR = (100, 100, 100, 100)

16
Demostrations (cont.)
 Example 3: more complex
– PageSim matrix
1: 100.0
2: 40.0    487.6
3: 50.7    159.4      397.4
4: 10.7    238.5      130.0       275.5
5: 10.7    130.0      130.0       130.0   314.9
PR = (100, 40.0, 50.7, 10.7, 10.7)
– SimRank matrix                     – PageSim results
1:   1                                     v3 is most similar to v1.
2:   0   1
3:   0   0.25   1                          v4 is most similar to v2.
4:   0   0      0.5    1
5:   0   0      0.5    1      1                                         17
Conclusion and current work
 Conclusion
– Web page similarity measures
– PageSim: PageRank score propagation.
 Current work
– How to compare performance of two similarity
measures, e.g., PageSim and SimRank?
Text-based measures.
Thank you!
18

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 5 posted: 2/29/2012 language: pages: 18