Random Walks in Ranking Query Results in Semistructured Databases
Vagelis Hristidis
Roadmap
• Ranking Web Pages using link structure
– Overview – PageRank – Hubs & Authorities
• Ranking Keyword Search Results in Semistructured Databases
– Problem Statement – Previous Work – Ongoing Work: Ranking using Random Walks
Roadmap
• Ranking Web Pages using link structure
– Overview – PageRank – Hubs & Authorities
• Ranking Keyword Search Results in Semistructured Databases
– Problem Statement – Previous Work – Ongoing Work: Ranking using Random Walks
Ranking Web Pages
Rank according to • Relevance of page to query • Quality of page
Roadmap
• Ranking Web Pages using link structure
– Overview – PageRank – Hubs & Authorities
• Ranking Keyword Search Results in Semistructured Databases
– Problem Statement – Previous Work – Ongoing Work: Ranking using Random Walks
PageRank
• Stanford project • Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. • Started Google
PageRank
• Make use of the link structure of the web to calculate a quality ranking (PageRank) for each web page. • Each page has unique PageRank, independent of keyword query • PageRank does NOT express relevance of page to query
PageRank is a Usage Simulation
• “Random surfer”
– Given a random URL – Clicks randomly on links – After a while gets bored and gets a new random URL
• The number of visits to each page is its PageRank.
PageRank Calculation Intuition
• PageRank of page P increases when pages with large PageRanks point to P.
PageRank Calculation
PR(A)=(1-d) + d*(PR(T1)/C(T1)+…+ PR(Tn)/C(Tn))
d: damping factor, normally this is set to 0.85. T1, …, Tn: pages pointing to page A PR(A): PageRank of page A. PR(Ti): PageRank of page Ti. C(Ti): the number of links going out of page Ti.
Note: d is needed due to PageRank sinks
Example of Calculation (1)
Page A Page B
Page C
Page D
Example of Calculation (2)
Page A 1 1*0.85/2
1*0.85/2
Page B 1
1*0.85 Page D 1
1*0.85
Page C 1 1*0.85
• Each page has not passed on 0.15, so we get:
Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15
Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
Example of Calculation (3)
Page A 1 Page B 0.575
Page C 2.275
Page D 0.15
Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125 Page D: receives none, but has not transferred 0.15 = 0.15
Page A 2.03875
Page B 0.575
Page C 1.1925
Page D 0.15
Example of calculation (4)
• After 20 iterations, we get
Page A 1.490
Page B 0.783
Page C 1.577
Page D 0.15
Example - Conclusions
• Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page graph! • More iterations lead to convergence of PageRanks.
Google
• Uses PageRank as one of the criteria to rank keyword query results. • Other criteria (may) include:
– – – – – – – Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information
Roadmap
• Ranking Web Pages using link structure
– Overview – PageRank – Hubs & Authorities
• Ranking Keyword Search Results in Semistructured Databases
– Problem Statement – Previous Work – Ongoing Work: Ranking using Random Walks
Hubs & Authorities
• Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. JACM 46(5): 604632 (1999) • HITS ( Hypertext-Induced Topic Search) developed by Jon Kleinberg, while visiting IBM Almaden. • IBM expanded HITS into Clever. • IBM doesn't see Clever as real-time search engine. But create constantly refreshed lists of relevant pages for categories
Hubs & Authorities
• Rank pages according to keyword query (in contrast to PageRank)
Hubs & Authorities
•Good hub: page that points to many good authorities. •Good authority: page pointed to by many good hubs.
•Given Keyword Query, assign a hub and an authoritative value to each page. •Pages with high authority are results of query
Hubs & Authorities Calculation : Root Set and Base Set
• Using query term to collect a root set of pages from text-based search engine (AltaVista)
Root Set
Hubs & Authorities Calculation :
Root Set and Base Set (Cont’d)
• Expand root set into base set by including (up to a designated size cut-off)
– all pages linked to by pages in root set – all pages that link to a page in root set
•
Typical base set contains roughly 1000-5000 pages
Base Set
Root Set
Hubs & Authorities Calculation
• Iterative algorithm on Base Set: authority weights a(p), and hub weights h(p). – Set authority weights a(p) = 1, and hub weights h(p) = 1 for all p. – Repeat following two operations (and then re-normalize a and h to have unit norm):
h(v1) h(v2) h(v3) v1 v2 v3 p p v1 v2 v3 a(v1) a(v2) a(v3)
a( p)
h(q)
q points to p
h( p)
a(q)
p points to q
Example: Mini Web
X Y Z
H
h x h h
y z
a x A a y a
z
X
M
Y Z
1 1 1 0 0 1 1 1 0
X
H i M * Ai -1
Y
H i M * M T H i -1
Z
Ai M * H i-1
T
Ai M * M * Ai -1
T
Example
1 0 1
M
1 1 0 1 1
0
1 T M 1 1
0 1 0 1 1
0
M
2 T M 2 1
3 1 2 2 1 2 1 M T M 1 1 0 2 0 2 1 2
Iteration 0
1 H 1 1
Y
1
6 2 4 5 5 4
2
28 8 20 24 24 18
3
132 36 96 114 114 84
…
2 3 1 1 3
X is the best hub
X
Z
1 A 1 1
1 1 2
3 3
Z is most authoritative
Hubs & Authorities Calculation
• Theorem (Kleinberg, 1998). The iterates a(p) and h(p) converge to the principal eigenvectors of MTM and MMT, where M is the adjacency matrix of the (directed) Web subgraph.
PageRank v.s. Authorities
• PageRank (Google) – computed for all web pages stored in the database prior to the query – computes authorities only – Trivial and fast to compute • HITS (CLEVER) – performed on the set of retrieved web pages for each query – computes authorities and hubs – easy to compute, but real-time execution is hard
Roadmap
• Ranking Web Pages using link structure
– Overview – PageRank – Hubs & Authorities
• Ranking Keyword Search Results in Semistructured Databases
– Problem Statement – Previous Work – Ongoing Work: Ranking using Random Walks
Keyword Search in Databases
Conf (SIGMOD 01) 550
Attendee
Author (L.Gravano) 30
Paper (PREFER) 13
Paper (Fagin PODS96) 30
Paper (Top-k ICDE2002) 13
Author (Vagelis) 5
Paper (Insignificant) 13
Paper (Insignificant2) 13
Author (Unknown Gravano) 1
• The label of a node is: Type (Value) degree • Query: Vagelis, Gravano • Assume that SIGMOD 01 has 500 attendees and 50 papers. Each paper has 10 references and 2 authors.
Result of Keyword Query
Result is tree T of nodes where: • each edge corresponds to an edge of the data graph • every keyword contained in a node of T • no node of T is redundant (minimal)
Example
Conf (SIGMOD 01) 550
Attendee
Author (L.Gravano) 30
Paper (PREFER) 13
Paper (Fagin PODS96) 30
Paper (Top-k ICDE2002) 13
Author (Vagelis) 5
Paper (Insignificant) 13
Paper (Insignificant2) 13
Author (Unknown Gravano) 1
Results R1: R2: R3:
Vagelis – PREFER – SIGMOD 01 – L.Gravano Vagelis – PREFER – Fagin PODS96 – Top-k ICDE2002 – L.Gravano Vagelis – PREFER – Insignificant1 paper – Insignificant2 paper – Unknown Gravano
Roadmap
• Ranking Web Pages using link structure
– Overview – PageRank – Hubs & Authorities
• Ranking Keyword Search Results in Semistructured Databases
– Problem Statement – Previous Work – Ongoing Work: Ranking using Random Walks
Previous Work
Conf (SIGMOD 01) 550
Results R1:
Attendee
Author (L.Gravano) 30
Paper (PREFER) 13
Paper (Fagin PODS96) 30
Paper (Top-k ICDE2002) 13
R2: R3:
Author (Unknown Gravano) 1
Author (Vagelis) 5
Paper (Insignificant) 13
Paper (Insignificant2) 13
XKeyword, DISCOVER, DBXplorer, Goldman98: Score is inverse of path distance between nodes.
BANKS: Weighted distance
Results output: R1, R2, R3
Previous Work – Keyword Queries
• XKeyword. V. Hristidis, Y. Papakonstantinou, A. Balmin. ICDE 2003 DISCOVER. V. Hristidis, Y. Papakonstantinou. VLDB 2002 DBXplorer. S. Agrawal et al. ICDE 2002
– Three step architecture – Data stored in DBMS – Schema use
• BANKS. G. Bhalotia et al. ICDE 2002
– Database viewed as graph – No schema info – Steiner tree problem approximations
• Proximity searching in databases. R. Goldman et al. VLDB 1998
– Database viewed as graph – No schema info – hub nodes
Previous Work
Conf (SIGMOD 01) 550
Results R1:
Attendee
Author (L.Gravano) 30
Paper (PREFER) 13
Paper (Fagin PODS96) 30
Paper (Top-k ICDE2002) 13
R2: R3:
Author (Unknown Gravano) 1
Author (Vagelis) 5
Paper (Insignificant) 13
Paper (Insignificant2) 13
• Prior work: Results output: R1, R2, R3 • Intuitively R3 shows a tighter connection than R1 (higher relevance between keywords) • But R2 connects objects of higher “importance” than R3 (higher quality of result) • Relevance and Quality can be contradicting factors
Random Walks (RW)
• Score of result A~B: Probability that a random walk goes from A to B • Captures Relevance, but ignores Quality of result. • P(A→B →C) = 1/degree(A) * 1/degree(B)
Random Walks (RW)
Results
Conf (SIGMOD 01) 550
Attendee
Author (L.Gravano) 30
R1:
Paper (Top-k ICDE2002) 13
Paper (PREFER) 13
Paper (Fagin PODS96) 30
R2: R3:
Author (Unknown Gravano) 1
Author (Vagelis) 5
Paper (Insignificant) 13
Paper (Insignificant2) 13
• RW: Results output: R3, R2, R1 •But R2 connects objects of higher “importance” than R3 (higher quality of result) • Relevance and Quality can be contradicting factors
Random Walks + PageRank (RW+PR)
• Score of result A~B: Probability that a random walk starting from any node, goes through both A and B. • Captures both Relevance and Quality of result. • Score = PR(A)* P(A~>B)+ PR(B)* P(B~>A) • P(A~>B) can be computed using PageRank algorithm setting the pagerank source to {A}
Random Walks + PageRank (RW+PR)
Conf (SIGMOD 01) 550
Results R1:
Attendee
Author (L.Gravano) 30
Paper (PREFER) 13
Paper (Fagin PODS96) 30
Paper (Top-k ICDE2002) 13
R2: R3:
Author (Unknown Gravano) 1
Author (Vagelis) 5
Paper (Insignificant) 13
Paper (Insignificant2) 13
• RW+PR: Results output: R2, R1, R3
•Assuming
PR Vagelis L.Gravano Unknown Gravano 1/1000 1/50 1/100000
Example - Details
The following table shows the scores of the results according to 3 ranking methods XKeyword R1 R2 1/4 1/5 RW 1/(7E+9) 1/(4E+9) PR+RW 1.2E-7 1.7E-7
R3
1/4
1/(2E+7)
9E-8
Ranking XKeyword R1, R2, R3 RW R3, R2, R1 RW+PR R2, R1, R3
Random Walk Variations
Criterion Relevance Definition Probability of random walk starting from a result node traverses the rest of the result nodes For each result node, calculate probability of random walk starting from any node of graph Probability that a random walk starting from any node of the graph, goes through the result nodes. Corresponds to PageRank. Also, part of the authority of a page is attributed to its quality. A combination of PageRank (computed offline) and random walks starting from a result node could be used. Comments
Quality
Both
Page vs Structured Results Ranking
Web (PageRank/Authorities) All edges have same meaning (hyperlinks) Straightforward notion of direction Single type (page) of nodes Return single object (page)
Structured Databases Each edge has different semantics No Multiple types Return tree of objects
Open issues
• Efficiently calculating RW First thoughts: Two ways
– DISCOVER-like with CNs – BANKS-like, using shorthest path progressively
• Edges must have different weights for PR and RW calculation. (eg: “Paper cites Paper” is one-way for PR but two-way for RW) How to assign PR and RW weights on schema graph?
Conclusions
• The concept of Random Walks has proven very useful in ranking Web pages • Can also be used in ranking results of queries in structured/semistructured databases. • Problem is more complicated