Keyword Search in Databases using PageRank
By Michael Sirivianos
April 11, 2003
Roadmap
PageRank: Ranking Web Pages using link structure Ranking Keyword Search Results in Structured Databases
Ranking Combining Individual PageRanks
Roadmap
PageRank: Ranking Web Pages using link structure of the web Ranking Keyword Search Results in Structured Databases
Ranking Combining Individual PageRanks
PageRank(1)
Stanford project Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
“The PageRank Citation Ranking: Bringing Order to the Web”.
Started Google
PageRank(2)
Make use of the link structure of the web to calculate a quality ranking (PageRank) for each web page. Citation counting a metric for measuring page/paper quality PageRank a more sophisticated citation counting method, not prone to manipulation. Each page has unique PageRank, independent of keyword query PageRank does NOT express relevance of page to query
PageRank (3)
Calculation Intuition :PageRank of page P increases when pages with large PageRanks point to P. The rank of a page is evenly distributed among its forward links. A problem: When two pages form a loop by pointing to each other but no other page, then in every iteration this loop accumulates and never distributes rank. This is called rank sink.
PageRank is a Usage Simulation
“Random surfer”
Given a random URL Clicks randomly on links After a while gets bored and gets a new random URL
The number of visits to each page is its PageRank.
PageRank Calculation
PR(A)=(1-d) + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) )
d: damping factor, normally this is set to 0.85. T1, …, Tn: pages pointing to page A PR(A): PageRank of page A. PR(Ti): PageRank of page Ti. C(Ti): the number of links going out of page Ti.
Note: d counts for PageRank sinks
Example of Calculation (1)
Page A Page B
Page C
Page D
Example of Calculation (2)
1*0.85/2
Page A 1 1*0.85 Page B 1
1*0.85
1*0.85/2
1*0.85 Page D 1
Page C 1
Example of Calculation (3)
Page A 1
Page B 0.575
Each page has not passed on 0.15, so we get:
Page A: 0.85 (from Page C) + 0.15 (not transferred) = 1 Page B: 0.425 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (not transferred) = 2.275 Page D: receives none, but has not transferred 0.15 = 0.15
Page C 2.275
Page D 0.15
Example of Calculation (4)
Page A 2.08375 Page B 0.575
Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575 Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from Page A) +0.15 (not transferred) = 1.19125 Page D: receives none, but has not transferred 0.15 = 0.15
Page C 1.19125
Page D 0.15
Example - Conclusions
Page C has the highest PageRank, and page A has the next highest: page C has a highest importance in this page graph! More iterations lead to convergence of PageRanks.
Base set
In practice when the user gets bored tends to use his bookmarked pages instead of a random one. These bookmarked pages constitute the base set. The PR formula is modified to reflect this behavior.
PR(A)=(1-d)*E + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) ) If A in base set E = 1 else E = 0
Roadmap
PageRank: Ranking Web Pages using link structure Ranking Keyword Search Results in Structured Databases
Ranking Combining Individual PageRanks
Keyword Query
Input: set of keywords Output: List of nodes ranked according to their relevance to the keywords
Score of a result-node:
• Sum of keyword-specific PRs (OR semantics) • Product of keyword-specific PRs (AND semantics)
Database Schema
C(cid,name) Y(yid,year,cid) P(pid,title,yid) A(aid,name) PP(pid1,pid2) PA(pid,aid)
Tupples in C, Y, P, A are objects that represent nodes in schema graph
Primary to foreign key relations represent edges in the graph All connections are two way except P – P that is only from paper to cited paper
C: conference Y: conference year P: paper A: author : primary to foreign key
Architecture
d, edge weights, epsilon, threshold Database Keywords, k
Create PR index
Query Module
Attributes of PRindex table: •Keyword •CLOB of (id,PR) list
PRindex
List of •Nodeid •Node text •PR wrt all keywords
Results
Preprocessing stage
Query stage
Modified PageRank Formula
PR(A)=(1-d) + d*(weight(T1→A)*PR(T1)/C(T1)+…+ weight(Tn→A)*PR(Tn)/C(Tn)), if A has keyword PR(A)=d*(weight(T1→A)*PR(T1)/C(T1)+… + weight(Tn→A)*PR(Tn)/C(Tn)), if A doesn’t have keyword
Preprocessing stage (1)
Load whole database in memory
Create edges Hashtable ( nodeId, nodeId, Type of edge ) Create nodes Hashtable ( nodeId ) Create text Hashtable ( nodeId, text )
For each keyword
Find all nodes that contain keyword and put them in base set. Execute PR algorithm with base set.
Preprocessing stage (2)
Create descending list of (nodeid,PR) pair. Store list in CLOB in PRindex table indexed by keyword.
Query Stage
For each keyword in input retrieve ( id, PR ) list from database. Resolve top-k ids with respect to the sum of Page ranks using Fagin’s algorithm (PODS 2001).
Fagin’s Algorithm
Descending sorted keyword-specific PR lists
Keep the maximum possible value of a node that is the current PR for node extracted so far in scanned lists plus the PR of currently pointed nodes in other lists. Keep the minimum value that is the current PR for node. Algorithm terminates when it finds k objects of which minimum value is greater than the maximum PR value for the rest of nodes.
Conclusions
We implemented a system for keyword search in databases using PageRank. It uses an index of keyword specific Object Ranks