Keyword Search in Databases using PageRank by mercy2beans108

VIEWS: 20 PAGES: 24

									Keyword Search in Databases
     using PageRank



      By Michael Sirivianos

          April 11, 2003
Roadmap
   PageRank: Ranking Web Pages using
    link structure
   Ranking Keyword Search Results in
    Structured Databases
       Ranking Combining Individual PageRanks
Roadmap
   PageRank: Ranking Web Pages using
    link structure of the web
   Ranking Keyword Search Results in
    Structured Databases
       Ranking Combining Individual PageRanks
PageRank(1)
   Stanford project
   Lawrence Page, Sergey Brin, Rajeev
    Motwani, Terry Winograd.
“The PageRank Citation Ranking:
 Bringing Order to the Web”.
   Started Google
PageRank(2)
   Make use of the link structure of the web to
    calculate a quality ranking (PageRank) for
    each web page.
   Citation counting a metric for measuring
    page/paper quality
   PageRank a more sophisticated citation
    counting method, not prone to manipulation.
   Each page has unique PageRank,
    independent of keyword query
   PageRank does NOT express relevance of
    page to query
PageRank (3)
   Calculation Intuition :PageRank of page P
    increases when pages with large PageRanks
    point to P.
   The rank of a page is evenly distributed
    among its forward links.
   A problem: When two pages form a loop by
    pointing to each other but no other page,
    then in every iteration this loop accumulates
    and never distributes rank. This is called rank
    sink.
PageRank is a Usage
Simulation
   “Random surfer”
       Given a random URL
       Clicks randomly on links
       After a while gets bored and gets a new
        random URL
   The number of visits to each page is its
    PageRank.
PageRank Calculation
PR(A)=(1-d) + d*( PR(T1)/C(T1)+…+
     PR(Tn)/C(Tn) )

d: damping factor, normally this is set to 0.85.
T1, …, Tn: pages pointing to page A
PR(A): PageRank of page A.
PR(Ti): PageRank of page Ti.
C(Ti): the number of links going out of page Ti.

Note: d counts for PageRank sinks
Example of Calculation (1)

    Page A      Page B




    Page C      Page D
Example of Calculation (2)
                1*0.85/2
     Page A                   Page B
        1                       1
                     1*0.85
 1*0.85   1*0.85/2

                     1*0.85
     Page C                   Page D
       1                        1
   Example of Calculation (3)
                        Each page has not passed on
                         0.15, so we get:
Page A      Page B       Page A: 0.85 (from Page C) +
   1         0.575       0.15 (not transferred) = 1
                         Page B: 0.425 (from Page A) +
                         0.15 (not transferred) = 0.575
                         Page C: 0.85 (from Page D) +
                         0.85 (from Page B) + 0.425
                         (from Page     A) + 0.15 (not
                         transferred) = 2.275
Page C      Page D       Page D: receives none, but has
 2.275       0.15        not transferred 0.15 = 0.15
    Example of Calculation (4)
                      Page A: 2.275*0.85 (from Page C)
                         + 0.15 (not transferred) =
Page A       Page B                              2.08375
2.08375       0.575   Page B: 1*0.85/2 (from Page A) +
                         0.15 (not transferred) =

                               0.575
                      Page C: 0.15*0.85 (from Page D)
                         + 0.575*0.85(from Page
                                                 B) +
Page C       Page D      1*0.85/2 (from Page A) +0.15
                         (not
1.19125       0.15             transferred) =
                                                 1.19125
                      Page D: receives none, but has not
                         transferred 0.15 = 0.15
Example - Conclusions
   Page C has the highest PageRank, and
    page A has the next highest: page C
    has a highest importance in this page
    graph!
   More iterations lead to convergence of
    PageRanks.
Base set
   In practice when the user gets bored tends to
    use his bookmarked pages instead of a
    random one. These bookmarked pages
    constitute the base set.
   The PR formula is modified to reflect this
    behavior.
    PR(A)=(1-d)*E + d*( PR(T1)/C(T1)+…+ PR(Tn)/C(Tn) )

    If A in base set E = 1 else E = 0
Roadmap
   PageRank: Ranking Web Pages using
    link structure
   Ranking Keyword Search Results in
    Structured Databases
       Ranking Combining Individual PageRanks
Keyword Query
Input: set of keywords
Output: List of nodes ranked according to
  their relevance to the keywords

Score of a result-node:
• Sum of keyword-specific PRs (OR semantics)
• Product of keyword-specific PRs (AND
  semantics)
       Database Schema
                                    Tupples in C, Y, P, A
 C(cid,name)
                                    are objects that represent
Y(yid,year,cid)                     nodes in schema graph
P(pid,title,yid)    PP(pid1,pid2)
                                    Primary to foreign key
 A(aid,name)          PA(pid,aid)   relations represent edges
C: conference
                                    in the graph
Y: conference year
P: paper                            All connections are two
A: author                           way except P – P that is
       : primary to foreign key
                                    only from paper to cited
                                    paper
       Architecture
                                         d,
                                   edge weights,     Keywords,
                                      epsilon,          k
                                     threshold
                        Database



                              Create                 Query
                             PR index                Module      List of
                                                                 •Nodeid
Attributes of PRindex                                            •Node text
table:                                                           •PR wrt all keywords
•Keyword                                              Results
•CLOB of (id,PR) list
                               PRindex


                          Preprocessing            Query stage
                              stage
Modified PageRank Formula
PR(A)=(1-d) +
  d*(weight(T1→A)*PR(T1)/C(T1)+…+
  weight(Tn→A)*PR(Tn)/C(Tn)), if A has
  keyword

PR(A)=d*(weight(T1→A)*PR(T1)/C(T1)+…
  + weight(Tn→A)*PR(Tn)/C(Tn)), if A
  doesn’t have keyword
Preprocessing stage (1)
   Load whole database in memory
       Create edges Hashtable ( nodeId, nodeId, Type of
        edge )
       Create nodes Hashtable ( nodeId )
       Create text Hashtable ( nodeId, text )

   For each keyword
       Find all nodes that contain keyword and put them
        in base set.
       Execute PR algorithm with base set.
Preprocessing stage (2)
    Create descending list of (nodeid,PR) pair.
    Store list in CLOB in PRindex table
     indexed by keyword.
Query Stage
   For each keyword in input retrieve ( id,
    PR ) list from database.
   Resolve top-k ids with respect to the
    sum of Page ranks using Fagin’s
    algorithm (PODS 2001).
Fagin’s Algorithm
   Descending sorted keyword-specific PR lists
        Keep the maximum possible value of a node that
        is the current PR for node extracted so far in
        scanned lists plus the PR of currently pointed
        nodes in other lists. Keep the minimum value that
        is the current PR for node.
       Algorithm terminates when it finds k objects of
        which minimum value is greater than the
        maximum PR value for the rest of nodes.
Conclusions

   We implemented a system for keyword
    search in databases using PageRank.
   It uses an index of keyword specific
    Object Ranks

								
To top