Mining the Web Graph

Shared by: dffhrtcv3
Categories
Tags
-
Stats
views:
4
posted:
12/31/2011
language:
pages:
17
Document Sample
scope of work template
							Mining the Web Graph
Mining the Web Graph

Marc Najork, Microsoft Research
Joint work with Sreenivas Gollapudi, Rina Panigrahy, 
         Michael Taylor and Hugo Zaragoza
         Mi h l T l       dH     Z
    Central messages of this talk
    Central messages of this talk
• The web graph is big!
  The web graph is big!
• It can be mined for a variety of purposes
• Different algorithms require different 
    iff      l ih           i diff
  infrastructure (but the boundary is fluid)
• We do not have a good (i.e. predictive) theory 
  of the semantics of hyperlinks
          The web graph is big!
          The web graph is big!
• Web graph = graph induced by web pages 
  Web graph  graph induced by web pages
  (vertices) and hyperlinks (edges)
• The web has many pages
  The web has many pages
  – Infinitely many due to calendars, crawler traps, …
  – G/Y/M engines index over 10B pages each
      / /      g                   p g
• Number of links per page is increasing
  – 62 links/page in 2002; over 100 links/page today
    62 links/page in 2002; over 100 links/page today
• The web graph visible to the G/Y/M engines 
   as          s
  has > 1T links
        Uses of the web graph
        Uses of the web graph
• Ranking of search results
  Ranking of search results
• Spam detection
• C       i id ifi i
  Community identification
 Two classes of web graph algorithms
 Two classes of web graph algorithms
• Some graph algorithms requires only regular
  Some graph algorithms requires only regular 
  (streaming) access to vertices and edges
  – Examples: Computing in degree PageRank
    Examples: Computing in‐degree, PageRank, …
  – Implement on top of MapReduce/Hadoop/Dryad
• Oth    l ith         i      d
  Other algorithms require random access
  – Example: HITS, SALSA, …
  – Implement using high‐performance link database
    The Scalable Hyperlink Store
    The Scalable Hyperlink Store
• Special‐purpose “database” for web graph
  Special purpose  database for web graph
  – In‐memory for performance 
  – Distributed for scalability
    Distributed for scalability
  – Compression scheme leverages web graph 
    properties
  – Core system operational since 2005
  – Used within MSR (often for unintended purposes)
    Used within MSR (often for unintended purposes)
       Query‐dependent link‐based 
              k     l     h
           ranking algorithms
• Approach: Project results of query onto web graph. 
    pp           j               q y             g p
  Include the “neighborhood” of each result. Results, 
  neighboring vertices, and edges between them form 
   neighborhood graph Compute score for each vertex
  “neighborhood graph”. Compute score for each vertex 
  in neighborhood graph (using some scoring function).
• Intuition: Result set is biased towards relevant pages; 
                                                   p g
  neighborhood graph exposes co‐citation of related 
  pages.
• Best known algorithms HITS (Kleinberg 1997) SALSA
  Best‐known algorithms: HITS (Kleinberg 1997), SALSA 
  (Lempel & Moran 2000).  Scores of papers on variants.
Free parameters in such algorithms
Free parameters in such algorithms
• How far should neighborhood extend?
                      g
  Typical choice: distance 1 in both directions.
• Exclude any neighbors? 
  Typical choice: exclude neighbors on same host/domain.
  Typical choice: exclude neighbors on same host/domain
• Take all other neighbors or sample them?
  Conventional: take all descendants; sample 50 ancestors.
  Better: sample both.
  B            l b h
• How to sample?
  Conventional: Uniformly at random.
                           y
  Better: Consistently, e.g. using min‐wise hashing.
• Are edges weighted? What determines weight? 
              The SALSA algorithm
              The SALSA algorithm
•                  g             g p
    Determine “neighborhood graph”
•   Perform one‐step‐back/one‐step‐forward random walk on graph
•   Stationary probability distribution = SALSA authority score
•   More formally: Given result set R and web graph (V,E), define 
    neighborhood graph (B,N) as follows:


• Compute SALSA authority score A(u) as follows:
      Digression: Evaluating ranking 
                 l    h
               algorithms
• Accepted approach: Compile “truth set” –
  Accepted approach: Compile  truth set
  queries and (totally or partially) ranked results
  – Variant A: Employ human “assessors” to judge 
                  p y                         j g
    quality of results – approach used by e.g. TREC
  – Variant B: Mine search engine result clickthroughs
• Run ranking algorithm against truth set and 
  measure similarity to “ideal” ranking
  – Many different evaluation measures: precision, 
    recall, average precision, reciprocal rank, 
    normalized discounted cumulative gain, …
    normalized discounted cumulative gain
      Effect of sampling parameters 
                   ’ ff
         on SALSA’s effectiveness
• Evaluating SALSA on 18B edge graph and 28K query 
            g                 g g p           q y
  truth set, varying sampling parameters m and n.
                      1          2           3          4          5
           1   0.180330   0.179877    0.179256   0.178859   0.178130
           2   0.181983   0.181579    0.181088   0.181079   0.180624
           3   0.181019   0.180908    0.180393   0.180463   0.180136
           4   0.179741   0.180132    0.180059   0.180302   0.179881
           5   0.179329   0.179958    0.180103   0.180094   0.179902
                                     NDCG@10

• Effectiveness non‐monotonic with sampling 
  parameters m and n!
  parameters m and n! 
• Confirmed this phenomenon on other data sets.
• So far no insight into the cause of this anomaly
  So far no insight into the cause of this anomaly.
  Making random accesses regular
  Making random accesses regular
• SALSA algorithm was evaluated using SHS
   – Query results are “random”, so extracting result 
     neighborhood from full web graph exhibits random access 
     pattern
• Idea: Pre‐compute “summary” of neighborhood of 
  each page on the web; combine neighborhood 
  su     a es o que y esu s o o app o a e
  summaries of query results to form approximate 
  neighborhood graph; compute (e.g.) SALSA on that.
• Just as effective!
• M       ffi i t          ( )M R d         to compute 
  More efficient – can use (e.g.) MapReduce t       t
  neighborhood summaries, summary server to retrieve 
  them at run‐time.
 Summarizing a page’s neighborhood
    (at index construction time)
    (     d                    )
• For each page p determine set I(p) of pages
  For each page p, determine set I(p) of pages 
  linking to p, and set O(p) of pages linked to by p.
• Consistently sample e g 1000 elements from I(p)
  Consistently sample e.g. 1000 elements from I(p) 
  and insert them into a Bloom filter. Consistently 
  sub sample e.g. 5 elements and retain their IDs. 
  sub‐sample e.g. 5 elements and retain their IDs.
  Ditto for O(p).
• Summary of p: 2 Bloom filters plus 2 short lists of
  Summary of p: 2 Bloom filters plus 2 short lists of 
  IDs.  Average size: 380 bytes for 2×5 explicit 
       p
  samples.
  Using summaries to compute SALSA
           (at query time)
           (             )
• Retrieve summary of each page in result set R.
                     y          p g
• Vertex set C of approximate neighborhood graph 
  consists of R plus all explicitly stored samples in 
  summaries.
  summaries
• For each p in R, if bloom filter representing 1000 
  samples of I(p) contains a q in C, add edge q→p
  samples of I(p) contains a q in C, add edge q→p
  to approximate neighborhood graph.  Analogous 
  for O(p).
• C         SALSA ( HITS          MAX        )
  Compute SALSA (or HITS, or MAX, or …) on this  hi
  approximate neighborhood graph.
• Effectiveness the same as original SALSA! (Why?)
  Effectiveness the same as original SALSA! (Why?)
   Putting SALSA into perspective
   Putting SALSA into perspective
              0.25




                       .183


                                     .182
              0.20




                                                   .173


                                                                 .158
              0.15
      CG@10




                                                                                .108


                                                                                               .106


                                                                                                              .104


                                                                                                                            92
    NDC




                                                                                                                          .09
              0.10


              0.05




                                                                                                                                      011
                                                                                                                                     .0
              0.00



                                                                                               er-domain




                                                                                                                          PageRank
                       proximate




                                                                                                                                     Random
                                                 [id,cs,8,∞]


                                                               [id,rs,3,∞]
                                   [id,cs,2,1]




                                                                               d,cs,25,∞]




                                                                                                             d,rs,25,∞]
                                                                 SALSA
                                     SALSA


                                                   SALSA




                                                                                              in-degree
                       SALSA




                                                                                                               HITS
                                                                                 HITS
                       S




                                                                                                           [id
                     app




                                                                             [id
                                                               [




                                                                                            inte
                                   [




                                                                                                                          P
                                                 [




• Comparing different link‐based ranking 
  algorithms, using same graph and query set
  algorithms using same graph and query set
  In need of a theory of hyperlinks
  In need of a theory of hyperlinks
• Wanted: A theory of the semantics of hyperlinks
                 y                      yp
  – Something that goes beyond the statement 
    “hyperlinks can be viewed as peer endorsements”
  – Theory should explain the effectiveness of ranking
    Theory should explain the effectiveness of ranking 
    algorithms
  – Theory should be predictive
• Computer Scientists are ill‐equipped to formulate 
  such a theory
  – Link creation is a Human activity (directly or indirectly)
    Link creation is a Human activity (directly or indirectly)
  – Social sciences study human activities and motives
  – See Raghavan’s “New sciences for a new web” talks
            Thanks & Questions
            Thanks & Questions
• Some papers:
  Some papers:
  – M. Najork, H. Zaragoza, M. Taylor. HITS on the web: how 
    does it compare? SIGIR 2007.
               p
  – M. Najork. Comparing the effectiveness of HITS and 
    SALSA. CIKM 2007.
  – S. Gollapudi, M. Najork, R. Panigrahy. Using Bloom filters 
    to speed up HITS‐like ranking algorithms. WAW 2007.
• Li k htt //          h i      ft    /   j k
  Link:  http://research.microsoft.com/~najork

						
Related docs
Other docs by dffhrtcv3
Branding
Views: 0  |  Downloads: 0
Boethius
Views: 0  |  Downloads: 0
branchial cleft cyst - PPT Free
Views: 0  |  Downloads: 0
Box Multiplication
Views: 0  |  Downloads: 0
Bone and Muscle Lab
Views: 0  |  Downloads: 0
Bond.ppt - CCSR
Views: 0  |  Downloads: 0
bobh-20061106-1-input.ppt - Nikhef
Views: 0  |  Downloads: 0
Board Role _amp; Thought Process
Views: 4  |  Downloads: 0
BMS 3031 TRADE AND DEVELOPMENT
Views: 0  |  Downloads: 0