Mining the Web Graph
Shared by: dffhrtcv3
-
Stats
- views:
- 4
- posted:
- 12/31/2011
- language:
- pages:
- 17
Document Sample


Mining the Web Graph
Mining the Web Graph
Marc Najork, Microsoft Research
Joint work with Sreenivas Gollapudi, Rina Panigrahy,
Michael Taylor and Hugo Zaragoza
Mi h l T l dH Z
Central messages of this talk
Central messages of this talk
• The web graph is big!
The web graph is big!
• It can be mined for a variety of purposes
• Different algorithms require different
iff l ih i diff
infrastructure (but the boundary is fluid)
• We do not have a good (i.e. predictive) theory
of the semantics of hyperlinks
The web graph is big!
The web graph is big!
• Web graph = graph induced by web pages
Web graph graph induced by web pages
(vertices) and hyperlinks (edges)
• The web has many pages
The web has many pages
– Infinitely many due to calendars, crawler traps, …
– G/Y/M engines index over 10B pages each
/ / g p g
• Number of links per page is increasing
– 62 links/page in 2002; over 100 links/page today
62 links/page in 2002; over 100 links/page today
• The web graph visible to the G/Y/M engines
as s
has > 1T links
Uses of the web graph
Uses of the web graph
• Ranking of search results
Ranking of search results
• Spam detection
• C i id ifi i
Community identification
Two classes of web graph algorithms
Two classes of web graph algorithms
• Some graph algorithms requires only regular
Some graph algorithms requires only regular
(streaming) access to vertices and edges
– Examples: Computing in degree PageRank
Examples: Computing in‐degree, PageRank, …
– Implement on top of MapReduce/Hadoop/Dryad
• Oth l ith i d
Other algorithms require random access
– Example: HITS, SALSA, …
– Implement using high‐performance link database
The Scalable Hyperlink Store
The Scalable Hyperlink Store
• Special‐purpose “database” for web graph
Special purpose database for web graph
– In‐memory for performance
– Distributed for scalability
Distributed for scalability
– Compression scheme leverages web graph
properties
– Core system operational since 2005
– Used within MSR (often for unintended purposes)
Used within MSR (often for unintended purposes)
Query‐dependent link‐based
k l h
ranking algorithms
• Approach: Project results of query onto web graph.
pp j q y g p
Include the “neighborhood” of each result. Results,
neighboring vertices, and edges between them form
neighborhood graph Compute score for each vertex
“neighborhood graph”. Compute score for each vertex
in neighborhood graph (using some scoring function).
• Intuition: Result set is biased towards relevant pages;
p g
neighborhood graph exposes co‐citation of related
pages.
• Best known algorithms HITS (Kleinberg 1997) SALSA
Best‐known algorithms: HITS (Kleinberg 1997), SALSA
(Lempel & Moran 2000). Scores of papers on variants.
Free parameters in such algorithms
Free parameters in such algorithms
• How far should neighborhood extend?
g
Typical choice: distance 1 in both directions.
• Exclude any neighbors?
Typical choice: exclude neighbors on same host/domain.
Typical choice: exclude neighbors on same host/domain
• Take all other neighbors or sample them?
Conventional: take all descendants; sample 50 ancestors.
Better: sample both.
B l b h
• How to sample?
Conventional: Uniformly at random.
y
Better: Consistently, e.g. using min‐wise hashing.
• Are edges weighted? What determines weight?
The SALSA algorithm
The SALSA algorithm
• g g p
Determine “neighborhood graph”
• Perform one‐step‐back/one‐step‐forward random walk on graph
• Stationary probability distribution = SALSA authority score
• More formally: Given result set R and web graph (V,E), define
neighborhood graph (B,N) as follows:
• Compute SALSA authority score A(u) as follows:
Digression: Evaluating ranking
l h
algorithms
• Accepted approach: Compile “truth set” –
Accepted approach: Compile truth set
queries and (totally or partially) ranked results
– Variant A: Employ human “assessors” to judge
p y j g
quality of results – approach used by e.g. TREC
– Variant B: Mine search engine result clickthroughs
• Run ranking algorithm against truth set and
measure similarity to “ideal” ranking
– Many different evaluation measures: precision,
recall, average precision, reciprocal rank,
normalized discounted cumulative gain, …
normalized discounted cumulative gain
Effect of sampling parameters
’ ff
on SALSA’s effectiveness
• Evaluating SALSA on 18B edge graph and 28K query
g g g p q y
truth set, varying sampling parameters m and n.
1 2 3 4 5
1 0.180330 0.179877 0.179256 0.178859 0.178130
2 0.181983 0.181579 0.181088 0.181079 0.180624
3 0.181019 0.180908 0.180393 0.180463 0.180136
4 0.179741 0.180132 0.180059 0.180302 0.179881
5 0.179329 0.179958 0.180103 0.180094 0.179902
NDCG@10
• Effectiveness non‐monotonic with sampling
parameters m and n!
parameters m and n!
• Confirmed this phenomenon on other data sets.
• So far no insight into the cause of this anomaly
So far no insight into the cause of this anomaly.
Making random accesses regular
Making random accesses regular
• SALSA algorithm was evaluated using SHS
– Query results are “random”, so extracting result
neighborhood from full web graph exhibits random access
pattern
• Idea: Pre‐compute “summary” of neighborhood of
each page on the web; combine neighborhood
su a es o que y esu s o o app o a e
summaries of query results to form approximate
neighborhood graph; compute (e.g.) SALSA on that.
• Just as effective!
• M ffi i t ( )M R d to compute
More efficient – can use (e.g.) MapReduce t t
neighborhood summaries, summary server to retrieve
them at run‐time.
Summarizing a page’s neighborhood
(at index construction time)
( d )
• For each page p determine set I(p) of pages
For each page p, determine set I(p) of pages
linking to p, and set O(p) of pages linked to by p.
• Consistently sample e g 1000 elements from I(p)
Consistently sample e.g. 1000 elements from I(p)
and insert them into a Bloom filter. Consistently
sub sample e.g. 5 elements and retain their IDs.
sub‐sample e.g. 5 elements and retain their IDs.
Ditto for O(p).
• Summary of p: 2 Bloom filters plus 2 short lists of
Summary of p: 2 Bloom filters plus 2 short lists of
IDs. Average size: 380 bytes for 2×5 explicit
p
samples.
Using summaries to compute SALSA
(at query time)
( )
• Retrieve summary of each page in result set R.
y p g
• Vertex set C of approximate neighborhood graph
consists of R plus all explicitly stored samples in
summaries.
summaries
• For each p in R, if bloom filter representing 1000
samples of I(p) contains a q in C, add edge q→p
samples of I(p) contains a q in C, add edge q→p
to approximate neighborhood graph. Analogous
for O(p).
• C SALSA ( HITS MAX )
Compute SALSA (or HITS, or MAX, or …) on this hi
approximate neighborhood graph.
• Effectiveness the same as original SALSA! (Why?)
Effectiveness the same as original SALSA! (Why?)
Putting SALSA into perspective
Putting SALSA into perspective
0.25
.183
.182
0.20
.173
.158
0.15
CG@10
.108
.106
.104
92
NDC
.09
0.10
0.05
011
.0
0.00
er-domain
PageRank
proximate
Random
[id,cs,8,∞]
[id,rs,3,∞]
[id,cs,2,1]
d,cs,25,∞]
d,rs,25,∞]
SALSA
SALSA
SALSA
in-degree
SALSA
HITS
HITS
S
[id
app
[id
[
inte
[
P
[
• Comparing different link‐based ranking
algorithms, using same graph and query set
algorithms using same graph and query set
In need of a theory of hyperlinks
In need of a theory of hyperlinks
• Wanted: A theory of the semantics of hyperlinks
y yp
– Something that goes beyond the statement
“hyperlinks can be viewed as peer endorsements”
– Theory should explain the effectiveness of ranking
Theory should explain the effectiveness of ranking
algorithms
– Theory should be predictive
• Computer Scientists are ill‐equipped to formulate
such a theory
– Link creation is a Human activity (directly or indirectly)
Link creation is a Human activity (directly or indirectly)
– Social sciences study human activities and motives
– See Raghavan’s “New sciences for a new web” talks
Thanks & Questions
Thanks & Questions
• Some papers:
Some papers:
– M. Najork, H. Zaragoza, M. Taylor. HITS on the web: how
does it compare? SIGIR 2007.
p
– M. Najork. Comparing the effectiveness of HITS and
SALSA. CIKM 2007.
– S. Gollapudi, M. Najork, R. Panigrahy. Using Bloom filters
to speed up HITS‐like ranking algorithms. WAW 2007.
• Li k htt // h i ft / j k
Link: http://research.microsoft.com/~najork
Get documents about "