swan

Document Sample
swan Powered By Docstoc
					Focused Crawling A New Approach to Topic-Specific Web Resource Discovery
Soumen Chakrabarti IBM Almaden
Joint work with: Martin van Den Berg (Xerox) Byron Dom (IBM) David Gibson (Berkeley) Funded by Global Web Solutions, IBM Atlanta
1

Portals and portholes
 Popular search portals and directories
 Useful for generic needs  Difficult to do serious research

 Information needs of net-savvy users are getting very sophisticated  Relatively little business incentive  Need handmade specialty sites: portholes  Resource discovery must be personalized
2

Quote
The emergence of portholes will be one of the major Internet trends of 1999. As people become more savvy users of the Net, they want things which are better focused on meeting their specific needs. We're going to see a whole lot more of this, and it's going to potentially erode the user base of some of the big portals. Jim Hake
(Founder, Global Information Infrastructure Awards)
3

Quote
The most interesting trend is the growing sense of natural limits, a recognition that covering a single galaxy can be more practical—and useful—than trying to cover the entire universe. Dan Gillmore
(Tech Columnist, San Jose Mercury News)

4

Scenario
 Disk drive research group wants to track magnetic surface technologies  Compiler research group wants to trawl the web for graduate student resumés  ____ wants to enhance his/her collection of bookmarks about ____ with prominent and relevant links  Virtual libraries like the Open Directory Project and the Mining Co.
5

Goal
 Automatically construct a focused portal (porthole) containing resources that are
 Relevant to the user’s focus of interest  Of high influence and quality  Collectively comprehensive

6

Tools at hand
 Keyword search engines
 Synonymy, polysemy  Abundance, lack of quality

 Hand compiled topic directories
 Labor intensive, subjective judgements

 Resources automatically located using keyword search and link graph distillation
 Dependence on large crawls and indices
7

Estimating popularity
 Extensive research on social network theory
 Wasserman and Faust

 Hyperlink based
 Large in-degree indicates popularity/authority  Not all votes are worth the same

 Several similar ideas and refinements
 Googol (Page and Brin) and HITS (Kleinberg)  CLEVER (Chakrabarti et al)  Topic distillation (Bharat and Henzinger)
8

Topic distillation overview
 Given web graph and query  Search engine selects sub-graph  Expansion, pruning and edge weights  Nodes iteratively transfer authority to cited neighbors
The Web

Search Engine

Query

Selected subgraph
9

Preliminary approach
 Use topic distillation for focused crawling
 Each node in topic taxonomy is a query  Query is refined by trial-and-error  Topic distillation runs at each node

 E.g.: European airlines
 +swissair +iberia +klm

10

11

Query construction
/Companies/Electronics/Power_Supply +“power suppl*” “switch* mode” smps

-multiprocessor* “uninterrupt* power suppl*” +ups
-parcel*

12

Query complexity
 Complex queries (966 trials)
 Average words 7.03  Average operators (+*–") 4.34

 Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz]
 Average query words 2.35  Average operators (+*–") 0.41

 Forcibly adding a hub or authority node helped in 86% of the queries
13

Problems with preliminary approach
 Difficulty of query construction  Dependence on large web crawl and index
 System = crawler + index + distiller

 Unreliability of keyword match
 Engines differ significantly on a given query due to small overlap [Bharat and Bröder]  Narrow, arbitrary view of relevant subgraph  Topic model does not improve over time

 Lack of output sensitivity
14

Output sensitivity
 Say the goal is to find a comprehensive collection of recreational and competitive bicycling sites and pages  Ideally effort should scale with size of the result  Time spent crawling and indexing sites unrelated to the topic is wasted  Likewise, time that does not improve comprehensiveness is wasted
15

Proposed solution
 Resource discovery system that can be customized to crawl for any topic by giving examples  Hypertext mining algorithms learn to recognize pages and sites about the given topic, and a measure of their goodness  Crawler has guidance hooks controlled by these two scores
16

Advantages
 No need for query formulation—system learns from examples  No dependence on global crawls  Specialized, deep and up-to-date web exploration  Modest desktop hardware adequate

17

Administration scenario

Current Examples Drag Taxonomy Editor

Suggested Additional Examples

18

Relevance
All

Path nodes
Recreation

Arts

Bus&Econ

Companies Bike Shops

...

Cycling Clubs Mt.Biking

...

Good nodes

Subsumed nodes

Pr[d is good]  good(c ) Pr[c | d ]
19

Classification
 How relevant is a document w.r.t. a class?
 Supervised learning, filtering, classification, categorization

 Many types of classifiers
 Bayesian, nearest neighbor, rule-based

 Hypertext
 Both text and links are class-dependent clues  How to model link-based features?
20

Exploiting link features
 c=class, t=text, N=neighbors  Text-only model: Pr[t|c]  Using neighbors’ text to judge my topic: Pr[t, t(N) | c]  Better model: Pr[t, c(N) | c]  Non-linear relaxation

?

21

Exploiting link features
 c=class, t=text, N=neighbors  Text-only model: Pr[t|c]  Using neighbors’ text to judge my topic: Pr[t, t(N) | c]  Better model: Pr[t, c(N) | c]  Non-linear relaxation
40 35 30 25 20 15 10 5 0 0 50 100 %Neighborhood known Text Link Text+Link
22

%Error

Putting it together
Feedback Topic Taxonomy Example Distiller Editor Browser

Scheduler

Taxonomy Database

Crawl Database

Workers

Hypertext Classifier (Learn)

Topic Models

Hypertext Classifier (Apply)
23

Monitoring the crawler

One URL

Relevance

Moving Average

Time
24

RDBMS benefits
      Multiple priority controls Dynamically changing crawling strategies Concurrency and crash recovery Effective out-of-core computations Ad-hoc crawl monitoring and tweaking Synergy of scale

25

Measures of success
 Harvest rate
 What fraction of crawled pages are relevant

 Robustness across seed sets
 Separate crawls with random disjoint samples  Measure overlap in URLs and servers crawled  Measure agreement in best-rated resources

 Evidence of non-trivial work
 #Links from start set to the best resources
26

Harvest rate
Harvest Rate (Cycling, Unfocused) 1 0.9 0.8
Average Relevance

Harvest Rate (Cycling, Soft Focus) 1 0.9 0.8
Average Relevance

Avg over 100 Avg over 1000

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5000 #URLs fetched 10000 Avg over 100

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2000 4000 6000 #URLs fetched

Unfocused

Focused
27

Crawl robustness
Crawl Robustness (Cycling) 0.9 0.8 0.7
URL Overlap

Crawl Robustness (Cycling) 1 0.9 0.8
Server overlap

0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 #URLs crawled

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 #URLs crawled 3000 Overlap1 Overlap2

URL Overlap

Crawl A

Crawl B

Server Overlap
28

Top resources after one hour
 Recreational and competitive cycling
 http://www.truesport.com/Bike/links.ht m  http://reality.sgi.com/employees/billh _hampton/jrvs/links.html  http://www.acs.ucalgary.ca/~bentley/ma rk_links.html

 HIV/AIDS research and treatment
 http://www.stopaids.org/Otherorgs.html  http://www.iohk.com/UserPages/mlau/aid 29 sinfo.html

Distance to best resources
Resource Distance (Cycling) 18 16
#Servers in top 100 #Servers in top 100

Resource Distance (Mutual Funds) 35 30 25 20 15 10 5 0

14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 Min. distance from crawl seed (#links)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Min. distance from crawl seed (#links)

Cycling: cooperative

Mutual funds: competitive
32

Robustness of resource discovery
 Sample disjoint sets of starting URL’s  Two separate crawls  Find best authorities  Order by rank  Find overlap in the top-rated resources
Resource Robustness (Cycling) 1 0.9 0.8
Server Overlap

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 #Top resources 20 25 Overlap1 Overlap2

33

Future work
 Harvest rate at different levels of taxonomy
 By definition harvest rate is 1 for root node

 Sociology of citations
 Build a gigantic citation matrix for web topics  Further enhance resource finding skills

 Semi-structured queries
 Suspicious link neighborhoods, e.g., traffic radar manufacturer and auto insurance company
34

Related work
 WebWatcher, HotList&ColdList
 Filtering as post-processing, not acquisition

 Fish search, WebCrawler
 Crawler guided by query keyword matches

 Ahoy!, Cora
 Hand-crafted to find home pages and papers

 ReferralWeb
 Social network on the Web
35

Conclusion
 New architecture for example-driven topicspecific web resource discovery  No dependence on full web crawl and index  Modest desktop hardware adequate  Variable radius goal-directed crawling  High harvest rate  High quality resources found far from keyword query response nodes
36

References
 soumen@cs.berkeley.edu  www.cs.berkeley.edu/~soumen/
 www8focus.pdf  sigmod98.ps

 www.almaden.ibm.com/cs/k53/ir.html

37


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:26
posted:12/23/2009
language:English
pages:35