Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Topical Crawling

VIEWS: 172 PAGES: 20

									Topical Crawling for
Business Intelligence
Gautam Pant* and Filippo Menczer**
*Department  of Management Sciences
The University of Iowa, Iowa City, IA 52246

**Schoolof Informatics
Indiana University, Bloomington, IN 47408
Overview
   Topical Crawling
   The Business Intelligence Problem
   Test Bed
   Crawling Algorithms
   Results
   Finding Better Seeds
Crawling as Graph Search
Frontier

                           Node expansion –
                            Downloading and parsing a
                            page
                           Open list - Frontier
                           Closed list – History
                           Expansion order – Crawl path
           Seeds


              History
Exhaustive vs. Preferential Crawling
   Exhaustive - blind expansion order (e.g.
    Breadth First )
   Preferential - heuristic-based expansion order
    (e.g. Best First)
       Topical Crawling: the guiding heuristic is based
        on a topic or a set of topics
Business Intelligence Problem
   Web based information about related business
    entities
   Related through the area of competence,
    research thrust etc.
   Topical crawlers can help in creating a small
    but focused collection of Web pages that is
    rich in information about related business
    entities
Business Intelligence Problem
   A list of business entities is available
   We create a focused document collection that
    can be further explored with ranking, indexing
    and text-mining tools
   We investigate the crawling techniques for the
    task
Finding paths in a competitive
community
    .com                                    .com




                  .edu, .org, .gov
                       .com                 .com
    .com




           .com
                                     .com
Test Bed
   DMOZ Categories – “Companies”,
    “Consultants”, “Manufacturers”
   159 topics
   seeds, targets, keywords and description
   Each crawler crawl up-to 10,000 pages for
    each topic
Sample Topic
Performance Metrics
   Precision@N
                     1 N
    precision @ N t    sim(d t , pi )
                     N i 1



   Target Recall@N
                                    • |Crawled ∩ Relevant| / |Relevant|
                                    • |Crawled ∩ Targets| / |Targets|
                     Crawled
           Targets



          Relevant
Crawling Infrastructure
Crawling Algorithms
   Breadth First

   Naïve Best First
Crawling Algorithms – DOM Crawler
link _ scoredom    link _ scorebfs  (1   )  context _ score
Hub-Seeking Crawler
                n  (n  1)
hub _ score                ,n  0
                 1 n   2


   n – number of seed hosts




link _ scorehub  max hub _ score, link _ score dom 
Performance
Improving the Seed Set
   Top 10 hubs based on back-
    links from Google
   Avoiding mirrors of DMOZ
   Augmented seed set
Performance
Related work
   Chakrabarti et. al. [1998]
       Use of Hubs
   Menczer et. al. [2001]
       Framework for evaluating topical crawlers
   Chakrabarti et. al. [2002]
       Use of DOM
Conclusion
   Investigated the problem of creating a small collection
    through topical crawling for locating related business entities
   Hub Seeking crawler that seeks hubs at crawl time and
    exploits the tag tree structure of Web pages outperforms
    Naïve Best-First
   Positive effects of identifying hubs before and during the
    crawl process
   Future Work –
       Find optimal aggregation node
       Compare the benefits of identifying hubs in competitive vs.
        collaborative communities
Thank You
gautam-pant@uiowa.edu

Acknowledgements:
Robin McEntire (GlaxoSmithKline R&D)
Valdis A. Dzelzkalns (GlaxoSmithKline R&D)
Paul Stead (GlaxoSmithKline R&D)
NSF grant to FM

								
To top