Crawling the Hidden Web

Document Sample
Crawling the Hidden Web Powered By Docstoc
					CRAWLING THE HIDDEN
WEB
Authors: S. Raghavan & H. Garcia-Molina
Presenter: Nga Chung
OUTLINE
 Introduction
 Challenges

 Approach

 Experimental Results

 Contributions

 Pros and Cons

 Related Work
INTRODUCTION
   Hidden Web
     Content stored in databases that can only be
      retrieved through user query, such as, medical
      research databases, flight schedules, product listings,
      news archives
     Social media blog posts, comments
   So why should we care?
     Scale of the web (55 ~ 60 billions of pages) does not
      include the deep web or pages behind security walls
      [2]
     Estimate in 2001, Hidden Web is 500 times the
      publicly indexed web
     Mike Bergman, “The long-term impact of Deep Web
      search had more to do with transforming business
      than with satisfying the whims of Web surfers.” [5]
CHALLENGES
   From a search engine perspective
     Locate the hidden databases
     Identify which databases to search for a given user
      query
   From a crawler‟s perspective
       Interact with a search form
           Search can be form-based, facet/guided navigation, free-
            text, which are intended for users [3]
     Know what keywords to put into the form fields
     Filter search results returned from search queries
     Define metrics to measure crawler‟s performance
HIDDEN WEB EXPOSER ARCHITECTURE
(HIWE)                  URL List




  Task
 Specific
Database


                          Parser             Crawl Manager
    Label Value Set
        (LVS)



                      Form Analyzer



                                         Form Submission
   LVS Manager        Form Processor                         WWW

                              Feedback
            …           Response                Response
                        Analyzer
   Data Sources
 FORM ANALYSIS
    How does a crawler interact with a search form?
        Crawler builds an “Internal Form Representation”
            F = ({E1, E2, …, En}, S, M)



set of n form
elements                            meta-information
                                    e.g. URL of form
                                    page, web site
             submission             hosting form, links
             information e.g.       to form
             submission URL
          Label(E1) is descriptive text describing the field e.g. Date
          Domain(E1) is set of possible values for the field which can

           be finite (select box) of infinite (text box)
FORM ANALYSIS



                Label(E1) = Make
                Domain(E1) = {Acura, Lexus…}




                Label(E5) = Your ZIP
                Domain(E5) = {s | s is a text string}
TASK SPECIFIC DATABASE
   How does a crawler know what keywords to put
    into the form fields?
       Crawler has a “task-specific database”
           For instance, if the task is to search archives pertaining to
            the automobile industry, the database will contain lists of
            all car makes and models.
       Database has a Label Value Set (LVS) table
         Each row contains
            L – a label e.g. “Car Make”

            V = {v1, …, vn} – a graded set of values e,g, {„Toyota‟,

             „Honda‟, „Mercedes-Benz‟, …}
         Membership function Mv assigns weight to each member of
          the set V
TASK SPECIFIC DATABASE
    LVS table can be populated through
        Explicit initialization by human intervention
        Built-in entries for commonly used categories e.g. dates
        Querying external data sources e.g. Open Directory Project


                                                               Categories
                                                               Regional:
                                                               North America:
                                                               United States




        Crawler‟s encounter with forms that have finite domain fields
 TASK SPECIFIC DATABASE
         Computing weights M(v1)
             Case 1: Precomputed
             Case 2: Computed by respective data source wrapper

             Case 3: Computed by crawling experience shown below



       Extract
        Label



                      yes     Find                            no
                                                                     Add new
         Extracted?          Label in          Found?
                                                                   entry to LVS
                            LVS Table
                no
                                                   yes

  Find entry that close
resembles Domain(E) and                 Replace (L, V) with
  add Domain(E) to set                  (L, V U Domain(E)
 MATCHING FUNCTION
       “Matching function” maps values from database to
        form field
                                                             E1 = Car Make
E1 = Car Make
                                                             v1 = Toyota
                                     Match
E2 = Car Model                                               E2 = Car Model
                                                             v2 = Prius

       Step 1: Label matching
           Normalize form label and use string matching algorithm to
            compute minimum edit distance between form label and all
            LVS labels
MATCHING FUNCTION
    Step 2: Value assignment
        Take all possible combinations of value assignments, rank
         them, and choose the best set to use for form submission
        There are three ranking functions
           Fuzzy conjunction

                          fuz ([E1  v i,...,E n  v n ])  min Mv (v i )
                                                                     i
            Average
                                                                 1
                          avg ([E1  v i ,...,E n  v n ])         M v i (v i )
                                                                 n i1...n
         Probabilistic
          
                           prob ([E1  v i ,...,E n  v n ]) 1  (1  Mv (v i ))
                                                                             i
        Example: form with 2 fields car make and year
           Jaguar, 2009 where Mv1(Jaguar) = 0.5 and Mv2(2009) = 1
          
             ρfuz = 0.5
             

           ρavg = ½ (0.5 + 1) = 0.75
           
            ρprob = 1 – [(1 – 0.5) * (1 – 1)] = 1
            Toyota, 2010, where Mv1(Toyota) = 1 and Mv2(2010) = 1
                ρfuz = 1
                ρavg = ½ (1 + 1) = 1
                ρprob = 1 – [(1 – 1) * (1 – 1)] = 1
LAYOUT-BASED INFORMATION
EXTRACTION
(LITE)
Label Extraction Method   Results

                             Method     Accuracy
                          LITE            93%
                          Textual         72%
                          Analysis
                          Common Form     83%
                          Layout
RESPONSE ANALYSIS
   How does crawler determine whether response page contains
    results or error message?
     Identify significant portion of the response page by removing header,
      footer, etc. and find content in middle of the page
     See if content matches predefined error messages e.g. “No results,”
      “No matches”
     Store hash of significant portion and assume that if hash occurs very
      often, then hash is that of an error page
METRICS
   How to measure the efficiency of the hidden web
    crawler?
       Define submission efficiency SE
          Ntotal = total number of forms submitted
          Nsuccess = total number of submissions that resulted in

           response page containing search results
          Nvalid = number of semantically correct submissions (e.g.

           inputting “Orange” for form element labeled “Vegetable” is
           semantically incorrect)


                      N success                            N valid
        SE strict                         SE lenient    
                      N total                              N total
EXPERIMENT
   Task: Market analyst interested in building an
    archive of information about the semiconductor
    industry in the past10 years
   LVS table populated from online sources such as
    Semiconductor Research Corporation, Lycos
    Companies Online

        Parameter                                Valu
                                                 e
        Number of sites visited                  50
        Number of forms encountered              218
        Number of forms chosen for submission    94
        Label matching threshold                 0.75
        Minimum form size                        3
        Value assignment ranking function        ρfuz
        Minimum acceptable value assignment rank 0.6
EXPERIMENTAL RESULTS –
RANKING FUNCTION
   Crawler executed 3 times with different ranking
    function
                                         Task 1 Performance with Different
                                                 Ranking Functions
                                  5000
                                                                   83.1%
      Number of Form Submission




                                  4500                                      4316
                                  4000           88.8%      3760                      65.1%
                                  3500    3214                     3126
                                                 2853                              2810
                                  3000
                                  2500
                                  2000                                                        Total
                                  1500                                                        Successes
                                  1000
                                   500
                                     0
                                             fuz               avg             prob
                                                         Ranking Function


   ρfuz and ρavg submission efficiency above 80%
   ρfuz does better but less forms are submitted as
    compared to ρavg
EXPERIMENTAL RESULTS –
MINIMUM FORM SIZE
   Effect of minimum form size – crawler performs
    better on larger forms

                                        Task 1 Performance with Different Minimum
                                                        Form Size
                                 4000     3735
                                                      78.9%
                                 3500                         3214       88.77%
                                                                                             88.96%
     Number of Form Submission




                                                     2950
                                 3000                                    2853     2800
                                                                                             2491
                                 2500
                                                                                                                 90%
                                 2000
                                                                                                      1560              Ntotal
                                                                                                                 1404
                                 1500                                                                                   Successes

                                 1000

                                 500

                                    0
                                                 2                   3                   4                   5
                                                                     Minimum Form Size
CONTRIBUTIONS
   Introduces HiWE, one of the first publicly available
    techniques for crawling the hidden web
   Introduces LITE, a technique to extract form data, by
    incorporating the physical layout of the HTML page
     Techniques  prior to this were based on pattern recognition of
      the underlying HTML
PROS
 Defines clear performance metric from which to
  analyze the crawler‟s efficiency
 Points out known limitations of technique from
  which future work can be done
 Directs readers to technical report which
  provides more detailed explanation of HiWE
  implementation
CONS
 Not an automatic approach, requires human
  intervention
 Task-specific
       Requires creation of LVS table per task
   Technique has lots of limitations
     Can only retrieve search results from HTML based
      forms
     Cannot support forms that is driven by Javascript
      events e.g. onclick, onselect
   No mention of whether forms submitted through
    HTTP post were stored/indexed
RELATED WORK
   USC ISI Extract Data from Web (1999 - 2001) [7, 8]
       Describe relevant information on web page with a formal
        grammar and automatically adapt to web page changes
   Research at UCLA (2005) [4]
       Adaptive approach – automatically generate queries by
        examining results from previous queries
   Google‟s Deep-Web Crawler (2008) [1]
       Select only a small number of input combinations that
        provides good coverage of content in underlying database
        and adds the resulting HTML pages into a search engine
        index
   DeepPeep [6]
       Tracks 45,000 forms across 7 domains and allows users to
        search for these forms
Q&A
REFERENCES
[1] J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, & A. Halev, “Google‟s
    Deep-Web Crawl,” Proceedings of the VLDB Endowment, 2008. Available:
    http://www.cs.cornell.edu/~lucja/Publications/I03.pdf. [Accessed June 13, 2010]
[2] C. Mattmann, “Characterizing the Web,” Available:
    http://sunset.usc.edu/classes/cs572_2010/Characterizing_the_Web.ppt. [Accessed
    May 19, 2010]
[3] C. Mattmann, “Query Models,” Available:
    http://sunset.usc.edu/classes/cs572_2010/Query_Models.ppt. [Accessed June 10,
    2010]
[4] A. Ntoulas, P. Zerfos, & J. Cho, “Downloading Textual Hidden Web Content by
    Keyword Queries,” Proceedings of the Joint Conference on Digital Libraries, June
    2005. Available: http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf.
    [Accessed June 13, 2010]
[5] A. Wright, “Exploring a „Deep Web‟ That Google Can‟t Grasp,” The New York
    Times, February 22, 2009. Available:
    http://www.nytimes.com/2009/02/23/technology/internet/23search.html?_r=1&th
    &emc=th. [Accessed June 1, 2010]
[6] DeepPeep beta, Available: http://www.deeppeep.org/index.jsp
[7] C. A. Knoblock, K. Lerman, S. Minton, & I. Muslea, “Accurately and Reliably
    Extracting Data from the Web: A Machine Learning Approach,” IEEE Data
    Engineering Bulletin, 1999. Available: http://www.isi.edu/~muslea/PS/deb-2k.pdf.
    [Accessed June 28, 2010]
[8] C. A. Knoblock, S. Minton, & I. Muslea,” Hierarchical Wrapper Induction for
    Semistructured Information Sources,” Journal of Autonomous Agents and Multi-
    Agent Systems, 2001. Available: http://www.isi.edu/~muslea/PS/jaamas-2k.pdf.
    [Accessed June 28, 2010]

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:7
posted:11/29/2010
language:English
pages:24