Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Crawling the Hidden Web

VIEWS: 10 PAGES: 73

									Crawling the Hidden Web
                     by
        Michael Weinberg
           mwmw@cs.huji.ac.il


             Internet DB Seminar,
      The Hebrew University of Jerusalem,
  School of Computer Science and Engineering,
               December 2001
                      Agenda
   Hidden Web - what is it all about?
   Generic model for a hidden Web crawler
   HiWE (Hidden Web Exposer)
   LITE – Layout-based Information Extraction
    Technique
   Results from experiments conducted to test these
    techniques


23/12/2001         Michael Weinberg, SDBI Seminar      2
              Web Crawlers
 Automatically traverse the Web graph, building a
  local repository of the portion of the Web that they
  visit
 Traditionally, crawlers have only targeted a
  portion of the Web called the publicly indexable
  Web (PIW)
 PIW – the set of pages reachable purely by
  following hypertext links, ignoring search forms
  and pages that require authentication
23/12/2001        Michael Weinberg, SDBI Seminar     3
             The Hidden Web
 Recent studies show that a significant fraction of
  Web content in fact lies outside the PIW
 Large portions of the Web are „hidden‟ behind
  search forms in searchable databases
 HTML pages are dynamically generated in
  response to queries submitted via the search forms
 Also referred as the „Deep‟ Web



23/12/2001       Michael Weinberg, SDBI Seminar    4
        The Hidden Web Growth
 Hidden Web continues to grow, as organizations
  with large amount of high-quality information are
  placing their content online, providing web-
  accessible search facilities over existing databases
 For example:
     – Census Bureau
     – Patents and Trademarks Office
     – News media companies
   InvisibleWeb.com lists over 10000 such databases

23/12/2001           Michael Weinberg, SDBI Seminar      5
             Surface Web




23/12/2001    Michael Weinberg, SDBI Seminar   6
             Deep Web




23/12/2001   Michael Weinberg, SDBI Seminar   7
Deep Web Content Distribution




23/12/2001   Michael Weinberg, SDBI Seminar   8
              Deep Web Stats
   The Deep Web is 500 times larger than PIW !!!
   Contains 7,500 terabytes of information (March 2000)
   More than 200,000 Deep Web sites exist
   Sixty of the largest Deep Web sites collectively
    contain about 750 terabytes of information
   95% of the Deep Web is publicly accessible (no
    fees)
   Google indexes about 16% of the PIW, so we
    search about 0.03% of the pages available today

23/12/2001          Michael Weinberg, SDBI Seminar         9
               The Problem
 Hidden Web contains large amounts of high-
  quality information
 The information is buried on dynamically
  generated sites
 Search engines that use traditional crawlers never
  find this information




23/12/2001        Michael Weinberg, SDBI Seminar       10
               The Solution
 Build a hidden Web crawler
 Can crawl and extract content from hidden
  databases
 Enable indexing, analysis, and mining of hidden
  Web content
 The content extracted by such crawlers can be
  used to categorize and classify the hidden
  databases

23/12/2001       Michael Weinberg, SDBI Seminar     11
                  Challenges
   Significant technical challenges in designing a
    hidden Web crawler
   Should interact with forms that were designed
    primarily for human consumption
   Must provide input in the form of search queries
   How equip the crawlers with input values for use
    in constructing search queries?
   To address these challenges, we adopt the
    task-specific, human-assisted approach

23/12/2001         Michael Weinberg, SDBI Seminar      12
             Task-Specificity
 Extract content based on the requirements of a
  particular application or task
 For example, consider a market analyst interested
  in press releases, articles, etc… pertaining to the
  semiconductor industry, and dated sometime in
  the last ten years




23/12/2001        Michael Weinberg, SDBI Seminar        13
             Human-Assistance
 Human-assistance is critical to ensure that the
  crawler issues queries that are relevant to the
  particular task
 For instance, in the semiconductor example, the
  market analyst may provide the crawler with lists
  of companies or products that are of interest
 The crawler will be able to gather additional
  potential company and product names as it
  processes a number of pages

23/12/2001       Michael Weinberg, SDBI Seminar       14
                      Two Steps
   There are two steps in achieving our goal:
     – Resource discovery – identify sites and databases that
       are likely to be relevant to the task
     – Content extraction – actually visit the identified sites to
       submit queries and extract the hidden pages
   In this presentation we do not directly address the
    resource discovery problem



23/12/2001             Michael Weinberg, SDBI Seminar            15
             Hidden Web Crawlers




23/12/2001       Michael Weinberg, SDBI Seminar   16
                User form interaction
             (2) View form       Form page          (1) Download form

                                                  (4) Submit form
             (3) Fill-out form
                                                   (5) Download
                                                                Web query    Hidden
                                                     response
                                                                front-end   Database

             (6) View result



                                   Response
                                     page


23/12/2001                       Michael Weinberg, SDBI Seminar              17
               Operation Model
   Our model of a hidden Web crawler consists of
    four components:
     – Internal Form Representation
     – Task-specific database
     – Matching function
     – Response Analysis


 Form Page – the page containing the search form
 Response Page – the page received in response to
  a form submission
23/12/2001           Michael Weinberg, SDBI Seminar   18
        Generic Operational Model
Hidden Web Crawler
                                               Form page
        Internal Form                Form                     Download form
       Representation               analysis



  Task              Match
 specific                                  Form submission             Web query    Hidden
database         Set of value-
                                                                       front-end   Database
                 assignments

                                                                  Download
                Response                                          response
                 Analysis

                                                Response
 Repository
                                                  page


   23/12/2001                    Michael Weinberg, SDBI Seminar                    19
 Internal Form Representation
 Form F: F  ({E 1, E 2 ,..., E n }, S, M})
 {E 1 , E 2 ,..., E n } is a set of n form elements
 S – submission information associated with the
  form:
     – submission URL
     – Internal identifiers for each form element
   M – meta-information about the form:
     – web-site hosting the form
     – set of pages pointing to this form page
     – other text on the page besides the form

23/12/2001            Michael Weinberg, SDBI Seminar   20
             Task-specific Database
   The crawler is equipped with a task-specific
    database D
   Contains the necessary information to formulate
    queries relevant to the particular task
   In the „market analyst‟ example, D could contain
    list of semiconductor company and product names
   The actual format and organization of D are
    specific for to a particular crawler implementation
   HiWE uses a set of labeled fuzzy sets

23/12/2001         Michael Weinberg, SDBI Seminar     21
               Matching Function
   Matching algorithm properties:
     – Match(({ E1 ,...En }, S , M ), D)  {[ E1  v1 ,..., En  vn ]}
     – Input: Internal form representation and current contents
       of the database D
     – Output: Set of value assignments
     – E  v associates value v with element E
         i     i                  i            i




23/12/2001               Michael Weinberg, SDBI Seminar                  22
             Response Analysis
 Module that stores the response page in the
  repository
 Attempts to distinguish between pages containing
  search results and pages containing error messages
 This feedback is used to tune the matching
  function




23/12/2001       Michael Weinberg, SDBI Seminar   23
Traditional Performance Metric
   Traditional crawlers performance metrics:
     – Crawling speed
     – Scalability
     – Page importance
     – Freshness
   These metrics are relevant to hidden web crawlers,
    but do not capture the fundamental challenges in
    dealing with the Hidden Web

23/12/2001           Michael Weinberg, SDBI Seminar   24
        New Performance Metrics
   Coverage metric:
     – „Relevant‟ pages extracted / „relevant‟ pages present in
       the targeted hidden databases
     – Problem: difficult to estimate how much of the hidden
       content is relevant to the task




23/12/2001            Michael Weinberg, SDBI Seminar           25
        New Performance Metrics
                   N success
      SE strict 
                   N total
     – N       : the total number of forms that the crawler submits
         total


     – N success : num of submissions which result in response page
               with one or more search results
     – Problem: the crawler is penalized if the database didn‟t
       contain any relevant search results



23/12/2001            Michael Weinberg, SDBI Seminar           26
        New Performance Metrics
                   N valid
     SE lenient 
                   N total
    – N valid: number of semantically correct form
               submissions
     – Penalizes the crawler only if a form submission is
       semantically incorrect
     – Problem: difficult to evaluate since a manual
       comparison is needed to decide whether the form is
       semantically correct



23/12/2001           Michael Weinberg, SDBI Seminar         27
                Design Issues
   What information about each form element Ei
    should the crawler collect?
   What meta-information is likely to be useful?
   How should the task-specific database be
    organized, updated and accessed?
   What Match function is likely to maximize
    submission efficiency?
   How to use the response analysis module to tune
    the Match function?

23/12/2001         Michael Weinberg, SDBI Seminar     28
   HiWE: Hidden Web Exposer




23/12/2001   Michael Weinberg, SDBI Seminar   29
                  Basic Idea
 Extract descriptive information (label) for each
  element of a form
 Task-specific database is organized in terms of
  categories, each of which is also associated with
  labels
 Matching function attempts to match from form
  labels to database categories to compute a set of
  candidate values assignments



23/12/2001        Michael Weinberg, SDBI Seminar      30
               HiWE Architecture
                                     URL 1
                                     URL 2    URL List

  LVS Table
                                     URL N
Label1 Value-Set1       Parser                            WWW
Label2 Value-Set2                 Crawl Manager



Labeln Value-Setn
                                   Form Analyzer
                                                      Form
 LVS Manager                                       submission
                                  Form Processor

                                         Feedback
                                                       Response
                                 Response Analyzer
  Custom data sources
             HiWE‟s Main Modules
   URL List:
    – contains all the URLs the crawler has discovered so far
   Crawl Manager:
    – controls the entire crawling process
   Parser:
    – extracts hypertext links from the crawled pages and adds
       them to the URL list
   Form Analyzer, Form Processor, Response Analyzer:
    – Together implement the form processing and submission
       operations

23/12/2001            Michael Weinberg, SDBI Seminar            32
              HiWE‟s Main Modules
 LVS Manager:
   – Manages additions and accesses to the LVS table
 LVS table:
   – HiWE‟s implementation of the task-specific
     database




 23/12/2001        Michael Weinberg, SDBI Seminar   33
 HiWE‟s Form Representation
   Form F  ({E 1, E 2 ,..., E n }, S, })
     – The third component of F is an empty set since current
         implementation of HiWE does not collect any meta-
         information about the form
   For each element E i , HiWE collects a domain
    Dom( E i ) and a label label(E i )




23/12/2001             Michael Weinberg, SDBI Seminar           34
 HiWE‟s Form Representation
   Domain of an element:
     – Set of values which can be associated with the
       corresponding form element
     – May be a finite set (e.g., domain of a selection list)
     – May be infinite set (e.g., domain of a text box)
   Label of an element:
     – The descriptive information associated with the
       element, if any
     – Most forms include some descriptive text to help users
       understand the semantics of the element

23/12/2001             Michael Weinberg, SDBI Seminar           35
Form Representation - Figure
                                               Element E1
                                Label(E1) = "Document Type"
                                Dom(E1 ) = {Articles, Press Releases,
                                             Reports}

                                              Element E2
                                 Label(E2) = "Company Name"
                                 Dom(E2) = {s | s is a text string}

                                               Element E3
                                 Label(E3) = "Sector"
                                 Dom(E3) = {Entertainment, Automobile
                                             Information Technology,
                                             Construction}


23/12/2001   Michael Weinberg, SDBI Seminar                      36
HiWE‟s Task-specific Database
 Task-specific information is organized in terms of
  a finite set of concepts or categories
 Each concept has one or more labels and an
  associated set of values
 For example the label „Company Name‟ could be
  associated with the set of values {„IBM‟,
  „Microsoft‟, „HP‟,…}




23/12/2001       Michael Weinberg, SDBI Seminar    37
HiWE‟s Task-specific Database
 The concepts are organized in a table called the
  Label Value Set (LVS)
 Each entry in the LVS is of the form (L,V):
     – L : label
     –   V  {v1,...,vn } fuzzy set of values
     – Fuzzy set V has an associated membership function    Mv
       that assigns weights, in the range [0,1] to each member
       of the set
     – M (v ) is a measure of the crawler‟s confidence that
          v i
       the assignment of vi to E is semantically meaningful


23/12/2001               Michael Weinberg, SDBI Seminar      38
     HiWE‟s Matching Function
   For elements with a finite domain:
     – The set of possible values is fixed and can be
         exhaustively enumerated

                                                         Element E1
                                          Label(E1) = "Document Type"
                                          Dom(E1 ) = {Articles, Press Releases,
                                                       Reports}



     – In this example, the crawler can first retrieve all
         relevant articles, then all relevant press releases and
         finally all relevant reports
23/12/2001              Michael Weinberg, SDBI Seminar                    39
     HiWE‟s Matching Function
   For elements with an infinite domain:
     – HiWE textually matches the labels of these elements
       with labels in the LVS table
     – For example, if a textbox element has the label “Enter
       State” which best matches an LVS entry with the label
       “State” , the values associated with that LVS entry
       (e.g., “California”) can be used to fill the textbox
     – How do we match Form labels with LVS labels?




23/12/2001            Michael Weinberg, SDBI Seminar            40
                Label Matching
   Two steps in matching Form labels with LVS
    labels:
     – 1. Normalization: includes conversion to a common
          case and standard style
     – 2. Use of an approximate string matching algorithm to
          compute minimum edit distances
     – HiWE employs D. Lopresti and A. Tomkins string
       matching algorithm that takes word reordering into
       account




23/12/2001           Michael Weinberg, SDBI Seminar            41
             Label Matching
 Let LabelMatch( Ei) denote the LVS entry with
  the minimum distance to label(Ei )
 Threshold 
 If all LVS entries are more than  edit operations
  away from label(Ei ) , LabelMatch( Ei ) = nil




23/12/2001       Michael Weinberg, SDBI Seminar    42
               Label Matching
 For each element Ei , compute ( Vi ,M V ):
                                             i
   – If Ei has an infinite domain and (L,V) is the
     closest matching LVS entry, then Vi = V and
     M Vi = M v
   – If Ei has a finite domain, then V =Dom( E )
                                          i           i
     and M Vi ( x)  1, x Vi
 The set of value assignments is computed as the
  product of all the Vi `s:
  Match( F , LVS )  {[ E1  v1 ,..., En  vn ] : vi Vi , i  1...n}
 Too many assignments?


23/12/2001           Michael Weinberg, SDBI Seminar          43
    Ranking Value Assignments
 HiWE employs an aggregation function to
  compute a rank for each value assignment
 Uses a configurable parameter, a minimum
  acceptable value assignment rank ( min)
 The intent is to improve submission efficiency by
  only using „high-quality‟ value assignments
 We will show three possible aggregation functions




23/12/2001       Michael Weinberg, SDBI Seminar   44
             Fuzzy Conjunction
   The rank of a value assignment is the minimum of
    the weights of all the constituent values.
      fuz ([ E1  v1 ,..., En  vn ])  min M V (vi )
                                              i 1... n   i



   Very conservative in assigning ranks. Assigns a
    high rank only if each individual weight is high




23/12/2001            Michael Weinberg, SDBI Seminar          45
                     Average
   The rank of a value assignment is the average of
    the weights of the constituent values
                                        1
      fuz ([ E1  v1 ,..., En  vn ])   M V (vi )
                                        n i 1... n i
   Less conservative than fuzzy conjunction




23/12/2001          Michael Weinberg, SDBI Seminar      46
                    Probabilistic
 This ranking function treats weights as
  probabilities
 M V (vi ) is the likelihood that the choice of vi is
        i
  useful and 1  M V (vi ) is the likelihood that it is not
                         i

 The likelihood of a value assignment being useful
  is:
      fuz ([ E1  v1 ,..., En  vn ])  1   (1  M V (vi ))
                                                             i
                                                 i 1... n

   Assigns low rank if all the individual weights are
    very low

23/12/2001             Michael Weinberg, SDBI Seminar            47
       Populating the LVS Table
   HiWE supports a variety of mechanisms for
    adding entries to the LVS table:
     – Explicit Initialization
     – Built-in entries
     – Wrapped data sources
     – Crawling experience




23/12/2001             Michael Weinberg, SDBI Seminar   48
             Explicit Initialization
 Supply labels and associated value sets at startup
  time
 Useful to equip the crawler with labels that the
  crawler is most likely to encounter
 In the „semiconductor‟ example, we supply HiWE
  with a list of relevant company names and
  associate the list with labels „Company‟ ,
  „Company Name‟



23/12/2001        Michael Weinberg, SDBI Seminar   49
               Built-in Entries
   HiWE has built-in entries for commonly used
    concepts:
     – Dates and Times
     – Names of months
     – Days of week




23/12/2001            Michael Weinberg, SDBI Seminar   50
             Wrapped Data Sources
 LVS Manager can query data sources through a
  well-defined interface
 The data source must be „wrapped‟ by a program
  that supports two kinds of queries:
     – Given a set of labels, return a value set
     – Given a set of values, return other values that belong to
         the same value set




23/12/2001             Michael Weinberg, SDBI Seminar          51
               HiWE Architecture
                                     URL 1
                                     URL 2    URL List

  LVS Table
                                     URL N
Label1 Value-Set1       Parser                            WWW
Label2 Value-Set2                 Crawl Manager



Labeln Value-Setn
                                   Form Analyzer
                                                      Form
 LVS Manager                                       submission
                                  Form Processor

                                         Feedback
                                                       Response
                                 Response Analyzer
  Custom data sources
             Crawling Experience
 Finite domain form elements are a useful source of
  labels and associated value sets
 HiWE adds this information to the LVS table
 Effective when similar label is associated with a
  finite domain element in one form and with an
  infinite domain element in another




23/12/2001        Michael Weinberg, SDBI Seminar   53
             Computing Weights
 New value added to the LVS must be assigned a
  suitable weight
 Explicit initialization and build-in values have
  fixed weights
 Values obtained from external data sources or
  through the crawler‟s own activity, are assigned
  weights that vary with time




23/12/2001       Michael Weinberg, SDBI Seminar      54
                     Initial Weights
 For external data sources - computed by the
  respective wrappers
 For values directly gathered by the crawler:
     – Finite domain element E with Dom(E)
     –   M Dom( E ) ( x) = 1 iff x  Dom (E )
     – Three cases arise when incorporating Dom(E) into the
         LVS table




23/12/2001              Michael Weinberg, SDBI Seminar        55
             Updating LVS – Case 1
   Crawler successfully extracts label(E) and
    computes LabelMatch(E)=(L,V):
     – Replace the (L,V) entry by the entry ( L,V           Dom( E ))
         V  Dom( E ) ( x)  max( M V ( x), M Dom( E ) ( x))
     – M


     – Intuitively, Dom(E) provides new elements to the value
         set and „boosts‟ the weights of existing elements




23/12/2001               Michael Weinberg, SDBI Seminar                  56
             Updating LVS – Case 2
   Crawler successfully extracts label(E) but
    LabelMatch(E) = nil:
     – A new entry ( label(E),Dom(E) ) is created in the LVS




23/12/2001           Michael Weinberg, SDBI Seminar            57
             Updating LVS – Case 3
   Crawler can not extract label(E):
     – For each entry (L,V):
                Compute a score :           M
                                          xDom( E )
                                                       V   ( x)

                                            Dom ( E )


                Identify the entry with the maximum score ( Lmax ,Vmax )
                Identify the value of the maximum score smax
                Replace entry ( Lmax ,Vmax ) with new entry
                 ( Lmax ,Vmax  Dom( E ))
                Confidence of new values: M ( x)  smax M Dom( E ) ( x)


23/12/2001                    Michael Weinberg, SDBI Seminar               58
             Configuring HiWE
   Initialization of the crawling activity includes:
     – Set of sites to crawl
     – Explicit initialization for the LVS table
     – Set of data sources
     – Label matching threshold ( )
     – Minimum acceptable value assignment rank
     – Value assignment aggregation function




23/12/2001          Michael Weinberg, SDBI Seminar      59
             Introducing LITE
 Layout-based Information Extraction Technique
 Physical Layout of a page is also used to aid in
  extraction
 For example, a piece of text that is physically
  adjacent to a form element is very likely a
  description of that element
 Unfortunately, this semantic associating is not
  always reflected in the underlying HTML of the
  Web page


23/12/2001       Michael Weinberg, SDBI Seminar      60
       Layout-based Information
         Extraction Technique




23/12/2001    Michael Weinberg, SDBI Seminar   61
               The Challenge
   Accurate extraction of the labels and domains of
    form elements
   Elements that are visually close on the screen, may
    be separated arbitrarily in the actual HTML text
   Even when HTML provides a facility for semantic
    relationships, it‟s not used in a majority of pages
   Accurate page layout is a complex process
   Even a crude approximate layout of portions of a
    page, can yield very useful semantic information

23/12/2001         Michael Weinberg, SDBI Seminar    62
             Form Analysis in HiWE
   LITE-based heuristic:
     – Prune the form page and isolate elements which
         directly influence the layout
     –   Approximately layout the pruned page using a custom
         layout engine
     –   Identify the pieces of text that are physically closest to
         the form element (these are candidates)
     –   Rank each candidate using a variety of measures
     –   Choose the highest ranked candidate as the label



23/12/2001               Michael Weinberg, SDBI Seminar               63
 Pruning Before Partial Layout




23/12/2001   Michael Weinberg, SDBI Seminar   64
                         LITE - Figure
               DOM Parser                 Key Idea in LITE:
     DOM                                   Physical page layout
 Representation
                   DOM API
                                           embeds significant
      Prune                                semantic information
 Pruned          List of Elements
  Page           Submission Info

      Partial Layout

  Labels &
Domain Values


       Internal Form
       Representation

  23/12/2001                 Michael Weinberg, SDBI Seminar       65
                  Experiments
 A number of experiments were conducted to study
  the performance of HiWE
 We will see how performance depends on:
     – Minimum form size
     – Crawler input to LVS table
     – Different ranking functions




23/12/2001           Michael Weinberg, SDBI Seminar   66
    Parameter Values for Task 1




   Task 1:
    News articles, reports, press releases and white
    papers relating to the semiconductor industry,
    dated sometime in the last ten years

23/12/2001         Michael Weinberg, SDBI Seminar      67
Variation of Performance with 




23/12/2001   Michael Weinberg, SDBI Seminar   68
 Effect of Crawler input to LVS




23/12/2001   Michael Weinberg, SDBI Seminar   69
     Different Ranking Functions




   When using  fuz and  avg the crawler‟s submission
    efficiency is mostly above 80%
    prob performs poorly
    avg submits more forms than  fuz (less conservative)



23/12/2001            Michael Weinberg, SDBI Seminar          70
               Label Extraction




   LITE-based heuristic achieved overall accuracy of 93%
   The test set was manually analyzed




23/12/2001           Michael Weinberg, SDBI Seminar         71
                Conclusion
 Addressed the problem of extending current-day
  crawlers to build repositories that include pages
  from the „Hidden Web‟
 Presented a simple operation model of a hidden
  web crawler
 Described the implementation of a prototype
  crawler – HiWE
 Introduced a technique for Layout-based
  information extraction


23/12/2001        Michael Weinberg, SDBI Seminar      72
               Bibliography
 Crawling the Hidden Web, by S. Raghavan and
  H. Garcia-Molina, Stanford University, 2001
 BrightPlanet.com white papers
 D. Lopresti and A. Tomkins. Block edit models for
  approximate string matching




23/12/2001       Michael Weinberg, SDBI Seminar   73

								
To top