Automatic Data Extraction Crawling the Hidden Web by mrz53354

VIEWS: 35 PAGES: 32

									Laying the groundwork for new
     Information Services




             NRC ‘06, Kai Simon

                        Databases and Information Systems
                                          Research Group
                             Computer Science Department
                         Albert-Ludwigs-University Freiburg
page 2
                                                       DBIS Research Group
                                               Computer Science Department
                                  Outline   Albert-Ludwigs-University Freiburg




         Introduction and Motivation

         Making Web Data Accessible

         Semantic Data Annotation

         Outlook and Conclusion
page 3
                                                                                   DBIS Research Group
                                                                           Computer Science Department
                                                 Motivation             Albert-Ludwigs-University Freiburg




                     Most Information available on the Web is only human
                     accessible through representation-oriented HTML pages.
                     We still lack techniques which enables machines [agents]
                 –      to extract and
                 –      understand
                 such kind of information resources to act, on behalf of humans.

                                                       II. Semantic Proximity




         I. Extracted piece of
              information


                       → Relationship of Cape Canaveral and Space Shuttle?
page 4
                                                                                                            DBIS Research Group
                                                                                                    Computer Science Department
                                         Part I – Web Data Extraction                            Albert-Ludwigs-University Freiburg




                                   Wrapper Programming Languages [Florid,Pillow,TSIMMIS]
                                   [Use logical programming languages or grammars]
                                    –   Extraction result quite exact
                                    –   Difficult to write and maintain

                                   Wrapper Induction [Wien, STALKER, WL2]
                                    [Use machine learning techniques to generate extraction rules]
                                    –   Manually labeled set of training pages.
                                    –   learns extraction rules and extracts objects from congeneric
                                        pages.

                                   Automatic Data Extraction [Roadrunner, MDR, ViNTs, DEPTA]
                                   [Main assumption: Information is organized in regularly structured data objects]




 increasing degree of automation
                                    –   Detect objects [data records] with similar
                                        structures in single/multiple Web page(s).
page 5
                                                                              DBIS Research Group
                                                                      Computer Science Department
         Automatic Data Extraction System                          Albert-Ludwigs-University Freiburg




         ViPER [ Visual Perception-based Extraction of Records ]

         Features
            Data extraction process
             –   operates on a single Web page
             –   needs at least two similar consecutive data records
             –   finds data record segmentations visually
             –   visually weights data regions

             Data item alignment
             –   global alignment techniques (bioinformatics)
             –   incorporate string similarity and tree information
page 6
                                            DBIS Research Group
                                    Computer Science Department
         Extraction Process      Albert-Ludwigs-University Freiburg




                      Scan the Web page for
                      similar data records

                      use visual information to
                      •   segment the data records
                      •   compute the relevance
                          according to the location
                          inside the Web page.

                      extract similar data records
                      with the highest relevance.
page 7
                                                          DBIS Research Group
                                                  Computer Science Department
                      Pattern Extraction       Albert-Ludwigs-University Freiburg




         Pattern Extraction
         Pattern Extraction

                                  DOM tree and the
                                  abstract tag sequence
                                  representation
         Record Alignment
         Record Alignment
                                  Identify structural
                                  characteristics to narrow
                                  down the search-space
            Benchmark
            Benchmark
                                  use edit-distance to
                                  compute the similarity
                                  between two sequences
page 8
                                                                                                DBIS Research Group
                                                                                        Computer Science Department
                                HTML Representations                                 Albert-Ludwigs-University Freiburg




                                                                        bounding box information
    DOM Tree (labeled ordered rooted tree)
                                                                        height, width, position
                                         <HTML>


                  <HEAD>                                                   <BODY>


         <META>      <TITLE>       <TABLE>      <TABLE>           DIV          BR        CENTER



                      Google-       <TR>          <TR>    P   BLOCKQUOTE   P        BR     BR     <TABLE>
                      Search:
                     semantic
                                  <TD>   <TD>     <TD>



                                                  WEB




   Abstract Tag Sequence
   <HTML><HEAD><META><TITLE>TEXT</TITLE></HEAD><BODY><TABLE>…..</BODY></HTML>
page 9
                                                                                        DBIS Research Group
                                                                                Computer Science Department
                          Structural Characteristics                         Albert-Ludwigs-University Freiburg




                            DIV



         P   BLOCKQUOTE     P     P




                                          A data record is formed by a
                                      subtree located under a single node.
page 10
                                                                                                                  DBIS Research Group
                                                                                                          Computer Science Department
                         Structural Characteristics                                                    Albert-Ludwigs-University Freiburg




                                                     CENTER




          FONT   TABLE   A   BR   FONT   BR   BR   TABLE   A   BR   FONT   BR   BR    TABLE   A   BR   FONT    BR   BR




                                                                           A data record is formed by
                                                                            multiple sibling nodes
                                                                       and their corresponding subtrees.
                                                                                     forest pattern
page 11
                                                                                                   DBIS Research Group
                                                                                           Computer Science Department
                             Repetitive Sub-Structures                                  Albert-Ludwigs-University Freiburg



                                                TABLE



     <TD>                         <TD>

 <A>      <TR>             <A> <TR>      <TR>


  text     text            text   text   text



                      <td> <a> text </a> <tr> text </tr> <tr> text </tr> </td>
                  0   1     2      3      4      5      6   7   8   9   10   11   Dynamic Programming
                                                                                  computing the edit distance
   <td>           1    0     1      2     3      4      5   6   7   8   9    10
                                                                                  metric.
   <a>            2    1     0      1     2      3      4   5   6   7   8    9
                                                                                  Consider two sub-trees to be
    text          3    2     1      0     1      2      3   4   5   6   7    8    similar if the edit distance
   </a>           4    3     2      1     0      1      2   3   4   5   6    7    is below a certain threshold.

   <tr>           5    4     3      2     1      0      1   2   3   4   5    6

    text          6    5     4      3     2      1      0   1   2   3   4    5    Speed up the computation
                                                                                  using beam search
   </tr>          7    6     5      4     3      2      1   0   1   2   3    4    techniques.
   </td>          8    7     6      5     4      3      2   1   1   2   3    3
page 12
                                                       DBIS Research Group
                                               Computer Science Department
   Visual Data Record Segmentation          Albert-Ludwigs-University Freiburg




          x-profile
                                      DIV




                          y-profile


          1           2




          3           4
page 13
                                                     DBIS Research Group
                                             Computer Science Department
           Visual Data Region Weighting   Albert-Ludwigs-University Freiburg




      Different structured data records
       and their spanning data region



      Pattern
page 14
                                                        DBIS Research Group
                                                Computer Science Department
                     Record Alignment        Albert-Ludwigs-University Freiburg




          Pattern Extraction
          Pattern Extraction



          Record Alignment
          Record Alignment
                               Use global alignment
                               techniques [bioinformatics]

             Benchmark         Incorporate
             Benchmark         • HTML-structure information
                               • string similarity functions
                                 [Jaro-Winkler]
page 15
                                         DBIS Research Group
                                 Computer Science Department
          Post-Processing     Albert-Ludwigs-University Freiburg




                    Data record alignment
page 16
                                                                             DBIS Research Group
                                                                     Computer Science Department
                       Data Record Alignment                      Albert-Ludwigs-University Freiburg




          Why do we need to align the data records?
              –   aligned data items are easy to label and store in DB
              –   export the result as XML
              –   synchronize with other results
              –   schema matching


          Why is it so complex?
              –   optional, repetitive sub-parts
              –   different sub-parts


          [Optimal multiple alignment has exponential time complexity]
page 17
                                                                                                                                 DBIS Research Group
                                                                                                                         Computer Science Department
                                              Global Alignment                                                        Albert-Ludwigs-University Freiburg




                                            Sequence information
 <P><A>T</A><BR> T<A>T</A>T <TABLE><TR><TD>T<BR> T<BR>T <NOBR> <A>T</A>T<A>T</A> </NOBR> </TD></TR></TABLE> </P>

 <BLOCKQ> T<A>T</A>T <BR> <NOBR> T<A>T</A> </NOBR> <TABLE><TR><TD>T<BR> T<BR>T<A>T</A>T<A>T</A> </TD></TR></TABLE>


 <P> T<A>T</A>T <TABLE><TR><TD>T<BR> T </TD></TR></TABLE> <NOBR> T </NOBR> </P>




 Text content information                                                             Tree structure information
                                                                                                   P
                                                                                                                                                   B
                                                                                  A   B    T       A       T     T                             L
                                                                                                                                       T A T B N     T
                                                                                  T   R            T             AT                            O
                                                                                                                 B                       T    R T OA AT
                                                                                                                  R                            C BT B R
                                                                                                                 LT                                   T
                                                                                                                 ED                            K R L  D
                                                                                       T       B       T       B T N                             T   E
                                                                                                                                           T BQ B T A T A
                                                                                               R               R                             R     R    T T
                                                                                                                    A OT   A
                                                                                                                      B
                                                                                                                    T R    T
                                                                                                                                             P

                                                                                                                               T   A    T     TA           N
                                                                                                                                              BL           O
                                                                                                                                   T           T           T
                                                                                                                                              E            B
                                                                                                                                               R
                                                                                                                                               T           R
                                                                                                                                               D
                                                                                                                                            T B        T
                                                                                                                                               R
page 18
                                                                                                                                                          DBIS Research Group
                                                                                                                                                  Computer Science Department
                                                                           Global Alignment                                                    Albert-Ludwigs-University Freiburg




                        P                                                                   BLOCKQ                                                         P


  A   BR       T        A       T         TABLE                       T    A       T    BR         NOBR       TABLE                       T   A    T       TABLE           NOBR

  T                     T                  TR                              T                       T      A     TR                            T                TR           T


                                           TD                                                             T     TD                                             TD


           T       BR       T        BR         T       NOBR                   T       BR      T       BR      T      A    T    A                      T       BR      T


                                                    A    T     A                                                      T         T

                                                    T          T




                                    Translate:                             <TABLE><TR>
           <A>Top                                                                                                         Online
                                    <A>Site</A>                            <TD>                        21.11.04 -                    <A>Cached</A>-
           10                                                                                                             shoppi
                                    Amazon                                 Description<BR              13 kb <BR>                    <A>Similar</A>
           link</A>                                                                                                       ng …
                                    Book Shop                              >
                                                                                                                          Some
                                    Translate:               <NOBR>Re      <TABLE><TR>                                    facts
                                    <A>Site</A>              lated<A>sit   <TD>                        21.11.04 -         and        <A>Cached</A>-
                                    Amazon                   es</A></N     Description<BR              14 kb <BR>         inform     <A>Similar</A>
                                    River                    OBR>          >                                              ation
                                                                                                                          ...
                                                                           <TABLE><TR>                                                                              <NOBR>
                                    Translate:                                                                            River
                                                                           <TD>                                                                                     New
                                    <A>Site</A>                                                                           rafting,
                                                                           Description<BR                                                                           Link</NO
                                    Amazonas                                                                              ...
                                                                           >                                                                                        BR>
page 19
                                           DBIS Research Group
                                   Computer Science Department
          Data Representation   Albert-Ludwigs-University Freiburg




            XML                          RuleML Fact
page 20
                                                         DBIS Research Group
                                                 Computer Science Department
                               Benchmarking   Albert-Ludwigs-University Freiburg




          Pattern Extraction
          Pattern Extraction



          Record Alignment
          Record Alignment




             Benchmark
             Benchmark
page 21
                                                                                                           DBIS Research Group
                                                                                                   Computer Science Department
                                 Data Extraction Results                                        Albert-Ludwigs-University Freiburg




                                                                                            Recall
          Compared ViPER with the results reported ViNTS                                    Precision
             Data Set 2                              Data Set 3
             99,5
                                                      100,0
             99,0
                                     99,1
                                                      90,0
                                                                  7 7                       7          0 6
             98,5         98,1                                98 , 98 ,              87 ,          98 , 98 ,
             98,0     97,6                            80,0


             97,5
                                                      70,0
                                        97,1
             97,0                                                                8
                                                      60,0
             96,5
                                                                          52 ,
                                                      50,0
             96,0

                                                               ViNTS        MDR                     ViPER
                       ViNTS          ViPER


          Compared ViNTS and ViPER on TBDW [WWW’04]
             TBDW
              100,0

               98,0
                                      97,6 98,5   ViNTS [text document pages]
               96,0

               94,0
                              93,5
               92,0
                                                         • searches for horizontal separator tags
               90,0
                       89,2
               88,0                                      • requires at least 4 data records
               86,0

               84,0
                                                  MDR [table structures]
                         ViNTS          VIPER
                                                         • build to extract data records from tables
                                                         • reports all similar data records
page 22
                                                                            DBIS Research Group
                                                                    Computer Science Department
          Part II - Semantic Proximity                           Albert-Ludwigs-University Freiburg




          Objective
                    Empower machines to understand and extract meanings from
                    human-crafted information resources.
                      – Text Mining
                      – Semantic Web

           –   Status quo
                    Facilitating lexical thesaurus, such as WordNet or Rogers
                    Thesaurus.
                    Work well for concepts, e.g., BICYCLE and CAR.

           -   Drawback
                - These knowledge bases lack of a majority of instances, e.g.,
                  LENOVO and IBM.
                - propose a not provide composed term using the Web
               WeAnd they do more informed approach requests, such as,
                  "Software Engineering" and "XBOX Games".
                       and taxonomies as background knowledge
page 23
                                                                                 DBIS Research Group
               Taxonomy-based Profile                                    Computer Science Department
                                                                      Albert-Ludwigs-University Freiburg

                          Generation
          ODP (Open Directory Project) taxonomy DMOZ
          –   Tree-structured taxonomy contains 590,000 categories.
          –   Contains 5,1 million classified Web sites.
          –   69,000 volunteering editors helping to categorize the Web.


          Basic concept
          –   Similarity computation is based on category vectors

                                               DMOZ-Taxonomy
page 24
                                                                                  DBIS Research Group
                                                                          Computer Science Department
                               Semantic Proximity                      Albert-Ludwigs-University Freiburg




                                                                       Taxonomy
                qa = «Space Shuttle»
                                                                 Open Directory Project

 Rank     Category
                                                                               Top
 1        Science > Technology > Space > NASA
 2        Science > Technology > Space > Space Shuttle
                                                                 Science                  Business
 3        Business > Aerospace and Defense > Space        Technology
 4
                                                             Space
 .
                                                          NASA
 .
 .
 .
 .                                                                   Profiling



 Problem sparse vector                      va = (0, …, 0, … 2.5, …, 2.5, … 5, …0)T
page 25
                                                                                 DBIS Research Group
                      Taxonomy-based Profile                             Computer Science Department
                                                                      Albert-Ludwigs-University Freiburg

                                 Generation
          • Idea
            – Exploit taxonomic structure to make profile vector dense
            – Inference of partial relatedness for super-topics
            – Score quantity depends on
                 – the number of siblings
                 – the rank of the original search result leading to the entry
                                                                          accumulating
                                                                             scores




                               score

                   va = (0.5, …, 2, … 2.5, …, 2.5, … 5, …5)T
page 26
                                                            DBIS Research Group
                                                    Computer Science Department
                       Semantic Proximity        Albert-Ludwigs-University Freiburg




          Request qa falls into the category
          with the descriptor dk Considerable branch overlap
                                        ⇒ high
          Request qb falls into the category similarity
          with the descriptor dm

             scores for qa
             scores for qb




                        dk                        dm
page 27
                                                                                             DBIS Research Group
                                                                                     Computer Science Department
                          Similarity Measure                                      Albert-Ludwigs-University Freiburg




          • Similarity computation
            – Generate profile vectors vi, vj for concept/instance ai, aj
            – Compute Pearson's correlation between vi, vj

                                             | D|

                                         ∑ (v       ik   − v i ) ⋅ (vjk − v j )
                                         k =0
                       c(ai, aj ) =
                                      | D|                       | D|
                                                             2
                                      ∑ (vik − v i ) ⋅ ∑ (vjk − v j ) 2
                                      k =0                       k =0



          • Benefits
            – Profile vectors become dense by virtue of score inference
            We got an overlap for ai, aj not having one category in common
page 28
                                                                              DBIS Research Group
                                                                      Computer Science Department
            Survey – Semantic Proximity                            Albert-Ludwigs-University Freiburg




   • 51 participants
   • 5-point likert scale, ranging from no proximity to synonymy
page 29
                               DBIS Research Group
                       Computer Science Department
          Example   Albert-Ludwigs-University Freiburg
page 30
                                                                    DBIS Research Group
              Finding Relationships Between                 Computer Science Department
                                                         Albert-Ludwigs-University Freiburg
                                  Resources


 resource: inkjettonerrefills.com.au




                                         HP 27 black vs. Color Inkjet

                                       proximity value :       71%
                                       relationship:
                                       hardware/peripherals/printers
    resource: hp.com
page 31
                                                                                             DBIS Research Group
                                                                                     Computer Science Department
                  Finding information sources                                     Albert-Ludwigs-University Freiburg




                                                                                     compute the
             qa = «Mars Mission»                                                     most related
                                                      0                               network of
                                                       .                              resources
                                                      3.5
                                                       .
                                                      4
       profile                                        4

     generation
                                                            Semantic Network
                                                  1
                                                  .
                  hardware/peripherals/printers
                                                  5              4

                                                  .              .

                                                  2             3.5

                                           0      4              .

                                           .                     4

                                          3.5                    0    Top/Regional/North_America/Canada
                                           .
                                                                         /Society_and_Culture/Politics
          science/technology/space/nasa
                                           0
                                           4
page 32
                                                                          DBIS Research Group
                                                                  Computer Science Department
                     Conclusion & Research                     Albert-Ludwigs-University Freiburg




          • Major contributions
            – Visual-Web data extraction (ViPER) [CIKM'05]

            – Taxonomy-driven semantic proximity computation
              [submitted for publication]

            – Taxonomy-driven query planning [in process]

          • Research co-operations
            – Describe patterns and alignment results with rules
            – Using workflows for Web site monitoring
            – Using RuleML in a restricted domain for schema matching

								
To top