Docstoc

EECS 595 LING 541 SI 661 Natural Language Processing

Document Sample
EECS 595  LING 541  SI 661 Natural Language Processing Powered By Docstoc
					Web Information Extraction



      3rd Oct 2007
Information Extraction
        (Slides based on those by
  Ray Mooney, Craig Knoblock, Dan Weld, Perry,
  Subbarao Kambhampati, Bing Liu)
Introduction

   The Web is perhaps the single largest data source in the
    world.
   Much of the Web (content) mining is about
       Data/information extraction from semi-structured objects
        and free text, and
       Integration of the extracted data/information
   Due to the heterogeneity and lack of structure, mining and
    integration are challenging tasks.
   This talk gives an overview.
Introduction
   Web mining aims to develop new techniques to extract
    useful knowledge from the Web
   Web offers unprecedented opportunity and challenges to
    NLP
      Huge amount of information accessible
      Wide and diverse coverage
      Information of all types – structured table, text,
       multimedia data …
      Semi-structured (html)
      Linked
      Redundant
   Noisy. main content, advertisement, navigation panel,
    copyright notices, …
   Surface Web and Deep Web
        Surface web: pages that can be browsed using q web
         browser
        Deep web: databases accessible through parameterized
         query interfaces
   Services.
   Dynamic
   Virtual Society. Interactions among people,
    organizations, and systems.
Information Extraction (IE)
    Identify specific pieces of information (data) in a
     unstructured or semi-structured textual document.
    Transform unstructured information in a corpus of
     documents or web pages into a structured database.
    Applied to different types of text:
       Newspaper articles
       Web pages
       Scientific articles
       Newsgroup messages
       Classified ads
       Medical notes
Information Extraction vs. NLP?

   Information extraction is attempting to find some of the
    structure and meaning in the hopefully template driven
    web pages.
   As IE becomes more ambitious and text becomes more
    free form, then ultimately we have IE becoming equal to
    NLP.
   Web does give one particular boost to NLP
      Massive corpora..
Sample Job Posting
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <56nigp$mrs@bilbo.reference.com>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-
Based Voice Mail systems. Experienced in C Programming. Must be familiar with
communicating with and controlling voice cards; preferable Dialogic, however, experience
with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a
Senior level person who can come on board and pick up code with very little training.
Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:
Kim Anderson
AdNET
(901) 458-2888 fax
kimander@memphisonline.com
Extracted Job Template
   computer_science_job
   id: 56nigp$mrs@bilbo.reference.com
   title: SOFTWARE PROGRAMMER
   salary:
   company:
   recruiter:
   state: TN
   city:
   country: US
   language: C
   platform: PC \ DOS \ OS-2 \ UNIX
   application:
   area: Voice Mail
   req_years_experience: 2
   desired_years_experience: 5
   req_degree:
   desired_degree:
   post_date: 17 Nov 1996
   Amazon Book Description
….
</td></tr>
</table>
<b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br>
<font face=verdana,arial,helvetica size=-1>
by <a href="/exec/obidos/search-handle-url/index=books&field-author=
         Kurzweil%2C%20Ray/002-6235079-4593641">
Ray Kurzweil</a><br>
</font>
<br>
<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">
<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90
   height=140 align=left border=0></a>
<font face=verdana,arial,helvetica size=-1>
<span class="small">
<span class="small">
<b>List Price:</b> <span class=listprice>$14.95</span><br>
<b>Our Price: <font color=#990000>$11.96</font></b><br>
<b>You Save:</b> <font color=#990000><b>$2.99 </b>
(20%)</font><br>
</span>
<p> <br>…
<p> <br>
Extracted Book Template

 Title: The Age of Spiritual Machines :
        When Computers Exceed Human Intelligence
 Author: Ray Kurzweil
 List-Price: $14.95
 Price: $11.96
 :
 :
Product information/ Comparison
shopping, etc.

    Need to learn to extract info from online vendors
    Can exploit uniformity of layout, and (partial)
     knowledge of domain by querying with known products
    Early e.g., Jango Shopbot (Etzioni and Weld)
       Gives convenient aggregation of online content
    Bug: originally not popular with vendors
       Make personal agents rather than web services?
    This seems to have changed (e.g., Froogle)
  Commercial information…




A book,
Not a toy


  Title




Need this
  price
Information Extraction

   Information extraction systems
      Find and understand the limited relevant parts of texts
          Clear, factual information (who did what to whom when?)
     Produce   a structured representation of the relevant
       information: relations (in the DB sense)
      Combine knowledge about language and a domain
      Automatically extract the desired information
   E.g.
      Gathering earnings, profits, board members, etc. from
       company reports
      Learn drug-gene product interactions from medical
       research literature
      “Smart Tags” (Microsoft) inside documents
Using information extraction to
populate knowledge bases




                       http://protege.stanford.edu/
Named Entity Extraction

   The task: find and classify names in text, for example:

                             [ORG] said on Thursday it
     The European Commission said on Thursday it disagreed with
     German advice. German [MISC] advice.
     disagreed with
                 [LOC] and Britain Fischler 's Fischler [PER]
     Only France and Britain backed[LOC] backed proposal .
     's proposal .
     “What we have to be extremely careful of is how other
     countrieshave to be extremely careful of is how other
     “What we are going to take Germany 's lead”, Welsh
     countries are going to take Germany 's lead”, Lloyd
     National Farmers ' Union ( NFU ) chairman John Welsh Jones
     National Farmers
     said on BBC radio'.Union [ORG] ( NFU [ORG] ) chairman John
     Lloyd Jones [PER] said on BBC [ORG] radio .


   The purpose:
      … a lot of information is really associations between named entities.
      … for question answering, answers are usually named entities.
      … the same techniques apply to other slot-filling classifications.
CoNLL (2003) Named Entity
Recognition task

Task: Predict semantic label of each word in text


  Foreign             NNP    I-NP           ORG
  Ministry            NNP    I-NP           ORG
  spokesman           NN     I-NP           O


                                                    }
  Shen                NNP    I-NP           PER         Standard
                                                        evaluation
  Guofang             NNP    I-NP           PER         is per
  told                VBD    I-VP           O           entity, not
                                                        per token
  Reuters             NNP    I-NP           ORG
  :                   :      :              :
Precision/Recall/F1 for IE

   Recall and precision are straightforward for tasks like IR
    and text categorization, where there is only one grain
    size (documents)
   The measure behaves a bit funnily for IE/NER when
    there are boundary errors (which are common):
      First Bank of Chicago announced earnings …
   This counts as both a fp and a fn
   Selecting nothing would have been better
   Some other systems (e.g., MUC scorer) give partial
    credit (according to complex rules)
Template Types
   Slots in template typically filled by a substring from the document.
   Some slots may have a fixed set of pre-specified possible fillers that
    may not occur in the text itself.
      Terrorist act: threatened, attempted, accomplished.
      Job type: clerical, service, custodial, etc.
      Company type: SEC code
   Some slots may allow multiple fillers.
      Programming language
   Some domains may allow multiple extracted templates per
    document.
      Multiple apartment listings in one ad
Task: Wrapper Induction

   Learning wrappers is wrapper induction
       Sometimes, the relations are structural.
            Web pages generated by a database.
            Tables, lists, etc.
       Can’t computers automatically learn the patterns a human
        wrapper-writer would use?
       Wrapper induction is usually regular relations which can be
        expressed by the structure of the document:
            the item in bold in the 3rd column of the table is the price
   Wrapper induction techniques can also learn:
       If there is a page about a research project X and there is a link
        near the word ‘people’ to a page that is about a person Y then Y
        is a member of the project X.
            [e.g, Tom Mitchell’s Web->KB project]
Web Extraction

   Many web pages are generated automatically from an underlying
    database.
   Therefore, the HTML structure of pages is fairly specific and regular
    (semi-structured).
   However, output is intended for human consumption, not machine
    interpretation.
   An IE system for such generated pages allows the web site to be
    viewed as a structured database.
   An extractor for a semi-structured web site is sometimes referred to
    as a wrapper.
   Process of extracting from such pages is sometimes referred to as
    screen scraping.
Web Extraction using DOM Trees

   Web extraction may be aided by first parsing web pages
    into DOM trees.
   Extraction patterns can then be specified as paths from
    the root of the DOM tree to the node containing the text
    to extract.
   May still need regex patterns to identify proper portion of
    the final CharacterData node.
Sample DOM Tree Extraction
                  HTML

                                                         Element

       HEADER                BODY
                                                       Character-Data


                       B             FONT


                Age of Spiritual              A
                                    by
                  Machines


                                             Ray
                                            Kurzweil


  Title: HTMLBODYBCharacterData
  Author: HTML BODYFONTA CharacterData
Wrappers: Simple Extraction Patterns
    Specify an item to extract for a slot using a regular expression
     pattern.
       Price pattern: “\b\$\d+(\.\d{2})?\b”
    May require preceding (pre-filler) pattern to identify proper
     context.
       Amazon list price:
             Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
             Filler pattern: “\$\d+(\.\d{2})?\b”
    May require succeeding (post-filler) pattern to identify the end of
     the filler.
       Amazon list price:
             Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
             Filler pattern: “.+”
             Post-filler pattern: “</span>”
Simple Template Extraction
   Extract slots in order, starting the search for the filler of the n+1 slot
    where the filler for the nth slot ended. Assumes slots always in a
    fixed order.
       Title
       Author
       List price
       …
   Make patterns specific enough to identify each filler always starting
    from the beginning of the document.
Pre-Specified Filler Extraction

   If a slot has a fixed set of pre-specified possible fillers,
    text categorization can be used to fill the slot.
      Job category
      Company type
   Treat each of the possible values of the slot as a
    category, and classify the entire document to determine
    the correct filler.
Wrapper induction

    Highly regular         Writing accurate patterns for
  source documents          each slot for each domain

          
                            (e.g. each web site) requires
                            laborious software
                            engineering.
   Relatively simple       Alternative is to use machine
  extraction patterns       learning:


          
                               Build a training set of
                                documents paired with
                                human-produced filled
                                extraction templates.
       Efficient               Learn extraction patterns for
  learning algorithm            each slot using an
                                appropriate machine learning
                                algorithm.
Learning for IE
   Writing accurate patterns for each slot for each domain (e.g. each web site)
    requires laborious software engineering.
   Alternative is to use machine learning:
        Build a training set of documents paired with human-produced
         filled extraction templates.
        Learn extraction patterns for each slot using an appropriate
         machine learning algorithm.
Wrapper induction:
Delimiter-based extraction

               <HTML><TITLE>Some Country Codes</TITLE>
               <B>Congo</B> <I>242</I><BR>
               <B>Egypt</B> <I>20</I><BR>
               <B>Belize</B> <I>501</I><BR>
               <B>Spain</B> <I>34</I><BR>
               </BODY></HTML>




      Use <B>, </B>, <I>, </I> for extraction
  Learning LR wrappers

          labeled pages                               wrapper
<HTML><HEAD>Some Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR> Codes</HEAD>
  <HTML><HEAD>Some Country
<B>Egypt</B> <I>20</I><BR>
  <B>Congo</B> <I>242</I><BR> Codes</HEAD>
    <HTML><HEAD>Some Country
<B>Belize</B> <I>501</I><BR>
  <B>Egypt</B> <I>20</I><BR>
    <B>Congo</B> <I>242</I><BR> Codes</HEAD>
      <HTML><HEAD>Some Country
<B>Spain</B> <I>34</I><BR>
  <B>Belize</B> <I>501</I><BR>
    <B>Egypt</B> <I>20</I><BR>
      <B>Congo</B> <I>242</I><BR>
</BODY></HTML><I>501</I><BR>
  <B>Spain</B> <I>34</I><BR>
    <B>Belize</B> <I>20</I><BR>
      <B>Egypt</B>
                                               l1, r1, …, lK, rK
  </BODY></HTML><I>501</I><BR>
    <B>Spain</B> <I>34</I><BR>
      <B>Belize</B>
    </BODY></HTML>
      <B>Spain</B> <I>34</I><BR>
      </BODY></HTML>



                            Example: Find 4 strings
                             <B>, </B>, <I>, </I>
                              l1 ,   r1 ,   l2 ,   r2 
LR: Finding r1

  <HTML><TITLE>Some Country Codes</TITLE>
  <B>Congo</B> <I>242</I><BR>
  <B>Egypt</B> <I>20</I><BR>
  <B>Belize</B> <I>501</I><BR>
  <B>Spain</B> <I>34</I><BR>
  </BODY></HTML>
LR: Finding l1, l2 and r2
 <HTML><TITLE>Some Country Codes</TITLE>
 <B>Congo</B> <I>242</I><BR>
 <B>Egypt</B> <I>20</I><BR>
 <B>Belize</B> <I>501</I><BR>
 <B>Spain</B> <I>34</I><BR>
 </BODY></HTML>
A problem with LR wrappers

      Distracting text in head and tail
  <HTML><TITLE>Some Country Codes</TITLE>
   <BODY><B>Some Country Codes</B><P>
   <B>Congo</B> <I>242</I><BR>
   <B>Egypt</B> <I>20</I><BR>
   <B>Belize</B> <I>501</I><BR>
   <B>Spain</B> <I>34</I><BR>
   <HR><B>End</B></BODY></HTML>
One (of many) solutions: HLRT

   Ignore page’s head and tail

 <HTML><TITLE>Some Country Codes</TITLE>
 <BODY><B>Some Country Codes</B><P>
 <B>Congo</B> <I>242</I><BR>
                                           }   head

 <B>Egypt</B> <I>20</I><BR>
 <B>Belize</B> <I>501</I><BR>
 <B>Spain</B> <I>34</I><BR>
 <HR><B>End</B></BODY></HTML>
                                           }   body

                                           } tail


              Head-Left-Right-Tail wrappers
More sophisticated wrappers
   LR and HLRT wrappers are extremely simple
      Though applicable to many tabular patterns
   Recent wrapper induction research has explored more expressive
    wrapper classes [Muslea et al, Agents-98; Hsu et al, JIS-98;
    Kushmerick, AAAI-1999; Cohen, AAAI-1999; Minton et al, AAAI-
    2000]
      Disjunctive delimiters
      Multiple attribute orderings
      Missing attributes
      Multiple-valued attributes
      Hierarchically nested data
      Wrapper verification and maintenance
Boosted wrapper induction

   Wrapper induction is only ideal for rigidly-structured
    machine-generated HTML…
   … or is it?!
   Can we use simple patterns to extract from natural
    language documents?

                   … Name: Dr. Jeffrey D. Hermes …
                    … Who: Professor Manfred Paul …
        ... will be given by Dr. R. J. Pangborn …
              … Ms. Scott will be speaking …
           … Karen Shriver, Dept. of ...
           … Maria Klawe, University of ...
BWI: The basic idea

   Learn “wrapper-like” patterns for texts
          pattern = exact token sequence
   Learn many such “weak” patterns
   Combine with boosting to build “strong” ensemble
    pattern
       Boosting is a popular recent machine learning method where
        many weak learners are combined
   Demo: http://www.smi.ucd.ie/bwi
   Not all natural text is sufficiently regular for exact string
    matching to work well!!
Natural Language Processing-based
Information Extraction

    If extracting from automatically generated web pages, simple
     regex patterns usually work.
    If extracting from more natural, unstructured, human-written text,
     some NLP may help.
        Part-of-speech (POS) tagging
              Mark each word as a noun, verb, preposition, etc.
         Syntactic parsing
              Identify phrases: NP, VP, PP
         Semantic word categories (e.g. from WordNet)
              KILL: kill, murder, assassinate, strangle, suffocate
    Extraction patterns can use POS or phrase tags.
       Crime victim:
              Prefiller: [POS: V, Hypernym: KILL]
              Filler: [Phrase: NP]
Stalker: A wrapper induction system
(Muslea et al. Agents-99)

E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570
E3:      523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
E4:      403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

We want to extract area code.
   Start rules:
    R1: SkipTo(()
    R2: SkipTo(-<b>)
   End rules:
    R3: SkipTo())
    R4: SkipTo(</b>)
Learning extraction rules

   Stalker uses sequential covering to learn extraction rules
    for each target item.
     In each iteration, it learns a perfect rule that covers as
       many positive items as possible without covering any
       negative items.
     Once a positive item is covered by a rule, the whole
       example is removed.
     The algorithm ends when all the positive items are
       covered. The result is an ordered list of all learned
       rules.
Rule induction through an example

Training examples:
    E1:    513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
    E2:    90 Colfax, <b>Palms</b>, Phone (800) 508-1570
    E3:    523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
    E4:    403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008

We learn start rule for area code.
 Assume the algorithm starts with E2. It creates three initial candidate
  rules with first prefix symbol and two wildcards:
         R1: SkipTo(()
         R2: SkipTo(Punctuation)
         R3: SkipTo(Anything)
   R1 is perfect. It covers two positive examples but no negative
    example.
Rule induction (cont …)
    E1:   513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
    E2:   90 Colfax, <b>Palms</b>, Phone (800) 508-1570
    E3:   523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
    E4:   403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
   R1 covers E2 and E4, which are removed. E1 and E3 need
    additional rules.
   Three candidates are created:
     R4: SkiptTo(<b>)
     R5: SkipTo(HtmlTag)
     R6: SkipTo(Anything)
   None is good. Refinement is needed.
   Stalker chooses R4 to refine, i.e., to add additional symbols, to
    specialize it.
   It will find R7: SkipTo(-<b>), which is perfect.
Limitations of Supervised Learning
   Manual Labeling is labor intensive and time consuming,
    especially if one wants to extract data from a huge
    number of sites.
   Wrapper maintenance is very costly:
     If Web sites change frequently
     It is necessary to detect when a wrapper stops to work
      properly.
     Any change may make existing extraction rules
      invalid.
     Re-learning is needed, and most likely manual re-
      labeling as well.
Road map
    Structured data extraction
      Wrapper induction
      Automatic extraction
    Information integration
    Summary
The RoadRunner System
(Crescenzi et al. VLDB-01)

 Given a set of positive examples (multiple sample pages). Each
  contains one or more data records.
 From these pages, generate a wrapper as a union-free regular
  expression (i.e., no disjunction).
The approach
 To start, a sample page is taken as the wrapper.
 The wrapper is then refined by solving mismatches between the
  wrapper and each sample page, which generalizes the wrapper.
Compare with wrapper induction
   No manual labeling, but need a set of positive pages of the same
    template
     which is not necessary for a page with multiple data records
   not wrapper for data records, but pages.
     A Web page can have many pieces of irrelevant information.


Issues of automatic extraction
 Hard to handle disjunctions
 Hard to generate attribute names for the extracted data.
 extracted data from multiple sites need integration, manual or
   automatic.
The DEPTA system (Zhai & Liu WWW-
05)

Data
region1


A data
record


A data
record




Data
region2
Align and extract data items (e.g.,
region1)
image1   EN7410 17-                 $299.99                Add to   (Delivery /   Penny      Compare
         inch LCD                                          Cart     Pick-Up )     Shopping
         Monitor
         Black/Dark
         charcoal


image2   17-inch LCD                $249.99                Add to   (Delivery /   Penny      Compare
         Monitor                                           Cart     Pick-Up )     Shopping




image3   AL1714 17-                 $269.99                Add to   (Delivery /   Penny      Compare
         inch LCD                                          Cart     Pick-Up )     Shopping
         Monitor, Black




image4   SyncMaster       Was:      $299.99   Save $70     Add to   (Delivery /   Penny      Compare
         712n 17-inch     $369.99             After: $70   Cart     Pick-Up )     Shopping
         LCD Monitor,                         mail-in-
         Black                                rebate(s)
1. Mining Data Records
(Liu et al, KDD-03; Zhai and Liu, WWW-05)
   Given a single page with multiple data records (a list
    page), it extracts data records.
   The algorithm is based on
      two observations about data records in a Web page
      a string matching algorithm (tree matching ok too)
   Considered both
      contiguous
      non-contiguous data records
The Approach
Given a page, three steps:
 Building the HTML Tag Tree
   Erroneous tags, unbalanced tags, etc
   Some problems are hard to fix
 Mining Data Regions
   Spring matching or tree matching
 Identifying Data Records


Rendering (or visual) information is very useful in the whole
  process
  Building tree based on visual cues
                                  left         right        top        bottom
1 <table>                         100          300          200        400
2       <tr>                      100          300          200        300
3               <td> … </td>      100          200          200        300
4               <td> … </td>      200          300          200        300
5       </tr>
6       <tr>                      100          300          300        400
7               <td> … </td>      100          200          300        400
8               <td> … </td>      200          300          300        400
9       </tr>
10</table>                                      table
                           The tag tree
                                          tr                tr

                                   td          td      td         td
Mining Data Regions
                               1


                2                3           4



  5    6    7       8   9      10


                                       11         12
 Region 1           Region 2

                            13       14 15       16    17   18   19   20


                                                 Region 3
Identify Data Records
                                        Name 1           Name 2
                                      Description of   Description of
                                        object 1         object 2
                                        Name 3           Name 4
   A generalized node may not be     Description of   Description of
    a data record.                      object 3         object 4
   Extra mechanisms are needed
    to identify true atomic objects
    (see the papers).
                                        Name 1           Name 2
   Some highlights:
     Contiguous                      Description of   Description of
                                        object 1         object 2
     non-contiguous data records.
                                        Name 3           Name 4
                                      Description of   Description of
                                        object 3         object 4
2. Extract Data from Data Records

   Once a list of data records are identified, we can align
    and extract data items in them.
   Approaches (align multiple data records):
     Multiple string alignment
           Many ambiguities due to pervasive use of table related tags.
       Multiple tree alignment (partial tree alignment)
           Together with visual information is effective
   Most multiple alignment methods work like hierarchical
    clustering,
     Not effective, and very expensive
Tree Matching (tree edit distance)
   Intuitively, in the mapping
     each node can appear no more than once in a mapping,
     the order between sibling nodes are preserved, and
     the hierarchical relation between nodes are also preserved.


         A                                        B
              p                               p



          a       b                       e       h


              c       d               a           c   d
The Partial Tree Alignment approach

   Choose a seed tree: A seed tree, denoted by Ts, is picked with the
    maximum number of data items.
   Tree matching:
    For each unmatched tree Ti (i ≠ s),
     match Ts and Ti.
     Each pair of matched nodes are linked (aligned).
     For each unmatched node nj in Ti do
         expand Ts by inserting nj into Ts if a position for insertion can be
          uniquely determined in Ts.
    The expanded seed tree Ts is then used in subsequent matching.
Illustration of partial tree alignment
    Ts      p                                  Ti       p


a            b            e               b         c       d   e

         New part of Ts                                     Insertion is possible
                                  p


                 a        b   c       d        e


                                          Ts            p                      Ti   p
    Insertion is not
       possible                                                                     x
                                          a             b       e          a            e
             Ts = T1 p                T2     p                      T3   p
A complete
               … x b          d            n c   k g                b c d h k
 example                              b

                         Ts       p
                                      No node inserted

                         … x b d

                                           New Ts    p     c, h, and k inserted
                                             …
      T2 is matched again                           x b   c d   h    k

                   T2     p

                 b n      c k g

                                             p

                                      … x b n c d h k g
Output Data Table
         …   x    b   n    c    d   h      k   g

    T1   …   1    1             1

    T2            1   1    1               1   1

    T3            1        1    1   1      1



   DEPTA does not work with nested data
    records.
   NET (Liu & Zhai, WISE-05)extracts data from
    both flat and nested data records.
Some other systems and techniques
   IEPAD (Chang & Lui WWW-01), DeLa (Wang & Lochovsky WWW-
    03)
     These systems treat a page as a long string, and find repeated
       substring patterns.
     They often produce multiple patterns (rules). Hard to decide
       which is correct.
   EXALG(Arasu & Garcia-Molina SIGMOD-03), (Lerman et al,
    SIGMOD-04).
     Require multiple pages to find patterns.
     Which is not necessary for pages with multiple records.
   (Zhao et al, WWW-04)
     It extracts data records in one area of a page.
Limitations and issues

   Not for a page with only a single data record
   Does not generate attribute names for the extracted data (yet!)
   extracted data from multiple sites need integration.
   It is possible in each specific application domain, e.g.,
     products sold online.
           need “product name”, “image”, and “price”.
           identify only these three fields may not be too hard.
       Job postings, publications, etc …
Road map

    Structured data extraction
      Wrapper induction
      Automatic extraction
    Information integration
    Summary
Web query interface integration

   Many integration tasks,
     Integrating Web query interfaces (search forms)
     Integrating extracted data
     Integrating textual information
     Integrating ontologies (taxonomy)
     …
   We only introduce integration of query interfaces.
     Many web sites provide forms to query deep web
     Applications: meta-search and meta-query
Global Query Interface




united.com   airtravel.com   delta.com   hotwire.com
Synonym Discovery (He and Chang, KDD-04)
   Discover synonym attributes
         Author – Writer, Subject – Category
          S1:                                       S1:
          author    S2:                             author    S2:
                    writer     S3:                            writer     S3:
          title                                     title
                    title      name                           title      name
          subject                                   subject
                    category   title                          category   title
          ISBN                                      ISBN
                    format     keyword                        format     keyword
                               binding
                                             V.S.                        binding


     Holistic Model Discovery                        Pairwise Attribute
                                                      Correspondence

                                                       S1.author  S3.name
author    writer name subject     category             S1.subject  S2.category
 Schema matching as correlation
 mining
Across many sources:
   Synonym attributes are negatively correlated
     synonym attributes are semantically alternatives.
     thus, rarely co-occur in query interfaces
   Grouping attributes with positive correlation
     grouping attributes semantically complement
     thus, often co-occur in query interfaces
1. Positive correlation mining as potential groups

               Mining positive correlations

                                                          Last Name, First Name

2. Negative correlation mining as potential matchings

                                                                 Author =
              Mining negative correlations                {Last Name, First Name}




3. Matching selection as model construction
                                                        Author (any) =
                                                        {Last Name, First Name}
                                                        Subject = Category




                                                        Format = Binding
    A clustering approach to schema
    matching (Wu et al. SIGMOD-04)

   1:1 mapping by clustering
   Bridging effect
      “a2” and “c2” might not look
        similar themselves but they
        might both be similar to “b3”
   1:m mappings
      Aggregate and is-a types
   User interaction helps in:            X
      learning of matching thresholds
      resolution of uncertain mappings
 Find 1:1 Mappings via Clustering
Interfaces:                           Initial similarity matrix:




After one merge:


                                                  Similarity functions
                                                    linguistic similarity
                                                    domain similarity




 …, final clusters:
                      {{a1,b1,c1}, {b2,c2},{a2},{b3}}
Find 1:m Complex Mappings




Aggregate type – contents of fields on the many side are part of
      the content of field on the one side


Commonalities – (1) field proximity, (2) parent label similarity,
and (3) value characteristics
Complex Mappings (Cont’d)




 Is-a type – contents of fields on the many side are sum/union of
         the content of field on the one side


Commonalities – (1) field proximity, (2) parent label similarity,
and (3) value characteristics
Instance-based matching via query
probing (Wang et al. VLDB-04)
   Both query interfaces and returned results (called
    instances) are considered in matching. It assumes
     a global schema (GS) is given and
     a set of instances are also given.
   Uses each instance value (V) in GS to probe the
    underlying database to obtain the count of V appeared in
    the returned results.
     These counts are used to help matching.
Query interface and result page
                           Title

                          Author

                         Publisher

                       Publish Date

                           ISBN

                          Format

                            …


                     Data Attributes




                                       Result Page
  Search Interface
Road map
    Structured data extraction
      Wrapper Induction
      Automatic extraction
    Information integration
    Summary
Summary
   Give an overview of two topics
     Structured data extraction
     Information integration
   Some technologies are ready for industrial exploitation,
    e.g., data extraction.
   Simple integration is do-able, complex integration still
    needs further research.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:3/4/2012
language:
pages:77