Docstoc

幻灯片 1

Document Sample
幻灯片 1 Powered By Docstoc
					Information Extraction


        PengBo
      Nov 12, 2007
                           Recap

   Time Complexity of K-Means Algorithm?
       Computing distance between two docs is O(m) where
        m is the dimensionality of the vectors.
       Reassigning clusters: O(Kn) distance computations, or
        O(Knm).
       Computing centroids: Each doc gets added once to
        some centroid: O(nm).
       Assume these two steps are each done once for I
        iterations: O(IKnm).
                 Topics of today

   Information Extraction
   Wrapper Induction
Information Extraction
      from the
   World Wide Web


      William W. Cohen
        Carnegie Mellon University
      Andrew McCallum
   University of Massachusetts Amherst
                 KDD 2003
Example: The Problem




               Martin Baker, a person




               Genomics job



               Employers job posting form
Example: A Solution
Extracting Job Openings from the Web

                    foodscience.com-Job2

                    JobTitle: Ice Cream Guru
                    Employer: foodscience.com
                    JobCategory: Travel/Hospitality
                    JobFunction: Food Services
                    JobLocation: Upper Midwest
                    Contact Phone: 800-488-2611
                    DateExtracted: January 8, 2001
                    Source: www.foodscience.com/jobs_midwest.htm
                    OtherCompanyJobs: foodscience.com-Job1
Job Openings:Category = Food Services
              Keyword = Baker
              Location = Continental U.S.
Data Mining the Extracted Job Information
IE from Research Papers
       Chinese Documents regarding
                 Weather

Chinese Academy of Sciences




                              200k+ documents
                              several millennia old

                              - Qing Dynasty Archives
                              - memos
                              - newspaper articles
                              - diaries
                               IE from SEC Filings
This filing covers the period from December 1996 to September 1997.


                        ENRON GLOBAL POWER & PIPELINES L.L.C.
                            CONSOLIDATED BALANCE SHEETS
                         (IN THOUSANDS, EXCEPT SHARE AMOUNTS)                  Data mine these
                                               SEPTEMBER 30,    DECEMBER 31,   reports for
                                                   1997            1996
                                               -------------    ------------
                                                                               - suspicious behavior,
ASSETS
                                                (UNAUDITED)                    - to better understand
Current Assets                                                                   what is normal.
   Cash and cash equivalents                        $ 54,262      $ 24,582
   Accounts receivable                                 8,473         6,301
   Current portion of notes receivable                 1,470         1,394
   Other current assets                                  336           404
                                                    --------      --------
     Total Current Assets                             71,730        32,681
                                                    --------      --------
Investments in to Unconsolidated Subsidiaries        286,340       298,530
Notes Receivable                                      16,059        12,111
                                                    --------      --------
      Total Assets                                  $374,408      $343,843
                                                    ========      ========
LIABILITIES AND SHAREHOLDERS' EQUITY
Current Liabilities
   Accounts payable                                 $ 13,461      $ 11,277
   Accrued taxes                                       1,910         1,488
                                                    --------      --------
     Total Current Liabilities                        15,371        49,348
                                                    --------      --------
Deferred Income Taxes                                    525         4,301

The U.S. energy markets in 1997 were subject to significant fluctuation
           What is “Information Extraction”

       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
source concept, by which software code is                  NAME           TITLE   ORGANIZATION
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
           What is “Information Extraction”

       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
                                                           NAME               TITLE  ORGANIZATION
source concept, by which software code is
                                                  IE       Bill Gates         CEO     Microsoft
made public to encourage improvement and
                                                           Bill Veghte        VP      Microsoft
development by outside programmers. Gates
                                                           Richard Stallman   founder Free Soft..
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
           What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
           What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
           What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
           What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
                                                  * Microsoft Corporation
denouncing its communal licensing as a              CEO
"cancer" that stifled technological innovation.     Bill Gates
Today, Microsoft claims to "love" the open-       * Microsoft
source concept, by which software code is           Gates
made public to encourage improvement and
development by outside programmers. Gates         * Microsoft
himself says Microsoft will gladly disclose its     Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               * Microsoft
customers.                                          VP
"We can be open source. We love the concept         Richard Stallman
of shared source," said Bill Veghte, a              founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                    Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                    IE in Context

Create ontology

       Spider
             Filter by relevance
                                   IE
                           Segment
                           Classify
                           Associate
                           Cluster
                                                      Database
                                            Load DB

Document          Train extraction models                        Query,
collection                                                       Search

         Label training data                                              Data mine
              Why IE from the Web?
   Science
       Grand old dream of AI: Build large KB* and reason with it.
        IE from the Web enables the creation of this KB.
       IE from the Web is a complex problem that inspires new
        advances in machine learning.
   Profit
       Many companies interested in leveraging data currently
        “locked in unstructured text on the Web”.
       Not yet a monopolistic winner in this space.
   Fun!
       Build tools that we researchers like to use ourselves:
        Cora & CiteSeer, MRQE.com, FAQFinder,…
       See our work get used by the general public.

                                        * KB = “Knowledge Base”
                           IE History
Pre-Web
 Mostly news articles
       De Jong‟s FRUMP [1982]
           Hand-built system to fill Schank-style “scripts” from news wire

       Message Understanding Conference (MUC) DARPA [‟87-‟95], TIPSTER
        [‟92-‟96]
   Most early work dominated by hand-built models
       E.g. SRI‟s FASTUS, hand-built FSMs.
       But by 1990‟s, some machine learning: Lehnert, Cardie, Grishman and
        then HMMs: Elkan [Leek ‟97], BBN [Bikel et al ‟98]
Web
 AAAI ‟94 Spring Symposium on “Software Agents”
       Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
   Tom Mitchell‟s WebKB, „96
       Build KB‟s from the Web.
   Wrapper Induction
       Initially hand-build, then ML: [Soderland ‟96], [Kushmeric ‟97],…
                    What makes IE from the Web
                            Different?
                             Less grammar, but more formatting & linking

                  Newswire                                                                    Web
                                                                     www.apple.com/retail

   Apple to Open Its First Retail Store
            in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--
Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
                                                           www.apple.com/retail/soho
of Apple's commitment to offering customers the
world's best computer shopping experience.
                                                                                   www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."




The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
            Landscape of IE Tasks (1/4):
              Pattern Feature Domain
        Text paragraphs                               Grammatical sentences
       without formatting                           and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.


  Non-grammatical snippets,
    rich formatting & links                                    Tables
          Landscape of IE Tasks (2/4):
                Pattern Scope

   Web site specific    Genre specific   Wide, non-specific
   Formatting                Layout           Language
Amazon.com Book Pages       Resumes         University Names
             Landscape of IE Tasks (3/4):
                 Pattern Complexity
E.g. word patterns:
           Closed set                         Regular set
        U.S. states                    U.S. phone numbers

        He was born in Alabama…        Phone: (413) 545-1323

        The big Wyoming sky…           The CALD main office can be
                                       reached at 412-268-1299


           Complex pattern               Ambiguous patterns,
                                         needing context and
        U.S. postal addresses            many sources of evidence
        University of Arkansas        Person names
        P.O. Box 140                  …was among the six houses
        Hope, AR 71802                sold by Hope Feldman that year.

        Headquarters:                 Pawel Opalinski, Software
        1128 Main Street, 4th Floor   Engineer at WhizBang Labs.
        Cincinnati, Ohio 45210
           Landscape of IE Tasks (4/4):
              Pattern Combinations

  Jack Welch will retire as CEO of General Electric tomorrow. The top role
  at the Connecticut company will be filled by Jeffrey Immelt.



     Single entity            Binary relationship            N-ary record

 Person: Jack Welch           Relation: Person-Title     Relation:   Succession
                              Person: Jack Welch         Company:    General Electric
                              Title:    CEO              Title:      CEO
Person: Jeffrey Immelt
                                                         Out:        Jack Welsh
                                                         In:         Jeffrey Immelt
                            Relation: Company-Location
Location: Connecticut
                            Company: General Electric
                            Location: Connecticut



“Named entity” extraction
                     Evaluation of Single Entity
                             Extraction
TRUTH:
 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.



                          # correctly predicted segments                         2
  Precision =                                                              =
                                  # predicted segments                           6


                          # correctly predicted segments                         2
  Recall         =                                                         =
                                  # true segments                                4

                                                                                                 1
    F1           =      Harmonic mean of Precision & Recall =
                                                                                       ((1/P) + (1/R)) / 2
          State of the Art Performance

   Named entity recognition
       Person, Location, Organization, …
       F1 in high 80‟s or low- to mid-90‟s
   Binary relation extraction
       Contained-in (Location1, Location2)
        Member-of (Person1, Organization1)
       F1 in 60‟s or 70‟s or 80‟s
   Wrapper induction
       Extremely accurate performance obtainable
       Human effort (~30min) required on each site
              Landscape of IE Techniques (1/1):
                           Models

                                                          Classify Pre-segmented
              Lexicons                                          Candidates                                                 Sliding Window
  Abraham Lincoln was born in Kentucky.                Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.

                                             member?        Classifier
                                      Alabama                                                                             Classifier
                                                                         which class?
                                      Alaska                                                                                         which class?
                                      …
                                      Wisconsin                                                                                                Try alternate
                                      Wyoming                                                                                                  window sizes:




        Boundary Models                                    Finite State Machines                                       Context Free Grammars
  Abraham Lincoln was born in Kentucky.                 Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.
BEGIN
                                                                                        Most likely state sequence?
                                                                                                                          NNP         NNP     V     V   P         NP

                  Classifier                                                                                                                                 PP
                                     which class?
                                                                                                                                                        VP
                                                                                                                                NP
                                                                                                                                               VP
    BEGIN   END    BEGIN       END
                                                                                                                                         S
                                                                                                                                             …and beyond
 Any of these models can be used to capture words, formatting or both.
                         Landscape:


   Pattern complexity    closed set      regular       complex       ambiguous


Pattern feature domain   words        words + formatting         formatting


        Pattern scope    site-specific        genre-specific     general


 Pattern combinations    entity       binary        n-ary


               Models    lexicon      regex     window      boundary       FSM   CFG
Wrapper Induction
                       “Wrappers”

   If we think of things from the database point of
    view
       We want to be able to database-style queries
       But we have data in some horrid textual form/content
        management system that doesn’t allow such querying
       We need to “wrap” the data in a component that
        understands database-style querying
          Hence the term “wrappers”
   Title: Schulz and Peanuts: A Biography
   Author: David Michaelis
   List Price: $34.95
                  Wrappers:
          Simple Extraction Patterns

   Specify an item to extract for a slot using a regular
    expression pattern.
       Price pattern: “\b\$\d+(\.\d{2})?\b”
   May require preceding (pre-filler) pattern to identify
    proper context.
       Amazon list price:
          Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”

          Filler pattern: “\$\d+(\.\d{2})?\b”


   May require succeeding (post-filler) pattern to
    identify the end of the filler.
       Amazon list price:
          Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”

          Filler pattern: “.+”

          Post-filler pattern: “</span>”
                  Wrapper tool-kits

   Wrapper toolkits
       Specialized programming environments for writing &
        debugging wrappers by hand
   Some Resources
       Wrapper Development Tools
       LAPIS
                    Wrapper Induction

   Problem description:
   Web sources present data in human-readable format
       take user query
       apply it to data base
       present results in “template” HTML page
   To integrate data from multiple sources, one must first
    extract relevant information from Web pages
   Task: learn extraction rules based on labeled examples
        Hand-writing rules is tedious, error prone, and time consuming
       Learning wrappers is wrapper induction
           Wrapper induction

  Highly regular         Writing accurate patterns
source documents          for each slot for each
                          domain (e.g. each web
                         site) requires laborious
                          software engineering.
 Relatively simple       Alternative is to use
extraction patterns       machine learning:
                              Build a training set of

        
                          
                              documents paired with
                              human-produced filled
                              extraction templates.
     Efficient               Learn extraction patterns for
                              each slot using an
learning algorithm            appropriate machine learning
                              algorithm.
          User gives first K positive—and
          thus many implicit negative
          examples



Learner
           Kushmerick’s WIEN system

   Earliest wrapper-learning system (published
    IJCAI ‟97)
   Special things about WIEN:
       Treats document as a string of characters
       Learns to extract a relation directly, rather than
        extracting fields, then associating them together in
        some way
       Example is a completely labeled page
    Wrapper induction:
Delimiter-based extraction


            <HTML><TITLE>Some Country Codes</TITLE>
            <B>Congo</B> <I>242</I><BR>
            <B>Egypt</B> <I>20</I><BR>
            <B>Belize</B> <I>501</I><BR>
            <B>Spain</B> <I>34</I><BR>
            </BODY></HTML>




   Use <B>, </B>, <I>, </I> for extraction
                   Learning LR wrappers

          labeled pages                           wrapper
<HTML><HEAD>Some Country Codes</HEAD>
<B>Congo</B> <I>242</I><BR> Codes</HEAD>
  <HTML><HEAD>Some Country
<B>Egypt</B> <I>20</I><BR>
  <B>Congo</B> <I>242</I><BR> Codes</HEAD>
    <HTML><HEAD>Some Country
<B>Belize</B> <I>501</I><BR>
  <B>Egypt</B> <I>20</I><BR>
    <B>Congo</B> <I>242</I><BR> Codes</HEAD>
      <HTML><HEAD>Some Country
<B>Spain</B> <I>34</I><BR>
  <B>Belize</B> <I>501</I><BR>
    <B>Egypt</B> <I>20</I><BR>
      <B>Congo</B> <I>242</I><BR>
</BODY></HTML><I>501</I><BR>
  <B>Spain</B> <I>34</I><BR>
    <B>Belize</B> <I>20</I><BR>
      <B>Egypt</B>
                                               l1, r1, …, lK, rK
  </BODY></HTML><I>501</I><BR>
    <B>Spain</B> <I>34</I><BR>
      <B>Belize</B>
    </BODY></HTML>
      <B>Spain</B> <I>34</I><BR>
      </BODY></HTML>




                         Example: Find 4 strings
                             <B>, </B>, <I>, </I>
                              l1 ,   r1 ,  l2 ,    r2 
             LR: Finding r1


<HTML><TITLE>Some Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
          LR: Finding l1, l2 and r2

<HTML><TITLE>Some Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
                       WIEN system

   Assumes items are always in fixed, known
    order
       … Name: J. Doe; Address: 1 Main; Phone: 111-1111. <p>
        Name: E. Poe; Address: 10 Pico; Phone: 777-1111. <p> …
   Introduces several types of wrappers

       LR
           Learning LR extraction rules




   Admissible rules:
       prefixes & suffixes of items of interest
   Search strategy:
       start with shortest prefix & suffix, and expand until
        correct
                   Summary of WIEN

   Advantages:
       Fast to learn & extract
   Drawbacks:
       Cannot handle permutations and missing items
       Must label entire page
       Requires large number of examples
STALKER: Hierarchical boundary finding
                                  [Muslea,Minton & Knoblock 99]
   Hierarchical wrapper induction
       Decomposes a hard problem in several easier ones
       Extracts items independently of each other
       Each rule is a finite automaton
(BEFORE=null, AFTER=(Tutorial,Topics))




(BEFORE=null, AFTER=(Tutorials,and))
(BEFORE=null, AFTER=(<,li,>,))
(BEFORE=(:), AFTER=null)
(BEFORE=(:), AFTER=null)
(BEFORE=(:), AFTER=null)
Embedded Catalog Tree (ECT)
               Extraction Rules

   Extraction rule: sequence of landmarks
More about Extraction Rules
Example of Rule Induction
          Stalker: summary and results

   Rule format:
       “landmark automata” format for rules
           E.g.: <a>W. Cohen</a> CMU: Web IE </li>

           STALKER: BEGIN = SkipTo(<, /, a, >), SkipTo(:)


   Top-down rule learning algorithm
       Carefully chosen ordering between types of rule
        specializations
Summary

       Information Extraction
           Filling slots in database
       Wrapper Induction
           Wrapper toolkits
           LR Wrapper
                      Readings

   [1] M. Ion, M. Steve, and K. Craig, "A hierarchical
    approach to wrapper induction," in Proceedings
    of the third annual conference on Autonomous
    Agents. Seattle, Washington, United States: ACM,
    1999.
Thank You!
    Q&A

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:7
posted:10/23/2011
language:English
pages:63