Information extraction - kdd-tutorial.ppt by lovemacromastia

VIEWS: 29 PAGES: 79

									   Information Extraction
and Integration: an Overview



       William W. Cohen
        Carnegie Mellon University



              April 26, 2004
Example: The Problem




               Martin Baker, a person




               Genomics job



               Employers job posting form
Example: A Solution
Extracting Job Openings from the Web

                    foodscience.com-Job2

                    JobTitle: Ice Cream Guru
                    Employer: foodscience.com
                    JobCategory: Travel/Hospitality
                    JobFunction: Food Services
                    JobLocation: Upper Midwest
                    Contact Phone: 800-488-2611
                    DateExtracted: January 8, 2001
                    Source: www.foodscience.com/jobs_midwest.htm
                    OtherCompanyJobs: foodscience.com-Job1
Job Openings:
Category = Food Services
Keyword = Baker
Location = Continental U.S.
IE from Research Papers
               What is “Information Extraction”

       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
source concept, by which software code is                  NAME           TITLE   ORGANIZATION
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
               What is “Information Extraction”

       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
                                                           NAME               TITLE  ORGANIZATION
source concept, by which software code is
                                                  IE       Bill Gates         CEO     Microsoft
made public to encourage improvement and
                                                           Bill Veghte        VP      Microsoft
development by outside programmers. Gates
                                                           Richard Stallman   founder Free Soft..
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates           aka “named entity
made public to encourage improvement and
                                                  Microsoft       extraction”
development by outside programmers. Gates
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
                                                  * Microsoft Corporation
denouncing its communal licensing as a              CEO
"cancer" that stifled technological innovation.     Bill Gates
Today, Microsoft claims to "love" the open-       * Microsoft
source concept, by which software code is           Gates
made public to encourage improvement and
development by outside programmers. Gates         * Microsoft
himself says Microsoft will gladly disclose its     Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               * Microsoft
customers.                                          VP
"We can be open source. We love the concept         Richard Stallman
of shared source," said Bill Veghte, a              founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                    Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
               Tutorial Outline
• IE History
• Landscape of problems and solutions
• Models for named entity recognition:
  – Sliding window
  – Boundary finding
  – Finite state machines
• Overview of related problems and solutions
  – Association, Clustering
  – Integration with Data Mining
                            IE History
Pre-Web
• Mostly news articles
   – De Jong‟s FRUMP [1982]
       • Hand-built system to fill Schank-style “scripts” from news wire
   – Message Understanding Conference (MUC) DARPA [‟87-‟95],
     TIPSTER [‟92-‟96]
• Early work dominated by hand-built models
   – E.g. SRI‟s FASTUS, hand-built FSMs.
   – But by 1990‟s, some machine learning: Lehnert, Cardie, Grishman and
     then HMMs: Elkan [Leek ‟97], BBN [Bikel et al ‟98]
Web
• AAAI ‟94 Spring Symposium on “Software Agents”
   – Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell‟s WebKB, „96
   – Build KB‟s from the Web.
• Wrapper Induction
   – Initially hand-build, then ML: [Soderland ‟96], [Kushmeric ‟97],…
   – Citeseer; Cora; FlipDog; contEd courses, corpInfo, …
                   IE History
Biology
• Gene/protein entity extraction
• Protein/protein fact interaction
• Automated curation/integration of databases
  – At CMU: SLIF (Murphy et al, subcellular information
    from images + text in journal articles)
Email
• EPCA, PAL, RADAR, CALO: intelligent office
  assistant that “understands” some part of email
  – At CMU: web site update requests, office-space
    requests; calendar scheduling requests; social
    network analysis of email.
                IE is different in different domains!
        Example: on web there is less grammar, but more formatting & linking
                  Newswire                                                                    Web
                                                                     www.apple.com/retail

   Apple to Open Its First Retail Store
            in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--
Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
                                                           www.apple.com/retail/soho
of Apple's commitment to offering customers the
world's best computer shopping experience.
                                                                                   www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."




The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
                 Landscape of IE Tasks (1/4):
                               Degree of Formatting
        Text paragraphs                               Grammatical sentences
       without formatting                           and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.


  Non-grammatical snippets,
    rich formatting & links                                    Tables
                Landscape of IE Tasks (2/4):
                Intended Breadth of Coverage

   Web site specific     Genre specific   Wide, non-specific
   Formatting                 Layout           Language
Amazon.com Book Pages        Resumes         University Names
                Landscape of IE Tasks (3/4):
                                 Complexity
E.g. word patterns:
           Closed set                           Regular set
        U.S. states                      U.S. phone numbers

        He was born in Alabama…          Phone: (413) 545-1323

        The big Wyoming sky…             The CALD main office can be
                                         reached at 412-268-1299


           Complex pattern                 Ambiguous patterns,
                                           needing context and
        U.S. postal addresses              many sources of evidence
        University of Arkansas          Person names
        P.O. Box 140                    …was among the six houses
        Hope, AR 71802                  sold by Hope Feldman that year.

        Headquarters:                   Pawel Opalinski, Software
        1128 Main Street, 4th Floor     Engineer at WhizBang Labs.
        Cincinnati, Ohio 45210
              Landscape of IE Tasks (4/4):
                         Single Field/Record

  Jack Welch will retire as CEO of General Electric tomorrow. The top role
  at the Connecticut company will be filled by Jeffrey Immelt.



     Single entity            Binary relationship            N-ary record

 Person: Jack Welch           Relation: Person-Title     Relation:   Succession
                              Person: Jack Welch         Company:    General Electric
                              Title:    CEO              Title:      CEO
Person: Jeffrey Immelt
                                                         Out:        Jack Welsh
                                                         In:         Jeffrey Immelt
                            Relation: Company-Location
Location: Connecticut
                            Company: General Electric
                            Location: Connecticut



“Named entity” extraction
                      Landscape of IE Techniques (1/1):
                                  Models
                                                          Classify Pre-segmented
              Lexicons                                          Candidates                                                 Sliding Window
  Abraham Lincoln was born in Kentucky.                Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.

                                             member?        Classifier
                                      Alabama                                                                             Classifier
                                                                         which class?
                                      Alaska                                                                                         which class?
                                      …
                                      Wisconsin                                                                                                Try alternate
                                      Wyoming                                                                                                  window sizes:




        Boundary Models                                    Finite State Machines                                       Context Free Grammars
  Abraham Lincoln was born in Kentucky.                 Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.
BEGIN
                                                                                        Most likely state sequence?
                                                                                                                          NNP         NNP     V     V   P         NP

                  Classifier                                                                                                                                 PP
                                     which class?
                                                                                                                                                        VP
                                                                                                                                NP
                                                                                                                                               VP
    BEGIN   END    BEGIN       END
                                                                                                                                         S


 Any of these models can be used to capture words, formatting or both.
Sliding Windows
          Extraction by Sliding Window

                  GRAND CHALLENGES FOR MACHINE LEARNING

                         Jaime Carbonell
                     School of Computer Science
                    Carnegie Mellon University

E.g.                         3:30 pm
                          7500 Wean Hall
Looking for
seminar       Machine learning has evolved from obscurity
location      in the 1970s into a vibrant and popular
              discipline in artificial intelligence
              during the 1980s and 1990s.   As a result
              of its success and growth, machine learning
              is evolving into a collection of related
              disciplines: inductive concept acquisition,
              analytic learning in problem solving (e.g.
              analogy, explanation-based learning),
              learning theory (e.g. PAC learning),
              genetic algorithms, connectionist learning,
              hybrid systems, and so on.

                  CMU UseNet Seminar Announcement
          Extraction by Sliding Window

                  GRAND CHALLENGES FOR MACHINE LEARNING

                         Jaime Carbonell
                     School of Computer Science
                    Carnegie Mellon University

E.g.                         3:30 pm
                          7500 Wean Hall
Looking for
seminar       Machine learning has evolved from obscurity
location      in the 1970s into a vibrant and popular
              discipline in artificial intelligence
              during the 1980s and 1990s.   As a result
              of its success and growth, machine learning
              is evolving into a collection of related
              disciplines: inductive concept acquisition,
              analytic learning in problem solving (e.g.
              analogy, explanation-based learning),
              learning theory (e.g. PAC learning),
              genetic algorithms, connectionist learning,
              hybrid systems, and so on.

                  CMU UseNet Seminar Announcement
          Extraction by Sliding Window

                  GRAND CHALLENGES FOR MACHINE LEARNING

                         Jaime Carbonell
                     School of Computer Science
                    Carnegie Mellon University

E.g.                         3:30 pm
                          7500 Wean Hall
Looking for
seminar       Machine learning has evolved from obscurity
location      in the 1970s into a vibrant and popular
              discipline in artificial intelligence
              during the 1980s and 1990s.   As a result
              of its success and growth, machine learning
              is evolving into a collection of related
              disciplines: inductive concept acquisition,
              analytic learning in problem solving (e.g.
              analogy, explanation-based learning),
              learning theory (e.g. PAC learning),
              genetic algorithms, connectionist learning,
              hybrid systems, and so on.

                  CMU UseNet Seminar Announcement
          Extraction by Sliding Window

                  GRAND CHALLENGES FOR MACHINE LEARNING

                         Jaime Carbonell
                     School of Computer Science
                    Carnegie Mellon University

E.g.                         3:30 pm
                          7500 Wean Hall
Looking for
seminar       Machine learning has evolved from obscurity
location      in the 1970s into a vibrant and popular
              discipline in artificial intelligence
              during the 1980s and 1990s.   As a result
              of its success and growth, machine learning
              is evolving into a collection of related
              disciplines: inductive concept acquisition,
              analytic learning in problem solving (e.g.
              analogy, explanation-based learning),
              learning theory (e.g. PAC learning),
              genetic algorithms, connectionist learning,
              hybrid systems, and so on.

                  CMU UseNet Seminar Announcement
    A “Naïve Bayes” Sliding Window Model
                                                        [Freitag 1997]


…   00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
    w t-m        w t-1 w t        w t+n w t+n+1           w t+n+m

          prefix               contents                   suffix

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)


If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
    “Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
    GRAND CHALLENGES FOR MACHINE LEARNING

           Jaime Carbonell
       School of Computer Science
      Carnegie Mellon University
                                               Field          F1
               3:30 pm
            7500 Wean Hall                     Person Name:   30%
                                               Location:      61%
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
                                               Start Time:    98%
discipline in artificial intelligence during
the 1980s and 1990s.   As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.
SRV: a realistic sliding-window-classifier
                IE system
                                                     [Frietag AAAI „98]
 • What windows to consider?
     – all windows containing as many tokens as the shortest
       example, but no more tokens than the longest example
 • How to represent a classifier? It might:
      – Restrict the length of window;
      – Restrict the vocabulary or formatting used
        before/after/inside window;
      – Restrict the relative order of tokens;
      – Use inductive logic programming techniques to express all
        these…
<title>Course Information for CS213</title>
<h1>CS 213 C++ Programming</h1>
SRV: a rule-learner for sliding-window
             classification
• Primitive predicates used by SRV:
  – token(X,W), allLowerCase(W), numerical(W), …
  – nextToken(W,U), previousToken(W,V)


• HTML-specific predicates:
  – inTitleTag(W), inH1Tag(W), inEmTag(W),…
  – emphasized(W) = “inEmTag(W) or inBTag(W) or …”
  – tableNextCol(W,U) = “U is some token in the column
    after the column W is in”
  – tablePreviousCol(W,V), tableRowHeader(W,T),…
    SRV: a rule-learner for sliding-window
                 classification
• Non-primitive “conditions” used by SRV:
    –   every(+X, f, c) = for all W in X : f(W)=c
    –   some(+X, W, <f1,…,fk>, g, c)= exists W: g(fk(…(f1(W)…))=c
    –   tokenLength(+X, relop, c):
    –   position(+W,direction,relop, c):
          • e.g., tokenLength(X,>,4), position(W,fromEnd,<,2)




courseNumber(X) :-
       tokenLength(X,=,2),                                Non-primitive
       every(X, inTitle, false),                          conditions
       some(X, A, <previousToken>, inTitle, true),        make greedy
       some(X, B, <>. tripleton, true)                    search easier
Rapier – results vs. SRV
  Rule-learning approaches to sliding-
   window classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD „97]
   – Representations for classifiers allow restriction of the
     relationships between tokens, etc
   – Representations are carefully chosen subsets of even more
     powerful representations based on logic programming (ILP
     and Prolog)
   – Use of these “heavyweight” representations is complicated,
     but seems to pay off in results
• Some questions to consider:
   – Can simpler, propositional representations for classifiers
     work (see Roth and Yih)
   – What learning methods to consider (NB, ILP, boosting,
     semi-supervised – see Collins & Singer)
   – When do we want to use this method vs fancier ones?
  BWI: Learning to detect boundaries
                              [Freitag & Kushmerick, AAAI 2000]

• Another formulation: learn three probabilistic
  classifiers:
   – START(i) = Prob( position i starts a field)
   – END(j) = Prob( position j ends a field)
   – LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
   START(i) * END(j) * LEN(j-i)

• LEN(k) is estimated from a histogram
BWI: Learning to detect boundaries




                         Field          F1
                         Person Name:   30%
                         Location:      61%
                         Start Time:    98%
     Problems with Sliding Windows
         and Boundary Finders
• Decisions in neighboring parts of the input are made
  independently from each other.

   – Expensive for long entity names
   – Sliding Window may predict a “seminar end time” before the
     “seminar start time”.
   – It is possible for two overlapping windows to both be above
     threshold.

   – In a Boundary-Finding system, left boundaries are laid down
     independently from right boundaries, and their pairing
     happens as a separate step.
Finite State Machines
         IE with Hidden Markov Models
Given a sequence of observations:
      Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM:                             person name
                                               location name
                                               background

                                                                
Find the most likely state sequence: (Viterbi) arg max s P ( s , o )
                                                       




       Yesterday Pedro Domingos spoke this example sentence.


Any words said to be generated by the designated “person name”
state extract as a person name:
       Person name: Pedro Domingos
            HMM for Segmentation




• Simplest Model: One state per entity type
                         What is a “symbol” ???

Cohen => “Cohen”, “cohen”, “Xxxxx”, “Xx”, … ?
4601 => “4601”, “9999”, “9+”, “number”, … ?

                                                              All


                                    Numbers                                                     Words               Delimiters


         3-digits               5-digits                 Others                        Chars         Multi-letter . , / - + ? #


 000..          ...999    00000..      ..99999   0..99   0000..9999   000000..   A..           ..z      aa..



 Datamold: choose best abstraction level using holdout set
                      HMM Example: “Nymble”
                                                                                [Bikel, et al 1998],
Task: Named Entity Extraction                                                   [BBN “IdentiFinder”]

                                                            Transition            Observation
                  Person                                    probabilities         probabilities
                                        end-of-
start-of-                               sentence          P(st | st-1, ot-1 )    P(ot | st , st-1 )
sentence
                    Org
                                                                                   or   P(ot | st , ot-1 )
            (Five other name classes)
                                                            Back-off to:          Back-off to:

                   Other                                     P(st | st-1 )          P(ot | st )

Train on ~500k words of news wire text.                        P(st )               P(ot )
Results:        Case        Language               F1 .
                Mixed       English                93%
                Upper       English                91%
                Mixed       Spanish                90%

                         Other examples of shrinkage for HMMs in IE: [Freitag and McCallum „99]
                   What is a symbol?
Bikel et al mix symbols from two abstraction levels
                     What is a symbol?

 Ideally we would like to use many, arbitrary, overlapping
 features of words.
                                                 S t-1       St    S t+1
  identity of word
  ends in “-ski”                                                           …
  is capitalized
  is part of a noun phrase
  is in a list of city names   is “Wisniewski”

  is under node X in WordNet                                               …
                                    part of        ends in
  is in bold font              noun phrase         “-ski”
  is indented
                                                  O          Ot   O t +1
  is in hyperlink anchor                              t -1

  …

Lots of learning systems are not confounded by multiple, non-
independent features: decision trees, neural nets, SVMs, …
                   What is a symbol?

                                               S t-1       St    S t+1
identity of word
ends in “-ski”                                                           …
is capitalized
is part of a noun phrase
is in a list of city names   is “Wisniewski”

is under node X in WordNet                                               …
                                  part of        ends in
is in bold font              noun phrase         “-ski”
is indented
                                                O          Ot   O t +1
is in hyperlink anchor                              t -1

…



Idea: replace generative model in HMM with a maxent
model, where state depends on observations

                     Pr( st | xt )  ...
                   What is a symbol?

                                                 S t-1       St    S t+1
identity of word
ends in “-ski”                                                             …
is capitalized
is part of a noun phrase
is in a list of city names     is “Wisniewski”

is under node X in WordNet                                                 …
                                   part of         ends in
is in bold font               noun phrase          “-ski”
is indented
                                                  O          Ot   O t +1
is in hyperlink anchor                                t -1

…


Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state
                     Pr( st | xt , st 1, )  ...
                   What is a symbol?

                                               S t-1              S t+1
identity of word                                            St
ends in “-ski”                                                            …
is capitalized
is part of a noun phrase
is in a list of city names   is “Wisniewski”

is under node X in WordNet                                                …
                                  part of        ends in
is in bold font              noun phrase         “-ski”
is indented
                                                O          Ot    O t +1
is in hyperlink anchor                              t -1

…


Idea: replace generative model in HMM with a maxent
model, where state depends on observations and
previous state history

                  Pr( st | xt , st 1, st  2, ...)  ...
              Ratnaparkhi’s MXPOST

• Sequential learning problem:
  predict POS tags of words.
• Uses MaxEnt model
  described above.
• Rich feature set.
• To smooth, discard features
  occurring < 10 times.
    Conditional Markov Models (CMMs) aka
    MEMMs aka Maxent Taggers vs HMMS

                                                     St-1        St          St+1

Pr( s, o)   Pr( si | si 1 ) Pr( oi | si 1 )
                                                                                ...

               i
                                                  Ot-1      Ot        Ot+1


                                                     St-1        St          St+1

 Pr( s | o)   Pr( si | si 1 , oi 1 )
                                                                                ...

                   i                              Ot-1      Ot        Ot+1
                     Label Bias Problem

• Consider this MEMM, and enough training data to perfectly model it:

                                                    Pr(0123|rib)=1
                                                    Pr(0453|rob)=1


Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3
              = 0.5 * 1 * 1


Pr(0453|rib) = Pr(4|0,r)/Z1’ * Pr(5|4,i)/Z2’ * Pr(3|5,b)/Z3’
               = 0.5 * 1 *1
Another view of label bias [Sha & Pereira]




So what’s the alternative?
                              CMMs to CRFs

                                                             exp( i f i ( x j , y j , y j 1 ))
  Pr( y1...yn | x1...xn )   Pr( y j | y j 1, x j )              i

                              j                          j               Z ( x j )


           exp( i Fi ( x, y ))
                 i
                                    , whereFi ( x, y )   f i ( x j , y j , y j 1 )
                Z ( x )
                 j
                          j                                    j




New model                         exp(  i Fi ( x, y ))
                                          i

                                              Z  ( x)
     What’s the new model look like?


     exp(  i Fi ( x, y ))       exp(  i  f i ( x j , y j , y j 1 )
                              
           i                             i       j

               Z  ( x)                         Z  ( x)
What’s independent?


                     y1            y2                y3


                     x1            x2                x3
      Graphical comparison among
       HMMs, MEMMs and CRFs




HMM            MEMM                CRF
Sha & Pereira results




           CRF beats MEMM
           (McNemar’s test); MEMM
           probably beats voted
           perceptron
Sha & Pereira results




 in minutes, 375k examples
Combining Sliding Windows and Finite-
            State Models
  Problems with Tagging-based NER

• Decisions are made token-by-token
• Feature engineering can be awkward:
   – How long is the extracted name?
   – Does the extracted name appear in a dictionary?
   – Is the extracted name similar to (an acronym for, a
     short version of...) a name appearing earlier in the
     document?
• “Tag engineering” is also crucial:
   – Do we use one state per entity? a special state for
     beginning an entity? a special state for end?
           Semi-Markov models for IE
                                  with Sunita Sarawagi, IIT Bombay
• Train on sequences of
  labeled segments, not
  labeled words.
   S=(start,end,label)
• Build probability model
  of segment sequences,
  not word sequences
• Define features f of
  segments
• (Approximately)
  optimize feature weights
  on training data
                             f(S) = words xt...xu, length, previous
                             words, case information, ..., distance to
                             known name
                                              m
                                 maximize:    log Pr( S
                                             i 1
                                                           i   | xi)
                      Segments vs tagging

  t     1         2          3             4        5           6           7       8

 x    Fred      please    stop        by       my         office      this      afternoon

 y    Person     other     other       other    Loc            Loc    other       Time



                                               f(xt,yt)

t,u   t1=u1=1            t2=2, u2=4                 t3=5,u3=6        t4=u4=7      t5=u5=8

      Fred      please    stop        by       my         office     this       afternoon
 x
 y    Person               other                        Loc           other        Time



                                                    f(xj,yj)
 A training algorithm for CSMM’s (1)

Review: Collins’ perceptron training algorithm




                                             Correct tags

                                                 Viterbi tags
       A training algorithm for CSMM’s (2)

    Variant of Collins’ perceptron training algorithm:



                                                         voted
                                                         perceptron
                                                         learner for
                                                         TTRANS


like Viterbi
       A training algorithm for CSMM’s (3)

    Variant of Collins’ perceptron training algorithm:



                                                         voted
                                                         perceptron
                                                         learner for
                                                         TSEGTRANS


like Viterbi
Results
Results: vs CRF
Results: vs CRF
Broader Issues in IE
                                    Broader View
       Up to now we have been focused on segmentation and classification

Create ontology

       Spider
             Filter by relevance
                                   IE
                           Segment
                           Classify
                           Associate
                           Cluster
                                                      Database
                                            Load DB

Document          Train extraction models                        Query,
collection                                                       Search

         Label training data                                              Data mine
                                     Broader View
                               Now touch on some other issues

3 Create ontology
         Spider
              Filter by relevance
                                    IE
                  Tokenize
                             Segment
                             Classify
                        1    Associate
                        2    Cluster
                                                      Database
                                            Load DB

 Document       4 Train extraction models                        Query,
 collection                                                      Search

          Label training data                                             5 Data mine
(1) Association as Binary Classification


Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.
     Person                         Person                Role


     Person-Role (Christos Faloutsos, KDD 2003 General Chair)  NO

     Person-Role (   Ted Senator,      KDD 2003 General Chair)  YES




          Do this with SVMs and tree kernels over parse trees.
                                               [Zelenko et al, 2002]
(1) Association with Finite State Machines
                                                     [Ray & Craven, 2001]

… This enzyme, UBC6,
localizes to the endoplasmic
reticulum, with the catalytic
domain facing the cytosol. …


    DET     this
    N       enzyme
    N       ubc6
    V       localizes
    PREP    to
    ART     the
    ADJ     endoplasmic
    N       reticulum
    PREP    with
    ART     the
    ADJ     catalytic
    N       domain
    V       facing
    ART     the
    N       cytosol
                      Subcellular-localization (UBC6, endoplasmic reticulum)
 (1) Association with Graphical Models
 Capture arbitrary-distance                             [Roth & Yih 2002]
 dependencies among
 predictions.                                                   Random variable
                                                                over the class of
                                                                relation between
                                                                entity #2 and #1,
Random variable                                                 e.g. over {lives-in,
over the class of                                               is-boss-of,…}
entity #2, e.g. over
{person, location,…}
                                                                  Local language
                                                                  models contribute
                                                                  evidence to relation
                                                                  classification.

Local language
models contribute                                Dependencies between classes
evidence to entity                               of entities and relations!
classification.




                              Inference with loopy belief propagation.
 (1) Association with Graphical Models
 Also capture long-distance                                  [Roth & Yih 2002]
 dependencies among
 predictions.                                                        Random variable
                                                                     over the class of
                                                                     relation between
                                                                     entity #2 and #1,
Random variable        person                                        e.g. over {lives-in,
over the class of                                                    is-boss-of,…}
                                         lives-in
entity #1, e.g. over
{person, location,…}
                                                                       Local language
                                                                       models contribute
                                                                       evidence to relation
                        location                                       classification.

Local language
models contribute                                     Dependencies between classes
evidence to entity                                    of entities and relations!
classification.




                                   Inference with loopy belief propagation.
                                     Broader View
                               Now touch on some other issues

3 Create ontology
         Spider
              Filter by relevance
                                    IE
                  Tokenize
                             Segment
                             Classify
                        1    Associate
                        2    Cluster
                                                      Database
                                            Load DB

 Document       4 Train extraction models                        Query,
 collection                                                      Search

          Label training data                                             5 Data mine



                                When do two extracted strings
                                refer to the same object?
             (2) Information Integration
                     [Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],
                                                      [Richardson & Domingos 2003]


Goal might be to merge results of two IE systems:
Name:      Introduction to                 Title:        Intro. to Comp. Sci.
           Computer Science
                                           Num:          101
Number:    CS 101
                                           Dept:         Computer Science
Teacher:   M. A. Kludge
                                           Teacher:      Dr. Klüdge

Time:      9-11am                          TA:           John Smith

Name:      Data Structures in              Topic:        Java Programming
           Java
Room:      5032 Wean Hall                  Start time:   9:10 AM
(2) Other Information Integration Issues

  • Distance metrics for text – which work well?
     – [Cohen, Ravikumar, Fienberg, 2003]
  • Finessing integration by soft database
    operations based on similarity
     – [Cohen, 2000]
  • Integration of complex structured databases:
    (capture dependencies among multiple
    merges)
     – [Cohen, MacAllister, Kautz KDD 2000; Pasula,
       Marthi, Milch, Russell, Shpitser, NIPS 2002;
       McCallum and Wellner, KDD WS 2003]
                     IE Resources
• Data
   – RISE, http://www.isi.edu/~muslea/RISE/index.html
   – Linguistic Data Consortium (LDC)
      • Penn Treebank, Named Entities, Relations, etc.
   – http://www.biostat.wisc.edu/~craven/ie


• Papers, tutorials, lectures, code
   – http://www.cs.cmu.edu/~wcohen/10-707
                                              References
•   [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In
    Proceedings of ANLP‟97, p194-201.
•   [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in
    Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
•   [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML
    documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
•   [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the
    Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
•   [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual
    Similarity, in Proceedings of ACM SIGMOD-98.
•   [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,
    ACM Transactions on Information Systems, 18(3).
•   [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:
    Proceedings of the Seventeeth International Conference (ML-2000).
•   [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the
    Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
•   [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural
    Language Processing. Larence Erlbaum, 1982, 149-176.
•   [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the
    Fifteenth National Conference on Artificial Intelligence (AAAI-98).
•   [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon
    University.
•   [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101
    (2000).
•   Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National
    Conference on Artificial Intelligence (AAAI-99)
•   [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings
    AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
•   [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
•   [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models
    for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
•   [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master‟s thesis. UC San Diego.
•   [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information
    extraction and segmentation, In Proceedings of ICML-2000
•   [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from
    Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
                                             References

•   [Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of
    Autonomous Agents-99.
•   [Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information
    sources. Journal of Autonomous Agents and Multi-Agent Systems.
•   [Nahm & Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In
    Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX.
•   [Punyakanok & Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural
    Information Processing Systems 13.
•   [Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language
    Processing Conference, p133-141.
•   [Ray & Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information
    Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann.
•   [Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third
    International Conference on Knowledge Discovery and Data Mining (KDD-97).
•   [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning,
    34(1/3):233-277.

								
To top