Information Extraction by 0bLaM0dQ

VIEWS: 33 PAGES: 90

									Make-up Class:
 Tomorrow (Wed) 10:30—11:45AM
     BY 210 (next to the advising office)




          Information Extraction

   (Several slides based on those by Ray
    Mooney, Cohen/McCallum (via Dan
                Weld’s class)


                                             1
      Intended Use of Semantic Web?
• Pages should be annotated with RDF triples, with links to
  RDF-S (our OWL) background ontology.
• E.g. See Jim Hendler‟s page…




                                                              2
  Database vs. Semantic Web Inference
        (and the Magellan Story)

• Also templated extraction as undoing XMLHTML
  conversion. Templated extraction is by DOM-patterns;
  unstructured extraction is (sort of) by grammar parse tree
  patterns. Grammar learning is mostly from +ve examples.




Rinku Patel             To be added

                                                               3
              Who will annotate the data?
•   Semantic web works if the users annotate their pages using some existing
    ontology (or their own ontology, but with mapping to other ontologies)
     –   But users typically do not conform to standards..
          • and are not patient enough for delayed gratification…
•   Two Solutions
     –   1. Intercede in the way pages are created (act as if you are helping them write
         web-pages)
           • What if we change the MS Frontpage/Claris Homepage so that they (slyly)
               add annotations?
           • E.g. The Mangrove project at U. Wash.
                 – Help user in tagging their data (allow graphical editing)
                 – Provide instant gratification by running services that use the tags.

     –   2. Collaborative tagging!
           • “Folksonomies” (look at Wikipedia article)
                 – FLICKR, Technorati, deli.cio.us etc
           • CBIOC, ESP game etc.
                 – Need to incentivize users to do the annotations..

     –   3. Automated information extraction (next topic)

                                                                                           4
            Folksonomies—The good
• Bottom-up approach to taxonomies/ontologies
   – [In systems like] Furl, Flickr and Del.icio.us... people classify
     their pictures/bookmarks/web pages with tags (e.g. wedding),
     and then the most popular tags float to the top (e.g. Flickr's
     tags or Del.icio.us on the right)....
   – [F]olksonomies can work well for certain kinds of information
     because they offer a small reward for using one of the popular
     categories (such as your photo appearing on a popular page).
     People who enjoy the social aspects of the system will
     gravitate to popular categories while still having the freedom
     to keep their own lists of tags.




                                                                     5
Works best when
Many people
Tag the same
Info…




                  6
             Folksonomies… the bad
• On the other hand, not hard to see a few reasons why a
  folksonomy would be less than ideal in a lot of cases:
   – None of the current implementations have synonym control
     (e.g. "selfportrait" and "me" are distinct Flickr tags, as are
     "mac" and "macintosh" on Del.icio.us).
   – Also, there's a certain lack of precision involved in using
     simple one-word tags--like which Lance are we talking about?
   – And, of course, there's no heirarchy and the content types
     (bookmarks, photos) are fairly simple.
• For indexing and library people, folksonomies are about as
  appealing as Wikipedia is to encyclopedia editors.
   – But.. there's some interesting stuff happening around them.



                                                                      7
              Mass Collaboration
           (& Mice running the Earth)
• The quality of the tags generated through folksonomies is
  notoriously hard to control
   – So, design mechanisms that ensure correctness of tags..
      • ESP game makes it fun to
      • CBIOC and Google Co-op restrict annotation previleges to
         trusted users..
• It is hard to get people to tag things in which they don‟t
  have personal interest..
   – Find incentive structures..
      • ESP makes it a “game” with points
      • CBIOC and Google Co-op try to promise delayed
        gratification in terms of improved search later..



                                                                   8
              Who will annotate the data?
•   Semantic web works if the users annotate their pages using some existing
    ontology (or their own ontology, but with mapping to other ontologies)
     –   But users typically do not conform to standards..
          • and are not patient enough for delayed gratification…
•   Two Solutions
     –   1. Intercede in the way pages are created (act as if you are helping them write
         web-pages)
           • What if we change the MS Frontpage/Claris Homepage so that they (slyly)
               add annotations?
           • E.g. The Mangrove project at U. Wash.
                 – Help user in tagging their data (allow graphical editing)
                 – Provide instant gratification by running services that use the tags.

     –   2. Collaborative tagging!
           • “Folksonomies” (look at Wikipedia article)
                 – FLICKR, Technorati, deli.cio.us etc
           • CBIOC, ESP game etc.
                 – Need to incentivize users to do the annotations..

     –   3. Automated information extraction     Next Topic

                                                                                           9
          Information Extraction (IE)
• Identify specific pieces of information (data) in a
  unstructured or semi-structured textual document.
• Transform unstructured information in a corpus of
  documents or web pages into a structured database.
• Applied to different types of text:
   –   Newspaper articles
   –   Web pages
   –   Scientific articles
   –   Newsgroup messages
   –   Classified ads
   –   Medical notes
   –   Wikipedia (info boxes)..

                                                        10
    Information Extraction vs. NLP?

• Information extraction is attempting to find
  some of the structure and meaning in the
  hopefully template driven web pages.
• As IE becomes more ambitious and text
  becomes more free form, then ultimately we
  have IE becoming equal to NLP.
• Web does give one particular boost to NLP
  – Massive corpora..


                                                 11
                       MUC
• DARPA funded significant efforts in IE in the
  early to mid 1990’s.
• Message Understanding Conference (MUC) was
  an annual event/competition where results were
  presented.
• Focused on extracting information from news
  articles:
   – Terrorist events
   – Industrial joint ventures
   – Company management changes
• Information extraction of particular interest to the
  intelligence community (CIA, NSA).
                                                         12
                Other Applications
• Job postings:
    – Newsgroups: Rapier from austin.jobs
    – Web pages: Flipdog
• Job resumes:
    – BurningGlass
    – Mohomine
•   Seminar announcements
•   Company information from the web
•   Continuing education course info from the web
•   University information from the web
•   Apartment rental ads
•   Molecular biology information from MEDLINE      13
         Wikipedia Infoboxes..

• Wikipedia has
  both                           Infobox
  unstructured
  text and
  structured
  info boxes..




                                           14
                        Sample Job Posting
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <56nigp$mrs@bilbo.reference.com>

SOFTWARE PROGRAMMER

Position available for Software Programmer experienced in generating software for PC-
Based Voice Mail systems. Experienced in C Programming. Must be familiar with
communicating with and controlling voice cards; preferable Dialogic, however, experience
with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a
Senior level person who can come on board and pick up code with very little training.
Present Operating System is DOS. May go to OS-2 or UNIX in future.

Please reply to:
Kim Anderson
AdNET
(901) 458-2888 fax
kimander@memphisonline.com
                                                                                              15
          Extracted Job Template
computer_science_job
id: 56nigp$mrs@bilbo.reference.com
title: SOFTWARE PROGRAMMER
salary:
company:
recruiter:
state: TN
city:
country: US
language: C
platform: PC \ DOS \ OS-2 \ UNIX
application:
area: Voice Mail
req_years_experience: 2
desired_years_experience: 5
req_degree:
desired_degree:
post_date: 17 Nov 1996
                                     16
                    Amazon Book Description
….
</td></tr>
</table>
<b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br>
<font face=verdana,arial,helvetica size=-1>
by <a href="/exec/obidos/search-handle-url/index=books&field-author=
         Kurzweil%2C%20Ray/002-6235079-4593641">
Ray Kurzweil</a><br>
</font>
<br>
<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">
<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90
   height=140 align=left border=0></a>
<font face=verdana,arial,helvetica size=-1>
<span class="small">
<span class="small">
<b>List Price:</b> <span class=listprice>$14.95</span><br>
<b>Our Price: <font color=#990000>$11.96</font></b><br>
<b>You Save:</b> <font color=#990000><b>$2.99 </b>
(20%)</font><br>
</span>
<p> <br>…
<p> <br>                                                                                     17
          Extracted Book Template

Title: The Age of Spiritual Machines :
       When Computers Exceed Human Intelligence
Author: Ray Kurzweil
List-Price: $14.95
Price: $11.96
:
:




                                                  18
       Extraction from Templated Text
• Many web pages are generated automatically from an
  underlying database.
• Therefore, the HTML structure of pages is fairly
  specific and regular (semi-structured).
• However, output is intended for human consumption,
  not machine interpretation.
• An IE system for such generated pages allows the web
  site to be viewed as a structured database.
• An extractor for a semi-structured web site is
  sometimes referred to as a wrapper.
• Process of extracting from such pages is sometimes
  referred to as screen scraping.
                                                     19
Templated Extraction using DOM Trees

• Web extraction may be aided by first
  parsing web pages into DOM trees.
• Extraction patterns can then be specified as
  paths from the root of the DOM tree to the
  node containing the text to extract.
• May still need regex patterns to identify
  proper portion of the final CharacterData
  node.

                                                 20
   Sample DOM Tree Extraction
                HTML

                                                       Element

     HEADER                BODY
                                                     Character-Data


                     B             FONT


              Age of Spiritual              A
                                  by
                Machines


                                           Ray
                                          Kurzweil


Title: HTMLBODYBCharacterData
Author: HTML BODYFONTA CharacterData
                                                                      21
                    Template Types
• Slots in template typically filled by a substring
  from the document.
• Some slots may have a fixed set of pre-specified
  possible fillers that may not occur in the text itself.
   – Terrorist act: threatened, attempted, accomplished.
   – Job type: clerical, service, custodial, etc.
   – Company type: SEC code
• Some slots may allow multiple fillers.
   – Programming language
• Some domains may allow multiple extracted
  templates per document.
   – Multiple apartment listings in one ad
                                                            22
          Simple Extraction Patterns
• Specify an item to extract for a slot using a regular
  expression pattern.
   – Price pattern: ―\b\$\d+(\.\d{2})?\b‖
• May require preceding (pre-filler) pattern to
  identify proper context.
   – Amazon list price:
      • Pre-filler pattern: ―<b>List Price:</b> <span class=listprice>‖
      • Filler pattern: ―\$\d+(\.\d{2})?\b‖
• May require succeeding (post-filler) pattern to
  identify the end of the filler.
   – Amazon list price:
      • Pre-filler pattern: ―<b>List Price:</b> <span class=listprice>‖
      • Filler pattern: ―.+‖
      • Post-filler pattern: ―</span>‖                                    23
           Simple Template Extraction
• Extract slots in order, starting the search for the
  filler of the n+1 slot where the filler for the nth
  slot ended. Assumes slots always in a fixed order.
   –   Title
   –   Author
   –   List price
   –   …
• Make patterns specific enough to identify each
  filler always starting from the beginning of the
  document.

                                                        24
      Pre-Specified Filler Extraction

• If a slot has a fixed set of pre-specified
  possible fillers, text categorization can be
  used to fill the slot.
  – Job category
  – Company type
• Treat each of the possible values of the slot
  as a category, and classify the entire
  document to determine the correct filler.


                                                  25
                           Learning for IE
• Writing accurate patterns for each slot for each domain (e.g. each
  web site) requires laborious software engineering.
• Alternative is to use machine learning:
   – Build a training set of documents paired with human-produced
      filled extraction templates.
   – Learn extraction patterns for each slot using an appropriate
      machine learning algorithm.




                                                                       26
Information Extraction from
     unstructured text




                              37
Information Extraction from Unstructured Text:

• Semantic web needs:
   – Tagged data
   – Background knowledge
• (blue sky approaches to) automate both
   – Knowledge Extraction
      • Extract base level knowledge (―facts‖) directly from
        the web
   – Automated tagging
      • Start with a background ontology and tag other web
        pages
         – Semtag/Seeker

                                                               39
Fielded IE Systems: Citeseer, Google Scholar; Libra
       How do they do it? Why do they fail?




                                       




                                                 40
               What is “Information Extraction”

       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
source concept, by which software code is                  NAME            TITLE   ORGANIZATION
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                                                      Slides from Cohen & McCallum
               What is “Information Extraction”

       As a task:                Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-
                                                           NAME               TITLE  ORGANIZATION
source concept, by which software code is
                                                  IE       Bill Gates         CEO     Microsoft
made public to encourage improvement and
                                                           Bill Veghte        VP      Microsoft
development by outside programmers. Gates
                                                           Richard Stallman   founder Free Soft..
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.

"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“

Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                                                        Slides from Cohen & McCallum
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                                                      Slides from Cohen & McCallum
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                                                      Slides from Cohen & McCallum
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy      Microsoft Corporation
of open-source software with Orwellian fervor,
denouncing its communal licensing as a            CEO
"cancer" that stifled technological innovation.   Bill Gates
Today, Microsoft claims to "love" the open-       Microsoft
source concept, by which software code is         Gates
made public to encourage improvement and
development by outside programmers. Gates         Microsoft
himself says Microsoft will gladly disclose its   Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               Microsoft
customers.                                        VP
"We can be open source. We love the concept       Richard Stallman
of shared source," said Bill Veghte, a            founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                  Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                                                      Slides from Cohen & McCallum
               What is “Information Extraction”
    As a family                    Information Extraction =
of techniques:                      segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
                                                  * Microsoft Corporation
denouncing its communal licensing as a              CEO
"cancer" that stifled technological innovation.     Bill Gates
Today, Microsoft claims to "love" the open-       * Microsoft
source concept, by which software code is           Gates
made public to encourage improvement and
development by outside programmers. Gates         * Microsoft
himself says Microsoft will gladly disclose its     Bill Veghte
crown jewels--the coveted code behind the
Windows operating system--to select               * Microsoft
customers.                                          VP
"We can be open source. We love the concept         Richard Stallman
of shared source," said Bill Veghte, a              founder
Microsoft VP. "That's a super-important shift
for us in terms of code access.“                    Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
                                                                        Slides from Cohen & McCallum
                                    IE in Context

Create ontology

       Spider
             Filter by relevance


                                   Segment

                          IE       Classify
                                   Associate
                                   Cluster
                                                         Database
                                               Load DB

Document          Train extraction models                           Query,
collection                                                          Search

         Label training data                                                      Data mine




                                                                        Slides from Cohen & McCallum
                                 IE History
Pre-Web
• Mostly news articles
   – De Jong‟s FRUMP [1982]
       • Hand-built system to fill Schank-style “scripts” from news wire
   – Message Understanding Conference (MUC) DARPA [‟87-‟95], TIPSTER [‟92-
     ‟96]
• Most early work dominated by hand-built models
   – E.g. SRI‟s FASTUS, hand-built FSMs.
   – But by 1990‟s, some machine learning: Lehnert, Cardie, Grishman and then
     HMMs: Elkan [Leek ‟97], BBN [Bikel et al ‟98]
Web
• AAAI ‟94 Spring Symposium on “Software Agents”
   – Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell‟s WebKB, „96
   – Build KB‟s from the Web.
• Wrapper Induction
   – First by hand, then ML: [Doorenbos „96], [Soderland ‟96], [Kushmerick ‟97],…

                                                                  Slides from Cohen & McCallum
        What makes IE from the Web Different?
                             Less grammar, but more formatting & linking

                  Newswire                                                                    Web
                                                                     www.apple.com/retail

   Apple to Open Its First Retail Store
            in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--
Apple's first retail store in New York City will open in
Manhattan's SoHo district on Thursday, July 18 at
8:00 a.m. EDT. The SoHo store will be Apple's
largest retail store to date and is a stunning example
                                                           www.apple.com/retail/soho
of Apple's commitment to offering customers the
world's best computer shopping experience.
                                                                                   www.apple.com/retail/soho/theatre.html
"Fourteen months after opening our first retail store,
our 31 stores are attracting over 100,000 visitors
each week," said Steve Jobs, Apple's CEO. "We
hope our SoHo store will surprise and delight both
Mac and PC users who want to see everything the
Mac can do to enhance their digital lifestyles."




The directory structure, link structure,
formatting & layout of the Web is its own
new grammar.
                                                                                              Slides from Cohen & McCallum
                 Landscape of IE Tasks (1/4):
                   Pattern Feature Domain
        Text paragraphs                               Grammatical sentences
       without formatting                           and some formatting & links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.


  Non-grammatical snippets,
    rich formatting & links                                    Tables




                                                                    Slides from Cohen & McCallum
                Landscape of IE Tasks (2/4):
                      Pattern Scope
   Web site specific     Genre specific   Wide, non-specific
   Formatting                 Layout            Language
Amazon.com Book Pages        Resumes         University Names




                                            Slides from Cohen & McCallum
                Landscape of IE Tasks (3/4):
                    Pattern Complexity
E.g. word patterns:

           Closed set                         Regular set
        U.S. states                    U.S. phone numbers

        He was born in Alabama…        Phone: (413) 545-1323

        The big Wyoming sky…           The CALD main office can be
                                       reached at 412-268-1299

                                        Ambiguous patterns,
           Complex pattern
        U.S. postal addresses
                                        needing context + many
                                        sources of evidence
        University of Arkansas
                                      Person names
        P.O. Box 140
        Hope, AR 71802                …was among the six houses
                                      sold by Hope Feldman that year.
        Headquarters:
        1128 Main Street, 4th Floor   Pawel Opalinski, Software
        Cincinnati, Ohio 45210        Engineer at WhizBang Labs.
                                                     Slides from Cohen & McCallum
              Landscape of IE Tasks (4/4):
                 Pattern Combinations
  Jack Welch will retire as CEO of General Electric tomorrow. The top role
  at the Connecticut company will be filled by Jeffrey Immelt.



     Single entity          Binary relationship           N-ary record

 Person: Jack Welch         Relation: Person-Title    Relation:   Succession
                            Person: Jack Welch        Company:    General Electric
                            Title:    CEO             Title:      CEO
Person: Jeffrey Immelt
                                                      Out:        Jack Welsh
                                                      In:         Jeffrey Immelt
                         Relation: Company-Location
Location: Connecticut
                         Company: General Electric
                         Location: Connecticut



“Named entity” extraction
                                                        Slides from Cohen & McCallum
         Evaluation of Single Entity Extraction
TRUTH:
 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.
PRED:
 Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke.



                          # correctly predicted segments                         2
  Precision =                                                              =
                                  # predicted segments                           6


                          # correctly predicted segments                         2
  Recall         =                                                         =
                                  # true segments                                4

                                                                                                 1
    F1           =      Harmonic mean of Precision & Recall =
                                                                                       ((1/P) + (1/R)) / 2


                                                                                  Slides from Cohen & McCallum
      State of the Art Performance

• Named entity recognition
  – Person, Location, Organization, …
  – F1 in high 80‟s or low- to mid-90‟s
• Binary relation extraction
  – Contained-in (Location1, Location2)
    Member-of (Person1, Organization1)
  – F1 in 60‟s or 70‟s or 80‟s
• Wrapper induction
  – Extremely accurate performance obtainable
  – Human effort (~30min) required on each site

                                          Slides from Cohen & McCallum
                      Landscape of IE Techniques (1/1):
                                  Models
                                                          Classify Pre-segmented
              Lexicons                                          Candidates                                                 Sliding Window
  Abraham Lincoln was born in Kentucky.                Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.

                                             member?        Classifier
                                      Alabama                                                                             Classifier
                                                                         which class?
                                      Alaska                                                                                         which class?
                                      …
                                      Wisconsin                                                                                                Try alternate
                                      Wyoming                                                                                                  window sizes:




        Boundary Models                                    Finite State Machines                                       Context Free Grammars
  Abraham Lincoln was born in Kentucky.                 Abraham Lincoln was born in Kentucky.                          Abraham Lincoln was born in Kentucky.
BEGIN
                                                                                        Most likely state sequence?
                                                                                                                          NNP         NNP     V     V   P         NP

                  Classifier                                                                                                                                 PP
                                     which class?
                                                                                                                                                        VP
                                                                                                                                NP
                                                                                                                                               VP
    BEGIN   END    BEGIN       END
                                                                                                                                         S
                                                                                                                                             …and beyond
 Any of these models can be used to capture words, formatting or both. Cohen & McCallum
                                                              Slides from
Sliding Windows




                  Slides from Cohen & McCallum
Extraction by Sliding Window

      GRAND CHALLENGES FOR MACHINE LEARNING

             Jaime Carbonell
         School of Computer Science
        Carnegie Mellon University

                 3:30 pm
              7500 Wean Hall

  Machine learning has evolved from obscurity
  in the 1970s into a vibrant and popular
  discipline in artificial intelligence
  during the 1980s and 1990s.   As a result
  of its success and growth, machine learning
  is evolving into a collection of related
  disciplines: inductive concept acquisition,
  analytic learning in problem solving (e.g.
  analogy, explanation-based learning),
  learning theory (e.g. PAC learning),
  genetic algorithms, connectionist learning,
  hybrid systems, and so on.

      CMU UseNet Seminar Announcement
                                          Slides from Cohen & McCallum
Extraction by Sliding Window

      GRAND CHALLENGES FOR MACHINE LEARNING

             Jaime Carbonell
         School of Computer Science
        Carnegie Mellon University

                 3:30 pm
              7500 Wean Hall

  Machine learning has evolved from obscurity
  in the 1970s into a vibrant and popular
  discipline in artificial intelligence
  during the 1980s and 1990s.   As a result
  of its success and growth, machine learning
  is evolving into a collection of related
  disciplines: inductive concept acquisition,
  analytic learning in problem solving (e.g.
  analogy, explanation-based learning),
  learning theory (e.g. PAC learning),
  genetic algorithms, connectionist learning,
  hybrid systems, and so on.

      CMU UseNet Seminar Announcement
                                          Slides from Cohen & McCallum
Extraction by Sliding Window

      GRAND CHALLENGES FOR MACHINE LEARNING

             Jaime Carbonell
         School of Computer Science
        Carnegie Mellon University

                 3:30 pm
              7500 Wean Hall

  Machine learning has evolved from obscurity
  in the 1970s into a vibrant and popular
  discipline in artificial intelligence
  during the 1980s and 1990s.   As a result
  of its success and growth, machine learning
  is evolving into a collection of related
  disciplines: inductive concept acquisition,
  analytic learning in problem solving (e.g.
  analogy, explanation-based learning),
  learning theory (e.g. PAC learning),
  genetic algorithms, connectionist learning,
  hybrid systems, and so on.

      CMU UseNet Seminar Announcement
                                          Slides from Cohen & McCallum
Extraction by Sliding Window

      GRAND CHALLENGES FOR MACHINE LEARNING

             Jaime Carbonell
         School of Computer Science
        Carnegie Mellon University

                 3:30 pm
              7500 Wean Hall

  Machine learning has evolved from obscurity
  in the 1970s into a vibrant and popular
  discipline in artificial intelligence
  during the 1980s and 1990s.   As a result
  of its success and growth, machine learning
  is evolving into a collection of related
  disciplines: inductive concept acquisition,
  analytic learning in problem solving (e.g.
  analogy, explanation-based learning),
  learning theory (e.g. PAC learning),
  genetic algorithms, connectionist learning,
  hybrid systems, and so on.

      CMU UseNet Seminar Announcement
                                          Slides from Cohen & McCallum
    A “Naïve Bayes” Sliding Window Model
                                                        [Freitag 1997]


…   00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun …
    w t-m        w t-1 w t        w t+n w t+n+1           w t+n+m

          prefix               contents                   suffix

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)


If P(―Wean Hall Rm 5409‖ = LOCATION) is above some threshold, extract it.


                        Other examples of sliding window: [Baluja et al 2000]
                        (decision tree over individual words & their context)
                                                        Slides from Cohen & McCallum
    “Naïve Bayes” Sliding Window Results
Domain: CMU UseNet Seminar Announcements
    GRAND CHALLENGES FOR MACHINE LEARNING

           Jaime Carbonell
       School of Computer Science
      Carnegie Mellon University
                                               Field            F1
               3:30 pm
            7500 Wean Hall                     Person Name:     30%
                                               Location:        61%
Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
                                               Start Time:      98%
discipline in artificial intelligence during
the 1980s and 1990s.   As a result of its
success and growth, machine learning is
evolving into a collection of related
disciplines: inductive concept acquisition,
analytic learning in problem solving (e.g.
analogy, explanation-based learning),
learning theory (e.g. PAC learning), genetic
algorithms, connectionist learning, hybrid
systems, and so on.

                                                  Slides from Cohen & McCallum
   Realistic sliding-window-classifier IE

• What windows to consider?
   – all windows containing as many tokens as the
     shortest example, but no more tokens than the
     longest example
• How to represent a classifier? It might:
   – Restrict the length of window;
   – Restrict the vocabulary or formatting used
     before/after/inside window;
   – Restrict the relative order of tokens, etc.
• Learning Method
   – SRV: Top-Down Rule Learning    [Frietag AAAI „98]
   – Rapier: Bottom-Up    [Califf & Mooney, AAAI „99]

                                            Slides from Cohen & McCallum
Rapier: results – precision/recall




                           Slides from Cohen & McCallum
 Rule-learning approaches to sliding-
  window classification: Summary
• SRV, Rapier, and WHISK [Soderland KDD „97]
  – Representations for classifiers allow restriction of
    the relationships between tokens, etc
  – Representations are carefully chosen subsets of
    even more powerful representations based on
    logic programming (ILP and Prolog)
  – Use of these “heavyweight” representations is
    complicated, but seems to pay off in results
• Can simpler representations for classifiers
  work?


                                          Slides from Cohen & McCallum
  BWI: Learning to detect boundaries
                              [Freitag & Kushmerick, AAAI 2000]

• Another formulation: learn three probabilistic
  classifiers:
   – START(i) = Prob( position i starts a field)
   – END(j) = Prob( position j ends a field)
   – LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) by
   START(i) * END(j) * LEN(j-i)

• LEN(k) is estimated from a histogram



                                            Slides from Cohen & McCallum
    BWI: Learning to detect boundaries
• BWI uses boosting to find “detectors” for
  START and END
• Each weak detector has a BEFORE and
  AFTER pattern (on tokens before/after
  position i).
• Each “pattern” is a sequence of
  – tokens and/or
  – wildcards like: anyAlphabeticToken, anyNumber, …
• Weak learner for “patterns” uses greedy
  search (+ lookahead) to repeatedly extend a
  pair of empty BEFORE,AFTER patterns
                                        Slides from Cohen & McCallum
BWI: Learning to detect boundaries




                         Field             F1
                         Person Name:      30%
                         Location:         61%
                         Start Time:       98%

                           Slides from Cohen & McCallum
    Problems with Sliding Windows
        and Boundary Finders
• Decisions in neighboring parts of the input
  are made independently from each other.

  – Naïve Bayes Sliding Window may predict a
    “seminar end time” before the “seminar start time”.
  – It is possible for two overlapping windows to both
    be above threshold.

  – In a Boundary-Finding system, left boundaries are
    laid down independently from right boundaries,
    and their pairing happens as a separate step.

  Solution? Joint inference…
                                         Slides from Cohen & McCallum
    More Ambitious (Blue Sky) Approaches
• The information extraction         • Semantic web needs:
  tasks in fielded                        – Tagged data
  applications like                       – Background knowledge
  Citeseer/Libra are                 • (blue sky approaches to)
  narrowly focused                     automate both
   – We assume that we are                – Knowledge Extraction
     learning specific relations               • Extract base level
     (e.g. author/title etc)                     knowledge (―facts‖)
   – We assume that the                          directly from the web
     extracted relations will be          – Automated tagging
     put in a database for db-                 • Start with a background
     style look-up                               ontology and tag other
                                                 web pages
                                                    – Semtag/Seeker


                                   Let’s look at state of the feasible art
                                   before going to blue-sky..                82
      Extraction from Free Text involves




                                                               Analogy to regex patterns on DOM trees for structured tex
        Natural Language Processing
• If extracting from automatically generated web
  pages, simple regex patterns usually work.
• If extracting from more natural, unstructured,
  human-written text, some NLP may help.
   – Part-of-speech (POS) tagging
      • Mark each word as a noun, verb, preposition, etc.
   – Syntactic parsing
      • Identify phrases: NP, VP, PP
   – Semantic word categories (e.g. from WordNet)
      • KILL: kill, murder, assassinate, strangle, suffocate
• Off-the-shelf software available to do this!
   – The ―Brill‖ tagger
• Extraction patterns can use POS or phrase tags.                   83
      I. Generate-n-Test Architecture
Generic extraction patterns (Hearst ’92):
• ―…Cities such as Boston, Los Angeles, and Seattle…‖


      (―C such as NP1, NP2, and NP3‖) =>
          IS-A(each(head(NP)), C), …
                                                         Template

   •Detailed information for several countries            Driven
                                                        Extraction

   such as maps, …‖ ProperNoun(head(NP))             (where template
                                                      In in terms of
                                                       Syntax Tree)
   • ―I listen to pretty much all music but prefer
   country such as Garth Brooks‖
                                                             84
                   Test


Assess candidate extractions using Mutual
Information (PMI-IR) (Turney ’01).




      Many variations are possible…
                                            85
    ..but many things indicate ―city‖ness
      Discriminator phrases fi :
      ―x is a city‖
      ―x has a population of‖
      ―x is the capital of y‖
      ―x’s baseball team…‖



     •PMI = frequency of I & D co-occurrence
     •5-50 discriminators Di
     •Each PMI for Di is a feature fi                        Keep the probablities
     •Naïve Bayes evidence combination:                       with the extracted facts




PMI is used for feature selection. NBC is used for learning. Hits used for assessing
                                                                              86
 PMI as well as conditional probabilities
         Assessment In Action

 1.   I = ―Yakima‖ (1,340,000)
 2.   D = <class name>
 3.   I+D = ―Yakima city‖ (2760)
 4.   PMI = (2760 / 1.34M)= 0.02
•I = ―Avocado‖ (1,000,000)
•I+D =―Avocado city‖ (10)
      PMI = 0.00001 << 0.02
                                   87
         Some Sources of ambiguity

•   Time: ―Clinton is the president‖ (in 1996).
•   Context: ―common misconceptions..‖
•   Opinion: Elvis…
•   Multiple word senses: Amazon, Chicago,
    Chevy Chase, etc.
    – Dominant senses can mask recessive ones!
    – Approach: unmasking. ‘Chicago –City’

                                                 88
       Chicago


City             Movie




                         89
 Chicago Unmasked

City sense    Movie sense




                            90
      Impact of Unmasking on PMI


Name          Recessive Original Unmask Boost
Washington     city        0.50   0.99  96%
Casablanca     city        0.41   0.93 127%
Chevy Chase    actor       0.09   0.58 512%
Chicago        movie       0.02   0.21 972%




                                                91
CBioC: Collaborative Bio-
Curation
   Motivation
       To help get information nuggets of articles and
        abstracts and store in a database.
       The challenge is that the number of articles are
        huge and they keep growing, and need to
        process natural language.
       The two existing approaches
           human curation and use of automatic information
            extraction systems
           They are not able to meet the challenge, as the first is
            expensive, while the second is error-prone.

                                                                       92
CBioC (cont‟d)
   Approach: We propose a solution that is
    inexpensive, and that scales up.
       Our approach takes advantage of automatic information
        extraction methods as a starting point,
           Based on the premise that if there are a lot of articles, then
            there must be a lot of readers and authors of these articles.
       We provide a mechanism by which the readers of the
        articles can participate and collaborate in the curation of
        information.
       We refer to our approach as “Collaborative Curation''.




                                                                             93
Using the C-BioCurator System
(cont‟d)




                                94
What is the main difference between Knowitall and CBIOC?

    Assessment– Knowitall does it by HITS. CBioC by voting
                Annotation
“The Chicago Bulls announced yesterday that
  Michael Jordan will. . . ”

The <resource ref="http://tap.stanford.edu/
BasketballTeam_Bulls">Chicago Bulls</resource>
announced yesterday that <resource ref=
"http://tap.stanford.edu/AthleteJordan,_Michael">
Michael Jordan</resource> will...‟‟

                                                96
            Semantic Annotation

      Name Entity Identification




  This simplest task of meta-data
  extraction on NL is to establish “type”
  relation between entities in the NL
  resources and concepts in ontologies.
                                                                                                    97
Picture from http://lsdis.cs.uga.edu/courses/SemWebFall2005/courseMaterials/CSCI8350-Metadata.ppt
                              Semantics
• Semantic Annotation
 - The content of annotation consists of some rich
   semantic information
 - Targeted not only at human reader of resources
   but also software agents
 - formal : metadata following structural standards
   informal : personal notes written in the margin while
              reading an article
 - explicit : carry sufficient information for interpretation
   tacit : many personal annotations (telegraphic and
           incomplete)

  http://www-scf.usc.edu/~csci586/slides/6
                                                                98
                 Uses of Annotation




                                           99
http://www-scf.usc.edu/~csci586/slides/8
              Objectives of Annotation
   • Generate Metadata for existing information
        – e.g., author-tag in HTML
        – RDF descriptions to HTML
        – Content description to Multimedia files
   • Employ metadata for
        – Improved search
        – Navigation
        – Presentation
        – Summarization of contents
http://www.aifb.uni-
karlsruhe.de/WBS/sst/Teaching/Intelligente%20System%20im%20WWW%20SS%202000/10-Annotation.pdf
                                                                                      100
                                   Annotation
Current practice of annotation for knowledge identification and extraction




         is time                       needs annotation by                        is complex
       consuming                             experts

            Reduce burden of text
          annotation for Knowledge
                Management                                                                      101
 www.racai.ro/EUROLAN-2003/html/presentations/SheffieldWilksBrewsterDingli/Eurolan2003AlexieiDingli.ppt
                SemTag & Seeker
   WWW-03 Best Paper Prize
   Seeded with TAP ontology (72k concepts)
       And ~700 human judgments
   Crawled 264 million web pages
   Extracted 434 million semantic tags
       Automatically disambiguated
               SemTag
• Uses broad, shallow knowledge base
• TAP – lexical and taxonomic information
  about popular objects
  – Music
  – Movies
  – Sports
  – Etc.


                                            104
                 SemTag
• Problem:
  – No write access to original document, so how
    do you annotate?


• Solution:
  – Store annotations in a web-available
    database



                                               105
                    SemTag
• Semantic Label Bureau
  – Separate store of semantic annotation
    information
  – HTTP server that can be queried for
    annotation information
  – Example
    • Find all semantic tags for a given document
    • Find all semantic tags for a particular object


                                                       106
                SemTag
• Methodology




                         107
                             SemTag
•    Three phases
    1.       Spotting Pass:
         –     Tokenize the document
         –     All instances plus 20 word window
    2.       Learning Pass:
         –     Find corpus-wide distribution of terms at each internal node
               of taxonomy
         –     Based on a representative sample
    3.       Tagging Pass:
         –     Scan windows to disambiguate each reference
         –     Finally determined to be a TAP object

                                                                        108
                  SemTag
•   Solution:
    –   Taxonomy Based Disambiguation (TBD)
•   TBD expectation:
    – Human tuned parameters used in small,
      critical sections
    – Automated approaches deal with bulk of
      information



                                               110
                    SemTag
• TBD methodology:
  – Each node in the taxonomy is associated with
    a set of labels
    • Cats, Football, Cars all contain “jaguar”
  – Each label in the text is stored with a window
    of 20 words – the context
  – Each node has an associated similarity
    function mapping a context to a similarity
    • Higher similarity  more likely to contain a
      reference

                                                     111
                  SemTag
• Similarity:
  – Built a 200,000 word lexicon (200,100 most
    common – 100 most common)
  – 200,000 dimensional vector space
  – Training: spots (label, context) and correct
    node
  – Estimated the distribution of terms for nodes
  – Standard cosine similarity
  – TFIDF vectors (context vs. node)
                                                    112
                SemTag
• Some internal nodes very popular:
  – Associate a measurement of how accurate
    Sim is likely to be at a node
  – Also, how ambiguous the node is overall
    (consistency of human judgment)
• TBD Algorithm: returns 1 or 0 to indicate
  whether a particular context c is on topic
  for a node v
• 82% accuracy on 434 million spots
                                               114
                     Summary
• Information extraction can be motivated either as
  explicating more structure from the data or as an
  automated way to Semantic Web
• Extraction complexity depends on whether the
  text you have is “templated” or “free-form”
  – Extraction from templated text can be done by regular
    expressions
  – Extraction from free form text requires NLP
     • Can be done in terms of parts-of-speech-tagging
• “Annotation” involves connecting terms in a free
  form text to items in the background knowledge
  – It too can be automated

                                                         116

								
To top