Docstoc

Information Extraction

Document Sample
Information Extraction Powered By Docstoc
					      Information Extraction
          2 sessions in the section “Web Search”
                of the course “Web Mining”
at the École nationale supérieure des Télécommunications
                 in Paris/France in fall 2010

                by Fabian M. Suchanek

                   This document is available under a
                   Creative Commons Attribution Non-Commercial License
                     Organisation
•  4h class on Information extraction
   2 sessions with 2h each

•  Small home-work given at the end of each session,
    to be handed in for the next session
   (on paper or by email)

•  Web-site: http://suchanek.name/  Teaching




                                                       2
         Motivation




               Elvis Presley
               1935 - 1977

Will there ever be someone like him again?

                                             3
                 Motivation
           Another Elvis



Elvis Presley: The Early Years
Elvis spent more weeks at the top of the charts than
any other artist.
www.fiftiesweb.com/elvis.htm




                                                       4
                       Motivation
                Another singer called Elvis, young



Personal relationships of Elvis Presley – Wikipedia
...when Elvis was a young teen.... another girl whom the singer's
mother hoped Presley would .... The writer called Elvis "a hillbilly cat”
en.wikipedia.org/.../Personal_relationships_of_Elvis_Presley




                                                                            5
                Motivation

                                    SELECT * FROM person
                                    WHERE gName=‘Elvis’
Another Elvis                       AND occupation=‘singer’




                Information   GName FName Occupation
                Extraction
                              Elvis       Presley singer
                              Elvis       Hunter     painter
                              ...         ...




                                         1: Elvis Presley
                                         2: Elvis ...
     ✗                                   3. Elvis ...          6
            Motivation: Definition
 Information Extraction (IE) is the process
 of extracting structured information (e.g., database tables)
 from unstructured machine-readable documents
 (e.g., Web documents).


Elvis Presley was a
famous rock singer.       Information
...
                                            GName FName Occupation
                          Extraction
Mary once remarked                          Elvis      Presley singer
that the only
                                            Elvis      Hunter   painter
attractive thing
about the painter                           ...        ...
Elvis Hunter was his
first name.




                                                                          7
Motivation: Examples




    Title                      Type        Location
    Business strategy Associate Part time Palo Alto, CA

    Registered Nurse           Full time   Los Angeles
    ...                        ...                    8
Motivation: Examples
         Name         Birthplace    Birthdate
         Elvis Presley Tupelo, MI   1935-01-08

         ...          ...




                                                 9
                Motivation: Examples




Author    Publication             Year
Grishman Information Extraction... 2006
                                          10
...       ...                     ...
           Motivation: Examples




Product   Type   Price
Dynex 32” LCD TV $1000

...       ...                     11
                Information Extraction and beyond
                                                                     Ontological
Information Extraction (IE) is the process
                                                                     Information
of extracting structured information (e.g., database tables)
                                                                     Extraction
from unstructured machine-readable documents
(e.g., Web documents).
                                                                Fact
                                                                Extraction

                                                     Instance
                                                     Extraction
                                 Named Entity
                                                     Elvis Presley      singer
                                 Recognition
                                                     Angela Merkel politician
                 Tokenization&   ...married Elvis
Source           Normalization   on 1967-05-01
Selection         05/01/67
                  
            ?     1967-05-01
                                                                             12
                    Sources: The Web



                                              Languages                                             German;
                                                                                                       6% French;
                               Japanese;
                                                                                                   Chinese; 3%
                                  6%
                                                                                                     4% Spanish;
                                                                                                                        3%
                                                                                                              Russian;
                                                                                                                2%
                                                                                                                    Italian; 2%
                                                                                                                 Portugues
                                   English;
                                                                                                                     e; 1%
                                    71%
(1 trillion Web sites)                                                              Dutch; 1%                         Korean;
                            Source for the languages: http://www.clickz.com/clickz/stats/1697080/web-pages-language
                            Need not be correct
                                                                                                                        1%
                                                                                                                           13
 Sources: Language detection
    Elvis Presley ist einer der größten
    Rockstars aller Zeiten.
                                              a b c ä ö ü ß ...
How can we find out the language of a document?

•  Watch for certain characters or scripts (umlauts, Chinese characters etc.)
   But: These are not always specific

•  Use the meta-information associated with a Web page
   But: This is usually not very reliable
•  Use a dictionary
   But: This is costly
•  Use frequent character signatures
   (Count how often each character appears in the document.
   Compare this histogram to the histogram computed on a
   large text document corpus of the language in question)
•  Extension: Make a histogram of character n-grams
   (n-gram: a sequence of n characters)                                  14
                                       Sources: Scripts

Elvis Presley was a rock star.                        (Latin script)

                                                      (Chinese script, “simplified”)

                                 ‫אלביס היה כוכב רוק‬   (Hebrew)

                    ‫و#"ن أ('&% /.-,+* 210 ا(.وك‬       (Arabic)

                                            (         (Korean script)

Elvis Presley ถูกดาวร็อก                              (Thai script)

  Source: http://translate.bing.com
  Probably not correct




                                                                                       15
  Sources: Character Encodings
 100,000 different               ?            One byte with 8 bits
 characters                                   per character
 from 90 scripts                              (can store numbers 0-255)

How can we encode so many characters in 8 bits?
•  Ignore all non-English characters
    There are 26 letters, + 26 lowercase letters + punctuation ≈ 100 chars
   ... 65=A, 66=B, 67=C, ...                                 ASCII standard ✓



•  Depending on the script (the so-called code page),           (Example)
   the numbers mean different characters
      Latin code page: ...., 65=A, 66=B, ...
     Greek code page: ...., 65=α, ...                    Code page model ✓

•  Invent special names for special characters
     è = è                                     HTML entity encoding ✓


                                                                          16
   Sources: Character Encodings
  100,000 different                ?            One byte with 8 bits
  characters                                    per character
  from 90 scripts                               (can store numbers 0-255)

How can we encode so many characters in 8 bits?

•  Use 4 bytes to represent a character                        (Example)
     ...65=A, 66=B, ..., 1001=α, ..., 2001=               Unicode standard       ✓

•  then compress them into 1-4 bytes
     ...65=A, 66=B, ..., 00+01=α, ..., 01+01=            UTF-8 standard      ✓

•  or refer to the characters by their number
    ϩ = α                                 HTML entity encoding (too)       ✓


                                                                            17
                     Sources: UTF-8
•  Characters 0-0x7F, 7 bits: Latin alphabet, punctuation and numbers
    0xxxxxxx                        (i.e., equal to ASCII and most code pages)

•  Characters 0x80-0x7FF, 11 bits: Greek, Arabic, Hebrew, etc.
    110xxxxx 10xxxxxx              (i.e., marker byte + follower byte)

•  Characters 0x800-0xFFFF, 16 bits: Chinese, Chinese and Chinese (et al)
    1110xxxx 10xxxxxx 10xxxxxx (i.e., marker byte + 2 follower bytes)


 Advantages:
 •  common Western characters require only 1 byte ()
 •  backwards compatibility with ASCII
 •  stream readability (follower bytes cannot be confused with marker bytes)
 •  sorting compliance



          We will assume that the document is a sequence of characters
                                                                     18
                       Sources: Structured

Name 	
   	
           	
  Number	
  
D.	
  Johnson	
        	
  30714	
  	
  
J.	
  Smith 	
         	
  20934	
  
S.	
  Shenker	
        	
  20259	
         Information   Name         Citations
Y.	
  Wang	
   	
      	
  19471	
         Extraction
                                                         D. Johnson   30714
J.	
  Lee 	
   	
      	
  18969	
  
A.	
  Gupta	
   	
     	
  18884	
  	
                   J. Smith     20937
R.	
  Rivest	
  	
     	
  18038	
  	
                   ...          ...
H.	
  Zhang	
   	
     	
  17902	
  	
  
L.	
  Zhang	
   	
     	
  17800	
  	
  
J.	
  Ullman	
  	
     	
  16804	
  


 TSV file
 (tab separated values)

 Related: CSV (comma separated values)
                                                                                  19
                    Sources: Semi-Structured
<catalog>	
  
	
  	
  <cd>	
  
	
  	
  	
  	
  	
  <title>	
  
	
  	
  	
  	
  	
  	
  	
  	
  Empire	
  Burlesque	
  
	
  	
  	
  	
  	
  </title>	
  
	
  	
  	
  	
  	
  <artist>	
  
	
  	
  	
  	
  	
  	
  	
  	
  <firstName>	
             Information   Title       Artist
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Bob	
         Extraction
                                                                        Empire      Bob Dylan
	
  	
  	
  	
  	
  	
  	
  	
  </firstName>	
  
                                                                        Burlesque
	
  	
  	
  	
  	
  	
  	
  	
  <lastName>	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Dylan	
                     ...         ...
	
  	
  	
  	
  	
  	
  	
  	
  </lastName>	
  
	
  	
  	
  	
  	
  	
  <artist>	
  
	
  	
  </cd>	
  
...	
  


     XML file
     (Extensible Markup Language)

     Related: YAML (Yaml Ain’t a Markup Language)                                            20
              Sources: Semi-Structured




<table>	
  
	
  	
  <tr>	
                               Information    Title         Date
	
  	
  	
  	
  <td>	
  2008-­‐11-­‐24	
     Extraction
	
  	
  	
  	
  <td>	
  Miles	
  away	
                     Miles away    2008-11-24
	
  	
  	
  	
  <td>	
  7	
                                 ...           ...
	
  	
  <tr>	
  
...	
  	
  	
  	
  


HTML file with table                                 Wiki file with table
(Extensible Markup Language)                         (A Markup Language used in Wikipedia)
                                                                                    21
         Sources: “Unstructured”

Founded in 1215 as a colony of Genoa, Monaco
has been ruled by the House of Grimaldi since
1297, except when under French control from
1789 to 1814. Designated as a protectorate of
Sardinia from 1815 until 1860 by the Treaty of
Vienna, Monaco's
sovereignty was recognized        Information
 by the Franco-Monegasque                      Event        Date
                                  Extraction
 Treaty of 1861. The Prince                    Foundation   1215
of Monaco was an absolute                      ...          ...
ruler until a constitution
was promulgated in 1911.


HTML file or text file or word processing document

                                                                   22
                                         Sources: Mixed




<table>	
  
	
  	
  <tr>	
                                              Information   Name    Title
	
  	
  	
  	
  <td>	
  Professor.	
  	
  	
  	
  	
        Extraction
	
  	
  	
  	
  	
  	
  	
  	
  	
  Computational	
  	
                   Barte   Professor
	
  	
  	
  	
  	
  	
  	
  	
  	
  Neuroscience,	
  	
                   ...     ...
	
  	
  	
  	
  	
  	
  	
  	
  	
  ...	
  
...	
  	
  	
  	
  

HTML file or text file or word processing document


              Different IE approaches work with different types of sources
                                                                                              23
                Sources: Domain
Restricted to         Restricted to                  Restricted to
one Internet Domain   one Thematic                   one Language
(e.g., Amazon.com)    Domain                         (e.g., English)
                      (e.g., biographies)




                                                                 24
                                  (Slide taken from William Cohen)
Sources: Finding the Sources
                                 Information
                                 Extraction
                       ?                            ... ...   ...

How can we find the documents to extract information from?
  •  The document collection can be given a priori
     (Closed Information Extraction)
      e.g., a given Internet domain, all files on my computer, ...
  •  We can aim to extract information from the entire Web
     (Open Information Extraction)

 •  The system can find by itself the source documents
     e.g., by using an Internet search engine such as Google




                                                                     25
                Information Extraction and beyond
                                                                     Ontological
Information Extraction (IE) is the process
                                                                     Information
of extracting structured information (e.g., database tables)
                                                                     Extraction
from unstructured machine-readable documents
(e.g., Web documents).
                                                                Fact
                                                                Extraction

                                                     Instance
                                                     Extraction
                                 Named Entity
                                                     Elvis Presley      singer
                                 Recognition
    ✓                                                Angela Merkel politician
                 Tokenization&   ...married Elvis
Source           Normalization   on 1967-05-01
Selection         05/01/67
                  
            ?     1967-05-01
                                                                             26
                         Tokenization
Tokenization is the process of splitting a text into tokens (i.e., words,
punctuation symbols, identifiers and literals).

  On 2010-01-01 , President Sarkozy spoke this example sentence .


Challenges:
•  In some languages (Chinese, Japanese),
   words are not separated by white spaces

•  We have to deal consistently with URLs, acronyms, etc.
         http://example.com, 2010-09-24, U.S.A.

•  We have to deal consistently with compound words
          hostname, host-name, host name


         ⇒  Solution depends on the language and the domain.

         Naive solution: split by white spaces and punctuation
                                                                            27
             Normalization: Strings
Problem: We might extract strings that differ only slightly
         and mean the same thing.

             Elvis Presley     singer
             ELVIS PRESLEY     singer

Solution: Normalize strings, i.e., convert strings that mean the same to
one common form

•  Lowercasing, i.e., converting all characters to lower case
        May be too strong: “President Bush” == “president bush”
•  Removing accents and umlauts
       résumé  resume, Universität  Universitaet
•  Normalizing abbreviations
       U.S.A.  USA, US  USA


                                                                           28
            Normalization: Literals
Problem: We might extract different literals (numbers, dates, etc.)
         that mean the same.

              Elvis Presley        1935-01-08
              Elvis Presley        08/01/35

Solution: Normalize the literals

                                                      1.67m
   08/01/35                                           1.67 meters
   01/08/35                                           167 cm
   8th Jan. 1935                                      6 feet 5 inches
   January 8th, 1935                                  3 feet 2 toenails
   ...




     1935-01-08                                            1.67m
                                                                          29
                Information Extraction and beyond
                                                                     Ontological
Information Extraction (IE) is the process
                                                                     Information
of extracting structured information (e.g., database tables)
                                                                     Extraction
from unstructured machine-readable documents
(e.g., Web documents).
                                                                Fact
                                                                Extraction

                                                     Instance
                                                     Extraction
                                 Named Entity
                                                     Elvis Presley      singer
                    ✓            Recognition
    ✓                                                Angela Merkel politician
                 Tokenization&   ...married Elvis
Source           Normalization   on 1967-05-01
Selection         05/01/67
                  
            ?     1967-05-01
                                                                             30
      Named Entity Recognition
Named Entity Recognition (NER) is the process of finding entities
(people, cities, organizations, ...) in a text.

      Elvis Presley was born in 1935 in East Tupelo, Mississippi.



We can extract different types of entities:
•  Entities for which we have an exhaustive dictionary (closed set extraction)

   ... in Tupelo, Mississippi, but ...                   States of the USA


   ... while Germany and France were                     Countries of the World (?)
   opposed to a 3rd World War, ...


  May not always be trivial...
     ... was a great fan of France Gall,
     whose songs...                                                          31
      Named Entity Recognition
Named Entity Recognition (NER) is the process of finding entities
(people, cities, organizations, ...) in a text.

      Elvis Presley was born in 1935 in East Tupelo, Mississippi.



We can extract different types of entities:
•  Entities for which we have an exhaustive dictionary (closed set extraction)
•  Proper names (open set extraction)
   ... together with the software
   engineer Bob “the coder” Miller...                   People

   ... The region of Northern Urzykistan has been at war             Locations
   with Southern Urzykistan ever since 1208, when...


   ... BrightFridge Inc. presented their new product, the           Organizations
   self-reloading fridge, at this year’s exposition in Paris...
                                                                           32
      Named Entity Recognition
Named Entity Recognition (NER) is the process of finding entities
(people, cities, organizations, ...) in a text.

      Elvis Presley was born in 1935 in East Tupelo, Mississippi.



We can extract different types of entities:
•  Entities for which we have an exhaustive dictionary (closed set extraction)
•  Proper names (open set extraction)
•  Entities that follow a certain pattern
    ... was born in 1935. His mother...
    ... started playing guitar in 1937, when...    Years
    ... had his first concert in 1939, although... (4 digit numbers)


   Office: 01 23 45 67 89                              Phone numbers
   Mobile: 06 19 35 01 08                              (groups of digits)
   Home: 09 77 12 94 65

                                                                            33
                               NER: Patterns
A pattern is a string that generalizes a set of strings.

  sequences of the letter ‘a’
              a+                                            ‘a’, followed by ‘b’s
                                                                  ab+

           aa            aaaaaa                              abbbbbb abbbb
   a
             aaaa                                              ab
       aaaaaaa                                                         abbb


         digit                                           sequence of digits
  0|1|2|3|4|5|6|7|8|9                                (0|1|2|3|4|5|6|7|8|9)+

       9       6 2                                              6543
   0       1             7 4                          987                     5321
                     5
                3                                              5643
       8
                                                                                    34
                               => Let’s find a systematic way of expressing patterns
         NER: Regular Expressions
A regular expression (regex) over a set of symbols Σ is:
1. the empty string
2. or the string consisting of an element of Σ (a single character)
3. or the string AB where A and B are regular expressions (concatenation)
4. or a string of the form (A|B), where A and B are regular expressions
  (alternation)
5. or a string of the form (A)*, where A is a regular expression (Kleene star)


For example, with Σ={a,b}, the following strings are regular expressions:

   a         b         ab         aba         (a|b)




                                                                             35
           NER: Regular Expressions
Matching
•  a string matches a regex of a single character
   if the string consists of just that character
      a           b             regular expression

       a          b             matching string

•  a string matches a regular expression of the form (A)*
   if it consists of zero or more parts that match A



               (a)*             regular expression

                                matching strings
            aa a aaaaa

               aaaaa
                                                            36
          NER: Regular Expressions
Matching
•  a string matches a regex of the form (A|B)
   if it matches either A or B
     (a|b)         (a|(b)*)               regular expression

    b a           bbbb      bb            matching strings
                      a

•  a string matches a regular expression of the form AB
   if it consists of two parts, where the first part matches A
   and the second part matches B

       ab           b(a)*                  regular expression

                                           matching strings
       ab          baa
                          baaaaa
                     b

                                                                 37
       NER: Regular Expressions
Given an ordered set of symbols Σ, we define

•  [x-y] for two symbols x and y, x<y, to be the alternation
      x|...|y      (meaning: any of the symbols in the range)
                                               [0-9] = 0|1|2|3|4|5|6|7|8|9
•  A+ for a regex A to be
   A(A)*        (meaning: one or more A’s)
                                                            [0-9]+ = [0-9][0-9]*

•  A{x,y} for a regex A and integers x<y to be
    A...A|A...A|A...A|...|A...A (meaning: x to y A’s)
                                                          f{4,6} = ffff|fffff|ffffff

•  A? for a regex A to be
     (|A)                       (meaning: an optional A)
                                                           ab? = a(|b)
•  . to be an arbitrary symbol from Σ

                                                                             38
         NER: Regular Expressions
A|B        Either A or B
A*         Zero or more occurrences of A
A+         One or more occurrences of A
A{x,y}     x to y occurrences of A
A?         an optional A                                       Example
[a-z]      One of the characters in the range
.          An arbitrary symbol


 A digit                                        Numbers in scientific format

 A digit or a letter                            HTML attributes

 A sequence of 8 digits
                                                Dates
 5 pairs of digits, separated by space


 5 pairs of digits, separated by a space or a hyphen

                                                                          39
        NER: Regular Expressions
A regex can be matched efficiently by a Finite State Machine
(Finite State Automaton, FSA, FSM)

A FSM is a quintuple of
•  A set Σ of symbols (the alphabet)
•  A set S of states
•  An initial state, s0 ε S
•  A state transition function δ:S x Σ  S
•  A set of accepting states F < S


             Regex: ab*c


               a                 c
       s0               s1                   s3

                        b
                                                   Accepting states usually
 Implicitly: All unmentioned inputs               depicted with double ring.
                                                                               40
  go to some artificial failure state
        NER: Regular Expressions
A FSM accepts an input string, if there exists a sequence of states, such that
•  it starts with the start state
•  it ends with an accepting state
•  the i-th state, si, is followed by the state δ(si,input.charAt(i))




                                                          Sample inputs:

             Regex: ab*c                                  abbbc

                                                          ac
               a                c
       s0               s1               s3               aabbbc

                        b                                 def


                                                                           41
        NER: Regular Expressions
A non-deterministic FSM has a transition function that maps to a set of states.

A FSM accepts an input string, if there exists a sequence of states, such that
•  it starts with the start state
•  it ends with an accepting state
•  the i-th state, si, is followed by a state in the set δ(si,input.charAt(i))

FSMs can be transformed and simplified while maintaining equivalence,
in particular, every non-deterministic FSM can be made deterministic



             Regex: ab*c|ab                               Sample inputs:

                                                          abbbc
               a                   c
       s0               s1               s3               ab
               a
                        b      b                          abc

                        s4
                                                                           42
         NER: Regular Expressions
A|B      Either A or B
A*       Zero or more occurrences of A
A+       One or more occurrences of A
A{x,y}   x to y occurrences of A
A?       an optional A
[a-z]    One of the characters in the range
.        An arbitrary symbol


Regular expressions
•  can express a wide range of patterns
•  can be matched efficiently
•  are employed in a wide variety of applications
   (e.g., in text editors, NER systems, normalization, UNIX grep tool etc.)


 Input:                                        Condition:
 •  Manual design of the regex                 •  Entities follow a syntactic pattern

                                                                              43
            NER: Sliding Windows
Alright, what if we do not want to specify regexes by hand?
Use sliding windows...

      Information Extraction: Tuesday 10:00 am, Rm 407b



      For each position, ask: Is the current window a named entity?

      Window size = 1




                                                                      44
            NER: Sliding Windows
Alright, what if we do not want to specify regexes by hand?
Use sliding windows of different sizes

      Information Extraction: Tuesday 10:00 am, Rm 407b



      For each position, ask: Is the current window a named entity?

      Window size = 2




                                                                      45
          NER: Sliding Windows

    Information Extraction: Tuesday 10:00 am, Rm 407b

                       Prefix     Content Postfix
                       window     window window

Choose certain features (properties) of windows that could be important:
•  window contains colon, comma, or digits
•  window contains week day, or certain other words
•  window starts with lowercase letter
•  window contains only lowercase letters
•  ...




                                                                     46
         NER: Sliding Windows

   Information Extraction: Tuesday 10:00 am, Rm 407b

                        Prefix   Content Postfix
                        window   window window

Prefix colon      1
Prefix comma      0
...               ...               The feature vector represents the
Content colon     1                 presence or absence of features
Content comma     0                 of one content window (and its
...               ...               prefix window and postfix window)
Postfix colon     0
Postfix comma     1
...               ...


  Features      Feature Vector                                    47
              NER: Sliding Windows
  Now, we need a corpus (set of documents) in which the entities of interest
  have been manually labeled.
                                 time!                     locatio
                                                                   n!
       NLP class: Wednesday, 7:30am and Thursday all day, room 667


   From this corpus, we can compute a set of feature vectors with labels:


          1              1              1             1                1
          0              1              0             0                0
          0              0              1             0                1
          0              0              1     ...     0       ...      0
          1      ...     0       ...    1             1                1
          1              0              1             1                0
          1              1              1             0                1
          0              0              0             1                1


Label: Nothing         Nothing         Time         Nothing         Location
                                                                               48
             NER: Sliding Windows
                    Information Extraction: Tuesday 10:00 am, Rm 407b



Use the labeled feature vectors as
training data for Machine Learning


        1       1                                            1
        0       1                                            0
        0       0                                            1   Result
                                                  classify                 Time
        0       0                                            0
        1       1                                            1
        1       0                                            0
        1       1       Machine Learning                     1
        0       0       (go to the other course              1
                        to see what that is)



    Nothing Location
                                                                          49
               NER: Sliding Windows
 The Sliding Windows Technique can be used for Named Entity Recognition
 for nearly arbitrary entities




Input:                                            Condition:
•  a labeled corpus                               •  The entities share some
•  a set of features                                 syntactic similarities
   The features can be
   arbitrarily complex and
   the result depends a
   lot on this choice


   The technique can be refined by using better features, taking into
   account more of the context (not just prefix postfix) and using
   advanced Machine Learning techniques (HMMs, CRFs,...).
                                                                           50
      Named Entity Recognition
Named Entity Recognition (NER) is the process of finding entities
(people, cities, organizations, ...) in a text.



We have seen different techniques
•  Closed-set extraction (if the set of entities is known)
•  Extraction with Regular Expressions (if the entities follow a pattern)
•  Extraction with sliding windows / Machine Learning
   (if the entities share some syntactic features)




                                                                            51
                Information Extraction and beyond
                                                                     Ontological
Information Extraction (IE) is the process
                                                                     Information
of extracting structured information (e.g., database tables)
                                                                     Extraction
from unstructured machine-readable documents
(e.g., Web documents).
                                                                Fact
                                                                Extraction

                                                     Instance
                                       ✓             Extraction
                                 Named Entity
                                                     Elvis Presley      singer
                    ✓            Recognition
    ✓                                                Angela Merkel politician
                 Tokenization&   ...married Elvis
Source           Normalization   on 1967-05-01
Selection         05/01/67
                  
            ?     1967-05-01
                                                                             52
               Instance Extraction
Instance Extraction is the process of extracting entities with their class (i.e.,
concept, set of similar entities)


Elvis was a great artist, but while        Entity             Class
all of Elvis’ colleagues loved the
                                           Elvis              artist
song “Oh yeah, honey”, Elvis
did not perform that song at his           Oh yeah, honey     song
concert in Hintertuepflingen.              Hintertuepflingen location




      ...some of the class assignment might already be done by the
      Named Entity Recognition.


                                                                               53
Instance Extraction: Hearst Patterns
Instance Extraction is the process of extracting entities with their class (i.e.,
concept, set of similar entities)

                                           Idea (by Hearst):
Elvis was a great artist, but while
all of Elvis’ colleagues loved the
                                           Sentences express class membership
song “Oh yeah, honey”, Elvis
                                           in very predictable patterns. Use these
did not perform that song at his
                                           patterns for instance extraction.
concert in Hintertuepflingen.
                                           Hearst patterns:
                                           •  X was a great Y




 Entity             Class
 Elvis              artist

                                                                               54
Instance Extraction: Hearst Patterns


Elvis was a great artist                       Idea (by Hearst):

                                               Sentences express class membership
                                               in very predictable patterns. Use these
   Many scientists, including                  patterns for instance extraction.
   Einstein, started to believe that
   matter and energy could be
   equated.                                    Hearst patterns:
                                               •  X was a Y
                                               •  Ys, such as X1, X2, ...
He adored Madonna, Celine                      •  X1, X2, ... and other Y
Dion and other singers, but                    •  many Ys, including X,
never got an autograph from
any of them.

                                   Many US citizens have never
                                   heard of countries such as
                                   Guinea, Belize or Germany.
                                                                                55
Instance Extraction: Hearst Patterns
   Hearst Patterns on Google




                               Idea (by Hearst):

                               Sentences express class membership
                               in very predictable patterns. Use these
                               patterns for instance extraction.

                               Hearst patterns:
   Wildcards on Google         •  X was a Y
                               •  Ys, such as X1, X2, ...
                               •  X1, X2, ... and other Y
                               •  many Ys, including X,




                                                                56
Instance Extraction: Hearst Patterns
Hearst Patterns can extract
instances from natural
language documents
                                      Idea (by Hearst):

Input:                                Sentences express class membership
•  Hearst patterns for the language   in very predictable patterns. Use these
   (easily available for English)     patterns for instance extraction.

                                      Hearst patterns:
Condition:
                                      •  X was a Y
•  Text documents contain
                                      •  Ys, such as X1, X2, ...
   class + entity explicitly in
                                      •  X1, X2, ... and other Y
   defining phrases
                                      •  many Ys, including X,




                                                                       57
Instance Extraction: Classification
Suppose that we already have the seed sets
scientists={Einstein, Bohr}
musician={Elvis, Madonna}                                     Rengstorff made
                                                              multiple important
When Einstein                            Elvis played the     discoveries, among
                        In 1940, Bohr                         others the theory of
discovered the U86                       guitar, the piano,
                        discovered the                        recursive
plutonium                                the flute, the
                        CO2H3X.                               subjunction.
hypercarbonate...                        harpsichord,...


Stemmed context of the entity without stop words:


  {discover             {1940,               {play,               {make,
  U86                   discover,            guitar,              important,
  plutonium}            CO2H3X}              piano}               discover,}




    Scientist            Scientist           Musician                What is
                                                                     Rengstorff?
                                                                         58
  Instance Extraction: Classification
  Suppose that we already have the seed sets
  scientists={Einstein, Bohr}
  musician={Elvis, Madonna}                                    Rengstorff made
                                                               multiple important
  When Einstein                           Elvis played the     discoveries, among
                         In 1940, Bohr                         others the theory of
  discovered the U86                      guitar, the piano,
                         discovered the                        recursive
  plutonium                               the flute, the
                         CO2H3X.                               subjunction.
  hypercarbonate...                       harpsichord,...



discover     1                1                 0                           1
U86          1                0                 0                           0
plutonium    1                0                 0                           0
1940         0                1                 0              classify     0
CO2H3X       0                1                 0                           0
play         0                0                 1                           0
guitar       0                0                 1                           0
piano        0                0                 1                           0


      Scientist           Scientist            Musician
                                                                           59
                                                                       Scientist
Instance Extraction: Classification
Classification can extract instances from text corpora
without defining phrases.
                                                                     Rengstorff made
                                                                     multiple important
When Einstein                                   Elvis played the     discoveries, among
                             In 1940, Bohr                           others the theory of
discovered the U86                              guitar, the piano,
                             discovered the                          recursive
plutonium                                       the flute, the
                             CO2H3X.                                 subjunction.
hypercarbonate...                               harpsichord,...

Condition:                                                                  Training
•  The texts have to be homogenous                                          vectors
Input: Known classes and either of
                                                      + documents                0
                scientists={Einstein, Bohr}                                      1
•  seed sets
                musician={Elvis, Madonna}                                        0
                                                scientist!                       0
•  or manually labeled documents         When Einstein                           1
                                         discovered ...                          1
                                                                                 1
•  or context        scientist={discover,theorem,...}                            0
                                                                                  60
  Instance Extraction: Iteration
    Seed set: {Einstein, Bohr}




Result set: {Einstein, Bohr, Planck}

                                       61
  Instance Extraction: Iteration
    Seed set: {Einstein, Bohr, Planck}




            One day, Roosevelt met
            Einstein, who had
            discovered the U68




Result set: {Einstein, Bohr, Planck, Roosevelt}

                                                  62
 Instance Extraction: Iteration
   Seed set: {Einstein,Bohr, Planck, Roosevelt}




                                        Semantic Drift is a problem
                                        that can appear in any
                                        system that reuses its output




Result set: {Einstein, Bohr, Planck,
Roosevelt, Kennedy, Bush, Obama, Clinton}
                                                                        63
Instance Extraction: Set Expansion
    Seed set: {Russia, USA, Australia}




Result set: {Russia, Canada, China, USA, Brazil,
Australia, India, Argentina,Kazakhstan, Sudan}
                                                   64
Instance Extraction: Set Expansion


        Most corrupt countries




Result set: {Russia, Canada, China, USA, Brazil,
Australia, India, Argentina,Kazakhstan, Sudan}
                                                   65
Instance Extraction: Set Expansion
Seed set: {Russia, Canada, China, USA, Brazil,   Try, e.g., Google sets:
Australia, India, Argentina,Kazakhstan, Sudan}   http://labs.google.com/sets

        Most corrupt countries




           Result set: {Uzbekistan,
           Chad, Iraq,...}
                                                                      66
Instance Extraction: Set Expansion
                  Set Expansion can extract instances
                  from tables or lists.


                  Input:
                  •  seed pairs

                   Condition:
                   •  a corpus full of tables




                                                67
Instance Extraction: Cleaning
Information Extraction nearly always produces noise (minor false outputs)

Approaches:
 •  Thresholding
              Einstein                          (number of times extracted)
              Bohr
              Planck
              Roosevelt
              Kennedy
              Elvis

 •  Heuristics (rules without scientific foundations that work well)

       Accept an output only if it appears on different pages,
       merge entities that look similar (Einstein, EINSTEIN), ...



                                                                        68
  Instance Extraction: Evaluation
In science, every system, algorithm or theory should be evaluated,
i.e. its output should be compared to the gold standard (i.e. the ideal output)

             Algorithm output:
             O = {Einstein, Bohr, Planck, Clinton, Obama}
                     ✓       ✓       ✓        ✗      ✗

             Gold standard:
             G = {Einstein, Bohr, Planck, Heisenberg}
                    ✓       ✓     ✓         ✗


    Precision:                                          Recall:
    What proportion of the                              What proportion of the
    output is correct?                                  gold standard did we get?

    |O     G|                                           |O    G|

       |O|                                                |G|
                                                                            69
  Instance Extraction: Evaluation
Explorative algorithms extract everything they find.
                                                            (very low threshold)

            Algorithm output:
            O = {Einstein, Bohr, Planck, Clinton, Obama, Elvis, Heisenberg, ...}


            Gold standard:
            G = {Einstein, Bohr, Planck, Heisenberg}



    Precision:                                         Recall:
    What proportion of the                             What proportion of the
    output is correct?                                 gold standard did we get?

     BAD                                                      GREAT


                                                                              70
  Instance Extraction: Evaluation
Conservative algorithms extract only things about which they are very certain
                                                          (very high threshold)

           Algorithm output:
           O = {Einstein}


           Gold standard:
           G = {Einstein, Bohr, Planck, Heisenberg}



    Precision:                                        Recall:
    What proportion of the                            What proportion of the
    output is correct?                                gold standard did we get?

     GREAT                                                  BAD


                                                                           71
Instance Extraction: Evaluation
You can’t get it all...



  Precision   1




                  0                               1 Recall


   The F1-measure combines precision and recall as the harmonic mean:

   F1 = 2 * precision * recall / (precision + recall)




                                                                   72
               Instance Extraction
Instance Extraction is the process of extracting entities with their class (i.e.,
concept, set of similar entities)


 Approaches:
 •  Hearst Patterns (work on natural language corpora)
 •  Classification (if the entities appear in homogeneous contexts)
 •  Set Expansion (for tables and lists)
 •  ...many others...

  On top of that:
  •  Iteration
  •  Cleaning


  And finally:
  •  Evaluation


                                                                               73
                Information Extraction and beyond
                                                                     Ontological
Information Extraction (IE) is the process
                                                                     Information
of extracting structured information (e.g., database tables)
                                                                     Extraction
from unstructured machine-readable documents
(e.g., Web documents).
                                                                Fact
                                                                Extraction
                                                        ✓
                                                     Instance
                                       ✓             Extraction
                                 Named Entity
                                                     Elvis Presley      singer
                    ✓            Recognition
    ✓                                                Angela Merkel politician
                 Tokenization&   ...married Elvis
Source           Normalization   on 1967-05-01
Selection         05/01/67
                  
            ?     1967-05-01
                                                                             74
                Information Extraction
  Information Extraction (IE) is the process                       and beyond
  of extracting structured information (e.g., database tables)
  from unstructured machine-readable documents
  (e.g., Web documents).


                                                   Ontological
                        Fact                       Information
                        Extraction                 Extraction
   Instance
   Extraction   ✓   Person           Nationality
                    Angela Merkel German                    nationality
Named Entity
Recognition
                ✓

Tokenization&
Normalization   ✓
Source      ✓                                                             75
Selection
                     Fact Extraction
Fact Extraction is the process of extracting pairs (triples,...) of entities
together with the relationship of the entities.




          Event                Time                 Location
          Costello sings...    2010-10-01,          Great
                               23:00                American...                76
    Fact Extraction: From Tables
Fact Extraction is the process of extracting pairs (triples,...) of entities
together with the relationship of the entities.




                    Date                 City                 Time
                    1969-07-01           Las Vegas, NV        22:15
                    1969-08-01           Las Vegas, NV        20:15            77
Fact Extraction: From Tables




      Dates               Cities   Times


      Date         City             Time
      1969-07-01   Las Vegas, NV    22:15
      1969-08-01   Las Vegas, NV    20:15   78
Fact Extraction: From Tables

New York        Broadway           Jan. 8th, 1970   10pm
New York        Hotel California   Jan 9th, 1970    10pm
San Francisco   Pier 39            Jan 11th, 1970   7pm
Mountain View   Castro Str.        Jan 12th, 1970   8pm
Mountain View   Castro Str.        Jan 12th, 1970   9pm
Mountain View   Castro Str.        Jan 12th, 1970   10pm
Mountain View   Castro Str.        Jan 12th, 1970   11pm
Mountain View   Castro Str.        Jan 13th, 1970   12am




    Cities                             Dates         Times


                Date                City              Time
                1969-07-01          Las Vegas, NV     22:15
                1969-08-01          Las Vegas, NV     20:15   79
  Fact Extraction: From Tables
Challenges:
 •  We need reliable Named Entity Recognition/
    instance extraction for the columns




                                    (Cities in Cameroun)




                                                           80
  Fact Extraction: From Tables
Challenges:
 •  We need reliable Named Entity Recognition/
    instance extraction for the columns

•  The tables are not always structured in columns




                                                     81
  Fact Extraction: From Tables
Challenges:
 •  We need reliable Named Entity Recognition/
    instance extraction for the columns

•  The tables are not always structured in columns

•  What if the columns are ambiguous?




                 (Presidents with their vice presidents)
                                                           82
  Fact Extraction: From Tables
Challenges:
 •  We need reliable Named Entity Recognition/
    instance extraction for the columns

•  The tables are not always structured in columns

•  What if the columns are ambiguous?

•  Tables are not always tables

                                                           <table>
                                                           ...
                                                           </table>



                                     <li> blah (<a>blub</a>)

                                                                  83
  Fact Extraction: From Tables
Challenges:
 •  We need reliable Named Entity Recognition/
    instance extraction for the columns

•  The tables are not always structured in columns

•  What if the columns are ambiguous?

•  Tables are not always tables

•  Most importantly: We need to find the tables that we want to target

                                        Web page contributors:
                                        Bob Miller
                                        Carla Hunter
                                        Sophie Jackson


                                                                     84
  Fact Extraction: From Tables
Challenges:
 •  We need reliable Named Entity Recognition/
    instance extraction for the columns

•  The tables are not always structured in columns

•  What if the columns are ambiguous?

•  Tables are not always tables

•  Most importantly: We need to find the tables that we want to target

Input:
•  The relations with their types (schema)


Condition:
•  the corpus contains lots of (clean) tables
                                                                     85
Fact Extraction: Wrapper Induction
 Observation: On Web pages of a certain domain, the information is
 often in the same spot.




                                                                     86
 Fact Extraction: Wrapper Induction
    Observation: On Web pages of a certain domain, the information is
    often in the same spot.

    Idea: Describe this spot in a general manner.
    A description of one spot or multiple spots on a page is called a wrapper.




<html>
<body>                                        A wrapper can be similar
<div>                                         to an XPath expression:
  ...
  <div>                                           html  div[1]  div[2]  b[1]
  ...                                          It can also be a search text/regex
  <div>
    ...                                            >.*</b>(TV
    <b>Elvis: Aloha from Hawaii</b> (TV...
                                                                          87
 Fact Extraction: Wrapper Induction
    We manually label the fields to be extracted, and produce
    the corresponding wrappers (usually with a GUI tool).




             title!
                                                                        Try it out



<html>                                                 Title:
<body>                                                 div[1]  div[2]
<div>
  ...                                                  Rating:
  <div>                                                div[7]  span[2]  b[1]
  ...
  <div>                                                ReleaseDate:
    ...                                                div[10]  i[1]
    <b>Elvis: Aloha from Hawaii</b> (TV...
                                                                             88
Fact Extraction: Wrapper Induction
   We manually label the fields to be extracted, and produce
   the corresponding wrappers.

    Then we apply the wrappers to all pages in the domain
    (i.e., we determine the spots of the pages that the wrappers point to).




                                                         Title:
                                                         div[1]  div[2]

                                                         Rating:
                                                         div[7]  span[2]  b[1]

                                                         ReleaseDate:
                                                         div[10]  i[1]
Title              Rating             ReleaseDate
                                                                              89
Titanic            7.4                1998-01-07
Fact Extraction: Wrapper Induction
 Wrappers can also work inside one page, if the content is repetitive.




                                                                         90
Fact Extraction: Wrapper Induction
 Wrappers can also work inside one page, if the content is repetitive.




                in stock




 Problem:
 some parts of the repetitive items may be optional or again repetitive
 ⇒  learn a stable wrapper




                                                                          91
Fact Extraction: Wrapper Induction


                in stock




 Problem:
 some parts of the repetitive items may be optional or again repetitive
 ⇒  learn a stable wrapper




Sample system:
RoadRunner
http://www.dia.uniroma3.it/db/roadRunner/
                                                                          92
Fact Extraction: Wrapper Induction
 Wrapper induction can extract entities and relations from
 a set of similarly structured pages.

 Input:                                     Condition:
 •  Choice of the domain                    •  All pages are of the same
 •  (Human) labeling of some pages             structure
 •  Wrapper design choices


        Can the wrapper say things like
          “The last child element of this element”
          “The second element, if the first element contains XYZ”
        ?

        If so, how do we generalize the wrapper?




                                                                       93
Fact Extraction: Pattern Matching
                                 Known facts (seed pairs)
                                Person           Discovery
Einstein ha scoperto il K68,
quando aveva 4 anni.            Einstein         K68




       X ha scoperto il Y          The patterns can either
                                   •  be specified by hand
                                   •  or come from annotated text
                                   •  or come from seed pairs + text
Bohr ha scoperto il K69 nel
anno 1960.




Person              Discovery
Bohr                K69
                                                                94
Fact Extraction: Pattern Matching
                                Person            Discovery
Einstein ha scoperto il K68,
quando aveva 4 anni.            Einstein          K68




       X ha scoperto il Y           The patterns can be more
                                    complex, e.g.
                                    •  regular expressions
                                          X discovered the .{0,20} Y
                                    •  POS patterns
Bohr ha scoperto il K69 nel               X discovered the ADJ? Y
anno 1960.                          •  Parse trees
                                                   S           Try
                                          NP            VP
                                                             NP
                                         PN          V
Person              Discovery
                                                               PN
Bohr                K69
                                           X   discovered      Y 95
Fact Extraction: Pattern Matching
                                Person             Discovery
Einstein ha scoperto il K68,
quando aveva 4 anni.            Einstein           K68




       X ha scoperto il Y


                                     First system to
                                     use iteration:
Bohr ha scoperto il K69 nel          Snowball
anno 1960.

                                                 Watch out for
                                                 semantic drift:
                                                 Einstein liked the K68
Person              Discovery
Bohr                K69
                                                                  96
Fact Extraction: Pattern Matching

Einstein ha scoperto il K68,
quando aveva 4 anni.


 Pattern matching can extract facts from natural language text corpora.

  Input:
  •  a known relation
  •  seed pairs or labeled documents or patterns


                                         Condition:
                                         •  The texts are homogenous
                                             (express facts in a similar way)
                                         •  Entities that stand in the relation
                                            do not stand in another relation
                                            as well
                                                                           97
Fact Extraction: Pattern Matching




                  Try this out:
                  http://viewer.opencalais.com/
                                              98
       Fact Extraction: Cleaning
Fact Extraction commonly produces huge amounts of garbage.

                                                      Web page contains
    Web page contains bogus information               misleading items
                                                      (advertisements,
        Deviation in iteration                        error messages)

                                    Regularity in the training set that
  Formatting problems               does not appear in the real world
  (bad HTML, character
  encoding mess)

    Different thematic domains        Something has changed over time
    or Internet domains behave        (facts or page formatting)
    in a completely different way

        ⇒  Cleaning is usually necessary,
           e.g., through thresholding or heuristics
                                                                          99
       Fact Extraction: Cleaning
Fact Extraction is the process of extracting pairs (triples,...) of entities
together with the relationship of the entities.


Approaches:
•  Fact extraction from tables (if the corpus contains lots of tables
•  Wrapper induction (for extraction from one Internet domain)
•  Pattern matching (for extraction from natural language documents)
•  ... and many others...




                                                                               100
                Information Extraction
  Information Extraction (IE) is the process                       and beyond
  of extracting structured information (e.g., database tables)
  from unstructured machine-readable documents
  (e.g., Web documents).


                             ✓                     Ontological
                        Fact                       Information
                        Extraction                 Extraction
   Instance
   Extraction   ✓   Person           Nationality
                    Angela Merkel German                    nationality
Named Entity
Recognition
                ✓

Tokenization&
Normalization   ✓
Source      ✓                                                             101
Selection
                     Ontological IE
Ontological Information Extraction (IE) tries to create or
extend an ontology through information extraction.




                          nationality




Angela Merkel is the German
chancellor....                                Person          Nationality
...Merkel was born in                         Angela Merkel   German
Germany...
                                              Merkel          Germany

...A. Merkel has French                       A.  Merkel      French
nationality...

                                                                            102
                     Ontological IE
Ontological Information Extraction (IE) tries to create or
extend an ontology through information extraction.




                           nationality
                                                Challenges:
                                                1.  Map entity names to
                           has nationality
                                                    ontological entities
                           has citizenship
                           is citizen of        2. Disambiguate entity names
      Merkel
               A. Merkel                        3.  Use the relationships from the
          Angie                                     ontology
                                                4. Make the ontology consistent



                                                                           103
    Ontological IE: Wikipedia

                  Wikipedia is a free online encyclopedia
                  •  3.4 million articles in English
                  •  16 million articles in dozens of languages



Why is Wikipedia good for information extraction?
•  It is a huge, but homogenous resource (more homogenous than the Web)
•  It is considered authoritative and covers many different aspects
   (more authoritative than a random Web page)
•  It is well-structured with infoboxes and categories
•  It provides a wealth of meta information
   (inter article links, inter language links, user discussion,...)




                                                                  104
      Ontological IE: Wikipedia

                     Wikipedia is a free online encyclopedia
                     •  3.4 million articles in English
                     •  16 million articles in dozens of languages




Every article is (should be) unique
=> We get a set of unique entities that cover numerous areas of interest




                                    Germany
Angela_Merkel
                                                    Theory_of_Relativity
                  Una_Merkel
                                                                           105
          Ontological IE: Wikipedia
                                        | bornOnDate = 1935

                                            (hello regexes!)




              Elvis Presley



Blah blah blub        ~Infobox~                          born
fasel (do not         Born: 1935                                1935
read this, better
listen to the talk)   ...
blah blah Elvis
blub (you are still
reading this) blah
Elvis blah blub                    Exploit Infoboxes
later became
astronaut blah
Categories: Rock singers
                                                                   106
             Ontological IE: Wikipedia
                                                 Idea:
American rock singers of German origin ✔
                                                 •  shallow noun phrase parsing
                                                    to determine the head noun
American rock music of Great Quality    ✗        •  take only plural nouns




                                               Rock Singer
                 Elvis Presley
                                                     type


   Blah blah blub        ~Infobox~                           born
   fasel (do not         Born: 1935                                    1935
   read this, better
   listen to the talk)   ...
   blah blah Elvis
   blub (you are still
   reading this) blah
   Elvis blah blub                     Exploit Infoboxes
   later became
   astronaut blah                      Exploit conceptual categories
    Categories: Rock singers
                                                                          107
          Ontological IE: Wikipedia
           WordNet                           Person         Disambiguation
       Person                                    subclassOf
              subclassOf                     Singer
       Singer
                                                      subclassOf

                                           Rock Singer
              Elvis Presley
                                                 type


Blah blah blub        ~Infobox~                          born
fasel (do not         Born: 1935                                   1935
read this, better
listen to the talk)   ...
blah blah Elvis
blub (you are still
reading this) blah
Elvis blah blub                    Exploit Infoboxes
later became
astronaut blah                     Exploit conceptual categories
Categories: Rock singers
                                                                      108
Ontological IE: Consistency Checks
                                             Person

                                                 subclassOf
                                             Singer

                                                      subclassOf

          Guitar      Guitarist          Rock Singer
                                                 type



                   Physics        born                  born
                                                                   1935


     Check uniqueness of entities and functional arguments
     Check domains and ranges of relations
     Check type coherence

                              YAGO & SOFIE                                109
 Ontological IE: Wikipedia
Example: Elvis on Wikipedia

                  |Birth_name = Elvis Aaron Presley
                  |Alias =
                  |Born = {{Birth date|1935|1|8}}<br />[[Tupelo, Mississippi|Tupelo]],
                   [[Mississippi]],<br />United States
                  |Died = {{Death date and age|mf=yes|1977|08|16|1935|01|08}}<br />
                  [[Memphis, Tennessee|Memphis]], [[Tennessee]],<br />United States
                  |Genre = [[Rock and roll]], [[pop music|pop]], [[rockabilly]],
                   [[country music|country]], [[blues]], [[gospel music|gospel]],
                   [[rhythm and blues|R&B]]
                  |Associated_acts = [[The Blue Moon Boys]], [[The Jordanaires]], [[The Imperials]]
                  |Occupation = Musician, actor
                  |Instrument = Vocals, guitar, piano
                  |Years_active = 1954–77
                  |Label = [[Sun Records|Sun]], [[RCA Records|RCA Victor]]
                  |URL = [http://www.Elvis.com www.elvis.com]
                  }}
                  '''Elvis Aaron Presley'''{{ref|fn_a|a}} (January 8, 1935 – August 16, 1977)
                  was one of the most popular American singers of the 20th century....
                  ...
                  [[Category:Actors from Mississippi]]
                  [[Category:Actors from Tennessee]]
                  [[Category:Number-one single or album artist in the UK]]
                  [[Category:American baritones]]
                  [[Category:American country singers]]


                                                                                          110
 Ontological IE: Wikipedia
Example: Elvis in YAGO




                             111
  Ontological IE: Wikipedia
                              YAGO
                              •  3m entities, 28m facts
                              •  focus on precision     95%
                                 (automatic checking of facts,
                                  manual relations,
                                  link with WordNet)
                                   http://mpii.de/yago
              DBpedia
              •  3.4m entities
              •  1b facts (also from non-English Wikipedia)
              •  large community
              http://dbpedia.org

                 Community project on top of Wikipedia data
                 (bought by Google, but still open)
                 http://freebase.com


KYLIN/KOG   Part of the Intelligence in Wikipedia project
                                                                 112
        Ontological IE: Reasoning
Goal:                                             born
                                                               1935
Extract ontological
information from natural
language documents
                                         "Elvis was born in 1935"



Main Challenges:
                                         died in, perished in, was killed in
•  deliver canonic relations
                                          Elvis, Elvis Presley, The King
•  deliver canonic entities
                                               born (Elvis, 1970)
•  deliver consistent facts                    born (Elvis, 1935)


   Idea: These problems are interleaved, solve all of them together.
                                   113
           Ontological IE: Reasoning
Ontology                 First Order Logic Formulae
                         type(Elvis_Presley,singer)
                         subclassof(singer,person)           New facts with
                         ...                                 1. canonic relations
                                                             2. canonic entities
                                                             3. consistency with
                         appears(“Elvis”,”was born in”,         the ontology
NL documents                      ”1935”)
Elvis was born in 1935   ...
                         means(“Elvis”,Elvis_Presley,0.8)                born
                                                                                1935
                         means(“Elvis”,Elvis_Costello,0.2)
                         ...
Semantic Rules
                         born(X,Y) & died(X,Z) => Y<Z
birthdate<deathdate      appears(A,P,B) & R(A,B)
                               => expresses(P,R)
                         appears(A,P,B) & expresses(P,R)
                               => R(A,B)
                         ...                                    SOFIE
                                                                system
      Ontological IE: Reasoning
A Weighted Maximum Satisfiability Problem
is a set of propositional logic formulae with weights
(can be generalized to first order logic)

  A      [10]
  A => B [5]                    A solution to a WMAXSAT problem
  -B     [10]                   is an assignment of the variables to truth values

                                The optimal solution is a solution is a solution
Solution:                       that maximizes the sum of the weights of the
A=true                          satisfied formulae.
B=true
Weight: 10+5=15

                                The optimal solution is NP hard to compute
Solution:
A=true                          => use a (smart) approximation algorithm
B=false
Weight: 10+10=20
        Ontological IE: Reasoning
 A Markov Logic Program
 is a set of propositional logic formulae with weights
 (can be generalized to first order logic)
                                              ... with a probabilistic interpretation:
    A         [10]                            Every solution (possible world) has
    A => B [5]                                a certain probability
    -B        [10]
                                      Number of satisfied        Weight of the ith formula
                                  instances of the ith formula

                                    P(X) ~ Π e sat(i,X) wi
P
                                    max X       Π e sat(i,X) wi
                                    max X       log( Π e sat(i,X) wi )
    false      true
bornIn(Elvis, Tupelo)               max X       Σ sat(i,X) wi
                                    Weighted MAX SAT problem
        Ontological IE: Reasoning
   Reasoning-based approaches use logical rules to extract knowledge
   from natural language documents.

   Current approaches use either
   •  Weighted MAX SAT
   •  or Datalog
   •  or Markov Logic



                        Input:                                 born
                                                                        1935
                        •  often an ontology
                        •  manually designed rules
Semantic Rules

birthdate<deathdate
                        Condition:
                        •  homogeneous corpus helps



                                                                       117
                     Ontological IE
Ontological Information Extraction (IE) tries to create or
extend an ontology through information extraction.



                                 nationality




  Current hot approaches:
  •  extraction from Wikipedia
  •  reasoning-based approaches




                                                             118
                Information Extraction
  Information Extraction (IE) is the process                       and beyond
  of extracting structured information (e.g., database tables)
  from unstructured machine-readable documents
  (e.g., Web documents).

                                                        ✓
                             ✓                     Ontological
                        Fact                       Information
                        Extraction                 Extraction
   Instance
   Extraction   ✓   Person           Nationality
                    Angela Merkel German                    nationality
Named Entity
Recognition
                ✓

Tokenization&
Normalization   ✓
Source      ✓                                                             119
Selection
   Open Information Extraction
Information Extraction (IE) is the process
of extracting structured information (e.g., database tables)
from unstructured machine-readable documents
(e.g., Web documents).

Open Information Extraction/Machine Reading/Macro Reading
aims at information extraction from the entire Web.

Vision of Open Information Extraction:
•  the system runs perpetually, constantly gathering new information
•  the system creates meaning on its own from the gathered data
•  the system learns and becomes more intelligent,
   i.e. better at gathering information

Rationale for Open Information Extraction:
•  We do not need to care for every single sentence, but just for the ones
   we understand
•  The size of the Web generates redundancy
•  The size of the Web can generate synergies                            120
          Open IE: KnowItAll &Co
KnowItAll, KnowItNow and TextRunner are projects
at the University of Washington (in Seattle, WA).




Subject      Verb            Object       Count              Valuable
Egyptians    built           pyramids     400                common sense
                                                             knowledge
Americans    built           pyramids     20
                                                             (if filtered)
...          ...             ...          ...



                   http://www.cs.washington.edu/research/textrunner/   121
        Open IE: KnowItAll &Co
KnowItAll, KnowItNow and TextRunner are projects
at the University of Washington (in Seattle, WA).




               http://www.cs.washington.edu/research/textrunner/   122
        Open IE: Read the Web
“Read the Web” is a project at the Carnegie Mellon University
in Pittsburgh, PA.
                   Initial Ontology

                                 Table Extractor
   Natural Language              Krzewski       Blue Devils
   Pattern Extractor             Miller         Red Angels

   Krzewski coaches
    the Blue Devils.
                                            Mutual exclusion
                                            Learner
                                            sports coach != scientist ?
 Rule Learner
 coaches => is paid by?
                                Type Check Learner
                                 If I coach, am I a coach?

                                                  http://rtw.ml.cmu.edu/rtw/
                                                                        123
Open IE: Read the Web




              http://rtw.ml.cmu.edu/rtw/
                                    124
  Open Information Extraction
Open Information Extraction/Machine Reading/Macro Reading
aims at information extraction from the entire Web.


Main hot projects
•  TextRunner
•  Read the Web                 Input:
                                •  The Web
                                •  Read the Web: Manual rules
                                •  Read the Web: initial ontology

       Conditions
       •  none




                                                                    125
                Information Extraction
                                                                           ✓
  Information Extraction (IE) is the process                       and beyond
  of extracting structured information (e.g., database tables)
  from unstructured machine-readable documents
  (e.g., Web documents).

                                                        ✓
                             ✓                     Ontological
                        Fact                       Information
                        Extraction                 Extraction
   Instance
   Extraction   ✓   Person           Nationality
                    Angela Merkel German                    nationality
Named Entity
Recognition
                ✓

Tokenization&
Normalization   ✓
Source      ✓                                                             126
Selection
                       Homework
•  Write a regular expression that can recognize person names like

      Elvis Presley
      Elvis-Aaron Presley                 (if there are design choices,
      Dr. Elvis Presley                    make them and explain them)
      Prof. Dr. Elvis Presley

•  What features would you choose to recognize person names
   with the sliding window technique?

•  Name 3 different prototypical cases where a Hearst pattern extracts
   a wrong fact from a correct sentence




                                                                     127
                       Homework
•  Explain why it is a good idea (or a bad idea) to do set expansion for
     “Nuclear physicist”

•  Assume that Nostradamus predicted a world war for every
   century from 1500 to 2000 (incl.). What is his precision, what is his recall?
   (assuming that there will be no more world wars)

•  TextRunner extracts words and phrases, not entities and relations.
  Which techniques would you propose to achieve more ontological
  output?




                                                                           128