Docstoc

Acknowledgement

Document Sample
Acknowledgement Powered By Docstoc
					        CSA405: Advanced Topics
                in NLP
                Information Extraction II
                 Named Entity Recognition


December 2004        CSA4050: Information Extraction II   1
                       Sources
     – D. Appelt and D. Israel, Introduction= to IE
       Technology, tutorial given at IJCAI99
     – Mikheev et al EACL 1999: Named Entity
       Recognition without Gazetteers
     – Daniel M. Bikel, Richard Schwartz and Ralph
       M. Weischedel. 1999. An Algorithm that
       Learns What’s in a Name


December 2004     CSA4050: Information Extraction II   2
                      Outline
• NER – what is involved
• The MUC6/7 task definition
• Two approaches:
     – Mikheev 1999 (Rule Based)
     – Bikel 1999 (NER Based on HMMs)




December 2004    CSA4050: Information Extraction II   3
   The Named Entity Recognition
• Named Entity task introduced as part of
  MUC-6 (1995), and continued at MUC-7
  (1998)
• Different kinds of named entity:
     – temporal expressions
     – numeric expressions
     – name expressions

December 2004     CSA4050: Information Extraction II   4
                Temporal Expressions
                   (TIMEX tag)
• DATE: complete or partial date expression
• TIME: complete or partial expression of
  time of day
• Absolute temporal expressions only, i.e.
     – Monday,“
     – "10th of October“
     – but not "first day of the month".

December 2004       CSA4050: Information Extraction II   5
            More TIMEX Examples
• "twelve o'clock noon"
  <TIMEX TYPE="TIME">twelve o'clock
  noon</TIMEX>
• "January 1990"
  <TIMEX TYPE="DATE">January 1990</TIMEX>
• "third quarter of 1991"
  <TIMEX TYPE="DATE">third quarter of
  1991</TIMEX>
• "the fourth quarter ended Sept. 30"
  <TIMEX TYPE="DATE">the fourth quarter ended
  Sept. 30</TIMEX>
December 2004   CSA4050: Information Extraction II   6
    Time Expressions - Difficulties
• Problems interpreting some task specs:
  “Relative time expressions are not to be tagged,
  but any absolute times expressed as part of the
  entire expression are to be tagged”
  – this <TIMEX TYPE="DATE">June</TIMEX>
  – thirty days before the end of the year (no markup)
  – the end of <TIMEX TYPE="DATE">1991</TIMEX>



 December 2004     CSA4050: Information Extraction II    7
                Temporal Expressions
• DATE/TIME distinction relatively
  straightforward to handle
• Can typically be captured by Regular
  Expressions
• Need to handle missing elements properly
  e.g. Jan 21st  Jan 21st 2002


December 2004       CSA4050: Information Extraction II   8
                Number Expressions
                   (NUMEX)
• Monetary expressions
• Percentages.
• Numbers may be expressed in either
  numeric or alphabetic form.
• Categorized as “MONEY” or “PERCENT”
  via the TYPE attribute.


December 2004       CSA4050: Information Extraction II   9
                 NUMEX Tag
• The entire string is to be tagged.
  <NUMEX TYPE="MONEY">20 million New
  Pesos</NUMEX>
• Modifying words are to be excluded from the NUMEX
  tag.
   over <NUMEX TYPE="MONEY">$90,000</NUMEX>
• Nested tags allowed
   <NUMEX TYPE="MONEY"><ENAMEX
   TYPE="LOCATION">US</ENAMEX>$43.6 million</NUMEX>
• Numeric expressions that do not use currency/percentage
  terms are not to be tagged.
  12 points (no markup)


December 2004      CSA4050: Information Extraction II       10
                NUMEX Examples
• "about 5%"
  about <NUMEX TYPE="PERCENT">5%</NUMEX>
• "over $90,000"
  over <NUMEX TYPE="MONEY">$90,000</NUMEX>
• "several million dollars"
  <NUMEX TYPE="MONEY" ALT="million
  dollars">several million dollars</NUMEX>
• "US$43.6 million"
  <NUMEX TYPE="MONEY">
  <ENAMEX TYPE="LOCATION">US</ENAMEX>
  $43.6 million</NUMEX>

December 2004     CSA4050: Information Extraction II   11
                Name Expressions
• Two related subtasks:
     – Identification – which piece of text
     – Classification – what kind of name




December 2004      CSA4050: Information Extraction II   12
             Name Recognition
      Identification and Classification
• The delegation, which included the commander of
  the U.N. troops in Bosnia, Lt.Gen. Sir Michael
  Rose, went to the Serb stronghold of Pale, near
  Sarajevo, for talks with Bosnian Serb leader
  Radovan Karadzic .
     – Locations
     – Persons
     – Organizations


December 2004          CSA4050: Information Extraction II   13
                Annotator Guidelines
   TYPE                        DESCRIPTION
Organisation       named corporate, governmental, or
                   other organizational entity
Person             Named person or family
Location           name of politically or
                   geographically defined location
                   (cities, provinces, countries,
                   international regions, bodies of
                   water, mountains, etc.)


December 2004       CSA4050: Information Extraction II   14
                 MUC-6 Output Format

• Output in terms of SGML markup
 <ENAMEX TYPE="ORGANIZATION">Taga Co.</ENAMEX>


       tag                  type attribute




 December 2004       CSA4050: Information Extraction II   15
                Name Expressions
                   Problems
• Recognition
     – Sentence initial uppercase is unreliable
• Delimitation
     – Conjunctions: to bind or not to bind
       Victoria and Albert (Museum)
• Type Ambiguity
     – Persons versus Organisations versus Locations, e.g.
       J. Arthur Rank
       Washington

December 2004         CSA4050: Information Extraction II     16
                 Example 2
1. MATSUSHITA                      4. IN A FACTORY OF
   ELECTRIC                           BLAUPUNKT WERKE ,
   INDUSTRIAL CO .                    A ROBERT BOSCH
   HAS REACHED                        SUBSIDIARY , …
   AGREEMENT …                     5. TOUCH PANEL
2. IF ALL GOES WELL,                  SYSTEMS ,
   MATSUSHITA AND                     CAPITALIZED AT 50
   ROBERT BOSCH WILL                  MILLION YEN, IS
   …                                  OWNED …
3. VICTOR CO. OF                   6. MATSUSHITA EILL
   JAPAN ( JVC ) AND                  DECIDE ON THE
   SONY CORP.                         PRODUCTION SCALE. …

December 2004   CSA4050: Information Extraction II    17
                 Example 2
1. EASY – keyword     4. HARD – difficult
   present               to tell ROBERT
2. EASY – shortened      BOSCH is an
   form is computable    organisation name
3. EASY – acronym     5. HARD – cf. 4.
   is computable      6. HARD – spelling
                         error difficult to
                         spot.
December 2004   CSA4050: Information Extraction II   18
                 Name Expressions:
                Sources of Information
• Occurrence specific
     – capitalisation; presence of immediately
       surrounding clue words (e.g . Mr.)
• Document specific
     – Previous mention of a name (cf. symbol tables)
     – same document; same collection
• External
     – Gazetteers: e.g. person names; place names; zip
       codes.
December 2004        CSA4050: Information Extraction II   19
                 Gazetteers
• System that recognises only entities stored
  in its lists (gazetteers).
• Advantages - Simple, fast, language
  independent, easy to retarget (just create
  lists)
• Disadvantages – impossible to enumerate
  all names, cannot deal with name variants,
  cannot resolve ambiguity.

December 2004   CSA4050: Information Extraction II   20
                 Gazetteers
• Limited availability
• Maintenance (organisations change)
• Criteria for building effective gazetteers
  unclear, e.g. size, but
• Better to use small gazetteers with of well-
  known names than large ones of low-
  frequency names (Mikheev et al. 1999).

December 2004   CSA4050: Information Extraction II   21
   Sources for Creation of Gazetteers
• Yellow pages for person and organisation
  names.
• US GEOnet Names Server (GNS) data – 3.9
  million locations with 5.37 million names
  http://earth-info.nga.mil/gns/html/
• UN site: http://unstats.un.org/unsd/citydata
• Automatic collection from annotated
  training data
December 2004   CSA4050: Information Extraction II   22
                Recognising Names
• Two main approaches
• Rule Based System
     – Usually based on FS methods
• Automatically trained system
     – Usually based on HMMs
• Rule based systems tend to have a
  performance advantage
December 2004      CSA4050: Information Extraction II   23
                Mikheev et al 1999
• How important are gazetteers?
• Is it important that they are big?
• If gazetteers are important but their size
  isn't,
• What are the criteria for building
  gazetteers?


December 2004      CSA4050: Information Extraction II   24
                Mikheev – Experiment
• Learned List
     – Training data (200 articles from MUC7)
     – 1228 persons, 809 Organisations, 770
       Locations
• Common Lists
     – CIA World Fact book
     – 33K Organisations, 27K persons, 5K Locations
• Combined
December 2004       CSA4050: Information Extraction II   25
Mikheev – Results of Experiment




December 2004   CSA4050: Information Extraction II   26
                Mikheev’s System
• Hybrid approach – c. 100 rules
• Rules make heavy use of capitalisation
• Rules based on internal structure which reveals
  the type e.g.
  Word Word plc
  Prof. Word Word
• Modest but well-chosen gazetteer - 5000
  Company Names, 1000 Human Names, 20,000
  Locations, 2-3 weeks effort

December 2004      CSA4050: Information Extraction II   27
  Mikheev et-al (1999): Architecture
1. Sure-fire Rules                        Rule Relaxation




2. Partial Match                          Partial Match 2




                                          Title Assignment

December 2004        CSA4050: Information Extraction II      28
                    Sure-Fire Rules
•       Fire when a possible candidate expression is
        surrounded by a suggestive context




    December 2004     CSA4050: Information Extraction II   29
                Partial Match 1
• Collect all named entitities already identified – eg:
  Adam Kluver Ltd.
• Generate all subsequences: Adam, Adam Kluver;
  Kluver, Kluver Ltd, Ltd.
• Check for occurrences of subsequences and mark
  as possible items of the same class as the orginal
  named entity
• Check against pre-trained maximum entropy
  model.

December 2004     CSA4050: Information Extraction II   30
         Maximum Entropy Model
• This model takes into account contextual
  information for named entities
     – sentence position
     – whether they exist in lowercase in general
     – used in lowercase elsewhere in the same document, etc.
• These features are passed to the model as
  attributes of the partially matched words.
• If the model provides a positive answer for a
  partial match, the system makes a definite
  assignment.

December 2004        CSA4050: Information Extraction II     31
                Rule Relaxation
• More relaxed contextual constraints
• Make use of information from existing
  markup and from previous stages to
     – Resolve conjunctions within named entitites
       e.g. China Import and Export Co.
     – Resolve ambiguity of e.g.
       Murdoch’s News Corp


December 2004      CSA4050: Information Extraction II   32
                Partial Match 2
• Handle single word names not covered by
  partial match 1 (eg Hughes – Hughes
  Communication Ltd)
• U7ited States and Russia: If evidence for 2
  items and one item has already been tagged
  “Location”, then likely that XXX and YYY
  are of same type. Hence conclude that
  U7ited States is of type Location
December 2004    CSA4050: Information Extraction II   33
                Title Assignment
• Newswire titles are uppercase
• Mark up entities in title by matching or
  partially matching entities found in text




December 2004     CSA4050: Information Extraction II   34
          Mikheev: System Results




December 2004   CSA4050: Information Extraction II   35
                Use of Gazetteers




December 2004     CSA4050: Information Extraction II   36
                Mikheev - Conclusions
• Locations suffer without gazetteers, but
  addition of small numbers of certain entries
  (e.g.country names) make a big difference.
• Main point: relatively small gazetteers are
  sufficient to give good precision and recall.
• Experiments on the basis of a particuar type
  (journalistic English with mixed case)

December 2004        CSA4050: Information Extraction II   37
      Bikel 99 - Trainable Systems
        Hidden Markov Models
• HMM is a probabilistic model based on a
  sequence of events – in this case words..
• Whether a word is part of a name is an event with
  an estimable probability that can be determined
  from a training corpus.
• With HMM we assume that there is an underlying
  probabilistic FSM that changes state with each
  input event.
• Probability that a word is part of a name is
  conditional also on the state of the machine.
December 2004    CSA4050: Information Extraction II   38
                Creating HMMs
• Constructing an HMM depends upon
• Having a good hidden state model
• Having enough training data to estimate the
  probabilities of the state transitions given
  sequences of words.
• When the recogniser is run, it computes the
  maximum likelihood path through the hidden state
  model, given the input word sequence.
• Viterbi Algorithm finds the path.

December 2004    CSA4050: Information Extraction II   39
         The HMM for NER (Bikel)
                                person




start-of-sentence          organisation                  end-of-sentence

                         (other name classes)



                            not-a-name

 December 2004      CSA4050: Information Extraction II               40
                Name Class Categories
• Eight Name Classes + not-a-name (NAN).
• Within each category, use a bigram
  language model (number of states in each
  category is V).
• Aim, for a given sentence, is to find the
  most likely sequence of name-classes (NC)
  given a sequence of words (W):
• NC = argmax(P(NC|W))
December 2004        CSA4050: Information Extraction II   41
        Model of Word Production
• Select a name class NC, conditioning on the previous
  name-class (NC-1) and previous word w-1.
• Generate the first word inside NC, conditioning on
  the NC and NC-1..
• Generate all subsequent words inside NC, where each
  subsequent word is conditioned on its immediate
  predecessor (using standard bigram language model).


December 2004   CSA4050: Information Extraction II   42
                     Example
• Sentence: Mr. Jones eats
• According to MUC-6 rules, correct
  labelling is
   Mr. <ENAMEX TYPE=PERSON>Jones</ENAMEX>eats.
   NAN                   PERSON         NAN
• According to model, the likelihood of this
  word/name-class sequence is given by the
  following expression (which should turn out
  to be most likely, given sufficient training)..
December 2004     CSA4050: Information Extraction II   43
      Likelihood Under the Model
Pr(NOT-A-NAME | START-OF-SENTENCE, “+end+”) *
Pr(“Mr.” | NOT-A-NAME, START-OF-SENTENCE) *
Pr(+end+ | “Mr.”, NOT-A-NAME) *
Pr(PERSON | NOT-A-NAME, “Mr.”) *
Pr(“Jones” | PERSON, NOT-A-NAME) *
Pr(+end+ | “Jones”, PERSON) *
Pr(NOT-A-NAME | PERSON, “Jones”) *
Pr(“eats” | NOT-A-NAME, PERSON) *
Pr(“.” | “eats”, NOT-A-NAME) *
Pr(+end+ | “.”, NOT-A-NAME) *
Pr(END-OF-SENTENCE | NOT-A-NAME, “.”)
December 2004   CSA4050: Information Extraction II   44
         Words and Word Features
• Word features are a language dependent part of the model

twoDigitNum              90                    Two digit year
fourDigitNum             1990                  Four digit year
containsDigitAndAlpha    A8956-67              Product code
containsDigitAndDash     09-96                 Date
containsDigitAndSlash    11/9/89               Date
containsDigitAndComma    23,000.00             Monetary amount
containsDigitAndPeriod   1.00                  Monetary amount
allCaps                  BBN                   Organization
capPeriod                M.                    Person name initial
initCap                  Sally                 Capitalized word
other                    ,                     Punctuation all other words

December 2004       CSA4050: Information Extraction II                       45
                Three Sub Models
• Model to generate a name class
• Model to generate first word
• Model to generate subsequent words




December 2004     CSA4050: Information Extraction II   46
                How the Model Works


                    Model to generate a name class

                    Model to generate first word

                  Model to generate subsequent words




December 2004         CSA4050: Information Extraction II   47
        Generate First Word in NC
• Likelihood =
  P(transition from NC-1 to NC )*
  P(generate word w).
  =
  P(NC | NC-1,w-1)*P(<w,f> | NC, NC-1)
• N.B. Underlying Intuitions
     – Transition to NC strongly influenced by previous word
       and previous word class
     – First word of a name class strongly influenced by
       preceding word class.

December 2004        CSA4050: Information Extraction II    48
          Generate Subsequent Words
                in Name Class
• Here there are two cases:
     – Normal – likelihood of w following w-1 within
       a particular NC.
       P(<w,f> | <w,f>-1,NC )
     – Final word – likelihood of w in NC being the
       final word of the class. This uses a
       distinguished “+end+” word with features
       “other”
       P(<+end+,other> | <w,f>final,NC)

December 2004     CSA4050: Information Extraction II   49
            Estimating Probabilities
• P(NC|NC-1,w-1) =
  c(NC,NC-1,w-1) / c(NC-1,w-1)
• P(<w,f>first|NC,NC-1) =
  c(<w,f>first,NC,NC-1)/c(NC,NC-1)
• P(<w,f>|<w,f>-1,NC) =
  c(<w,f>,<w,f>-1,NC)/c(<w,f>-1,NC)


December 2004     CSA4050: Information Extraction II   50
  Backoff Models and Smoothing
• System knows about all words/bigrams
  encountered during training.
• However, in real applications, unknown
  words are also encountered, and mapped to
  _UNK_
• System must therefore handle bigram
  probabilities involving _UNK_:
• as first word, as second word, as both.
December 2004   CSA4050: Information Extraction II   51
Constructing Unknown Word Model
• Based on "held out" data.
• Divide data into 2 halves.
• Use first half to create vocabulary, and train
  on second half.
• When performing name recognition, the
  unknown word model is used whenever
  either or both words of a bigram is
  unknown.
December 2004   CSA4050: Information Extraction II   52
                Backoff Strategy
• However, even with UWM, it is possible to
  be faced with a bigram that has never been
  encountered. In this case a backoff strategy
  is used.
• Underlying such a strategy is a series of
  fallback models.
• Data for successive members of the series
  are easier to obtain, but of lower quality.

December 2004     CSA4050: Information Extraction II   53
                 Backoff Models for
                Names Class Bigrams
                   P(NC | NC-1,w-1)
                           |
                     P(NC | NC-1)
                           |
                        P(NC)
                           |
                        1/NC
December 2004       CSA4050: Information Extraction II   54
                Backoff Weighting
• The weight for each backoff model is
  computed on the fly
• If computing P(X|Y), assign weight λ to the
  direct estimate and a weight (1- λ) to the
  backoff model, where λ =

   1 – (old c(Y)/c(y)) / 1+ (unique outcomes of Y/c(Y))


December 2004       CSA4050: Information Extraction II    55
                 Results of Evaluation
                   Language             Best Rules Identifinder

Mixed Case          En (WSJ)                  96.4        94.9

Upper Case          En (WSJ)                    89        93.6

Speech Form         En (WSJ)                    74        90.7

Mixed Case             Sp                       93        90


 December 2004       CSA4050: Information Extraction II          56
       How Much Data is Needed?
• Performance increase of 1.5 F-points for
  each doubling in the quantity of training
  data.
• 1.2 million words of training data = 200
  hours of broadcast news or 1777 Wall Street
  Journal articles. = 20 person weeks


December 2004   CSA4050: Information Extraction II   57
                Bikel - Conclusion
•   Old fashioned techniques
•   Simple probabilistic
•   Near human performance
•   Higher F-measure than any other system
    when case information is missing.



December 2004      CSA4050: Information Extraction II   58

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:36
posted:8/13/2011
language:English
pages:58