Acknowledgement
Shared by: wuxiangyu
-
Stats
- views:
- 12
- posted:
- 8/13/2011
- language:
- English
- pages:
- 58
Document Sample


CSA405: Advanced Topics
in NLP
Information Extraction II
Named Entity Recognition
December 2004 CSA4050: Information Extraction II 1
Sources
– D. Appelt and D. Israel, Introduction= to IE
Technology, tutorial given at IJCAI99
– Mikheev et al EACL 1999: Named Entity
Recognition without Gazetteers
– Daniel M. Bikel, Richard Schwartz and Ralph
M. Weischedel. 1999. An Algorithm that
Learns What’s in a Name
December 2004 CSA4050: Information Extraction II 2
Outline
• NER – what is involved
• The MUC6/7 task definition
• Two approaches:
– Mikheev 1999 (Rule Based)
– Bikel 1999 (NER Based on HMMs)
December 2004 CSA4050: Information Extraction II 3
The Named Entity Recognition
• Named Entity task introduced as part of
MUC-6 (1995), and continued at MUC-7
(1998)
• Different kinds of named entity:
– temporal expressions
– numeric expressions
– name expressions
December 2004 CSA4050: Information Extraction II 4
Temporal Expressions
(TIMEX tag)
• DATE: complete or partial date expression
• TIME: complete or partial expression of
time of day
• Absolute temporal expressions only, i.e.
– Monday,“
– "10th of October“
– but not "first day of the month".
December 2004 CSA4050: Information Extraction II 5
More TIMEX Examples
• "twelve o'clock noon"
<TIMEX TYPE="TIME">twelve o'clock
noon</TIMEX>
• "January 1990"
<TIMEX TYPE="DATE">January 1990</TIMEX>
• "third quarter of 1991"
<TIMEX TYPE="DATE">third quarter of
1991</TIMEX>
• "the fourth quarter ended Sept. 30"
<TIMEX TYPE="DATE">the fourth quarter ended
Sept. 30</TIMEX>
December 2004 CSA4050: Information Extraction II 6
Time Expressions - Difficulties
• Problems interpreting some task specs:
“Relative time expressions are not to be tagged,
but any absolute times expressed as part of the
entire expression are to be tagged”
– this <TIMEX TYPE="DATE">June</TIMEX>
– thirty days before the end of the year (no markup)
– the end of <TIMEX TYPE="DATE">1991</TIMEX>
December 2004 CSA4050: Information Extraction II 7
Temporal Expressions
• DATE/TIME distinction relatively
straightforward to handle
• Can typically be captured by Regular
Expressions
• Need to handle missing elements properly
e.g. Jan 21st Jan 21st 2002
December 2004 CSA4050: Information Extraction II 8
Number Expressions
(NUMEX)
• Monetary expressions
• Percentages.
• Numbers may be expressed in either
numeric or alphabetic form.
• Categorized as “MONEY” or “PERCENT”
via the TYPE attribute.
December 2004 CSA4050: Information Extraction II 9
NUMEX Tag
• The entire string is to be tagged.
<NUMEX TYPE="MONEY">20 million New
Pesos</NUMEX>
• Modifying words are to be excluded from the NUMEX
tag.
over <NUMEX TYPE="MONEY">$90,000</NUMEX>
• Nested tags allowed
<NUMEX TYPE="MONEY"><ENAMEX
TYPE="LOCATION">US</ENAMEX>$43.6 million</NUMEX>
• Numeric expressions that do not use currency/percentage
terms are not to be tagged.
12 points (no markup)
December 2004 CSA4050: Information Extraction II 10
NUMEX Examples
• "about 5%"
about <NUMEX TYPE="PERCENT">5%</NUMEX>
• "over $90,000"
over <NUMEX TYPE="MONEY">$90,000</NUMEX>
• "several million dollars"
<NUMEX TYPE="MONEY" ALT="million
dollars">several million dollars</NUMEX>
• "US$43.6 million"
<NUMEX TYPE="MONEY">
<ENAMEX TYPE="LOCATION">US</ENAMEX>
$43.6 million</NUMEX>
December 2004 CSA4050: Information Extraction II 11
Name Expressions
• Two related subtasks:
– Identification – which piece of text
– Classification – what kind of name
December 2004 CSA4050: Information Extraction II 12
Name Recognition
Identification and Classification
• The delegation, which included the commander of
the U.N. troops in Bosnia, Lt.Gen. Sir Michael
Rose, went to the Serb stronghold of Pale, near
Sarajevo, for talks with Bosnian Serb leader
Radovan Karadzic .
– Locations
– Persons
– Organizations
December 2004 CSA4050: Information Extraction II 13
Annotator Guidelines
TYPE DESCRIPTION
Organisation named corporate, governmental, or
other organizational entity
Person Named person or family
Location name of politically or
geographically defined location
(cities, provinces, countries,
international regions, bodies of
water, mountains, etc.)
December 2004 CSA4050: Information Extraction II 14
MUC-6 Output Format
• Output in terms of SGML markup
<ENAMEX TYPE="ORGANIZATION">Taga Co.</ENAMEX>
tag type attribute
December 2004 CSA4050: Information Extraction II 15
Name Expressions
Problems
• Recognition
– Sentence initial uppercase is unreliable
• Delimitation
– Conjunctions: to bind or not to bind
Victoria and Albert (Museum)
• Type Ambiguity
– Persons versus Organisations versus Locations, e.g.
J. Arthur Rank
Washington
December 2004 CSA4050: Information Extraction II 16
Example 2
1. MATSUSHITA 4. IN A FACTORY OF
ELECTRIC BLAUPUNKT WERKE ,
INDUSTRIAL CO . A ROBERT BOSCH
HAS REACHED SUBSIDIARY , …
AGREEMENT … 5. TOUCH PANEL
2. IF ALL GOES WELL, SYSTEMS ,
MATSUSHITA AND CAPITALIZED AT 50
ROBERT BOSCH WILL MILLION YEN, IS
… OWNED …
3. VICTOR CO. OF 6. MATSUSHITA EILL
JAPAN ( JVC ) AND DECIDE ON THE
SONY CORP. PRODUCTION SCALE. …
December 2004 CSA4050: Information Extraction II 17
Example 2
1. EASY – keyword 4. HARD – difficult
present to tell ROBERT
2. EASY – shortened BOSCH is an
form is computable organisation name
3. EASY – acronym 5. HARD – cf. 4.
is computable 6. HARD – spelling
error difficult to
spot.
December 2004 CSA4050: Information Extraction II 18
Name Expressions:
Sources of Information
• Occurrence specific
– capitalisation; presence of immediately
surrounding clue words (e.g . Mr.)
• Document specific
– Previous mention of a name (cf. symbol tables)
– same document; same collection
• External
– Gazetteers: e.g. person names; place names; zip
codes.
December 2004 CSA4050: Information Extraction II 19
Gazetteers
• System that recognises only entities stored
in its lists (gazetteers).
• Advantages - Simple, fast, language
independent, easy to retarget (just create
lists)
• Disadvantages – impossible to enumerate
all names, cannot deal with name variants,
cannot resolve ambiguity.
December 2004 CSA4050: Information Extraction II 20
Gazetteers
• Limited availability
• Maintenance (organisations change)
• Criteria for building effective gazetteers
unclear, e.g. size, but
• Better to use small gazetteers with of well-
known names than large ones of low-
frequency names (Mikheev et al. 1999).
December 2004 CSA4050: Information Extraction II 21
Sources for Creation of Gazetteers
• Yellow pages for person and organisation
names.
• US GEOnet Names Server (GNS) data – 3.9
million locations with 5.37 million names
http://earth-info.nga.mil/gns/html/
• UN site: http://unstats.un.org/unsd/citydata
• Automatic collection from annotated
training data
December 2004 CSA4050: Information Extraction II 22
Recognising Names
• Two main approaches
• Rule Based System
– Usually based on FS methods
• Automatically trained system
– Usually based on HMMs
• Rule based systems tend to have a
performance advantage
December 2004 CSA4050: Information Extraction II 23
Mikheev et al 1999
• How important are gazetteers?
• Is it important that they are big?
• If gazetteers are important but their size
isn't,
• What are the criteria for building
gazetteers?
December 2004 CSA4050: Information Extraction II 24
Mikheev – Experiment
• Learned List
– Training data (200 articles from MUC7)
– 1228 persons, 809 Organisations, 770
Locations
• Common Lists
– CIA World Fact book
– 33K Organisations, 27K persons, 5K Locations
• Combined
December 2004 CSA4050: Information Extraction II 25
Mikheev – Results of Experiment
December 2004 CSA4050: Information Extraction II 26
Mikheev’s System
• Hybrid approach – c. 100 rules
• Rules make heavy use of capitalisation
• Rules based on internal structure which reveals
the type e.g.
Word Word plc
Prof. Word Word
• Modest but well-chosen gazetteer - 5000
Company Names, 1000 Human Names, 20,000
Locations, 2-3 weeks effort
December 2004 CSA4050: Information Extraction II 27
Mikheev et-al (1999): Architecture
1. Sure-fire Rules Rule Relaxation
2. Partial Match Partial Match 2
Title Assignment
December 2004 CSA4050: Information Extraction II 28
Sure-Fire Rules
• Fire when a possible candidate expression is
surrounded by a suggestive context
December 2004 CSA4050: Information Extraction II 29
Partial Match 1
• Collect all named entitities already identified – eg:
Adam Kluver Ltd.
• Generate all subsequences: Adam, Adam Kluver;
Kluver, Kluver Ltd, Ltd.
• Check for occurrences of subsequences and mark
as possible items of the same class as the orginal
named entity
• Check against pre-trained maximum entropy
model.
December 2004 CSA4050: Information Extraction II 30
Maximum Entropy Model
• This model takes into account contextual
information for named entities
– sentence position
– whether they exist in lowercase in general
– used in lowercase elsewhere in the same document, etc.
• These features are passed to the model as
attributes of the partially matched words.
• If the model provides a positive answer for a
partial match, the system makes a definite
assignment.
December 2004 CSA4050: Information Extraction II 31
Rule Relaxation
• More relaxed contextual constraints
• Make use of information from existing
markup and from previous stages to
– Resolve conjunctions within named entitites
e.g. China Import and Export Co.
– Resolve ambiguity of e.g.
Murdoch’s News Corp
December 2004 CSA4050: Information Extraction II 32
Partial Match 2
• Handle single word names not covered by
partial match 1 (eg Hughes – Hughes
Communication Ltd)
• U7ited States and Russia: If evidence for 2
items and one item has already been tagged
“Location”, then likely that XXX and YYY
are of same type. Hence conclude that
U7ited States is of type Location
December 2004 CSA4050: Information Extraction II 33
Title Assignment
• Newswire titles are uppercase
• Mark up entities in title by matching or
partially matching entities found in text
December 2004 CSA4050: Information Extraction II 34
Mikheev: System Results
December 2004 CSA4050: Information Extraction II 35
Use of Gazetteers
December 2004 CSA4050: Information Extraction II 36
Mikheev - Conclusions
• Locations suffer without gazetteers, but
addition of small numbers of certain entries
(e.g.country names) make a big difference.
• Main point: relatively small gazetteers are
sufficient to give good precision and recall.
• Experiments on the basis of a particuar type
(journalistic English with mixed case)
December 2004 CSA4050: Information Extraction II 37
Bikel 99 - Trainable Systems
Hidden Markov Models
• HMM is a probabilistic model based on a
sequence of events – in this case words..
• Whether a word is part of a name is an event with
an estimable probability that can be determined
from a training corpus.
• With HMM we assume that there is an underlying
probabilistic FSM that changes state with each
input event.
• Probability that a word is part of a name is
conditional also on the state of the machine.
December 2004 CSA4050: Information Extraction II 38
Creating HMMs
• Constructing an HMM depends upon
• Having a good hidden state model
• Having enough training data to estimate the
probabilities of the state transitions given
sequences of words.
• When the recogniser is run, it computes the
maximum likelihood path through the hidden state
model, given the input word sequence.
• Viterbi Algorithm finds the path.
December 2004 CSA4050: Information Extraction II 39
The HMM for NER (Bikel)
person
start-of-sentence organisation end-of-sentence
(other name classes)
not-a-name
December 2004 CSA4050: Information Extraction II 40
Name Class Categories
• Eight Name Classes + not-a-name (NAN).
• Within each category, use a bigram
language model (number of states in each
category is V).
• Aim, for a given sentence, is to find the
most likely sequence of name-classes (NC)
given a sequence of words (W):
• NC = argmax(P(NC|W))
December 2004 CSA4050: Information Extraction II 41
Model of Word Production
• Select a name class NC, conditioning on the previous
name-class (NC-1) and previous word w-1.
• Generate the first word inside NC, conditioning on
the NC and NC-1..
• Generate all subsequent words inside NC, where each
subsequent word is conditioned on its immediate
predecessor (using standard bigram language model).
December 2004 CSA4050: Information Extraction II 42
Example
• Sentence: Mr. Jones eats
• According to MUC-6 rules, correct
labelling is
Mr. <ENAMEX TYPE=PERSON>Jones</ENAMEX>eats.
NAN PERSON NAN
• According to model, the likelihood of this
word/name-class sequence is given by the
following expression (which should turn out
to be most likely, given sufficient training)..
December 2004 CSA4050: Information Extraction II 43
Likelihood Under the Model
Pr(NOT-A-NAME | START-OF-SENTENCE, “+end+”) *
Pr(“Mr.” | NOT-A-NAME, START-OF-SENTENCE) *
Pr(+end+ | “Mr.”, NOT-A-NAME) *
Pr(PERSON | NOT-A-NAME, “Mr.”) *
Pr(“Jones” | PERSON, NOT-A-NAME) *
Pr(+end+ | “Jones”, PERSON) *
Pr(NOT-A-NAME | PERSON, “Jones”) *
Pr(“eats” | NOT-A-NAME, PERSON) *
Pr(“.” | “eats”, NOT-A-NAME) *
Pr(+end+ | “.”, NOT-A-NAME) *
Pr(END-OF-SENTENCE | NOT-A-NAME, “.”)
December 2004 CSA4050: Information Extraction II 44
Words and Word Features
• Word features are a language dependent part of the model
twoDigitNum 90 Two digit year
fourDigitNum 1990 Four digit year
containsDigitAndAlpha A8956-67 Product code
containsDigitAndDash 09-96 Date
containsDigitAndSlash 11/9/89 Date
containsDigitAndComma 23,000.00 Monetary amount
containsDigitAndPeriod 1.00 Monetary amount
allCaps BBN Organization
capPeriod M. Person name initial
initCap Sally Capitalized word
other , Punctuation all other words
December 2004 CSA4050: Information Extraction II 45
Three Sub Models
• Model to generate a name class
• Model to generate first word
• Model to generate subsequent words
December 2004 CSA4050: Information Extraction II 46
How the Model Works
Model to generate a name class
Model to generate first word
Model to generate subsequent words
December 2004 CSA4050: Information Extraction II 47
Generate First Word in NC
• Likelihood =
P(transition from NC-1 to NC )*
P(generate word w).
=
P(NC | NC-1,w-1)*P(<w,f> | NC, NC-1)
• N.B. Underlying Intuitions
– Transition to NC strongly influenced by previous word
and previous word class
– First word of a name class strongly influenced by
preceding word class.
December 2004 CSA4050: Information Extraction II 48
Generate Subsequent Words
in Name Class
• Here there are two cases:
– Normal – likelihood of w following w-1 within
a particular NC.
P(<w,f> | <w,f>-1,NC )
– Final word – likelihood of w in NC being the
final word of the class. This uses a
distinguished “+end+” word with features
“other”
P(<+end+,other> | <w,f>final,NC)
December 2004 CSA4050: Information Extraction II 49
Estimating Probabilities
• P(NC|NC-1,w-1) =
c(NC,NC-1,w-1) / c(NC-1,w-1)
• P(<w,f>first|NC,NC-1) =
c(<w,f>first,NC,NC-1)/c(NC,NC-1)
• P(<w,f>|<w,f>-1,NC) =
c(<w,f>,<w,f>-1,NC)/c(<w,f>-1,NC)
December 2004 CSA4050: Information Extraction II 50
Backoff Models and Smoothing
• System knows about all words/bigrams
encountered during training.
• However, in real applications, unknown
words are also encountered, and mapped to
_UNK_
• System must therefore handle bigram
probabilities involving _UNK_:
• as first word, as second word, as both.
December 2004 CSA4050: Information Extraction II 51
Constructing Unknown Word Model
• Based on "held out" data.
• Divide data into 2 halves.
• Use first half to create vocabulary, and train
on second half.
• When performing name recognition, the
unknown word model is used whenever
either or both words of a bigram is
unknown.
December 2004 CSA4050: Information Extraction II 52
Backoff Strategy
• However, even with UWM, it is possible to
be faced with a bigram that has never been
encountered. In this case a backoff strategy
is used.
• Underlying such a strategy is a series of
fallback models.
• Data for successive members of the series
are easier to obtain, but of lower quality.
December 2004 CSA4050: Information Extraction II 53
Backoff Models for
Names Class Bigrams
P(NC | NC-1,w-1)
|
P(NC | NC-1)
|
P(NC)
|
1/NC
December 2004 CSA4050: Information Extraction II 54
Backoff Weighting
• The weight for each backoff model is
computed on the fly
• If computing P(X|Y), assign weight λ to the
direct estimate and a weight (1- λ) to the
backoff model, where λ =
1 – (old c(Y)/c(y)) / 1+ (unique outcomes of Y/c(Y))
December 2004 CSA4050: Information Extraction II 55
Results of Evaluation
Language Best Rules Identifinder
Mixed Case En (WSJ) 96.4 94.9
Upper Case En (WSJ) 89 93.6
Speech Form En (WSJ) 74 90.7
Mixed Case Sp 93 90
December 2004 CSA4050: Information Extraction II 56
How Much Data is Needed?
• Performance increase of 1.5 F-points for
each doubling in the quantity of training
data.
• 1.2 million words of training data = 200
hours of broadcast news or 1777 Wall Street
Journal articles. = 20 person weeks
December 2004 CSA4050: Information Extraction II 57
Bikel - Conclusion
• Old fashioned techniques
• Simple probabilistic
• Near human performance
• Higher F-measure than any other system
when case information is missing.
December 2004 CSA4050: Information Extraction II 58
Get documents about "