Docstoc

cunningham

Document Sample
cunningham Powered By Docstoc
					Named Entity Recognition
  http://gate.ac.uk/   http://nlp.shef.ac.uk/

             Hamish Cunningham
              Kalina Bontcheva

RANLP, Borovets, Bulgaria, 8th September 2003
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
       Information Extraction


• Information Extraction (IE) pulls facts and
  structured information from the content of
  large text collections.
• IR - IE - NLU
• MUC: Message Understanding
  Conferences
• ACE: Automatic Content Extraction
                     3(109)
               MUC-7 tasks


•   NE: Named Entity recognition and typing
•   CO: co-reference resolution
•   TE: Template Elements
•   TR: Template Relations
•   ST: Scenario Templates



                      4(109)
                      An Example
• The shiny red rocket was fired          • NE: entities are "rocket",
  on Tuesday. It is the brainchild          "Tuesday", "Dr. Head" and
  of Dr. Big Head. Dr. Head is a            "We Build Rockets"
  staff scientist at We Build             • CO: "it" refers to the rocket;
  Rockets Inc.                              "Dr. Head" and "Dr. Big Head"
                                            are the same
                                          • TE: the rocket is "shiny red"
                                            and Head's "brainchild".
                                          • TR: Dr. Head works for We
                                            Build Rockets Inc.
                                          • ST: a rocket launching event
                                            occurred with the
                                            various participants.



                                     5(109)
        Performance levels

• Vary according to text type, domain,
  scenario, language
• NE: up to 97% (tested in English, Spanish,
  Japanese, Chinese)
• CO: 60-70% resolution
• TE: 80%
• TR: 75-80%
• ST: 60% (but: human level may be only
  80%)
                    6(109)
    What are Named Entities?

• NER involves identification of proper
  names in texts, and classification into a set
  of predefined categories of interest
• Person names
• Organizations (companies, government
  organisations, committees, etc)
• Locations (cities, countries, rivers, etc)
• Date and time expressions
                      7(109)
   What are Named Entities (2)

• Other common types: measures (percent,
   money, weight etc), email addresses, Web
   addresses, street addresses, etc.
• Some domain-specific entities: names of drugs,
   medical conditions, names of ships,
   bibliographic references etc.
• MUC-7 entity definition guidelines (Chinchor’97)
http://www.itl.nist.gov/iaui/894.02/related_projects/muc
   /proceedings/ne_task.html

                          8(109)
    What are NOT NEs (MUC-7)

• Artefacts – Wall Street Journal
• Common nouns, referring to named entities –
  the company, the committee
• Names of groups of people and things named
  after people – the Tories, the Nobel prize
• Adjectives derived from names – Bulgarian,
  Chinese
• Numbers which are not times, dates,
  percentages, and money amounts
                      9(109)
       Basic Problems in NE

• Variation of NEs – e.g. John Smith, Mr
  Smith, John.
• Ambiguity of NE types: John Smith
  (company vs. person)
  – May (person vs. month)
  – Washington (person vs. location)
  – 1945 (date vs. time)
• Ambiguity with common words, e.g. "may"

                      10(109)
   More complex problems in NE

• Issues of style, structure, domain, genre etc.
• Punctuation, spelling, spacing, formatting, ...
  all have an impact:
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom

 Tell me more about Leonardo
 Da Vinci

                        11(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       12(109)
                  Applications

• Can help summarisation, ASR and MT
• Intelligent document access
  – Browse document collections by the entities that
    occur in them
  – Formulate more complex queries than IR can
    answer
  – Application domains:
     • News
     • Scientific articles, e.g, MEDLINE abstracts

                           13(109)
          Application -Threat tracker
http://www.alias-i.com/iraq/feature_description/entity_search.html




                               14(109)
Application Example - KIM
  http://www.ontotext.com/kim




                 15(109)
    Application Example - KIM
Ontotext’s KIM query and results




                      16(109)
Application Example - Perseus
     http://www.perseus.tufts.edu/




                  17(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       18(109)
     Some NE Annotated Corpora
• MUC-6 and MUC-7 corpora - English
• CONLL shared task corpora
  http://cnts.uia.ac.be/conll2003/ner/ - NEs in English
  and German
  http://cnts.uia.ac.be/conll2002/ner/ - NEs in Spanish
  and Dutch
• TIDES surprise language exercise (NEs in Cebuano
  and Hindi)
• ACE – English -
  http://www.ldc.upenn.edu/Projects/ACE/


                          19(109)
             The MUC-7 corpus

•   100 documents in SGML
•   News domain
•   1880 Organizations (46%)
•   1324 Locations (32%)
•   887 Persons (22%)
•   http://www.itl.nist.gov/iaui/894.02/related_pro
    jects/muc/proceedings/muc_7_proceedings/ma
    rsh_slides.pdf
                        20(109)
            The MUC-7 Corpus (2)
<ENAMEX TYPE="LOCATION">CAPE CANAVERAL</ENAMEX>,
   <ENAMEX TYPE="LOCATION">Fla.</ENAMEX> &MD; Working in
   chilly temperatures <TIMEX TYPE="DATE">Wednesday</TIMEX>
   <TIMEX TYPE="TIME">night</TIMEX>, <ENAMEX
   TYPE="ORGANIZATION">NASA</ENAMEX> ground crews readied
   the space shuttle Endeavour for launch on a Japanese satellite retrieval
   mission.
<p>
Endeavour, with an international crew of six, was set to blast off from the
   <ENAMEX TYPE="ORGANIZATION|LOCATION">Kennedy Space
   Center</ENAMEX> on <TIMEX TYPE="DATE">Thursday</TIMEX> at
   <TIMEX TYPE="TIME">4:18 a.m. EST</TIMEX>, the start of a 49-
   minute launching period. The <TIMEX TYPE="DATE">nine
   day</TIMEX> shuttle flight was to be the 12th launched in darkness.




                                  21(109)
NE Annotation Tools - Alembic




             22(109)
NE Annotation Tools – Alembic (2)




               23(109)
NE Annotation Tools - GATE




           24(109)
 Corpora and System Development

• Corpora are divided typically into a training and
  testing portion
• Rules/Learning algorithms are trained on the training
  part
• Tuned on the testing portion in order to optimise
   – Rule priorities, rules effectiveness, etc.
   – Parameters of the learning algorithm and the features used
• Evaluation set – the best system configuration is run
  on this data and the system performance is obtained
• No further tuning once evaluation set is used!
                             25(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       26(109)
       Performance Evaluation

• Evaluation metric – mathematically defines
  how to measure the system’s performance
  against a human-annotated, gold standard
• Scoring program – implements the metric and
  provides performance measures
  – For each document and over the entire corpus
  – For each type of NE



                        27(109)
         The Evaluation Metric

• Precision = correct answers/answers produced
• Recall = correct answers/total possible correct
  answers
• Trade-off between precision and recall
• F-Measure = (β2 + 1)PR / β2R + P
  [van Rijsbergen 75]
• β reflects the weighting between precision and
  recall, typically β=1
                       28(109)
      The Evaluation Metric (2)

• Precision =
  Correct + ½ Partially correct
  Correct + Incorrect + Partial
• Recall =
  Correct + ½ Partially correct
  Correct + Missing + Partial

• Why: NE boundaries are often misplaced, so
  some partially correct results
                     29(109)
              The MUC scorer (1)
Document: 9601020572
-----------------------------------------------------------------
               POS   ACT| COR PAR INC | MIS SPU NON| REC PRE
------------------------+-------------+--------------+-----------
SUBTASK SCORES           |             |             |
enamex                   |             |             |
organization    11    12|   9   0     0|  2    3    0| 82 75
person          24    26| 24    0     0|  0    2    0| 100 92
location        27    31| 25    0     0|  2    6    0| 93 81
…

* * * SUMMARY SCORES * * *
-----------------------------------------------------------------
              POS   ACT| COR PAR INC | MIS SPU NON| REC PRE
-----------------------+-------------+--------------+------------
TASK SCORES            |             |              |
enamex                 |             |              |
organizatio 1855 1757|1553     0   37| 265 167    30| 84 88
person        883   859| 797   0   13| 73    49    4| 90 93
location     1322 1406|1199    0   13| 110 194     7| 91 85

                              30(109)
              The MUC scorer (2)

• Tracking errors in each document, for
  each instance in the text

ENAMEX   cor inc   PERSON PERSON "Wernher von Braun" "Braun"
ENAMEX   cor inc   PERSON PERSON "von Braun"         "Braun"
ENAMEX   cor cor   PERSON PERSON "Braun"             "Braun"
…
ENAMEX   cor cor   LOCATI LOCATI "Saturn"           "Saturn"
…




                              31(109)
The GATE Evaluation Tool




          32(109)
           Regression Testing

• Need to track system’s performance over time
• When a change is made to the system we want
  to know what implications are over the entire
  corpus
• Why: because an improvement in one case can
  lead to problems in others
• GATE offers automated tool to help with the
  NE development task over time

                      33(109)
        Regression Testing (2)
At corpus level – GATE’s corpus benchmark tool –
tracking system’s performance over time




                          34(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       35(109)
Pre-processing for NE Recognition

• Format detection
• Word segmentation (for languages like
  Chinese)
• Tokenisation
• Sentence splitting
• POS tagging



                    36(109)
   Two kinds of NE approaches
Knowledge Engineering              Learning Systems

• rule based                       • use statistics or other machine
                                     learning
• developed by experienced
  language engineers               • developers do not need LE
                                     expertise
• make use of human intuition
                                   • requires large amounts of
• requires only small amount of
                                     annotated training data
  training data
                                   • some changes may require re-
• development could be very
                                     annotation of the entire training
  time consuming
                                     corpus
• some changes may be hard to
                                   • annotators are cheap (but you
  accommodate
                                     get what you pay for!)



                              37(109)
   List lookup approach - baseline

• System that recognises only entities stored in
  its lists (gazetteers).
• Advantages - Simple, fast, language
  independent, easy to retarget (just create lists)
• Disadvantages - collection and maintenance of
  lists, cannot deal with name variants, cannot
  resolve ambiguity


                        38(109)
         Creating Gazetteer Lists

• Online phone directories and yellow pages for person
  and organisation names (e.g. [Paskaleva02])
• Locations lists
   – US GEOnet Names Server (GNS) data – 3.9 million locations
     with 5.37 million names (e.g., [Manov03])
   – UN site: http://unstats.un.org/unsd/citydata
   – Global Discovery database from Europa technologies Ltd,
     UK (e.g., [Ignat03])
• Automatic collection from annotated training data


                           39(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       40(109)
          Shallow Parsing Approach
             (internal structure)
• Internal evidence – names often have internal structure. These
  components can be either stored or guessed, e.g. location:

• Cap. Word + {City, Forest, Center, River}

• e.g. Sherwood Forest

• Cap. Word + {Street, Boulevard, Avenue, Crescent,
  Road}


• e.g. Portobello Street


                              41(109)
    Problems with the shallow parsing
               approach
• Ambiguously capitalised words (first word in sentence)
  [All American Bank] vs. All [State Police]
• Semantic ambiguity
  "John F. Kennedy" = airport (location)
  "Philip Morris" = organisation
• Structural ambiguity
  [Cable and Wireless] vs. [Microsoft] and [Dell];
  [Center for Computational Linguistics] vs.
  message from [City Hospital] for [John Smith]




                          42(109)
    Shallow Parsing Approach with
               Context
• Use of context-based patterns is helpful in
  ambiguous cases
• "David Walton" and "Goldman Sachs" are
  indistinguishable
• But with the phrase "David Walton of Goldman
  Sachs" and the Person entity "David Walton"
  recognised, we can use the pattern "[Person] of
  [Organization]" to identify "Goldman Sachs“
  correctly.




                       43(109)
       Identification of Contextual
                Information
• Use KWIC index and concordancer to find
  windows of context around entities
• Search for repeated contextual patterns of
  either strings, other entities, or both
• Manually post-edit list of patterns, and
  incorporate useful patterns into new rules
• Repeat with new entities


                      44(109)
       Examples of context patterns
•   [PERSON] earns [MONEY]
•   [PERSON] joined [ORGANIZATION]
•   [PERSON] left [ORGANIZATION]
•   [PERSON] joined [ORGANIZATION] as [JOBTITLE]
•   [ORGANIZATION]'s [JOBTITLE] [PERSON]
•   [ORGANIZATION] [JOBTITLE] [PERSON]
•   the [ORGANIZATION] [JOBTITLE]
•   part of the [ORGANIZATION]
•   [ORGANIZATION] headquarters in [LOCATION]
•   price of [ORGANIZATION]
•   sale of [ORGANIZATION]
•   investors in [ORGANIZATION]
•   [ORGANIZATION] is worth [MONEY]
•   [JOBTITLE] [PERSON]
•   [PERSON], [JOBTITLE]



                                 45(109)
                  Caveats

• Patterns are only indicators based on
  likelihood
• Can set priorities based on frequency
  thresholds
• Need training data for each domain
• More semantic information would be useful
  (e.g. to cluster groups of verbs)


                     46(109)
       Rule-based Example: FACILE
• FACILE - used in MUC-7 [Black et al 98]
• Uses Inxight’s LinguistiX tools for tagging and
  morphological analysis
• Database for external information, role similar to a
  gazetteer
• Linguistic info per token, encoded as feature vector:
   –   Text offsets
   –   Orthographic pattern (first/all capitals, mixed, lowercase)
   –   Token and its normalised form
   –   Syntax – category and features
   –   Semantics – from database or morphological analysis
   –   Morphological analyses
• Example:
  (1192 1196 10 T C "Mrs." "mrs." (PROP TITLE) (ˆPER_CIV_F)
  (("Mrs." "Title" "Abbr")) NIL)
  PER_CIV_F – female civilian (from database)
                                  47(109)
                        FACILE (2)

• Context-sensitive rules written in special rule notation,
  executed by an interpreter
• Writing rules in PERL is too error-prone and hard
• Rules of the kind:
  A => B\C/D, where:
   – A is a set of attribute-value expressions and optional score, the
     attributes refer to elements of the input token feature vector
   – B and D are left and right context respectively and can be empty
   – B, C, D are sequences of attribute-value pairs and Klene regular
     expression operations; variables are also supported
• [syn=NP, sem=ORG] (0.9) =>
  \ [norm="university"],
  [token="of"],
  [sem=REGION|COUNTRY|CITY] / ;

                                 48(109)
                       FACILE (3)
# Rule for the mark up of person names when the first name is not
# present or known from the gazetteers: e.g 'Mr J. Cass',


 [SYN=PROP,SEM=PER, FIRST=_F, INITIALS=_I, MIDDLE=_M,
   LAST=_S] #_F, _I, _M, _S are variables, transfer info from RHS
=>
 [SEM=TITLE_MIL|TITLE_FEMALE|TITLE_MALE]
\[SYN=NAME, ORTH=I|O, TOKEN=_I]?,
 [ORTH=C|A, SYN=PROP, TOKEN=_F]?,
 [SYN=NAME, ORTH=I|O, TOKEN=_I]?,
 [SYN=NAME, TOKEN=_M]?,
 [ORTH=C|A|O,SYN=PROP,TOKEN=_S, SOURCE!=RULE]
 #proper name, not recognised by a rule
/;

                                 49(109)
                   FACILE (4)

• Preference mechanism:
  – The rule with the highest score is preferred
  – Longer matches are preferred to shorter matches
  – Results are always one semantic categorisation of
    the named entity in the text
• Evaluation (MUC-7 scores):
  –   Organization: 86% precision, 66% recall
  –   Person: 90% precision, 88% recall
  –   Location: 81% precision, 80% recall
  –   Dates: 93% precision, 86% recall
                          50(109)
Example Rule-based System - ANNIE

• Created as part of GATE
• GATE – Sheffield’s open-source infrastructure
  for language processing
• GATE automatically deals with document
  formats, saving of results, evaluation, and
  visualisation of results for debugging
• GATE has a finite-state pattern-action rule
  language, used by ANNIE

                     51(109)
               NE Components
The ANNIE system – a reusable and easily extendable set of
components




                            52(109)
  Gazetteer lists for rule-based NE

• Needed to store the indicator strings for the
  internal structure and context rules
• Internal location indicators – e.g., {river,
  mountain, forest} for natural locations; {street,
  road, crescent, place, square, …}for address
  locations
• Internal organisation indicators – e.g.,
  company designators {GmbH, Ltd, Inc, …}
• Produces Lookup results of the given kind

                       53(109)
    The Named Entity Grammars

• Phases run sequentially and constitute a cascade of
  FSTs over the pre-processing results
• Hand-coded rules applied to annotations to identify
  NEs
• Annotations from format analysis, tokeniser, sentence
  splitter, POS tagger, and gazetteer modules
• Use of contextual information
• Finds person names, locations, organisations, dates,
  addresses.

                         54(109)
                 NE Rule in JAPE
JAPE: a Java Annotation Patterns Engine
• Light, robust regular-expression-based processing
• Cascaded finite state transduction
• Low-overhead development of new components
• Simplifies multi-phase regex processing

 Rule: Company1
 Priority: 25
   (
     ( {Token.orthography == upperInitial} )+ //from tokeniser
     {Lookup.kind == companyDesignator} //from gazetteer lists
   ):match
 -->
    :match.NamedEntity = { kind=company, rule=“Company1” }
                              55(109)
          Named Entities in GATE




56(109)
      Using co-reference to classify
            ambiguous NEs
• Orthographic co-reference module that matches
  proper names in a document
• Improves NE results by assigning entity type to
  previously unclassified names, based on
  relations with classified NEs
• May not reclassify already classified entities
• Classification of unknown entities very useful for
  surnames which match a full name, or abbreviations,
  e.g. [Bonfield] will match [Sir Peter Bonfield];
  [International Business Machines Ltd.] will
  match [IBM]

                         57(109)
Named Entity Coreference




           58(109)
DEMO




 59(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                      60(109)
   Machine Learning Approaches
• ML approaches frequently break down the NE task in
  two parts:
   – Recognising the entity boundaries
   – Classifying the entities in the NE categories
• Some work is only on one task or the other
• Tokens in text are often coded with the IOB scheme
   – O – outside, B-XXX – first word in NE, I-XXX – all other
     words in NE
   – Easy to convert to/from inline MUC-style markup
   – Argentina       B-LOC
     played          O
     with            O
     Del             B-PER
     Bosque          I-PER
                              61(109)
       IdentiFinder [Bikel et al 99]
• Based on Hidden Markov Models
• Their HMM has 7 regions – one for each MUC type,
  not-name, begin-sentence and end-sentence
• Features
  –   Capitalisation
  –   Numeric symbols
  –   Punctuation marks
  –   Position in the sentence
  –   14 features in total, combining above info, e.g.,
      containsDigitAndDash (09-96), containsDigitAndComma
      (23,000.00)

                           62(109)
               IdentiFinder (2)

•   Back-off models and smoothing
•   Unknown words
•   Further back-off and smoothing
•   Different strategies for name-class bigrams,
    first-word bigrams and non-first-word bigrams




                        63(109)
                  IdentiFinder (3)
• MUC-6 (English) and MET-1(Spanish) corpora used for
  evaluation
• Mixed case English
   – IdentiFinder - 94.9% f-measure
   – Best rule-based – 96.4%
• Spanish mixed case
   – IdentiFinder – 90%
   – Best rule-based - 93%
   – Lower case names, noisy training data, less training data
• Training data: 650,000 words, but similar performance
  with half of the data. Less than 100,000 words reduce
  the performance to below 90% on English
                              64(109)
     MENE [Borthwick et al 98]

• Combining rule-based and ML NE to achieve
  better performance
• Tokens tagged as: XXX_start, XXX_continue,
  XXX_end, XXX_unique, other (non-NE),
  where XXX is an NE category
• Uses Maximum Entropy
  – One only needs to find the best features for the
    problem
  – ME estimation routine finds the best relative
    weights for the features
                         65(109)
                   MENE (2)

• Features
  – Binary features – “token begins with capitalised
    letter”, “token is a four-digit number”
  – Lexical features – dependencies on the
    surrounding tokens (window ±2) e.g., “Mr” for
    people, “to” for locations
  – Dictionary features – equivalent to gazetteers (first
    names, company names, dates, abbreviations)
  – External systems – whether the current token is
    recognised as an NE by a rule-based system

                          66(109)
                    MENE (3)
• MUC-7 formal run corpus
  – MENE – 84.2% f-measure
  – Rule-based systems it uses – 86% - 91 %
  – MENE + rule-based systems – 92%
• Learning curve
  –   20 docs – 80.97%
  –   40 docs – 84.14%
  –   100 docs – 89.17%
  –   425 docs – 92.94%

                          67(109)
   Fine-grained Classification of NEs
            [Fleischman 02]
• Finer-grained categorisation needed for applications like
  question answering
• Person classification into 8 sub-categories – athlete,
  politician/government, clergy, businessperson,
  entertainer/artist, lawyer, doctor/scientist, police.
• Supervised approach using local context and global
  semantic information such as WordNet
• Used a decision list classifier and Identifinder to
  construct automatically training set from untagged data
• Held-out set of 1300 instances hand annotated
                          68(109)
Fine-grained Classification of NEs (2)

• Word frequency features
  – For each 8 categories 10 distinct word positions = 80
    features per instance
  – 3 words before & after the instance
  – The two-word bigrams immediately before and after the
    instance
  – The three-word trigrams before/after the instance
• Topic signatures and WordNet information
  – Compute lists of terms that signal relevance to a
    topic/category [Lin&Hovy 00] & expand with WordNet
    synonyms to counter unseen examples
  – Politician – campaign, republican, budget
                           69(109)
Fine-grained Classification of NEs (3)

• Due to differing contexts, instances of the same name
  in a single text were classified differently
• MemRun chooses the prevailing sub-category based on
  their most frequent classification
• Othomatching-like algorithm is developed to match
  George Bush, Bush, and George W. Bush
• Expts with k-NN, Naïve Bayes, SVMs, Neural
  Networks and C4.5 show that C4.5 is best
• Future work: treating finer grained classification as a
  WSD task (categories are different senses of a person)
                         70(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       71(109)
      Multilingual Named Entity
             Recognition
• Recent experiments are aimed at NE
  recognition in multiple languages
• TIDES surprise language evaluation exercise
  measures how quickly researchers can develop
  NLP components in a new language
• CONLL’02, CONLL’03 focus on language-
  independent NE recognition

                     72(109)
  Analysis of the NE Task in Multiple
    Languages [Palmer&Day 97]
Language   NE     Time/       Numeric   Org/Per/
                  Date        exprs.    Loc
Chinese    4454   17.2%       1.8%      80.9%
English    2242   10.7%       9.5%      79.8%
French     2321   18.6%       3%        78.4%
Japanese   2146   26.4%       4%        69.6%
Portuguese 3839   17.7%       12.1%     70.3%
Spanish    3579   24.6%       3%        72.5%
                    73(109)
   Analysis of Multilingual NE (2)

• Numerical and time expressions are very easy
  to capture using rules
• Constitute together about 20-30% of all NEs
• All numerical expressions in the 6 languages
  required only 5 patterns
• Time expressions similarly require only a few
  rules (less than 30 per language)
• Many of these rules are reusable across the
  languages
                      74(109)
  Analysis of Multilingual NE (3)
• Suggest a method for calculating the lower bound for
  system performance given a corpus in the target
  language
• Conclusion: Much of the NE task can be achieved by
  simple string analysis and common phrasal contexts
• Zipf’s law: the prevalence of frequent phenomena
  allow high scores to be achieved directly from the
  training data
• Chinese, Japanese, and Portuguese corpora had a
  lower bound above 70%
• Substantial further advances require language
  specificity
                         75(109)
What is needed for multilingual NE

• Extensive support for non-Latin scripts and
  text encodings, including conversion utilities
  – Automatic recognition of encoding [Ignat et al03]
  – Occupied up to 2/3 of the TIDES Hindi effort
• Bi-lingual dictionaries
• Annotated corpus for evaluation
• Internet resources for gazetteer list collection
  (e.g., phone books, yellow pages, bi-lingual
  pages)
                        76(109)
    Multilingual support - Alembic



Japanese
example




                 77(109)
           Editing Multilingual Data
GATE Unicode Kit (GUK)
    Complements Java’s facilities



• Support for defining
  Input Methods (IMs)

• currently 30 IMs
  for 17 languages

• Pluggable in other
  applications (e.g.
  JEdit)
                            78(109)
         Multilingual Data - GATE
All processing, visualisation and editing tools use GUK




                             79(109)
     Gazetteer-based Approach to
    Multilingual NE [Ignat et al 03]
• Deals with locations only
• Even more ambiguity than in one language:
  – Multiple places that share the same name, such as
    the fourteen cities and villages in the world called
    ‘Paris’
  – Place names that are also words in one or more
    languages, such as ‘And’ (Iran), ‘Split’ (Croatia)
  – Places have varying names in different languages
    (Italian ‘Venezia’ vs. English ‘Venice’, German
    ‘Venedig’, French ‘Venise’)
                          80(109)
 Gazetteer-based multilingual NE (2)

• Disambiguation module applies heuristics
  based on location size and country mentions
  (prefer the locations from the country
  mentioned most)
• Performance evaluation:
  – 853 locations from 80 English texts
  – 96.8% precision
  – 96.5% recall

                        81(109)
Machine Learning for Multilingual NE
• CONLL’2002 and 2003 shared tasks were NE
  in Spanish, Dutch, English, and German
• The most popular ML techniques used:
  – Maximum Entropy (5 systems)
  – Hidden Markov Models (4 systems)
  – Connectionist methods (4 systems)
• Combining ML methods has been shown to
  boost results


                       82(109)
        ML for NE at CONLL (2)

• The choice of features is at least as important as
  the choice of ML algorithm
  –   Lexical features (words)
  –   Part-of-speech
  –   Orthographic information
  –   Affixes
  –   Gazetteers
• External, unmarked data is useful to derive
  gazetteers and for extracting training instances

                         83(109)
       ML for NE at CONLL (3)
• English (f-measure)
   – Baseline 59.5%
   – Systems – between 60.2% and 88.76%
• German (f-measure)
   – Baseline – 30.3%
   – Systems – between 47.7% and 72.4%
• Spanish (f-measure)
   – Baseline – 35.9%
   – Systems – between 60.9% and 81.4%
• Dutch (f-measure)
   – Baseline – 53.1%
   – Systems – between 56.4% and 77%
                          84(109)
Language Independent NE Recognition
      [Cucerzan&Yarowsky 02]
• Uses iterative learning and re-estimation of
  contextual and morphological patterns,
  using tri models
• Learns from unannotated text and requires
  only small list of labelled names, without
  using other language-specific tools
• Word internal features:
  – Some prefixes and suffixes are good indicators
  – For example -escu, -wski, -ova, -ov for person
    names
                      85(109)
Language Independent NE Recognition
      [Cucerzan&Yarowsky 02]
• Classify all occurrences of an entity of the
  text together by combining the contextual
  and morphological clues from each instance
• “One NE class per document/discourse”
  assumption similar to the “one sense per
  discourse” assumption used in word sense
  disambiguation [Gale,Church&Yarowsky 92]
• 70.5% - 75.4% f-measure for Romanian
• Measured on two tasks – NE identification and
  classification (NE boundaries are pre-defined)
                       86(109)
 TIDES surprise language exercise

• Collaborative effort between a number of sites
  to develop resources and tools for various LE
  tasks on a surprise language
• Tasks: IE (including NE), machine translation,
  summarisation, cross-language IR
• Dry-run lasted 10 days on the Cebuano
  language from the Philippines
• Surprise language was Hindi, announced at the
  start of June 2003; duration 1 month

                      87(109)
        Language categorisation

• LDC – survey of 300 largest languages (by
  population) to establish what resources are
  available
• http://www.ldc.upenn.edu/Projects/TIDES/lan
  guage-summary-table.html
• Classification dimensions:
  – Dictionaries, news texts, parallel texts, e.g., Bible
  – Script, orthography, words separated by spaces

                          88(109)
          The Surprise Languages

• Cebuano:
  – Latin script and words are spaced, but
  – Few resources and little work, so
  – Medium difficulty
• Hindi
  – Non-latin script, different encodings used, words
    are spaced, no capitalisation
  – Many resources available
  – Medium difficulty

                         89(109)
 Named Entity Recognition for TIDES

• Information on other systems and results from
  TIDES is still unavailable to non-TIDES
  participants
• Will be made available by the end of 2003 in a
  Special issue of ACM Transactions on Asian
  Language Information Processing (TALIP).
  Rapid Development of Language Capabilities:
  The Surprise Languages
• The Sheffield approach is presented below,
  because it is not subject to these restrictions
                       90(109)
  Dictionary-based Adaptation of
      an English POS tagger
• Substituted Hindi/Cebuano lexicon for English
  one in a Brill-like tagger
• Hindi/Cebuano lexicon derived from a bi-lingual
  dictionary
• Used empty ruleset since no training data
  available
• Used default heuristics (e.g. return NNP for
  capitalised words)
• Very experimental, but reasonable results
                      91(109)
       Evaluation of the Tagger

• No formal evaluation was possible
• Estimate around 67% accuracy on Hindi –
  evaluated by a native speaker on 1000 words
• Created in 2 person days
• Results and a tagging service made available
  to other researchers in TIDES
• Important pre-requisite for NE recognition

                     92(109)
               NE grammars
• Most English JAPE rules based on POS tags
  and gazetteer lookup
• Grammars can be reused for languages with
  similar word order, orthography etc.
• No time to make detailed study of Cebuano,
  but very similar in structure to English
• Most of the rules left as for English, but some
  adjustments to handle especially dates
• Used both English and Cebuano grammars
  and gazetteers, because NEs appear in both
  languages
                        93(109)
94(109)
                Evaluation Results
Cebuano                          English
                                 Baseline
Entity     P    R     F          P          R      F
Person     71   65    68         36         36     36
Org        75   71    73         31         47     38
Location   73   78    76         65         7      12
Date       83   100   92         42         58     49
Total      76   79    77.5 45               41.7   43
                           95(109)
    Structure of the Tutorial
• task definition
• applications
• corpora, annotation
• evaluation and testing
• how to
   – preprocessing
   – approaches to NE
   – baseline
   – rule-based approaches
   – learning-based approaches
• multilinguality
• future challenges
                       96(109)
            Future challenges

• Towards semantic tagging of entities
• New evaluation metrics for semantic entity
  recognition
• Expanding the set of entities recognised – e.g.,
  vehicles, weapons, substances (food, drug)
• Finer-grained hierarchies, e.g., types of
  Organizations (government, commercial,
  educational, etc.), Locations (regions,
  countries, cities, water, etc)

                       97(109)
          Future challenges (2)

• Standardisation of the annotation formats
  – [Ide & Romary 02] – RDF-based annotation
    standards
  – [Collier et al 02] – multi-lingual named entity
    annotation guidelines
  – Aimed at defining how to annotate in order to
    make corpora more reusable and lower the
    overhead of writing format conversion tools
     • MUC used inline markup
     • TIDES and ACE used stand-off markup, but two
       different kinds (XML vs one-word per line)
                        98(109)
Towards Semantic Tagging of Entities

• The MUC NE task tagged selected segments of text
  whenever that text represents the name of an entity.

• In ACE (Automated Content Extraction), these names
  are viewed as mentions of the underlying entities. The
  main task is to detect (or infer) the mentions in the text
  of the entities themselves.
• ACE focuses on domain- and genre-independent
  approaches
• ACE corpus contains newswire, broadcast news (ASR
  output and cleaned), and newspaper reports (OCR
  output and cleaned)
                           99(109)
                 ACE Entities

• Dealing with
  – Proper names – e.g., England, Mr. Smith, IBM
  – Pronouns – e.g., he, she, it
  – Nominal mentions – the company, the spokesman
• Identify which mentions in the text refer to
  which entities, e.g.,
  – Tony Blair, Mr. Blair, he, the prime minister, he
  – Gordon Brown, he, Mr. Brown, the chancellor

                         100(109)
                    ACE Example
<entity ID="ft-airlines-27-jul-2001-2"
     GENERIC="FALSE"
     entity_type = "ORGANIZATION">
     <entity_mention ID="M003"
                     TYPE = "NAME"
                     string = "National Air Traffic Services">
     </entity_mention>
     <entity_mention ID="M004"
                     TYPE = "NAME"
                     string = "NATS">
     </entity_mention>
     <entity_mention ID="M005"
                     TYPE = "PRO"
                     string = "its">
     </entity_mention>
     <entity_mention ID="M006"
                     TYPE = "NAME"
                     string = "Nats">
     </entity_mention>
  </entity>

                              101(109)
            ACE Entities (2)

• Some entities can have different roles, i.e.,
  behave as Organizations, Locations, or Persons
  – GPEs (Geo-political entities)
• New York [GPE – role: Person], flush with
  Wall Street money, has a lot of loose change
  jangling in its pockets.
• All three New York [GPE – role: Location]
  regional commuter train systems were found to
  be punctual more than 90 percent of the time.
                      102(109)
    Further information on ACE

• ACE is a closed-evaluation initiative, which
  does not allow the publication of results
• Further information on guidelines and corpora
  is available at:
• http://www.ldc.upenn.edu/Projects/ACE/
• ACE also includes other IE tasks, for further
  details see Doug Appelt’s presentation:
  http://www.clsp.jhu.edu/ws03/groups/sparse/pr
  esentations/doug.ppt

                     103(109)
   Evaluating Richer NE Tagging
• Need for new metrics
  when evaluating
  hierarchy/ontology-
  based NE tagging
• Need to take into
  account distance in the
  hierarchy
• Tagging a company as a
  charity is less wrong
  than tagging it as a
  person
                        104(109)
THANK YOU!




   105(109)
                           Further Reading
•   Aberdeen J., Day D., Hirschman L., Robinson P. and Vilain M. 1995. MITRE:
    Description of the Alembic System Used for MUC-6. MUC-6 proceedings.
    Pages141-155. Columbia, Maryland. 1995.
•   Black W.J., Rinaldi F., Mowatt D. Facile: Description of the NE System Used For
    MUC-7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19
    April - 1 May, 1998.
•   Borthwick. A. A Maximum Entropy Approach to Named Entity Recognition.
    PhD Dissertation. 1999
•   Bikel D., Schwarta R., Weischedel. R. An algorithm that learns what’s in a name.
    Machine Learning 34, pp.211-231, 1999
•   Carreras X., Màrquez L., Padró. 2002. Named Entity Extraction using AdaBoost.
    The 6th Conference on Natural Language Learning. 2002
•   Chang J.S., Chen S. D., Zheng Y., Liu X. Z., and Ke S. J. Large-corpus-based
    methods for Chinese personal name recognition. Journal of Chinese Information
    Processing, 6(3):7-15, 1992
•   Chen H.H., Ding Y.W., Tsai S.C. and Bian G.W. Description of the NTU System
    Used for MET2. Proceedings of 7th Message Understanding Conference, Fairfax,
    VA, 19 April - 1 May, 1998.
•   Chinchor. N. MUC-7 Named Entity Task Definition Version 3.5.
    Available by from ftp.muc.saic.com/pub/MUC/MUC7-guidelines, 1997


                                      106(109)
                           Further reading (2)
•   Collins M., Singer Y. Unsupervised models for named entity classification
    In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural
    Language Processing and Very Large Corpora, 1999
•   Collins M. Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted
    Perceptron. Proceedings of the 40th Annual Meeting of the ACL, Philadelphia, pp.
    489-496, July 2002 Gotoh Y., Renals S. Information extraction from broadcast news,
    Philosophical Transactions of the Royal Society of London, series A: Mathematical,
    Physical and Engineering Sciences, 2000.
•   Grishman R. The NYU System for MUC-6 or Where's the Syntax? Proceedings of the
    MUC-6 workshop, Washington. November 1995.
•   [Ign03a] C. Ignat and B. Pouliquen and A. Ribeiro and R. Steinberger. Extending and
    Information Extraction Tool Set to Eastern-European Languages. Proceedings of
    Workshop on Information Extraction for Slavonic and other Central and Eastern
    European Languages (IESL'03). 2003.
•   Krupka G. R., Hausman K. IsoQuest Inc.: Description of the NetOwlTM Extractor
    System as Used for MUC-7. Proceedings of 7th Message Understanding
    Conference, Fairfax, VA, 19 April - 1 May, 1998.
•   McDonald D. Internal and External Evidence in the Identification and Semantic
    Categorization of Proper Names. In B.Boguraev and J. Pustejovsky editors: Corpus
    Processing for Lexical Acquisition. Pages21-39. MIT Press. Cambridge, MA. 1996
•   Mikheev A., Grover C. and Moens M. Description of the LTG System Used for MUC-
    7. Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1
    May, 1998
•   Miller S., Crystal M., et al. BBN: Description of the SIFT System as Used for MUC-7.
    Proceedings of 7th Message Understanding Conference, Fairfax, VA, 19 April - 1
    May, 1998
                                        107(109)
                         Further reading (3)
•   Palmer D., Day D.S. A Statistical Profile of the Named Entity Task.
    Proceedings of the Fifth Conference on Applied Natural Language
    Processing, Washington, D.C., March 31- April 3, 1997.
•   Sekine S., Grishman R. and Shinou H. A decision tree method for finding
    and classifying names in Japanese texts. Proceedings of the Sixth
    Workshop on Very Large Corpora, Montreal, Canada, 1998
•   Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N. Chinese Named Entity
    Identification Using Class-based Language Model. In proceeding of the
    19th International Conference on Computational Linguistics
    (COLING2002), pp.967-973, 2002.
•   Takeuchi K., Collier N. Use of Support Vector Machines in Extended
    Named Entity Recognition. The 6th Conference on Natural Language
    Learning. 2002
•   D.Maynard, K. Bontcheva and H. Cunningham. Towards a semantic
    extraction of named entities. Recent Advances in Natural Language
    Processing, Bulgaria, 2003.
•   M. M. Wood and S. J. Lydon and V. Tablan and D. Maynard and H.
    Cunningham. Using parallel texts to improve recall in IE. Recent Advances
    in Natural Language Processing, Bulgaria, 2003.
•   D.Maynard, V. Tablan and H. Cunningham. NE recognition without training
    data on a language you don't speak. ACL Workshop on Multilingual and
    Mixed-language Named Entity Recognition: Combining Statistical and
    Symbolic Models, Sapporo, Japan, 2003.
                                   108(109)
                          Further reading (4)
•   H. Saggion, H. Cunningham, K. Bontcheva, D. Maynard, O. Hamza, Y. Wilks.
    Multimedia Indexing through Multisource and Multilingual Information
    Extraction; the MUMIS project. Data and Knowledge Engineering, 2003.
•   D. Manov and A. Kiryakov and B. Popov and K. Bontcheva and D. Maynard, H.
    Cunningham. Experiments with geographic knowledge for information
    extraction. Workshop on Analysis of Geographic References, HLT/NAACL'03,
    Canada, 2003.
•   H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework
    and Graphical Development Environment for Robust NLP Tools and
    Applications. Proceedings of the 40th Anniversary Meeting of the Association
    for Computational Linguistics (ACL'02). Philadelphia, July 2002.
•   H. Cunningham. GATE, a General Architecture for Text Engineering.
    Computers and the Humanities, volume 36, pp. 223-254, 2002.
•   D. Maynard, H. Cunningham, K. Bontcheva, M. Dimitrov. Adapting A Robust
    Multi-Genre NE System for Automatic Content Extraction. Proc. of the 10th
    International Conference on Artificial Intelligence: Methodology, Systems,
    Applications (AIMSA 2002), 2002.
•   E. Paskaleva and G. Angelova and M.Yankova and K. Bontcheva and H.
    Cunningham and Y. Wilks. Slavonic Named Entities in GATE. 2003. CS-02-
    01.
•   K. Pastra, D. Maynard, H. Cunningham, O. Hamza, Y. Wilks. How feasible is
    the reuse of grammars for Named Entity Recognition? Language Resources
    and Evaluation Conference (LREC'2002), 2002.
                                     109(109)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:5/16/2012
language:English
pages:109
fanzhongqing fanzhongqing http://
About