Corpus Annotation for Computational Linguistics - PowerPoint by hcj

VIEWS: 37 PAGES: 39

									Managing Linguistic Data




              Heng Ji
        hengji@cs.qc.cuny.edu
             April 29, 2011




                                1
Outline
 How do we design a new language resource
  and ensure that its coverage, balance, and
  documentation support a wide range of uses?
 When existing data is in the wrong format for
  some analysis tool, how can we convert it to a
  suitable format?
 What is a good way to document the
  existence of a resource we have created so
  that others can easily find it?

                                               2
      Why Bother with Corpora

  Human Expert Speakers                        Corpora
 Partial knowledge of a language   More comprehensive and balanced
 Notice the unusual and think of   Show what is common and typical
         what is possible
   Cannot quantify knowledge         Give more accurate statistics
  Cannot remember everything       Store and recall all the information
Cannot make up natural examples       Provide many real examples
 Cannot keep up with language        Can be constantly updated to
           change                    reflect even recent changes
         Lack authority            Encompass actual language use of
                                        many expert speakers

                                                                      3
What is Corpus Annotation

 Corpus
     A collection of machine-readable linguistic
      data, including written, spoken, or both, in
      one or multiple (e.g. parallel) languages

 Annotated Corpus
     Corpus with explicit linguistic information
      through concrete annotation



                                                     4
Why Corpus Annotation is Useful
 Corpus Linguistics
      Corpus-based approaches: hypotheses are checked
       against a corpus (as sources of examples)
      Corpus-driven approaches: hypotheses are drawn from
       the corpus (to gather statistics)

 Computational Linguistics (CL)
    Statistical and/or rule-based modeling of natural
     language from a computational perspective
    Such rules or statistical models have to be „trained‟ for
     a particular domain and analysis task
    Need common data sets for system comparison


                                                                 5
Annotation Types
 Text Annotation
    Lexical: tokenization, part of speech, head, lemmas
    Parsing and chunking
    Semantic tagging: semantic role, word sense
    Certain expressions: named entities
    Discourse: coreference, discourse segments

 Speech Annotation
    Phonetic transcription
    Segmentation (punctuations)
    Prosody

 Problem-oriented tagging

                                                           6
Corpus Structure: A case study on
TIMIT




                                    7
Phone Check
 Sentence:
     a. she had your dark suit in greasy wash water
      all year
     b. don‟t ask me to carry an oily rag like that
 Phones
   >>> phonetic = nltk.corpus.timit.phones('dr1-
    fvmh0/sa1')
   >>> phonetic
    ['h#', 'sh', 'iy', 'hv', 'ae', 'dcl', 'y', 'ix', 'dcl', 'd', 'aa',
    'kcl', 's', 'ux', 'tcl', 'en', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w',
    'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'ax', 'q', 'ao', 'l', 'y', 8
Lexicon Check
 >>> timitdict =
  nltk.corpus.timit.transcription_dict()
 >>> timitdict['greasy'] + timitdict['wash'] +
  timitdict['water']
  ['g', 'r', 'iy1', 's', 'iy', 'w', 'ao1', 'sh', 'w', 'ao1', 't',
  'axr']
 >>> phonetic[17:30]
  ['g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa',
  'dx', 'ax']

                                                                    9
Speaker Check
 >>> nltk.corpus.timit.spkrinfo('dr1-fvmh0')
SpeakerInfo(id='VMH0', sex='F', dr='1',
   use='TRN', recdate='03/11/86',
birthdate='01/08/60', ht='5\'05"', race='WHT',
   edu='BS',
comments='BEST NEW ENGLAND ACCENT
   SO FAR')



                                                 10
Notable Design Features
 the corpus contains two layers of annotation, at the
    phonetic and orthographic levels
   balance across multiple dimensions of variation, for
    coverage of dialect regions and diphones
   there is a sharp division between the original
    linguistic event captured as an audio recording and
    the annotations of that event
   hierarchical structure (more next)
   a split between training and testing sets
   transcriptions and associated data available

                                                           11
Hierarchical Structure




                         12
Fundamental Data Types




                         13
Corpus Creation Scenario 1
 the design unfolds over in the course of the
  creator‟s explorations
 The resulting corpus is then used during
  subsequent years of research
 may serve as an archival resource indefinitely




                                                 14
Corpus Creation Scenario 2
 experimental research where a body of
  carefully designed material is collected from a
  range of human subjects, then analyzed to
  evaluate a hypothesis or develop a
  technology
 shared and reused within a laboratory or
  company, and often to be published more
  widely
 Shared task or evaluation

                                                15
Corpus Creation Scenario 3
 “reference corpus” for a particular language,
  such as the American National Corpus (ANC)
  and the British National Corpus (BNC)
 produce a comprehensive record of the many
  forms, styles, and uses of a language
 a heavy reliance on automatic annotation
  tools together with post-editing to fix any
  errors


                                                  16
 General Procedure
1.   Electronic form (Data conversion, speech transcription)

2.   Size and genre selection (novel, technical report,
     broadcast news, web blog…), permission

3.   Cleaning, spell-checking

4.   Writing annotation guidelines

5.   Annotation (Encoding)

6.   Annotation Evaluation

7.   Distribution of corpora
                                                          17
Issues in Corpus Annotation
 In the past (before 2001) there were many markup
  standards
 Corpus construction as a scientific experiment:
      Ensuring the corpus is an appropriate SAMPLE
      Ensuring the annotation is done RELIABLY
           Addressing the problem of AMBIGUITY and OVERLAP

 Corpus construction as resource building:
      Finding the appropriate MARKUP METHOD
           Annotation should be separable
           Makes REUSE & EXCHANGE easy
      Need Corpus Annotation Standard

                                                              18
Corpus Encoding Standard (CES)

 Goal
     Identify a minimal and widely accepted set of encoding
      standards for corpus among the language engineering
      community

 Content
     Use XML for markup language
     Meta-language level recommendation (character
      set, …)
     Tagsets and recommendations for documentation,
      data selection and annotation

                                                           19
Text Encoding Initiative (TEI)

 Goal
     Develop standards and guidelines for data
      interchange

 Content
     400 different textual components and concepts
     Defined by DTD/XML schema
     Use Unicode
     Can be customized for particular research purpose
      (e.g. TEI Lite)


                                                          20
Annotation Guideline Documentation

 Goal
      Definitions for each element used in the markup scheme
      Any additional information needed to correctly process and
       interpret the markup scheme
 Principle
      “God‟s truth” standard doesn‟t exist
      New projects respect the outcomes/lessons of earlier
       projects
      Detailed, specific, explicit
 Example: What should be marked as a „person‟ name?
      General definition and example
      Titles, honorifics and positions?
      Fictional characters, names of animals, names of fictional
       animals?
                                                                    21
Corpus Annotation Pipelines

  Raw             Raw             Raw
 Corpus          Corpus          Corpus


                                Junior
  Single         Single        Annotator
Annotator 1    Annotator 2

                                Senior
                               Annotator
         Expert
       Adjudication
                                Expert
                             Quality Control

        Annotated
         Corpus                Annotated
                                Corpus         22
Corpus Evaluation: Consistency
(Inter-Annotator Agreement)

  Single                                      Single
Annotator 1     A        B        C         Annotator 2




  Precision (P) = #B / #A
  Recall (R) = #B / #C
  F-Measure (F) = 2*P*R / (P+R)
  Random-Agree-Rate (E): the proportion of times annotators
                            agree by chance
  Kappa Coefficient (K) = (F-E) / (1-E)
                    K=0: no agreement
                    0.6 <= K < 0.8: tentative agreement
                    0.8 <= K <=1: OK agreement
                                                          23
Corpus Evaluation: Accuracy



  Single                       Final
 Annotator   A   B     C     Annotation




  Precision (P) = #B / #A
  Recall (R) = #B / #C
  F-Measure (F) = 2*P*R / (P+R)

                                          24
Evolution of a Corpus Over Time




                                  25
Obtain Data from the Web
def lexical_data(html_file):
SEP = '_ENTRY'
html = open(html_file).read()
html = re.sub(r'<p', SEP + '<p', html)
text = nltk.clean_html(html)
text = ' '.join(text.split())
for entry in text.split(SEP):
if entry.count(' ') > 2:
yield entry.split(' ', 3)
>>> import csv
>>> writer = csv.writer(open("dict1.csv", "wb"))
>>> writer.writerows(lexical_data("dict.htm"))
                                                   26
Format Conversion
   >>> idx = nltk.Index((defn_word, lexeme)  construct index
   ... for (lexeme, defn) in pairs
   ... for defn_word in nltk.word_tokenize(defn)  tokenize
   ... if len(defn_word) > 3)  discard short words
   >>> idx_file = open("dict.idx", "w")
   >>> for word in sorted(idx):
   ... idx_words = ', '.join(idx[word])
   ... idx_line = "%s: %s\n" % (word, idx_words)  print
   ... idx_file.write(idx_line)
   >>> idx_file.close()




                                                                 27
Output
   body: sleep
   cease: wake
   condition: sleep
   down: walk
   each: walk
   foot: walk
   lifting: walk
   mind: sleep
   progress: walk
   setting: walk
   sleep: wake
                       28
Standards and Tools




                      29
Using Markup Languages:
SGML/XML
 SGML is developed as a universal format;
  XML is a simplified version; standard format
  for attributes

 Why use XML annotation in Computational
  Linguistics
     Flexible to extend attributes
     Support multi-functionality (multi-layer annotations)
     Easy to extract target information for statistics
     Format: Inline/Stand-off

                                                              30
The ElementTree Interface
  >>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
  >>> raw = open(merchant_file).read()
  >>> print raw[0:168]
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/css" href="shakes.css"?>
  <!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->
  <PLAY>
  <TITLE>The Merchant of Venice</TITLE>

  >>> print raw[1850:2075]
  <TITLE>ACT I</TITLE>
  <SCENE><TITLE>SCENE I. Venice. A street.</TITLE>
  <STAGEDIR>Enter ANTONIO, SALARINO, and
     SALANIO</STAGEDIR>
  <SPEECH>
  <SPEAKER>ANTONIO</SPEAKER>
  <LINE>In sooth, I know not why I am so sad:</LINE>

                                                                           31
The ElementTree Interface
>>> from nltk.etree.ElementTree import ElementTree
>>> merchant = ElementTree().parse(merchant_file)
>>> merchant
<Element PLAY at 22fa800>
>>> merchant[0]
<Element TITLE at 22fa828>
>>> merchant[0].text
'The Merchant of Venice'
>>> merchant.getchildren()
[<Element TITLE at 22fa828>, <Element PERSONAE at 22fa7b0>,
   <Element SCNDESCR at 2300170>,
<Element PLAYSUBT at 2300198>, <Element ACT at 23001e8>,
   <Element ACT at 234ec88>,
<Element ACT at 23c87d8>, <Element ACT at 2439198>,
   <Element ACT at 24923c8>]
                                                          32
The ElementTree Interface
>>> merchant[-2][0].text
'ACT IV'
>>> merchant[-2][1]
<Element SCENE at 224cf80>
>>> merchant[-2][1][0].text
'SCENE I. Venice. A court of justice.'
>>> merchant[-2][1][54]
<Element SPEECH at 226ee40>
>>> merchant[-2][1][54][0]
<Element SPEAKER at 226ee90>
>>> merchant[-2][1][54][0].text
'PORTIA'
>>> merchant[-2][1][54][1]
<Element LINE at 226eee0>
>>> merchant[-2][1][54][1].text
"The quality of mercy is not strain'd,"

                                          33
The ElementTree Interface
>>> for i, act in enumerate(merchant.findall('ACT')):
... for j, scene in enumerate(act.findall('SCENE')):
... for k, speech in enumerate(scene.findall('SPEECH')):
... for line in speech.findall('LINE'):
... if 'music' in str(line.text):
... print "Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1,
     line.text)
Act 3 Scene 2 Speech 9: Let music sound while he doth make his
     choice;
Act 3 Scene 2 Speech 9: Fading in music: that the comparison
Act 3 Scene 2 Speech 9: And what is music then? Then music is
Act 5 Scene 1 Speech 23: And bring your music forth into the air.
Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of
     music


                                                                    34
The ElementTree Interface

>>> speaker_seq = [s.text for s in
   merchant.findall('ACT/SCENE/SPEECH/SPEAKER')]
>>> speaker_freq = nltk.FreqDist(speaker_seq)
>>> top5 = speaker_freq.keys()[:5]
>>> top5
['PORTIA', 'SHYLOCK', 'BASSANIO', 'GRATIANO',
   'ANTONIO']




                                               35
The ElementTree Interface
>>> mapping = nltk.defaultdict(lambda: 'OTH')
>>> for s in top5:
... mapping[s] = s[:4]
...
>>> speaker_seq2 = [mapping[s] for s in speaker_seq]
>>> cfd = nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2))
>>> cfd.tabulate()
ANTO BASS GRAT OTH PORT SHYL
ANTO 0 11 4 11 9 12
BASS 10 0 11 10 26 16
GRAT 6 8 0 19 9 5
OTH 8 16 18 153 52 25
PORT 7 23 13 53 0 21
SHYL 15 15 2 26 21 0

                                                                  36
Resources: Existing Corpora

 Brown corpus, LOB Corpus, British National Corpus
 Bank of English
 Wall Street Journal, Penn Tree Bank, Nombank,
    Propbank, BNC, ANC, ICE, WBE, Reuters Corpus
   Canadian Hansard: parallel corpus English-French
   York-Helsinki Parsed corpus of Old Poetry
   Tiger corpus – German
   CORII/CODIS - contemporary written Italian
   MULTEX 1984 and The Republic in many languages
   MapTask

                                                      37
Distributors of Corpora

 LDC (Linguistic Data Consortium)
 ELRA (European Language Resources
  Association)
 TRACTOR (TELRI Research Archive of
  Computational Tools and Resources)
 ICAME (International Computer Archive of
  Modern and Medieval English)



                                             38
Annotation Tools
 Uniform CL Annotation Platform
       UIMA (IBM NLP platform): http://incubator.apache.org/uima/svn.html
       Mallet (UMASS): http://mallet.cs.umass.edu/index.php/Main_Page
       MinorThird (CMU): http://minorthird.sourceforge.net/

 POS Tagging
       CLAWS:http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/tagservice.
        html
       Brill tagger: http://www.cs.jhu.edu/~brill/code.html
       Lingpipe: http://alias-i.com/lingpipe/web/demo-pos.html

 Information Extraction
       Jet (NYU): http://www.cs.nyu.edu/cs/faculty/grishman/jet/license.html
       Gate (University of Sheffield): http://gate.ac.uk/download/index.html

 Machine Translation
       Compara: http://adamastor.linguateca.pt/COMPARA/Welcome.html
       ISI decoder: http://www.isi.edu/licensed-sw/rewrite-decoder/



                                                                                 39

								
To top