Prague Dependency Treebank 1.0

Document Sample
Prague Dependency Treebank 1.0 Powered By Docstoc
					Prague Dependency
   Treebank 1.0

 Functional Generative Description
Functional Generative Description

    theoretical framework based on the findings of European
     structural linguistics, esp. of the classical Prague School
    methodological requirements of a formal description
    levels:
      tectogrammatical (underlying) representations (TRs) with

        dependency based syntax
      morphemics

      phonemics and phonetics

    TRs (see Sgall, Hajičová and Panevová 1986, formally specified by
     Petkevič, also in a declarative way)
The Language Layers

   Phonemic,
   Morphonological,
   Morphemic,
   Analytical (surface syntax)
   Tectogrammatical (deep syntax).
   Dependency tree

                               My younger brother arrived there yesterday.




Linearized form, one-to-one relation:
((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)
Dependency Tree

   labels - lexical meanings (abstract symbols) with indices
     functors

           subscripts at parentheses oriented towards head
       grammatemes - values of morphological categories
           Tense, Modality, Number, Definiteness, etc.
   projectivity
   valency
     arguments (inner participants) and
      adjuncts (circumstantials or 'free modifications')
     obligatory and optional with a given head,

     deletable or not
Dependency Tree

   Arguments/participan                  Adjuncts
    ts of verbs                               Locative, several
       Actor/Bearer                           Directional and Temporal
         (underlying subject)                  modifications
       Objective (Patient,                   Condition, Means,
        underlying direct object)              Manner, etc.
       Addressee
        (underlying indirect object)
       Effect ('second' object: to
        choose so. as sth.)
       Origin
        (to make sth. out of sth.)
Dependency Tree

Complementations dependent mainly on nouns
   Arguments (inner               Adjuncts (free
    participants)                   modifications)
       Material (Partitive)           Possession
        two baskets of sth.             (Appurtenance)
       Identity                        my table; Jim's brother
        the river Danube; the          Restrictive
        notion of operator              rich man
                                       Descriptive
                                        the Swedes, who are a
                                        Scandinavian nation
Dependency Tree

   syntactic grammatemes
       Loc, Dir - in, on, under, between...
       Regard - with, without
   operational (testable) criteria
       for distinguishing
           arguments from adjuncts,
           from each other
       deletability (dialogue test)
Simplified valency frames

    read V Act Addr Obj
                                         brother N Appurt
    change                              man N
        V Act Obj Orig Eff
                                         glass N Material
    give V Act Addr Obj
                                         full A Material

obligatory complementations in blue
  Topic-focus articulation

                                            contextual boundness
                             T
                                                main verb CB/NB (T/F)
                there                           dependents to the left/right
                                            communicative dynamism
             young                              left-right (mother, sisters,
                                                 transitive)
                                                partial ordering

left-to-right order of nodes together       underlying word order
with the index T or (prototypically) F          left-right
indicates the TFA of the sentence               linear ordering
(of the TR)
Topic-focus articulation


                      T                                        F

             there                                 yesterday


          young




   TFA - one of the basic aspects of underlying
    structures
    Complex sentence




          My brother, whom you know, arrived there yesterday.

   a subordinated (dependent) clause (i.e. its main verb)
    depends on a word contained in its governing clause
    Complex sentence




    Martin came there late, since he had to accompany his sick mother.

   function words (synsemantic) are viewed as function
    morphemes, syntactically fixed to certain lexical (autosemantic)
    words - prepositions and articles to nouns, conjunctions and
    auxiliaries to verbs
  Complex sentence




Martin arrived late to the session, since he had to accompany his sick mother.

schematically (morphemes):
Martin arrive.ed late to the session since he have.ed to accompany
he.s sick mother.

dot - close connection of morphemes ('semes')
   deleted items restored
     order of items - difference between 'underlying' and surface

      (morphemic) word order
     transductive components - Panevová, Oliva, Borota


   coordination (multidimensional)
     Jim and Mary, who have two children, went to Boston.

     the linearized notation is adequate:

     ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act

      went (Dir Boston)

   structures close to Boolean, i.e. no complex 'innate properties'
    specific for natural language are needed.
Prague Dependency Treebank -
corpus annotation
   an intermediate level - 'analytical'
    representations
       dependency trees, not always projective
       nodes for all word tokens, even for punctuation
        marks
   tectogrammmatical tree: coordinating
    conjunction as the head
Prague Dependency
   Treebank 1.0
 Morphological Layer
 ANNOTATED CORPORA


    PDT version 1.0, 2000
         (1996 - 2000)
        (currently) ver. 2


Penn Treebank, release 3, 1999
         (1989 - 1999)
    PropBank (currently)
The Levels in PDT

   Morphemic
   Analytical
   Tectogrammatical
                       TAG SETs

    Czech - ambiguous inflective language
nový, nového, novému, novém, novým, nová, nové, novou, nových,
novým, novými, … novější, novejšího, novějšímu, novějším, ….,
nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších,
nejnovějším, …



     English - language with poor inflection
work, works, worked, working
               TEXT SOURCES

   Lidové noviny          ´88, ´89 WSJ articles

   Mladá Fronta Dnes      Air Travel Information

   Vesmír                  System transcripts

   Českomoravský          Brown Corpus

             Profit        Switchboard transcripts

...taken from Czech
    National Corpus
ANNOTATION STRATEGY -
              Penn Treebank
 TEXT



    Ken Church‘s stochastic tagger,
    Eric Brill‘s transformation tagger



        corrections by annotator (GNU Emacs
                    Lisp based package)
   ANNOTATION STRATEGY - PDT

 Automatic Morphological Analyzer (AMA)

 two independent annotators; Linux, Win tools

      differences resolved by third annotator

       comparison with the current AMA;
manual resolution; Win tools
            INTERNAL FORMAT


                               word/tag(|tag)*
   SGML coding, csts dtd
                  SAMPLES

<s id=“ln95040:020-p1s1“>
<f>Pokus<l>pokus<t>NNIS1-----A----
<f>o<l>o<t>RR--4----------
<f>zázrak<l>zázrak<t>NNIS4-----A----
<d>.<l>.<t>Z:-------------

The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.
               CONVERSION
   SGML coding        word/tag



      pdt2wsj.pl

                            pdt2wsjFLT.pl

   SGML coding        word/lemma/tag
                DATA SIZE

                  # word    # sentences
                  tokens
PDT 1.0           1 730K       112K


Penn Treebank     4 600K       350K
release 3
DATA SETs of MORPHOLOGICALLY
      ANNOTATED DATA
for tagging only                   #tokens/sentences

training data                         1 470K/95K

development test data                  130K/8K

evaluation test data                   127K/8K

for parsing (preprocessing step)

training data                          475K/29K

development test data                  130K/8K

evaluation test data                   127K/8K
                            TOOLS
   Automatic                     Czech Taggers
    Morphological
    Analyser/Generator of             HMM
    Czech
       HMAnalyze.pl,                 Exponential
        HMGenerate.pl
       Dictionary: CZE_a
       Remote Access
Prague Dependency
   Treebank 1.0

           Analytical Layer in PDT
Introduction

   Input: morphologically tagged sentences

   Graph Editor: “user-friendly” software

   Output: ATS structure
       „surface“ syntax tree structure
       nodes labelled by the analytical functions
Analytical Functions
   Pred    - Predicate if it depends on the tree root
   Sb      - Subject
   Obj     - Object
   Adv     - Adverbial
   Atv     - Complement
   AtvV    - Complement, if one governor is present
   Atr     - Attribute
   Pnom    - Nominal predicate‘s nominal part, depends on the
                      copula „to be“
   AuxV    - Auxiliary verb „to be“
   Coord   - Coordination node
   Apos    - Apposition node
   AuxR    - Reflexive particle, which is neither Obj nor AuxT
                      (passive)
   AuxT    - Reflexive particle, lexically bound to the verb
Analytical Functions
   AuxP         - Preposition or a part of compound preposition
   AuxC         - Subordinate conjunction
   AuxO         - (Superfluously) referring particle or emotional particle
   AuxZ         - Rhematizer or another node acting to another
                           constituent
   AuxX         - Comma, but not the main coordinating comma
   AuxG         - Other graphical symbols being not classified as AuxK
   AuxY         - Other words, such as particles without a specific
                           syntactic function, parts of lexical idioms, etc.
   AuxS         - Sentence holder (the only added root to the tree)
   AuxK         - Punctuation at the end of the sentence
                           or direct speech or citation clause
   ExD          - Ellipsis handling: functions for nodes which pseudo
                           depend on a node on which the would not
                           depend if there were no ellipsis
   AtrAtr, AtrAdv, AdvAtr, AtrObj, ObjAtr + *_Co, *_Pa, *_Ap
Two stages (chronologically)


   (A) manual „analytic“ annotation (ATS)
       training data for (B)(a)

   (B)
       (a) semiautomatic procedure (Collin„s parser)
       (b) manual correcting of (B)(a)
Constraints and limitations

   any string has a node of its own
     word-form, punctuation mark, etc.

       AuxV, AuxP, AuxC, AuxX, AuxG…

   reflecting the coordination and apposition relations
     so called third dimension of the graph in the plain tree

       (X_Co, X_Ap, X_Pa, where X is one of analytic functions,
       such as Sb, Obj, Adv, etc.)
Constraints and limitations

   no missing nodes (on the surface) can be added
     analytic funtion Ex_D is used



   relations between semi-automatic and manual procedure

       80% edges are established correctly automatically
Project organization

   team consisting of 5-6 annotators
   handbook for ATS structure annotation
   100000 sentences on ATS
   tectogrammatical annotation follows
Projectivity/Nonprojectivity/Surface
Order
   A(B, C)
    A                 A               A



        B     C   B       C
                              B   C
Projectivity/Non-projectivity/Surface
Order
   A(B( C ))
    A               A               A



        B                   B           B



                C
                        C       C
První restituční zákon českého parlamentu se do sněmovních
lavic může vrátit jako bumerang.




                              AuxT



                                            Adv
Prague Dependency
   Treebank 1.0

      From the Analytical
           towards
  the Tectogrammatical layer
Introduction

   ATS annotation
       nodes:
           word forms              edges:
           punctuation                 surface relations
           graphical symbols
   TGTS annotation
           autosemantic words
           deletions                   deep layer functions
Annotation process
 Input                      Morphological tagging   Syntactic parsing
 Czech       Tokenization         and lexical       and analytic function
sentence                        disambiguation           assignment




                                 ATS        PDT1.0


           Tree structure    Attribute
              pruning       assignments             TGTS
Transition procedure

   deterministic procedure operating on trees
   macro language for Graph Editor (perl)
   automatic changes & tools for annotators

   Requirements
       new attributes for tectogrammatical layer
       ATS is recoverable from TGTS
       automatized to a maximally high degree
New attributes
   trlemma - lemma of the original node or lemma composed
    of joined nodes
   morphological grammatemes
       gender, number, degree of comparison, tense,
       aspect, iterativeness, verbal modality, deontic
        modality, sentence modality
   position of the node
       functor, topic-focus articulation, syntactic grammateme,
       type of relation (dependency, coordination, apposition),
       phraseme, deletion, quoted word, direct speech,
       coreference, antecedent
Tree Structure Pruning

   U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
    podstatný.
   For those, who start actually at zero, the tax outcome for the
    state is not substantial.
Tree Structure Pruning

   U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
    podstatný.
   For those, who start actually at zero, the tax outcome for the state
    is not substantial.



              REG
 Verbal Nodes

                                    verbmod=CDN
                                    deontmod=HRT



                                    PRED




•… podnikatelé by měli mít daně …
•… enterpreneurs should have (their) taxes …
Attribute Assignments

   prepositions stored as fw attribute
   quoted words
       clause in quotes -> DSP
       one pair of quotes in the sentence -> DSPP
       string in quotes -> QUOT
   gender, number, tense, degcmp, aspect
   default values
Macros for Annotators

   keyboard shortcuts (in Graph editor)
       structure changes
           hide/recover nodes
           merge nodes
       add new nodes
       functor assignments
Manual annotation

   structure checking
   functors
   deletions of obligatory modifications

   feedback for formulating the handbook for
    annotators
Prague Dependency
   Treebank 1.0


  Tectogrammatical Layer
                            F



                T       T       F



C   T   T   T       T


        T
   Jirka se      včera    opil do němoty a Honza dneska.
   George himself yesterday drank to silence and Honza today.
Attributes of Coreferrential relations

   only in MC
       attribute values
        coref            the lemma of the antecedent
        corsnt NIL - in the same sentence
                         PREV1 ... PREVi
                         - position of the sentence which
                          includes the antecedent

       grammatical coreference
        antec           the functor of the antecedent
Example




      coref:    Honza
      corsnt:   NIL
      cornum:   1
      antec:    ACT
   Honza    slíbil    přijít včas.
   Honza    promised to come in time.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:12
posted:8/14/2011
language:English
pages:58