Prague Dependency Treebank 1.0

Document Sample
Prague Dependency Treebank 1.0 Powered By Docstoc
					Prague Dependency
   Treebank 1.0

 Functional Generative Description
Functional Generative Description

    theoretical framework based on the findings of European
     structural linguistics, esp. of the classical Prague School
    methodological requirements of a formal description
    levels:
      tectogrammatical (underlying) representations (TRs) with

        dependency based syntax
      morphemics

      phonemics and phonetics

    TRs (see Sgall, Hajičová and Panevová 1986, formally specified by
     Petkevič, also in a declarative way)
The Language Layers

   Phonemic,
   Morphonological,
   Morphemic,
   Analytical (surface syntax)
   Tectogrammatical (deep syntax).
   Dependency tree

                               My younger brother arrived there yesterday.

Linearized form, one-to-one relation:
((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)
Dependency Tree

   labels - lexical meanings (abstract symbols) with indices
     functors

           subscripts at parentheses oriented towards head
       grammatemes - values of morphological categories
           Tense, Modality, Number, Definiteness, etc.
   projectivity
   valency
     arguments (inner participants) and
      adjuncts (circumstantials or 'free modifications')
     obligatory and optional with a given head,

     deletable or not
Dependency Tree

   Arguments/participan                  Adjuncts
    ts of verbs                               Locative, several
       Actor/Bearer                           Directional and Temporal
         (underlying subject)                  modifications
       Objective (Patient,                   Condition, Means,
        underlying direct object)              Manner, etc.
       Addressee
        (underlying indirect object)
       Effect ('second' object: to
        choose so. as sth.)
       Origin
        (to make sth. out of sth.)
Dependency Tree

Complementations dependent mainly on nouns
   Arguments (inner               Adjuncts (free
    participants)                   modifications)
       Material (Partitive)           Possession
        two baskets of sth.             (Appurtenance)
       Identity                        my table; Jim's brother
        the river Danube; the          Restrictive
        notion of operator              rich man
                                       Descriptive
                                        the Swedes, who are a
                                        Scandinavian nation
Dependency Tree

   syntactic grammatemes
       Loc, Dir - in, on, under, between...
       Regard - with, without
   operational (testable) criteria
       for distinguishing
           arguments from adjuncts,
           from each other
       deletability (dialogue test)
Simplified valency frames

    read V Act Addr Obj
                                         brother N Appurt
    change                              man N
        V Act Obj Orig Eff
                                         glass N Material
    give V Act Addr Obj
                                         full A Material

obligatory complementations in blue
  Topic-focus articulation

                                            contextual boundness
                                                main verb CB/NB (T/F)
                there                           dependents to the left/right
                                            communicative dynamism
             young                              left-right (mother, sisters,
                                                partial ordering

left-to-right order of nodes together       underlying word order
with the index T or (prototypically) F          left-right
indicates the TFA of the sentence               linear ordering
(of the TR)
Topic-focus articulation

                      T                                        F

             there                                 yesterday


   TFA - one of the basic aspects of underlying
    Complex sentence

          My brother, whom you know, arrived there yesterday.

   a subordinated (dependent) clause (i.e. its main verb)
    depends on a word contained in its governing clause
    Complex sentence

    Martin came there late, since he had to accompany his sick mother.

   function words (synsemantic) are viewed as function
    morphemes, syntactically fixed to certain lexical (autosemantic)
    words - prepositions and articles to nouns, conjunctions and
    auxiliaries to verbs
  Complex sentence

Martin arrived late to the session, since he had to accompany his sick mother.

schematically (morphemes):
Martin arrive.ed late to the session since he have.ed to accompany
he.s sick mother.

dot - close connection of morphemes ('semes')
   deleted items restored
     order of items - difference between 'underlying' and surface

      (morphemic) word order
     transductive components - Panevová, Oliva, Borota

   coordination (multidimensional)
     Jim and Mary, who have two children, went to Boston.

     the linearized notation is adequate:

     ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act

      went (Dir Boston)

   structures close to Boolean, i.e. no complex 'innate properties'
    specific for natural language are needed.
Prague Dependency Treebank -
corpus annotation
   an intermediate level - 'analytical'
       dependency trees, not always projective
       nodes for all word tokens, even for punctuation
   tectogrammmatical tree: coordinating
    conjunction as the head
Prague Dependency
   Treebank 1.0
 Morphological Layer

    PDT version 1.0, 2000
         (1996 - 2000)
        (currently) ver. 2

Penn Treebank, release 3, 1999
         (1989 - 1999)
    PropBank (currently)
The Levels in PDT

   Morphemic
   Analytical
   Tectogrammatical
                       TAG SETs

    Czech - ambiguous inflective language
nový, nového, novému, novém, novým, nová, nové, novou, nových,
novým, novými, … novější, novejšího, novějšímu, novějším, ….,
nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších,
nejnovějším, …

     English - language with poor inflection
work, works, worked, working
               TEXT SOURCES

   Lidové noviny          ´88, ´89 WSJ articles

   Mladá Fronta Dnes      Air Travel Information

   Vesmír                  System transcripts

   Českomoravský          Brown Corpus

             Profit        Switchboard transcripts

...taken from Czech
    National Corpus
              Penn Treebank

    Ken Church‘s stochastic tagger,
    Eric Brill‘s transformation tagger

        corrections by annotator (GNU Emacs
                    Lisp based package)

 Automatic Morphological Analyzer (AMA)

 two independent annotators; Linux, Win tools

      differences resolved by third annotator

       comparison with the current AMA;
manual resolution; Win tools

                               word/tag(|tag)*
   SGML coding, csts dtd

<s id=“ln95040:020-p1s1“>

The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.
   SGML coding        word/tag


   SGML coding        word/lemma/tag
                DATA SIZE

                  # word    # sentences
PDT 1.0           1 730K       112K

Penn Treebank     4 600K       350K
release 3
for tagging only                   #tokens/sentences

training data                         1 470K/95K

development test data                  130K/8K

evaluation test data                   127K/8K

for parsing (preprocessing step)

training data                          475K/29K

development test data                  130K/8K

evaluation test data                   127K/8K
   Automatic                     Czech Taggers
    Analyser/Generator of             HMM
    ,                 Exponential
       Dictionary: CZE_a
       Remote Access
Prague Dependency
   Treebank 1.0

           Analytical Layer in PDT

   Input: morphologically tagged sentences

   Graph Editor: “user-friendly” software

   Output: ATS structure
       „surface“ syntax tree structure
       nodes labelled by the analytical functions
Analytical Functions
   Pred    - Predicate if it depends on the tree root
   Sb      - Subject
   Obj     - Object
   Adv     - Adverbial
   Atv     - Complement
   AtvV    - Complement, if one governor is present
   Atr     - Attribute
   Pnom    - Nominal predicate‘s nominal part, depends on the
                      copula „to be“
   AuxV    - Auxiliary verb „to be“
   Coord   - Coordination node
   Apos    - Apposition node
   AuxR    - Reflexive particle, which is neither Obj nor AuxT
   AuxT    - Reflexive particle, lexically bound to the verb
Analytical Functions
   AuxP         - Preposition or a part of compound preposition
   AuxC         - Subordinate conjunction
   AuxO         - (Superfluously) referring particle or emotional particle
   AuxZ         - Rhematizer or another node acting to another
   AuxX         - Comma, but not the main coordinating comma
   AuxG         - Other graphical symbols being not classified as AuxK
   AuxY         - Other words, such as particles without a specific
                           syntactic function, parts of lexical idioms, etc.
   AuxS         - Sentence holder (the only added root to the tree)
   AuxK         - Punctuation at the end of the sentence
                           or direct speech or citation clause
   ExD          - Ellipsis handling: functions for nodes which pseudo
                           depend on a node on which the would not
                           depend if there were no ellipsis
   AtrAtr, AtrAdv, AdvAtr, AtrObj, ObjAtr + *_Co, *_Pa, *_Ap
Two stages (chronologically)

   (A) manual „analytic“ annotation (ATS)
       training data for (B)(a)

   (B)
       (a) semiautomatic procedure (Collin„s parser)
       (b) manual correcting of (B)(a)
Constraints and limitations

   any string has a node of its own
     word-form, punctuation mark, etc.

       AuxV, AuxP, AuxC, AuxX, AuxG…

   reflecting the coordination and apposition relations
     so called third dimension of the graph in the plain tree

       (X_Co, X_Ap, X_Pa, where X is one of analytic functions,
       such as Sb, Obj, Adv, etc.)
Constraints and limitations

   no missing nodes (on the surface) can be added
     analytic funtion Ex_D is used

   relations between semi-automatic and manual procedure

       80% edges are established correctly automatically
Project organization

   team consisting of 5-6 annotators
   handbook for ATS structure annotation
   100000 sentences on ATS
   tectogrammatical annotation follows
   A(B, C)
    A                 A               A

        B     C   B       C
                              B   C
   A(B( C ))
    A               A               A

        B                   B           B

                        C       C
První restituční zákon českého parlamentu se do sněmovních
lavic může vrátit jako bumerang.


Prague Dependency
   Treebank 1.0

      From the Analytical
  the Tectogrammatical layer

   ATS annotation
       nodes:
           word forms              edges:
           punctuation                 surface relations
           graphical symbols
   TGTS annotation
           autosemantic words
           deletions                   deep layer functions
Annotation process
 Input                      Morphological tagging   Syntactic parsing
 Czech       Tokenization         and lexical       and analytic function
sentence                        disambiguation           assignment

                                 ATS        PDT1.0

           Tree structure    Attribute
              pruning       assignments             TGTS
Transition procedure

   deterministic procedure operating on trees
   macro language for Graph Editor (perl)
   automatic changes & tools for annotators

   Requirements
       new attributes for tectogrammatical layer
       ATS is recoverable from TGTS
       automatized to a maximally high degree
New attributes
   trlemma - lemma of the original node or lemma composed
    of joined nodes
   morphological grammatemes
       gender, number, degree of comparison, tense,
       aspect, iterativeness, verbal modality, deontic
        modality, sentence modality
   position of the node
       functor, topic-focus articulation, syntactic grammateme,
       type of relation (dependency, coordination, apposition),
       phraseme, deletion, quoted word, direct speech,
       coreference, antecedent
Tree Structure Pruning

   U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
   For those, who start actually at zero, the tax outcome for the
    state is not substantial.
Tree Structure Pruning

   U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát
   For those, who start actually at zero, the tax outcome for the state
    is not substantial.

 Verbal Nodes



•… podnikatelé by měli mít daně …
•… enterpreneurs should have (their) taxes …
Attribute Assignments

   prepositions stored as fw attribute
   quoted words
       clause in quotes -> DSP
       one pair of quotes in the sentence -> DSPP
       string in quotes -> QUOT
   gender, number, tense, degcmp, aspect
   default values
Macros for Annotators

   keyboard shortcuts (in Graph editor)
       structure changes
           hide/recover nodes
           merge nodes
       add new nodes
       functor assignments
Manual annotation

   structure checking
   functors
   deletions of obligatory modifications

   feedback for formulating the handbook for
Prague Dependency
   Treebank 1.0

  Tectogrammatical Layer

                T       T       F

C   T   T   T       T

   Jirka se      včera    opil do němoty a Honza dneska.
   George himself yesterday drank to silence and Honza today.
Attributes of Coreferrential relations

   only in MC
       attribute values
        coref            the lemma of the antecedent
        corsnt NIL - in the same sentence
                         PREV1 ... PREVi
                         - position of the sentence which
                          includes the antecedent

       grammatical coreference
        antec           the functor of the antecedent

      coref:    Honza
      corsnt:   NIL
      cornum:   1
      antec:    ACT
   Honza    slíbil    přijít včas.
   Honza    promised to come in time.

Shared By: