Sinica Treebank Design Criteria, Annotation Guidelines, and On-line by zdh15614


									                          Sinica Treebank:
    Design Criteria, Annotation Guidelines, and On-line Interface
Chu-Ren Huang t, Feng-Yi Chen 2, Keh-Jiann Chen 2, Zhao-ming Gao s, &
                         Kuang-Yu Chen 2
      churen(a3,, aoole(, kchen(~,,

                  =Institute of Linguistics, Academia Siniea, Taipei, Taiwan
             2Institute of Information Science, Academia Sinica, Taipei, Taiwan
  3Dept. of Foreign Languages & Literatures, National Taiwan University, Taipei, Taiwan

Abstract                                                 Second, what information should or
                                                   can be annotated? A good sample of
     This paper describes the design
                                                   issues in these two directions can be
criteria and annotation guidelines of
                                                   found in the papers collected in Abeille
Sinica Treebank.     The three design
criteria are: Maximal Resource Sharing,
                                                         The construction of the Sinica
Minimal Structural Complexity, and
                                                   Treebank deals with both issues. First, it
Optimal Semantic Information. One of
                                                   is one of the first structurally annotated
the   important      design     decisions
                                                   corpora in Mandarin Chinese. Second,
following these criteria is the encoding
                                                   as a design feature, the Sinica Treebank
of thematic role information. An on-line
                                                   annotation includes thematic role
interface facilitating empirical studies of
                                                   information in addition to syntactic
Chinese phrase       structure   is   also
                                                   categories. In this paper, we will discuss
                                                   the   design     criteria     and    annotation
1. Introduction                                    guidelines of the Sinica Treebank. We
                                                   will also give a preliminary research
     The Penn Treebank (Marcus et al.
                                                   result based on the Sinica Treebank.
1993) initiated a new paradigm in
corpus-based research. The English.                2. Design C r i t e r i a
Penn Treebank has enabled and                            There are three important design
motivated corpus and computational                 criteria   for    the       Sinica   Treebank:
linguistic research based on information           maximal resource sharing, minimal
extractable from structurally annotated            structural complexity, and optimal
corpora.   Recently, the research has              semantic information.
focused on the following two issues:
first, when and how can a structurally                   First, to achieve maximal resource
annotated corpus of language X be                  sharing, the construction of the Sinica
built?                                             Treebank is bootstrapped from existing

Chinese       computational              linguistic          theoretical frameworks. Since a primary
resources.    The         textual     material     is        goal of annotated corpora is to serve as
extracted from the tagged Sinica Corpus                      the    empirical         base     of        linguistic
(hRp:l/                           investigations, it is desirable to annotate, Chen et al. 1996). In other                         structure divisions that are the most
words, the tasks and issues involving                        commonly shared among theories.                    We
tokenization / word segmentation and                         came to the conclusion that the minimal
category     assignment         are     previously           basic level structures are the ones that
resolved. It is worth noting that the                        are shared by all theories. Thus our
segmentation and tagging of Sinica                           annotation     is    designed          to    achieve
Corpus       have         undergone       vigorous           minimal structural complexity.                     All
post-editing. Hence the precision of                         abstract phrasal levels are eliminated
category-assignment is much higher                           and only canonical phrasal categories
than   with    an         automatically tagged               are marked.
corpora. In addition, since the same
                                                                   Third) a critical issue involving
research team carried out the tagging of
                                                             Treebank      construction        as        well    as
Sinica Corpus and annotation of Sinica
                                                             theories of NLP is how much semantic
Treebank,       consistency             of       the
                                                             information,        if     any,
                                                                                       should be
interpretation of texts and tags are
                                                             incorporated. The     original  Penn
ensured. For structure-assigument, an
                                                             Treebank took a fairly straightforward
automatic parser (Chen 1996) is applied
                                                             syntactic approach. A purely semantic
before human post-editing.
                                                             approach, though tempting in terms of
     Second) the criterion of minimal                        theoretical and practical considerations,
structural complexity is motivated to                        has never been attempted yet. A third
ensure that         the     assigned structural              approach is to annotate partial semantic
information can be shared regardless of                      information, especially those pertaining
users' theoretical presupposition. It is                     to    argument-relations.         This        is    an
observed            that            theory-internal          approach shared by us and the Prague
motivations     often         require        abstract        Dependency Treebank (e.g. Bohmova
intermediate phrasal levels (such as in                      and Hajikova        1999). In     this approach,
various versions of the X-bar theory).                       the thematic relation between a predicate
Other theories may also call for an                          and an argument is marked in addition to
abstract covert phrasal category (such as                    grammatical category. Note that the
INFL in the GB theory for Chinese). In                       predicate-argument relation is usually
either case, although the phrasal                            grammatically instantiated and generally
categories are well-motivated within the                     considered to be the semantic relation
theory, their significance cannot be                         that interacts most closely with syntactic
maintained in the context of other                           behavior. This allows optimal semantic

information to be encoded without going                  predicate.    However,          it    lacks   a
too beyond the partially automatic                       subject and cannot function alone.
process of argument identification.                   3. NP: An NP is beaded by an N.
3.   Annotation Guidelines I:                         4.GP: A GP is a phrase headed by

     Category and Hierarchy                              locational noun or locational adjunct.
                                                         Since the thematic role is often
     The basic structure of a tree in a
                                                         determined         by     the        governing
treebank is a hierarchy of nodes with
                                                         predicate and not encoded locally;
categorical       denotation.   As   in   any
                                                         nominal phrases are given a tentative
standard phrase structure grammar, the
                                                         role o f DUMMY so that it can
lexieal   (i.e.    terrninal)   symbols   are
                                                         inherit the correct role from the main the lexicon (CKIP 1992).
And following the recent lexicon-driven
                                                      5. PP: A PP is headed by a preposition.
and     information=based trends       in
                                                         The thematic role of its argument is
linguistic theory, linguistic information
                                                         inherited from the mother, hence its
will be projected from encoded lexical
                                                         argument      is        marked        with    a
information. Please refer to CKIP (1993)
for the definition of lexieal categories
that we followed. We will give below                  6. XP: A XP is a conjunctive phrase that
the inventory of the restricted set of                   is headed by a conjunction. Its
phrasal categories used and their                        syntactic head is the conjunction.
interpretation. This set defines the                     However, since the actual category
domain        of      expressed      syntactic           depends     on    the   interactive
information (instead of projected or                     inheritance             from           possibly
inherited information). Readers can also                 non-identical conjoined elements, X
consult Chen et al.'s (2000) general                      in XP stands for an under-specified
description of how the Siniea Treebank                    category.
is constructed for a more complete list of            3.2. Defining Inheritance Relations
tags as well as explanation in Chinese.
                                                           Following              unification-based
3.1. Defining Phrasal Categories                      grammatical       theories,             categorical
      There are only 6 non-terminal                   assignments in Sinica Treebank are both
phrasal categories annotated in the                   lexicon-driven and head-driven. In
Sinica Treebank.                                      principle, all grammatical information is
                                                      lexically encoded. Structurally heads
(1) Phrasal Categories
                                                      indicate the direction of information
1. S: An S is a complete tree headed by a             inheritance    and    define    possible
     predicate (i.e. S is the start symbol).          predicate-argument relations. However,
2.VP: A VP is a phrase headed by a                    since the notion 'head' can have several

different     linguistic     definitions,      we          case), but by a higher predicate. In
attempt to allow at least the discrepancy                  these cases, DUMMY allows a parser
between syntactic and semantic heads.                      to determine the correct categorical /
In Sinica Treebank, three different kinds                  thematic           relation      later,       while
of grammatical heads are annotated.                        maintaining identical local structures.

(2) Heads                                                 3.3. Beyond Simple Inheritance

1.Head: indicates a grammatical head in                        When simple inheritance fails, the
  an     endocentrie       phrasal      category.         following principles derived from our
  Unless a different semantic head is                     design criteria serve to predict the
  explicitly marked, a Head marks a                       structural    assignments         of a        phrasal
  category that serves simultaneously as                  category: default inheritance, sisters only,
  the syntactic and semantic heads of the                 and left most.
                                                          3.3.1. Default Inheritance
2.head: indicates a semantic head which
  does not simultaneously function as a                        This principle deals primarily and
  syntactic       head.     For     instance    in        most effectively with coordinations and
  constructions                         involving         conjunctions. The theoretical motivation
  grammatiealized 'particles,' such as in                 of this account follows Sag et al.'s (1985)
  the       'VP-de'        construction,       the        proposal. In essence, the category of a
  grammatical head ('de' in this case)                    conjunctive         construction        must      be
  does      not     carry         any   semantic          inherited     from        its   semantic      heads.
  information. In these cases, the h e a d                However, since conjunctions are not
  marks the semantic head ('VP" in this                   restricted to same categories, languages
  case) to indicate the flow of content                   must have principled ways to determine
  information.                                            the categorical identity when different
3. DUMMY: indicates the semantic                          semantic           heads        carry      different
  head(s) whose categorical or thematic                   information.
  identity cannot be locally determined.                       First, in the trivial case when all
  The       two    most      likely     scenarios         head daughters are of the same category,
  involving DUMMY                 are (a)   in a          the mother will inherit that category.
  coordination construction, where the                         Second, when the different head
  head category depends on the sum of                     daughters are an elaboration of the same
  all conjuncts. And (b) in a non-NP                      basic category (e.g. both Nd and Ne are
  argument phrase, such as PP, where                      elaboration        of N),       then    the     basic
  the semantic head carries a thematic                    category      is    the     default     inheritance
  role assigned not by the immediate                      category for the mother. This can be
  governing syntactic head ("P" in this                   illustrated by (3).

(3) [[[da4]VH1 l[er2]Caa [yuan2]VH13]]                     When lexical conjuncts are involved, the
   VP                                                      same principle is used. The priority is
     big           and           round                     given to the predicate head of the
                                                           sentence. Among possible argument
     Third," when        other        inheritance
                                                           roles, the nominal category is the default.
mechanisms fail to provide a clear
                                                           A n illustrative example can be found in
categorical     choice,         the         default
inheritance is activated. There are two
default hierarchies. The first one deals                     (6) [[wei4lan2de tianlkongl]NP
with when the head daughters are all                               [yu3]Caa[zhulqun2biaol han4]S]S
lexical categories (4a), and the second                            aqua-blue      DE sky
one deals with when they are all phrasal                           and         people    ferocious
categories (4b). If there is a disparity
                                                                   'That the sky being aqua blue and
between lexical and phrasal categories,
                                                                   that the people being ferocious...'
then a lexical category will be expanded
to a phrasal category first.                               3.3.2     Sisters Only

(4)Default Inheritance Hierarchy for                               Following most current linguistic
   Categories                                              theory, argument roles            and adjunct
  a) Lexical Categories: V > N > P > Ng                    complements must be sisters of a lexieal
 b) Phrasal Categories: S> VP> NP>                         head. However, driven by our design
                           PP> GP                          criteria of minimal structural complexity,
                                                           no same level iteration is allowed. Thus
When phrasal conjuncts are involved, S
                                                           these arguments and adjuncts can be
is the privileged category since it is the
                                                           located by the straightforward definition
start symbol of the grammar. VP comes
                                                           of sisterhood: that they share the same
next since its structural composition is
                                                           mother-daughter relation with the head.
identical to that of S. If the structure
                                                           The result is a flat structure.
involved is not a predicate (i.e. head of a
sentence), then it must be a role. For                     33.3      Left First
argument      roles,     NP's         are     more
                                                                   This   principle     is   designed to
privileged than PP's, and PP's are more
                                                           account for possible internal structure
privileged than GP's. (5) is an instance
                                                           when there are more than two sisters
of the application         of    this       default
                                                           -without having to add on hierarchical
                                                           complexity.    Hence,   the    default
(5) [[da41iang4]Neqa[er2]Caa                               interpretation of internal structure of
    [fengl sheng4]VH11]V]VP                                multiple sisters is that the internal
     big-quantity         and
                                                           association starts from leR to right.
     "bountiful and of big quantity"

4. Annotation Guidelines II:                                On the other hand, in theories
  Structural Annotation of                           where lexical heads drive the structural
  Thematic Information                               derivation / construction (e.g. ICG and
                                                     HPSG and LFG), thematic relations are
     A thematic relation contains a
                                                     critical. Hence, we decided to encode
compact     bundle     of   syntactic    and
                                                     realized thematic relations               on   each
semantic       information.      Although
                                                     phrasal argument. The list o f thematic
thematic relations are lexically encoded
                                                     relations encoded on the head predicate
on   a predicate, they can only be
                                                     is     consulted       whenever       a    phrasal
instantiated when that information is
                                                     argument        is     constructed,        and    a
projected to phrasal arguments. In other
                                                     contextually           appropriate         relation
words, the only empirical evidence for
                                                     sanctioned by the lexical information is
the existence of a thematic relation is a
                                                     encoded. It is worth noting that in our
realized argument. However, a realized
                                                     account., we not only mark the thematic
argument cannot by itself determine the
                                                     relations of a verbal predicate, but we
thematic relation. The exact nature of
                                                     also     mark        the   thematic       relations
the relation must be determined based
                                                     governed by a deverbal noun, among
on the lexical information fi'om the
                                                     others. Also note that an argument of a
predicate as well as checking of the
                                                     preposition is marked as a placeholder
compatibility of that realized argument.
                                                     DUMMY. This is because a preposition
Since     structural   information      alone
                                                     only governs an argument syntactically,
cannot determine thematic relations,
                                                     while its thematic relation is determined
prototypical structural annotation, such
                                                     by a higher verb.
as in the original Penn Treebank, does
not include thematic roles since they
                                                     (7) Thematic Roles: Classification and
contain non-structural information.

                                                                THEMATIC ROLES
        I                                I                                                 I                                 I
I PR~°''~'°N I                I '~'                    I                           I     "°UN         I
        I                                                                                                                    I
                                                                         I o8,o~          I-L---t.oM,~T,o.I           I    OtJMMY    [

                     e~edeneer                     Ioe~on

                 I °*"~'*"              t          *o.v,~
                 [ be.erect             ~         *errnlemr~
                 [    ¢~mdi~m                     conjunction

                 [   e~eluaem                      negae~
                 [   exrJudon                      incl~on

                 [   fl~*cy                  -{   impera~

                 [   quamiler                      quamiol
                 [    s~ndard
                 I   ~                                 deg~e
                 I       dei~$                         ma.~0n

                 I   hylxnl'~s:                   oondusion
                 I    wl'~u~f                     con~rdon
                 I   a'uDidanoe                    puq)ose

                 I                      l-

5. C u r r e n t Status o f the Sinica                                        the legal d e p ~ , e n t          o f Academia Sinica.
   Treebank and On-line                                                       A small subset o f it (1,000 sentences) is
   Interface                                                                  already available                  for researchers to
        Following the above criteria and                                      download                    from       the         website
principles, we have already finished                                          http ://godel.i is.sinica, edu. tw/CKIP/
Sinica       Treebank            1.0.             It       contains           treeslOOO.htm. A searchable interface is
annotations of 38,725 Chinese structural                                      also being developed and tested for
trees       containing       239,532                   words.       It        researchers so that they can directly
covers subject areas that include politics,                                   access            the          complete        treebank
traveling, sports, finance, society, etc.                                     information.
This version of the Sinica Treebank will                                               As an annotated corpus, one of the
be released in the near future as soon as                                     most important roles that a treebank can
the licensing documents are cleared by                                        play is that it can serve as a shared

source of data for linguistic, especially                 to be announced at the second ACL
syntactic studies. Following the example                  workshop     on   Chinese     Language
of the successful Sinica Corpus, we have                  Processing in October 2000.
developed an on-line interface for
                                                          6. Conclusion
extraction of grammatical information
from the Sinica Treebank. Although the                         The construction of the Sinica
users that we have              in mind are               Treebank is only a first step towards
theoretical     linguists     who      do     not         application of structurally annotated
necessarily        have         computational             corpora.   Continuing expansion and
background; we hope that non-linguists                    correction will make this database an
can   also     benefit      from     the    ready         invaluable resource for linguistic and
availability     of      such      grammatical            computational studies of Chinese.
information.          And       of         course,
computational linguists should be able                    References
to use this interface for quick references                I.ABEILI..E, Anne. 1999. Ed.
before going into a more in-depth study                     Proceedings of ATALA Workshop -
of the annotated corpus.                                    Treebanks. Paris, June 18-19, 1999.
                                                            Univ. de Paris VII.
     Currently, the beta site allows users                2.BOHMOVA, Alla and Eva Hajicova.
specify a variety of conditions to search                   1999. How Much of the Underlying
for structurally annotated sentences.                       Syntactic Structure Can be Tagged
                                                            Automatically? In Abeille (Ed).
Conditions can be specified in terms of                     1999.31-40.
keywords, grammatical tags (lexical or                    3.CHEN, Feng-Yi, Pi-Fang Tsai,
phrasal),     thematic relations, or any                    Keh-Jiann Chen, and Chu-Ren Huang.
                                                            2000. Sinica Treebank. [in Chinese]
boolean combination of the                  above
                                                           Computational Linguistics and
elements. The search result can be                         Chinese Language Processing.
presented as either annotated structure or                  4.2.87-103.
simply the example sentences. Simply                      4.CHEN, Keh-Jiarm. 1996. A Model for
                                                            Robust Chinese Parser. Computational
statistics, based on either straightforward                Linguistics and Chinese Language
frequency count or mutual information,                     Processing. 1.1.183-204.
are also available. For linguistically                    5.CHEN, Keh-Jiann, Chu-Ren Huang.
                                                            1996. Information-based Case
interesting information,        such as the                 Grammar: A Unification-based
heads of various phrasal constructions, a                   Formalism for Parsing Chinese. In
user can simply look up the explicitly                      Journal of Chinese Linguistics
                                                            Monograph Series No. 9. Chu-Ren
syntactic Head or semantic head; as                         Huang, Keh-Jiaun Chen, and
well as DUMMY when it serves as a                           Benjamin K. T'sou Eds. Readings in
head placeholder. The website of this                      Chinese Natural Language Processing.
                                                            23-45. Berkeley: JCL.
interface, as well as the general release
                                                          6.CHEN, Keh-Jiann, Chu-Ren Huang,
of the Sinica Treebank 1.0, is scheduled                    Li-Ping Chang, Hui-Li Hsu. 1996.

 Sinica Corpus: Design Methodology             Appendix
 for Balanced Corpora. Proceedings of
 the 11th Pacific Asia Conference on           1. Lexical Categories
 Language, Information, and
 Computation (PACLIC I1). Seoul                (1) NON-PREDCITIVEADJVECTIVE:A
  Korea. 167-176.                              (2) CONJUNCTION: C
7.CHEN, Keh-Jiann and Shing-Huan               (3) ADVERB: D
  Liu. 1992. Word Identification for           (4) INTERJECTION: I
                                               (5) NOUN: N
  Mandarin Chinese Sentences.
                                               (6) DETERMINATIVES:Ne
  Proceedings of COLING-92.101 - 105.          (7) MEASURE WORD / CLASSIFIER:
8.CHEN, Keh- Jiann, Shing-Huan Liu,                Nf
  Li-Ping Chang, Yeh-Hao Chin. 1994.           (8) POSTPOSITION WORD: Ng
  A Practical Tagger for Chinese               (9) PRONOUN: Nh
  Corpora." Proceedings of R OCLING            (10) PREPOSITION: P
  V/I. 111-126.                                (11) PARTICLES: T
9.CHEN, Keh-Jiann, Chi-Ching Luo,              (12) VERB: V
  Zhao-Ming Gao, Ming-Chung Chang,
  Feng-Yi Chen, and Chao-Ran Chert.            2. Sample Sentence and Tree
  1999. The CKIP Chinese Treebank:
  Guidelines for Annotation. In Abeille
  (Ed). 1999.85-96.
I0. CKIP (Chinese Knowledge
                                               nage wanfi de nyuren baifa           zhihou
  Information Processing). 1993. The           bian buzai lihui
  Categorical Analysis of Chinese. CKIP    •that hair-style DE woman white-hair after
  Technical Report 93-05. Nankang:
                                               then never pay-attention
  Academia Sinica.
11. HUANG, Chu-Ren, Keh-Jiann                  ting    qian tingting .FuR                de
  Chen, Feng-Yi Chen, and Li-Li Chang.         qingcao
  1997. Segmentation Standard for              courtyard front slender-lystanding-erectDE
  Chinese Natural Language Processing.
  Computational Linguistics and                green-grass
  Chinese Language Processing.                 'After her hair had turned white, that
 2.2.47-62                                     coiffured woman never paid any more
12. Lin, Fu-Wen. 1992. Some
 Reflections on the Thematic System of         attention to the nicely standing green grass
 Information-based Case Grammar                in the front courtyard."
 (ICG). CKIP Technical Report 92-01.           S(agent:NP(quantifier:DM:~l
 Nankang: Academia Sinica.
                                               property:VP- ~j(head:VP(Head:VA4:~-)
13. Marcus, Miteh P., Beatrice
  Santorini, and M. A. Marcinkiewiicz.         IHead:DE:~)lHead:Nab:~A.)ltime:GP
  1993. Building a Large Annotated                               11:~1~)1 Head:
 Corpus of English: The Peen Treebank.                               time:Dd:~ "~'1Head:
 Computational Linguistics.
  19.2.313-330.                                VC2:J~ ~'[goal:NP (property:VP • ~(head:
14. SAG,Ivan, Gerald Gazdar, Thomas            VP (location:NP(property:Neb:/~.l
 Wasow, and Steven Weisler. 1985.              Head:Neda:~,f)lHead:VH11:;~ ,~ ~ 2Y_)[
  Coordination and How to Distinguish
  Categories. Natural Language and             Head:DE:~ ) I Head:Nab:ff ~))
  Linguistic Theories. 117-171.


To top