Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Using GATE to Automatically Annotate Learner Texts Peter Wood

VIEWS: 31 PAGES: 44

									           Using GATE to Automatically
             Annotate Learner Texts

            Peter Wood, University of
                   Waterloo



iicall 2       Using GATE to Automatically Annotate Learner Texts   December 2007
                           Structure


    ●learner models
    ●
     discourse analytic measures
    ●
     annotations
    ●
     manual annotating using a Perl application
    ●
     automatically annotating using GATE
    ●new processing resources for GATE

    ●
     advantages of using GATE



iicall 2        Using GATE to Automatically Annotate Learner Texts   December 2007
                 Learner Models

    learner models are used to represent
    students' competence in language

    areas that can/should be modelled:
    ●
     vocabulary (active and passive)
    ●
     morphology
    ●syntax

    ●
     semantics
    ●
     pragmatics
    ●
     ...

iicall 2       Using GATE to Automatically Annotate Learner Texts   December 2007
               Discourse Analytic Measures

           ●used mainly in the study of spoken
           learner language
           ●aim to express proficiency levels as


           quantitative measures
           ●try to measure:

                     ● fluency

                     ● accuracy

                     ● complexity



iicall 2             Using GATE to Automatically Annotate Learner Texts   December 2007
           The “Fehlerquotient”




            Mistakes×100
              ∑ Words
iicall 2     Using GATE to Automatically Annotate Learner Texts   December 2007
            Accuracy: The Error Quotient




Mistakes                 lim EQ =0goodaccuracyrating
                        Mistakes 0




∑ Words                  limEQ =∞badaccuracyrating
                        Mistakes ∞


 iicall 2        Using GATE to Automatically Annotate Learner Texts   December 2007
           Accuracy: Erroneous Sentences /
                   Total Sentences
                            limSQ =0good accuracyrating
∑ *sent                    ∑ *sent 0



∑ sent                       limSQ =1badaccuracyrating
                         ∑ *sent ∑ sent

iicall 2          Using GATE to Automatically Annotate Learner Texts   December 2007
                      Accuracy:
           Erroneous Clauses / Total Clauses



 ∑ *clause                        limCQ =0goodaccuracy rating
                                 ∑ *clause0



 ∑ clause                         limCQ =1badaccuracy rating
                             ∑ *clause ∑ clause

iicall 2           Using GATE to Automatically Annotate Learner Texts   December 2007
           Measuring Complexity




iicall 2     Using GATE to Automatically Annotate Learner Texts   December 2007
              Complexity:
           Token / Type Ratio



∑ token
                           lim TT =1 good complexityrating
                         ∑ tokentype




∑ type                    lim TT
                ∑ token n,where ∑ typen∞
                                              =∞ bad complexityrating


iicall 2   Using GATE to Automatically Annotate Learner Texts      December 2007
                 Complexity:
           Clauses / Sentences Ratio



∑ clause                  limCS =∞goodaccuracy rating
                         ∑ clause ∞



    ∑ sent                 limCS =1badaccuracyrating
                      ∑ clause ∑ sent
iicall 2       Using GATE to Automatically Annotate Learner Texts   December 2007
                           Annotations

           To calculate the measures texts must
           be annotated for:

           ●time needed for production (fluency)
           ●sentence boundaries (complexity)

           ●clause boundaries (complexity)

           ●lemmas (complexity)

           ●various kinds of errors (accuracy)



iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
           Manual Annotating Using a Perl
                    Application




iicall 2         Using GATE to Automatically Annotate Learner Texts   December 2007
                        Preprocessing

           convert MS Word to plain text
           ●




iicall 2           Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
                          Preprocessing
           ●tokenization
           ●sentence splitting

           ●convert to XML




iicall 2             Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
                  Automatic Annotation

           type identification using MORPHY
           ●


           word form list




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
                    Manual Annotation
           ●clause boundaries
           ●lexical errors

           ●morphological errors

           ●syntactic errors

           ●other errors




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
                    Evaluation




iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
           Automatically Annotating Using
                       GATE




iicall 2         Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
           Problems with the Perl Application
           ●
            requires a web server
           ●limited support for different


           document formats
           ●limited annotation options

           ●many functions work very slow

           ●too much “roll your own”




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
                                     GATE
           ●preprocessing
           ●annotating

           ●evaluation




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
                          Preprocessing
           ●
            language resources can be created
           from a variety of different file formats
           (xml, html, txt, ...)
           ●offers: tokenizer, sentence splitter

           ●provides a wrapper for a German POS-


           tagger by the University of Stuttgart


iicall 2             Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
                  Automatic Annotation
           ●pos-tagging
           ●...




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007
                   Automatic Annotation ...
               Adding New Processing Resources
                          to GATE
           ●port the German lemmatizer from
           Perl
           ●shallow parser using the POS-tags

           ●gazetteer lists and JAPE grammars

           ●dictionary look-up




           would enable us to ...
iicall 2              Using GATE to Automatically Annotate Learner Texts   December 2007
                             lemmatizer:
           ●number of types
           ●number of tokens

           ●type / token ratio




iicall 2             Using GATE to Automatically Annotate Learner Texts   December 2007
                                “parser”:

           ●identify clauses and phrases
           ●calculate clause sentence ratio

           ●calculate average complexity of

                      •sentences

                      •clauses

                      •phrases




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
                 gazetteer lists and JAPE
                       grammars:
           ●store vocabulary covered at different
           language levels in different lists
           ●analyse what vocabulary the student


           is using at a certain stage


iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
                   dictionary look-ups:
           ●mark words not found as potentially
           erroneous to assist in manual
           annotation process




iicall 2            Using GATE to Automatically Annotate Learner Texts   December 2007
                Advantages of Using GATE
           ●Easy import of all sorts of documents
           ●easy creation of corpora

           ●sophisticated storage

           ●a number of preprocessing tools are


           available
           ●some support for German

           ●easy design and sharing of new


           resources
           ...
iicall 2             Using GATE to Automatically Annotate Learner Texts   December 2007
           Advantages of Using GATE
           ●manual annotation is straight
           forward
           ●cross-platform compatibility

           ●open-source

           ●aids with evaluation




iicall 2          Using GATE to Automatically Annotate Learner Texts   December 2007
                              Evaluation
           ●GATE can compare human
           annotations to automatic annotations
           ●all annotations are stored as XML and


           are accessible via Java classes



iicall 2             Using GATE to Automatically Annotate Learner Texts   December 2007
iicall 2   Using GATE to Automatically Annotate Learner Texts   December 2007

								
To top