Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

AC by xiangpeng

VIEWS: 18 PAGES: 202

									                             AECOTA
                          Introduction
Automatic PoS-Tagging Errors Detection
                            Conclusion




Automatic Error Correction of Treebank
             Annotation

                          Ekaterina Volkova

                        Tuebingen University, ISCL


                         November 29, 2007




                                                     1 / 62
                                    AECOTA
                                 Introduction
       Automatic PoS-Tagging Errors Detection
                                   Conclusion




AECOTA

Introduction
    Papers
    Treebanks and Errors they hide
    Approaches to the problem

Automatic PoS-Tagging Errors Detection
   T. Brants and W. Skut
   H. v. Halteren
   M. Dickinson and W. D. Meurers
   P. Kveton and K. Oliva

Conclusion


                                                2 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    3 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Works used for this presentation

   The following papers were used for this presentation:




                                                                                    4 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Works used for this presentation

   The following papers were used for this presentation:
       Automation of Treebank Annotation, Thorsten Brants and
       Wojciech Skut




                                                                                    4 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Works used for this presentation

   The following papers were used for this presentation:
       Automation of Treebank Annotation, Thorsten Brants and
       Wojciech Skut
       The Detection of Inconsistency in Manually Tagged Text,
       Hans van Halteren




                                                                                    4 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Works used for this presentation

   The following papers were used for this presentation:
       Automation of Treebank Annotation, Thorsten Brants and
       Wojciech Skut
       The Detection of Inconsistency in Manually Tagged Text,
       Hans van Halteren
       Detecting Inconsistencies in Treebanks, Markus Dickinson and
       W. Detmar Meurers




                                                                                    4 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Works used for this presentation

   The following papers were used for this presentation:
       Automation of Treebank Annotation, Thorsten Brants and
       Wojciech Skut
       The Detection of Inconsistency in Manually Tagged Text,
       Hans van Halteren
       Detecting Inconsistencies in Treebanks, Markus Dickinson and
       W. Detmar Meurers
       Detecting Errors in Part-of-Speech Annotation, Markus
       Dickinson and W. Detmar Meurers




                                                                                    4 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Works used for this presentation

   The following papers were used for this presentation:
       Automation of Treebank Annotation, Thorsten Brants and
       Wojciech Skut
       The Detection of Inconsistency in Manually Tagged Text,
       Hans van Halteren
       Detecting Inconsistencies in Treebanks, Markus Dickinson and
       W. Detmar Meurers
       Detecting Errors in Part-of-Speech Annotation, Markus
       Dickinson and W. Detmar Meurers
       (Semi-)Automatic Detection of Errors in PoS-Tagged
       Corpora, Pavel Kveton and Karel Oliva


                                                                                    4 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    5 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Treebanks



   Treebanks:




                                                                                    6 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Treebanks



   Treebanks:
       are (large) corpora, result of a (semi-)manual mark-up process;




                                                                                    6 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Treebanks



   Treebanks:
       are (large) corpora, result of a (semi-)manual mark-up process;
       can contain annotation errors from:




                                                                                    6 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Treebanks



   Treebanks:
       are (large) corpora, result of a (semi-)manual mark-up process;
       can contain annotation errors from:
            automatic preprocesses (fallacies in tagging algorithm)




                                                                                    6 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Treebanks



   Treebanks:
       are (large) corpora, result of a (semi-)manual mark-up process;
       can contain annotation errors from:
            automatic preprocesses (fallacies in tagging algorithm)
            human inconsistency (post-editing or annotation)




                                                                                    6 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors




   What are errors in tagged corpora?
   The definition depends on the intended usage of this corpus.




                                                                                    7 / 62
                                      AECOTA
                                                  Papers
                                   Introduction
                                                  Treebanks and Errors they hide
         Automatic PoS-Tagging Errors Detection
                                                  Approaches to the problem
                                     Conclusion



Errors, cont




   Suggestions?




                                                                                   8 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont




   Two main types of errors:




                                                                                    9 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont




   Two main types of errors:
       errors in assignment PoS tags (PE)




                                                                                    9 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont




   Two main types of errors:
       errors in assignment PoS tags (PE)
       ungrammatical constructions (UC)




                                                                                    9 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont




              Usage                                     Errors allowed
    Training statistical taggers                            None!
      Testing NLP systems                                     UC
        Linguistic research                  UC and PE (but be properly marked off)




                                                                                    10 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont



   Errors in PoS corpora result in:




                                                                                    11 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont



   Errors in PoS corpora result in:
       confusion in probability distributions for a correct text




                                                                                    11 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont



   Errors in PoS corpora result in:
       confusion in probability distributions for a correct text
       and possible becomes impossible




                                                                                    11 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont



   Errors in PoS corpora result in:
       confusion in probability distributions for a correct text
       and possible becomes impossible
       positive evidence for incorrect structures




                                                                                    11 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Errors, cont



   Errors in PoS corpora result in:
       confusion in probability distributions for a correct text
       and possible becomes impossible
       positive evidence for incorrect structures
       and impossible becomes possible




                                                                                    11 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    12 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Approaches to the problem:



   Various solutions were suggested by:




                                                                                    13 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Approaches to the problem:



   Various solutions were suggested by:
       Thorsten Brants and Wojciech Skut;




                                                                                    13 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Approaches to the problem:



   Various solutions were suggested by:
       Thorsten Brants and Wojciech Skut;
       Hans van Halteren;




                                                                                    13 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Approaches to the problem:



   Various solutions were suggested by:
       Thorsten Brants and Wojciech Skut;
       Hans van Halteren;
       Markus Dickinson and W. Detmar Meurers;




                                                                                    13 / 62
                                       AECOTA
                                                   Papers
                                    Introduction
                                                   Treebanks and Errors they hide
          Automatic PoS-Tagging Errors Detection
                                                   Approaches to the problem
                                      Conclusion



Approaches to the problem:



   Various solutions were suggested by:
       Thorsten Brants and Wojciech Skut;
       Hans van Halteren;
       Markus Dickinson and W. Detmar Meurers;
       Pavel Kveton and Karel Oliva;




                                                                                    13 / 62
                                     AECOTA
                                                 Papers
                                  Introduction
                                                 Treebanks and Errors they hide
        Automatic PoS-Tagging Errors Detection
                                                 Approaches to the problem
                                    Conclusion



Questions?




   Questions?




                                                                                  14 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    15 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



General approach



      Automatic annotation + human supervision and correction




                                                                                  16 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



General approach



      Automatic annotation + human supervision and correction
      ”Crossing branches”




                                                                                  16 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



General approach



      Automatic annotation + human supervision and correction
      ”Crossing branches”
      Two annotators for training corpus




                                                                                  16 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



General approach



      Automatic annotation + human supervision and correction
      ”Crossing branches”
      Two annotators for training corpus
      Markov model




                                                                                  16 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



General approach



      Automatic annotation + human supervision and correction
      ”Crossing branches”
      Two annotators for training corpus
      Markov model
      Bootstrapping




                                                                                  16 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Semi-automatic strategy




   A semi-automatic annotation strategy is:




                                                                                    17 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Semi-automatic strategy




   A semi-automatic annotation strategy is:
       superior to purely manual annotation in:




                                                                                    17 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Semi-automatic strategy




   A semi-automatic annotation strategy is:
       superior to purely manual annotation in:
            accuracy




                                                                                    17 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Semi-automatic strategy




   A semi-automatic annotation strategy is:
       superior to purely manual annotation in:
            accuracy
            efficiency




                                                                                    17 / 62
                                    AECOTA      T. Brants and W. Skut
                                 Introduction   H. v. Halteren
       Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                   Conclusion   P. Kveton and K. Oliva




Assigning grammatical functions:




                                                                                 18 / 62
                                    AECOTA      T. Brants and W. Skut
                                 Introduction   H. v. Halteren
       Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                   Conclusion   P. Kveton and K. Oliva




Assigning grammatical functions:
     standard part-of-speech tagging techniques:




                                                                                 18 / 62
                                    AECOTA      T. Brants and W. Skut
                                 Introduction   H. v. Halteren
       Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                   Conclusion   P. Kveton and K. Oliva




Assigning grammatical functions:
     standard part-of-speech tagging techniques:
         lexical and contextual probability measures PQ




                                                                                 18 / 62
                                    AECOTA      T. Brants and W. Skut
                                 Introduction   H. v. Halteren
       Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                   Conclusion   P. Kveton and K. Oliva




Assigning grammatical functions:
     standard part-of-speech tagging techniques:
         lexical and contextual probability measures PQ
         depending on the category of a mother node Q




                                                                                 18 / 62
                                    AECOTA      T. Brants and W. Skut
                                 Introduction   H. v. Halteren
       Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                   Conclusion   P. Kveton and K. Oliva




Assigning grammatical functions:
     standard part-of-speech tagging techniques:
         lexical and contextual probability measures PQ
         depending on the category of a mother node Q
    e.g.: each category (S, VP, NP, PP...) defines a separate
    Markov model




                                                                                 18 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Extending tagger




   Extension of the grammatical function tagger for:




                                                                                    19 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Extending tagger




   Extension of the grammatical function tagger for:
       recognition of phrasal categories,




                                                                                    19 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Extending tagger




   Extension of the grammatical function tagger for:
       recognition of phrasal categories,
       recognition of syntactic structures.




                                                                                    19 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Additional reliability check



   Calculate:




                                                                                    20 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Additional reliability check



   Calculate:
       the best assignment A1 and its probability PA1




                                                                                    20 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Additional reliability check



   Calculate:
       the best assignment A1 and its probability PA1
       the second-best alternative A2 and its probability PA2




                                                                                    20 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Additional reliability check



   Calculate:
       the best assignment A1 and its probability PA1
       the second-best alternative A2 and its probability PA2
       if PA2 comes very close to PA1 — the choice as unreliable




                                                                                    20 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Flow of program




   For each experiment, the corpus was divided into two disjoint
   parts: 90% training data and 10% test data. This procedure was
   repeated ten times, and the results were averaged.
   Resulting accuracy - 97% overall and 99% for NP’s and PP’s




                                                                                    21 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Questions?




   Questions?




                                                                                  22 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    23 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Consistency




      When we say that somebody is consistent, we mean that
      if the same situation is encountered more than once, that
      person will take the same action each time.




                                                                                  24 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Standards



      wordclass tagging requires strict standards BUT




                                                                                  25 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Standards



      wordclass tagging requires strict standards BUT
      wordclass tagging corresponds to linguistic descriptive
      tradition THUS




                                                                                   25 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Standards



      wordclass tagging requires strict standards BUT
      wordclass tagging corresponds to linguistic descriptive
      tradition THUS
      no tagging manual can ever be complete AND




                                                                                   25 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Standards



      wordclass tagging requires strict standards BUT
      wordclass tagging corresponds to linguistic descriptive
      tradition THUS
      no tagging manual can ever be complete AND
      the standards are incomplete and unstable




                                                                                   25 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND
            an ideally consistent training set,




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND
            an ideally consistent training set,
       we should be able to




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND
            an ideally consistent training set,
       we should be able to
            replicate the tagging in the training set completely




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND
            an ideally consistent training set,
       we should be able to
            replicate the tagging in the training set completely
       if this doesn’t happen




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND
            an ideally consistent training set,
       we should be able to
            replicate the tagging in the training set completely
       if this doesn’t happen
            there were consistencies in the training set AND/OR




                                                                                    26 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting inconsistency in a manually tagged corpus


   ”Ideal Taggers”
        if we had
            an ideal tagger AND
            an ideally consistent training set,
       we should be able to
            replicate the tagging in the training set completely
       if this doesn’t happen
            there were consistencies in the training set AND/OR
            or insufficiency on the training set




                                                                                    26 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Result




         Generating a wordclass tagger from the entire tagged
         corpus and comparing its output with the original corpus,
         turns out to be an efficient means of identifying
         inconsistency tency in the corpus tagging.




                                                                                     27 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Questions?




   Questions?




                                                                                  28 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    29 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Human post-editing




                                                                                  30 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Human post-editing




      can reduce the number of PoS Errors




                                                                                  30 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Human post-editing




      can reduce the number of PoS Errors
      e.g. Negra - 3.3% to 1.2%




                                                                                  30 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Human post-editing




      can reduce the number of PoS Errors
      e.g. Negra - 3.3% to 1.2%
      annotation errors remain despite human post-editing




                                                                                  30 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting Errors in Part-of-Speech Annotation



   Three main methods for error correction:




                                                                                    31 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting Errors in Part-of-Speech Annotation



   Three main methods for error correction:
       n-grams




                                                                                    31 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting Errors in Part-of-Speech Annotation



   Three main methods for error correction:
       n-grams
       closed-class analysis




                                                                                    31 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting Errors in Part-of-Speech Annotation



   Three main methods for error correction:
       n-grams
       closed-class analysis
       finite-state tagging guide patterns




                                                                                    31 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Error Correction




   Correcting errors involves:




                                                                                     32 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Error Correction




   Correcting errors involves:
       detect which positions of corpus is tagged wrong




                                                                                     32 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Error Correction




   Correcting errors involves:
       detect which positions of corpus is tagged wrong
       find a correct tag for those positions




                                                                                     32 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Tagging




                                                                                    33 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Tagging




      For each word in a corpus → lexically determined set of tags




                                                                                    33 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Tagging




      For each word in a corpus → lexically determined set of tags
      The tagging process reduces this set to the correct tag for a
      specific corpus occurrence




                                                                                    33 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:




                                                                                      34 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:
       occurs more than once in a corpus AND




                                                                                      34 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:
       occurs more than once in a corpus AND
       can thus be assigned different tags in a corpus




                                                                                      34 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:
       occurs more than once in a corpus AND
       can thus be assigned different tags in a corpus
       variation can appear due to




                                                                                      34 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:
       occurs more than once in a corpus AND
       can thus be assigned different tags in a corpus
       variation can appear due to
              ambiguity




                                                                                      34 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:
       occurs more than once in a corpus AND
       can thus be assigned different tags in a corpus
       variation can appear due to
              ambiguity
              error




                                                                                      34 / 62
                                         AECOTA      T. Brants and W. Skut
                                      Introduction   H. v. Halteren
            Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                        Conclusion   P. Kveton and K. Oliva



Variation


   Variation — a particular word which:
       occurs more than once in a corpus AND
       can thus be assigned different tags in a corpus
       variation can appear due to
              ambiguity
              error
       we have to decide in each case what is what




                                                                                      34 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Suggestions?




   Suggestions?




                                                                                  35 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error
       ALSO




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error
       ALSO
            tag assignment is dependent on the context




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error
       ALSO
            tag assignment is dependent on the context
            natural languages favor local dependencies over non-local ones




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error
       ALSO
            tag assignment is dependent on the context
            natural languages favor local dependencies over non-local ones
       ALSO




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error
       ALSO
            tag assignment is dependent on the context
            natural languages favor local dependencies over non-local ones
       ALSO
            to prevent n-gram non-sensual growth →




                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Error or ambiguity?

   A solution:
       classify the context
       the more similar is the context of variation → the more
       probable is that it is an error
       ALSO
            tag assignment is dependent on the context
            natural languages favor local dependencies over non-local ones
       ALSO
            to prevent n-gram non-sensual growth →
            use structure boundaries



                                                                                    36 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives
            closed classes (can be enumerated)




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives
            closed classes (can be enumerated)
                    determiners




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives
            closed classes (can be enumerated)
                    determiners
                    prepositions




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives
            closed classes (can be enumerated)
                    determiners
                    prepositions
                    modal verbs




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives
            closed classes (can be enumerated)
                    determiners
                    prepositions
                    modal verbs
                    auxiliaries




                                                                                    37 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Closed Class Analysis


       Lexical categories:
            open classes (large, productive categories)
                    verbs
                    nouns
                    adjectives
            closed classes (can be enumerated)
                    determiners
                    prepositions
                    modal verbs
                    auxiliaries
       up to 50% of the tags of a tagset correspond to CC



                                                                                    37 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Tagging rules




      case sensitive rules to fix annotation errors




                                                                                   38 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Tagging rules




      case sensitive rules to fix annotation errors
      specify a number of specific patterns and state explicitly how
      they should be treated




                                                                                   38 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting variation in syntactic annotation




   What constitutes a nucleus as the unit of data for which we
   compare annotations?




                                                                                    39 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Detecting variation in syntactic annotation




   Word as a unit doesn’t work of course :(




                                                                                    40 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Solution



   Algorithm:




                                                                                     41 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Solution



   Algorithm:
       series of runs with different nucleus sizes




                                                                                     41 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Solution



   Algorithm:
       series of runs with different nucleus sizes
       each run: detect the variation in the annotation of strings of a
       specific length




                                                                                     41 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Solution



   Algorithm:
       series of runs with different nucleus sizes
       each run: detect the variation in the annotation of strings of a
       specific length
       all strings-constituents are compared to the annotation of
       other occurrences of that string




                                                                                     41 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Problems




      only the category assigned to that entire string is compared




                                                                                     42 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Problems




      only the category assigned to that entire string is compared
      sometimes a string is not assigned a syntactic category




                                                                                     42 / 62
                                        AECOTA      T. Brants and W. Skut
                                     Introduction   H. v. Halteren
           Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                       Conclusion   P. Kveton and K. Oliva



Problems




      only the category assigned to that entire string is compared
      sometimes a string is not assigned a syntactic category
      solution to the latter: assign all non-constituent occurrences
      of a string the special label NIL




                                                                                     42 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Questions?




   Questions?




                                                                                  43 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Outline
   AECOTA

   Introduction
       Papers
       Treebanks and Errors they hide
       Approaches to the problem

   Automatic PoS-Tagging Errors Detection
      T. Brants and W. Skut
      H. v. Halteren
      M. Dickinson and W. D. Meurers
      P. Kveton and K. Oliva

   Conclusion

                                                                                    44 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors




                                                                                   45 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors




       Errors which are detectable without any context at all




                                                                                   45 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors




       Errors which are detectable without any context at all
       Morphological laws violation




                                                                                   45 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors




       Errors which are detectable without any context at all
       Morphological laws violation
       only error detection is local




                                                                                    45 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors




       Errors which are detectable without any context at all
       Morphological laws violation
       only error detection is local
       for the correction a vaster context is needed




                                                                                    45 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors, Negra examples




   Negra examples:




                                                                                    46 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors, Negra examples




   Negra examples:
       table - IN




                                                                                    46 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Preliminary phase: Trivial Errors, Negra examples




   Negra examples:
       table - IN
       12 000 - one CARD, not two




                                                                                    46 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Medium Phase: Impossible bi-grams




   What are IB’s?




                                                                                    47 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Medium Phase: Impossible bi-grams




   What are IB’s?
       ”impossible” or ”negative bi-grams” are pairs of adjacent tags
       which constitute an incorrect configuration




                                                                                    47 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Medium Phase: Impossible bi-grams




   What are IB’s?
       ”impossible” or ”negative bi-grams” are pairs of adjacent tags
       which constitute an incorrect configuration
       e.g., the bigram ARTICLE - FINITE VERB)




                                                                                    47 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Reasons for IB’s


   How do IB appear in a PoS tagged corpus?




                                                                                    48 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Reasons for IB’s


   How do IB appear in a PoS tagged corpus?
      Hand tagged corpus:




                                                                                    48 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Reasons for IB’s


   How do IB appear in a PoS tagged corpus?
      Hand tagged corpus:
            ill-formed text
            human error in tagging




                                                                                    48 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Reasons for IB’s


   How do IB appear in a PoS tagged corpus?
      Hand tagged corpus:
            ill-formed text
            human error in tagging
       Statistical tagger:




                                                                                    48 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Reasons for IB’s


   How do IB appear in a PoS tagged corpus?
      Hand tagged corpus:
            ill-formed text
            human error in tagging
       Statistical tagger:
            ill-formed text
            incorrect tagged training data
            smoothing process




                                                                                    48 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Smoothing process




   Smoothing process — assignment of non-zero probabilities also to
   bi-grams that were not seen in the learning phase




                                                                                    49 / 62
                                      AECOTA      T. Brants and W. Skut
                                   Introduction   H. v. Halteren
         Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                     Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase


   We need:




                                                                                   50 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase


   We need:
       Error free and representative corpus
            Representative means: any bigram can occur in a grammatical
            sentence of the language if and only if it occurs at least once
            in the corpus




                                                                                    50 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase


   We need:
       Error free and representative corpus
            Representative means: any bigram can occur in a grammatical
            sentence of the language if and only if it occurs at least once
            in the corpus
       An absolutely correct CB set built from this corpus




                                                                                    50 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase


   We need:
       Error free and representative corpus
            Representative means: any bigram can occur in a grammatical
            sentence of the language if and only if it occurs at least once
            in the corpus
       An absolutely correct CB set built from this corpus
       Presence on a erroneous CB or absence of a correct one is
       dramatic!




                                                                                    50 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase


   We need:
       Error free and representative corpus
            Representative means: any bigram can occur in a grammatical
            sentence of the language if and only if it occurs at least once
            in the corpus
       An absolutely correct CB set built from this corpus
       Presence on a erroneous CB or absence of a correct one is
       dramatic!
       IB = -CB




                                                                                    50 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont




   Checking on our goal corpus: go through, as soon you find a IB, it
   must be a mistake.




                                                                                    51 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus
          prune CB




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus
          prune CB
          generate IB




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus
          prune CB
          generate IB
          prune CB




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus
          prune CB
          generate IB
          prune CB
          check it on a bigger sub-corpus




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus
          prune CB
          generate IB
          prune CB
          check it on a bigger sub-corpus
          combine it with the small one




                                                                                  52 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Dealing with Errors at Medium Phase, cont


      But: an error free representative corpus in a myth!
      Bootstrapping technique:
          hand-clean a small sub-corpus
          prune CB
          generate IB
          prune CB
          check it on a bigger sub-corpus
          combine it with the small one
          repeat it as much as u want




                                                                                  52 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Advanced Phase: Variable-length n-grams



   Pros and cons of bi-grams:




                                                                                    53 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Advanced Phase: Variable-length n-grams



   Pros and cons of bi-grams:
       bi-grams are too local




                                                                                    53 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Advanced Phase: Variable-length n-grams



   Pros and cons of bi-grams:
       bi-grams are too local
       Any IB signals of a violation of a certain syntactic rule:




                                                                                    53 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Advanced Phase: Variable-length n-grams



   Pros and cons of bi-grams:
       bi-grams are too local
       Any IB signals of a violation of a certain syntactic rule:
            V. of constituency:e.g. fr – PREP reiche – VERB




                                                                                    53 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Advanced Phase: Variable-length n-grams



   Pros and cons of bi-grams:
       bi-grams are too local
       Any IB signals of a violation of a certain syntactic rule:
            V. of constituency:e.g. fr – PREP reiche – VERB
            V. of feature co-occurrence rules (such as agreement,
            sub-categorization, etc.)




                                                                                    53 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams


  Reasoning for n-grams:




                                                                                    54 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams


  Reasoning for n-grams:
      components of an IB get separated by material occurring in
      between them




                                                                                    54 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams


  Reasoning for n-grams:
      components of an IB get separated by material occurring in
      between them
      e.g.: PREP VFIN is a IB, then PREP ADV VFIN is a IT




                                                                                    54 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams


  Reasoning for n-grams:
      components of an IB get separated by material occurring in
      between them
      e.g.: PREP VFIN is a IB, then PREP ADV VFIN is a IT
      for each bigram [1st, 2nd] in the CB collect all...




                                                                                    54 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams


  Reasoning for n-grams:
      components of an IB get separated by material occurring in
      between them
      e.g.: PREP VFIN is a IB, then PREP ADV VFIN is a IT
      for each bigram [1st, 2nd] in the CB collect all...
            trigrams [1st, 2nd, 3rd] occurring in the corpus




                                                                                    54 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams


  Reasoning for n-grams:
      components of an IB get separated by material occurring in
      between them
      e.g.: PREP VFIN is a IB, then PREP ADV VFIN is a IT
      for each bigram [1st, 2nd] in the CB collect all...
            trigrams [1st, 2nd, 3rd] occurring in the corpus
            possible tags Between in the set PIT (Possible Inner Tags).




                                                                                    54 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont



   Pros and cons of n-grams:




                                                                                    55 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont



   Pros and cons of n-grams:
       – We again rely on a ”error-free representative corpus




                                                                                    55 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont



   Pros and cons of n-grams:
       – We again rely on a ”error-free representative corpus
       – Thus, the correctness of the resulting ”impossible n-grams”
       has to be hand-checked




                                                                                    55 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont



   Pros and cons of n-grams:
       – We again rely on a ”error-free representative corpus
       – Thus, the correctness of the resulting ”impossible n-grams”
       has to be hand-checked
       ++ The resulting ”impossible n-grams” are an extremely
       efficient tool for error detection




                                                                                    55 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont


   Another trap:




                                                                                    56 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont


   Another trap:
       It is risky to judge an IN on the basis of I(N-1) data




                                                                                    56 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont


   Another trap:
       It is risky to judge an IN on the basis of I(N-1) data
       E.g.: any IT [1st, 2nd, 3rd] cannot be detected as IT if the




                                                                                    56 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont


   Another trap:
       It is risky to judge an IN on the basis of I(N-1) data
       E.g.: any IT [1st, 2nd, 3rd] cannot be detected as IT if the
            (1st, 2nd)
            (2nd, 3rd)
            (1st, 3rd)




                                                                                    56 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont


   Another trap:
       It is risky to judge an IN on the basis of I(N-1) data
       E.g.: any IT [1st, 2nd, 3rd] cannot be detected as IT if the
            (1st, 2nd)
            (2nd, 3rd)
            (1st, 3rd)
       are all possible bi-grams




                                                                                    56 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



N-grams, cont


   Another trap:
       It is risky to judge an IN on the basis of I(N-1) data
       E.g.: any IT [1st, 2nd, 3rd] cannot be detected as IT if the
            (1st, 2nd)
            (2nd, 3rd)
            (1st, 3rd)
       are all possible bi-grams
       e.g.: [nominative-noun, main-verb, nominative-noun]




                                                                                    56 / 62
                                       AECOTA      T. Brants and W. Skut
                                    Introduction   H. v. Halteren
          Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                      Conclusion   P. Kveton and K. Oliva



Results




   Negra corpus — 2661 errors corrected




                                                                                    57 / 62
                                     AECOTA      T. Brants and W. Skut
                                  Introduction   H. v. Halteren
        Automatic PoS-Tagging Errors Detection   M. Dickinson and W. D. Meurers
                                    Conclusion   P. Kveton and K. Oliva



Questions?




   Questions?




                                                                                  58 / 62
                                      AECOTA
                                   Introduction
         Automatic PoS-Tagging Errors Detection
                                     Conclusion



Results, cont.

   AEDOTC




                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary




                                                   59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities




                                                       59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency




                                                              59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping
            Markov model




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping
            Markov model
            additional reliability check




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping
            Markov model
            additional reliability check
            n-grams and context classification




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping
            Markov model
            additional reliability check
            n-grams and context classification
       to fight:




                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping
            Markov model
            additional reliability check
            n-grams and context classification
       to fight:
            inconsistency



                                                                  59 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



Results, cont.

   AEDOTC
      is necessary
            to receive correct data on probabilities
            to improve other parsers’ and taggers’ efficiency
            detect inconsistency in handwriting and guide lines
       relies on such techniques as:
            bootstrapping
            Markov model
            additional reliability check
            n-grams and context classification
       to fight:
            inconsistency
            variation


                                                                  59 / 62
                                      AECOTA
                                   Introduction
         Automatic PoS-Tagging Errors Detection
                                     Conclusion



Open problems




  Open problems — detecting and tagging idioms




                                                  60 / 62
                                      AECOTA
                                   Introduction
         Automatic PoS-Tagging Errors Detection
                                     Conclusion



References
      Brants, T and W. Skut (1998). Automation of treebank annotation.
      In ation of Treebank annotation, Sydnei, Australi.
      Dickinson, M and W. D. Meurers (2003a). Detecting errors in
      part-of-speech annotation. In Proceedings of the 10th Conference
      of the European Chapter of the Association for Comutational
      Linguistics (EACL-03), Budapest, pp. 107-114.
      Dickinson, M and W. D. Meurers (2003b). Detecting
      inconsistencies in treebanks. In J. Nivre and E. Hinrichs (Eds.),
      Proceedings of the 2nd International Workshop on Treebanks and
      Linguistic Theories, Vxj, Sweden.
      Halteren, H. (2000). The detection of inconsistency in manually
      tagged text. In Proceedings of the Workshop on Linguistically
      Interpreted Corpora LINC-2000, Luxembourg, pp 48–55.
      Kveton, P. and K. Oliva (2002). (Semi-)Automatic Detection of
      Errors in PoS-Tagged Corpora, Wien, Austria.
                                                                          61 / 62
                                       AECOTA
                                    Introduction
          Automatic PoS-Tagging Errors Detection
                                      Conclusion



The End




  Thank you for attention!




                                                   62 / 62

								
To top