Part of Speech Tagger for Assamese Text

Document Sample
Part of Speech Tagger for Assamese Text Powered By Docstoc
					                        Part of Speech Tagger for Assamese Text

Navanath Saharia   Dhrubajyoti Das    Utpal Sharma            Jugal Kalita
Department of CSE Department of CSE  Department of CSE     Department of CS
Tezpur University  Tezpur University Tezpur University   University of Colorado
  India - 784028    India - 784028     India - 784028   Colorado Springs - 80918
    {nava tu,dhruba it06,utpal}

                     Abstract                                become available lately, a POS tagged corpus for
                                                             Assamese was unavailable till we started creating
    Assamese                                  is             one for the work presented in this paper. Another
    a morphologically rich, agglutinative and                problem was that a clearly defined POS tagset for
    relatively free word order Indic language.               Assamese was unavailable to us. As a part of the
    Although spoken by nearly 30 million                     work reported in this paper, we have developed
    people, very little computational linguistic             a tagset consisting of 172 tags, using this tagset
    work has been done for this language. In                 we have manually tagged a corpus of about ten
    this paper, we present our work on part                  thousand Assamese words.
    of speech (POS) tagging for Assamese                        In the next section we provide a brief relevant
    using the well-known Hidden Markov                       linguistic background of Assamese. Section 3
    Model. Since no well-defined suitable                     contains an overview of work on POS tagging.
    tagset was available, we develop a tagset                Section 4 describes our experimental setup. In
    of 172 tags in consultation with experts                 Section 5, we analyse the result of our work
    in linguistics. For successful tagging,                  and compare the performance with other models.
    we examine relevant linguistic issues in                 Section 6 concludes this paper.
    Assamese.       For unknown words, we
    perform simple morphological analysis                    2    Linguistic Characteristics of Assamese
    to determine probable tags. Using a
    manually tagged corpus of about 10000                    In Assamese, secondary forms of words are
    words for training, we obtain a tagging                  formed through three processes: affixation,
    accuracy of nearly 87% for test inputs.                  derivation and compounding. Affixes play a very
                                                             important role in word formation. Affixes are used
1   Introduction                                             in the formation of relational nouns and pronouns,
                                                             and in the inflection of verbs with respect to
Part of Speech (POS) tagging is the process of               number, person, tense, aspect and mood. For
marking up words and punctuation characters in               example, Table 1 shows how a relational noun
a text with appropriate POS labels. The problems             ed£te (deutA: father) is inflected depending on
faced in POS tagging are many. Many words that               number and person (Goswami, 2003). Though
occur in natural language texts are not listed in any        Assamese is relatively free word order, yet the
catalog or lexicon. A large percentage of words              predominant word order is subject-object-verb
also show ambiguity regarding lexical category.              (SOV).
   The challenges of our work on POS tagging                    The following paragraphs describe just a few
for Assamese, an Indo-European language, are                 of the many characteristics of Assamese text that
compounded by the fact that very little prior                make the tagging task complex.
computational linguistic exists for the language,
though it is a national language of India and                    • Depending on the context, even a common
spoken by over 30 million people. Assamese is a                    word         may         have         different
morphologically rich, free word order, inflectional                 POS tags. For example: If kerex (kArane),
language.     Although POS tagged annotated                        der (dare), inime¬ (nimitte), ehtu (hetu), etc.,
corpus for some of the Indian languages such as                    are placed after pronominal adjective, they
Hindi, Bengali, and Telegu (SPSAL, 2007) have                      are considered conjunction and if placed after

                    Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 33–36,
                           Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP
                                                                    • Even conjunctions can be used as other part
Table 1: Personal definitives are inflected on                          of speech.
person and number                                                     hir —e% ‰du veeyk kkeeykF
 Person            Singular           Plural                          TF : Hari aAru Jadu bhAyek kokAyek.
 1st               My father          Our father                      ET : Hari and Jadu are brothers.
 pzm              emer ed£te         —emer ed£te                     e‰e‡ekeil reitr q„nee„ee‡ i˜pye„ek —e% —ixk
                   mor deutA          aAmAr deutA
                                                                      rhsjnk kir tuilelF
 2nd               Your father        Your father
 men mxm         etemer ed£tere     etemeelekr ed£tere              TF : JowAkAli rAtir ghotonAtowe bishoitok
                   tomAr deutArA      tomAlokar deutArA               aAru adhik rahashyajanak kori tulile.
 2nd , Familiar    Your father        Your father                     ET : The last night incident has made the
 tu˜ mxm          eter ed£ter        thtwr ed£ter
                   tor deutAr         tahator deutAr                  matter more mysterious.
 3rd               Her father         Their father                    The word —e% (aAru : and) shows ambiguity
 ttsy             te¡r ed£tek        ishwtr ed£tek                   in these two sentences. In the first, it is used
                   tAir deutAk        sihator deutAk                  as conjunction (i.e. Hari and Jadu) and in the
                                                                      second, it is used as adjective of adjective.
       noun or personal pronoun they are considered             3    Related Work
       particle. For example,                                   Several approaches have been used for building
       ¦¡ kerex m¡ ngelweF                                      POS taggers.        Two main approaches are
       TF1 : ei kArane moi nagalo.                              supervised and unsupervised. Both supervised and
       This + why + I+ did not go.                              unsupervised tagging can be of three sub-types.
       ET2 : This is why I did not go.                          They are rule based, stochastic based and neural
       remr kerex m¡ ngelweF                                    network based. There are number of pros and cons
       TF : rAmar kArane moi nagalo.                            for each of these methods. The most common
       Ram’s + because of + I + did not go                      stochastic tagging technique is Hidden Markov
       ET : I did not go because of Ram.                        Model (HMM).
       In the first sentence kerex (kArne) is placed                During          the           last          two
       after pronominal adjective ¦¡ (ei); so kArne             decades, many different types of taggers have been
       is considered conjunction.      But in the               developed, especially for corpus rich languages
       second sentence kArne is placed after noun               such as English. Nevertheless, due to relatively
       rem (RAm), and hence kArne is considered                 free word order, agglutinative nature, lack of
       particle.                                                resources and the general lateness in entering the
                                                                computational linguistics field in India, reported
  • Some prepositions or particles are used as                  tagger development work on Indian languages
    suffix if they occur after noun, personal                    is relatively scanty. Among reported works,
    pronoun or verb. For example,                               Dandapat (2007) developed a hybrid model of
       iseh igiglF   TF: sihe goisil.                           POS tagging by combining both supervised and
       ET : Only he went.                                       unsupervised stochastic techniques. Avinesh and
       Actually eh (he : only) is a particle, but it is         Karthik (2007) used conditional random field and
       merged with the personal pronoun is (si).                transformation based learning. The heart of the
                                                                system developed by Singh et al. (2006) for Hindi
  • An affix denoting number, gender or person,                  was the detailed linguistic analysis of morpho-
    can be added to an adjective or other category              syntactic phenomena, adroit handling of suffixes,
    word to create a noun word. For example,                    accurate verb group identification and learning
       xunsyejns ih —eihgeF                                     of disambiguation rules. Saha et al. (2004)
       TF : dhuniyAjoni hoi aAhisA.                             developed a system for machine assisted POS
       ET : You are looking beautiful.                          tagging of Bangla corpora. Pammi and Prahllad
       Here xunsye (dhuniyA : beautiful) is an                  (2007) developed a POS tagger and chunker
       adjective, but after adding feminine suffix jns           using Decision Forests. This work explored
       the whole constituent becomes a noun word.               different methods for POS tagging of Indian
      TF : Transliterated Assamese Form                         languages using sub-words as units. Generally,
      ET : Aproximate English Translation                       most POS taggers for Indian langauages use

morphological analyzer as a module. However,
                                                             Table 2: POS tagging results with small corpora.
building morphological analyzer of a particular              Size of training words : 10000, UWH : Unknown word
Indian language is a very difficult task.                     handling, UPH : Unknown proper noun handling

                                                                   Test    Size     Average      UDH         UPH
4   Our Approach                                                    set             accuracy   accuracy   accuracy
                                                                     A    992        84.68%      62.8%      42.0%
                                                                     B    1074       89.94%     67.54%     53.96%
We have used a Assamese text corpus (Corpus                          C    1241       86.05%     85.64%     26.47%
Asm) of nearly 300,000 words from the online
version of the Assamese daily Asomiya Pratidin
(Sharma et al., 2008). The downloaded articles
                                                             Table 3: Comparison of our result with other
use a font-based encoding called Luit. For
                                                             HMM based model.
our experiments we transliterate the texts to a
                                                                 Author                        Language    Average
normalised Roman encoding using transliteration                                                            accuracy
                                                                 Toutanova et al.(2003)        English     97.24%
software developed by us. We manually tag a                      Banko and Moore(2004)         English     96.55%
part of this corpus, Tr, consisting of nearly 10,000             Dandapat and Sarkar(2006)     Bengali     84.37%
words for training. We use other portions of                     Rao et al.(2007)
                                                                                               Hindi       76.34%
                                                                                               Bengali     72.17%
Corpus Asm for testing the tagger.                                                             Telegu      53.17%
   There was no tagset for Assamese before we                                                  Hindi       70.67%
                                                                 Rao and Yarowsky(2007)
started the project reported in this paper. Due to                                             Bengali     65.47%
                                                                                               Telegu      65.85%
the morphological richness of the language, many                                               Hindi       69.98%
words of Assamese occur in secondary forms in                    Sastry et al.(2007)
                                                                                               Bengali     67.52%
texts. This increases the number of POS tags                                                   Telegu      68.32%
                                                                                               Hindi       71.65%
that needed for the language. Also, often there                  Ekbal et al.(2007)
                                                                                               Bengali     80.63%
are differences of opinion among linguists on the                                              Telegu      53.15%
tags that may be associated with certain words                   Ours                          Assamese    85.64%
in texts. We developed a tagset after in-depth
consultation with linguists and manually tagged
text segments of nearly 10,000 words according to            and searched in the affix-probability table. From
their guidance. To make the tagging process easier           this search, we obtain the probable tags and
we have subcategorised each category of noun                 their corresponding probabilities for each word.
and personal pronoun based on six case endings               All these probable tags and the corresponding
(viz, nominative, accussative, instumental, dative,          probabilities are stored in a two dimensional array
genitive and locative) and two numbers.                      which we call the lattice of the sentence. If we
   We            have           used           HMM           do not get probable tags and probabilities for a
(Dermatas and Kokkinakis, 1995) and the Viterbi              certain word from these two tables we assign tag
algorithm (1967) in developing our POS tagger.               CN (Common Noun) and probability 1 to the
HMM/Viterbi approach is the most useful method,              word since occurrence of CN is highest in the
when pretagged corpus is not available. First, in            manually tagged corpus. After forming the lattice,
the training phase, we have manually tagged the              the Viterbi algorithm is applied to the lattice that
Tr part of the corpus using the tagset discussed             yields the most probable tag sequence for that
above. Then, we build four database tables                   sentence. After that next sentence is taken and the
using probabilities extracted from the manually              same procedure is repeated.
tagged corpus- word-probability table, previous-
                                                             5    Experimental Evaluation
tag-probability table, starting-tag-probability table
and affix-probability table.                                  The results using the three test segments are
   For testing, we consider three text segments, A,          summarised in Table 2. The evaluation of the
B and C, each of about 1000 words. First the input           results require intensive manual verification effort.
text is segmented into sentences. Each sentence              Larger training corpora is likely to produce more
is parsed individually. Each word of a sentence              accurate results. More reliable results can be
is stored in an array. After that, each word is              obtained using larger test corpora.        Table 3
searched in the word-probability table. If the               compares our result with other HMM based
word is unknown, its possible affixes are extracted           reported work. Form the table it is clear that

Toutanova et al. (2003) obtained the best result                 learning. IJCAI-07 workshop on Shallow Parsing for
for English (97.24%). Among HMM based                            South Asian Languages. 2007.
experiments reported on Indian languages, we                   Banko, M., & Robert Moore, R. Part of speech tagging in
have obtained the best result (86.89%). This work                context. 20th International Conference on Computational
                                                                 Linguistics. 2004.
is ongoing and the corpus size and the amount of
tagged text are being increased on a regular basis.            Dandapat, S. Part-of-Speech Tagging and Chunking with
   The accuracy of a tagger depends on the size of               Maximum Entropy Model. Workshop on Shallow Parsing
                                                                 for South Asian Languages. 2007.
tagset used, vocabulary used, and size, genre and
quality of the corpus used. Our tagset containing              Dandapat, S., & Sarkar, S. Part-of-Speech Tagging for
                                                                 Bengali with Hidden Markov Model.       NLPAI ML
172 tags is rather big compared to other Indian                  workshop on Part of speech tagging and Chunking for
language tagsets. A smaller tagset is likely to                  Indian language. 2006.
give more accurate result, but may give less
                                                               Dermatas, S., & Kokkinakis, G. Automatic stochastic
information about word structure and ambiguity.                  tagging of natural language text. Computational
The corpora for training and testing our tagger are              Linguistics 21 : 137-163. 1995.
taken form an Assamese daily newspaper Asomiya                 Ekbal, A., Mandal, S., & Bandyopadhyay, S. POS tagging
Pratidin, thus they are of the same genre.                       using HMM and rule based chunking . Workshop on
                                                                 Shallow Parsing for South Asian Languages. 2007.
6   Conclusion & Future work                                   Goswami, G. C. Asam¯ a Vy¯ karan Pravesh, Second edition.
                                                                                   iy¯ a       .
                                                                 Bina Library, Guwahati. 2003.
We have achieved good POS tagging results for
Assamese, a fairly widely spoken language which        IJCAI-07 workshop on
                                                                  Shallow Parsing for South Asian Languages. Hyderabad,
had very little prior computational linguistic work.              India.
We have obtained an average tagging accuracy
of 87% using a training corpus of just 10000                   Pammi, S.C., & Prahallad, K. POS tagging and chunking
                                                                 using Decision Forests. Workshop on Shallow Parsing for
words. Our main achievement is the creation of                   South Asian Languages. 2007.
the Assamese tagset that was not available before
                                                               Rao, D., & Yarowsky, D.. Part of speech tagging and
starting this project. We have implemented an                    shallow parsing of Indian languages. IJCAI-07 workshop
existing method for POS tagging but our work is                  on Shallow Parsing for South Asian Languages. 2007.
for a new language where an annotated corpora
                                                               Rao, P.T., & Ram, S.R., Vijaykrishna, R. & Sobha L. A
and a pre-defined tagset were not available.                      text chunker and hybrid pos tagger for Indian languages.
   We are currently working on developing a                      IJCAI-07 workshop on Shallow Parsing for South Asian
                                                                 Languages. 2007.
small and more compact tagset. We propose
the following additional work for improved                     Saha, G.K., Saha, A.B., & Debnath, S.         Computer
performance. First, the size of the manually                     Assisted Bangla Words POS Tagging. Proc. International
                                                                 Symposium on Machine Translation NLP & TSS. 2004.
tagged part of the corpus will have to be
increased. Second, a suitable procedure for                    Sastry, G.M.R., Chaudhuri, S., & Reddy, P.N. A HMM
                                                                  based part-of-speech and statistical chunker for 3 Indian
handling unknown proper nouns will have to be                     languages. IJCAI-07 workshop on Shallow Parsing for
developed. Third, if this system can be expanded                  South Asian Languages. 2007.
to trigrams or even n-grams using a larger training
                                                               Sharma, U., Kalita, J. & Das, R. K. Acquisition of
corpus, we believe that the tagging accuracy will                Morphology of an Indic language from text corpus. ACM
increase.                                                        TALIP 2008.

                                                               Singh, S., Gupta K., Shrivastava, M., & Bhattacharyya,
Acknowledgemnt                                                    P. Morphological richness offsets resource demand-
                                                                  experiences in constructing a POS tagger for Hindi.
We would like to thank Dr. Jyotiprakash Tamuli,                   COLING/ACL. 2006.
Dr. Runima Chowdhary and Dr. Madhumita
                                                               Toutanova, K., Klein, D., Manning, C.D. & Singer,
Barbora for their help, specially in making the                  Y. Feature-Rich part-of-speech tagging with a Cyclic
Assamese tagset.                                                 Dependency Network. HLT-NAACL. 2003.

                                                               Viterbi, A.J. Error bounds for convolutional codes and
                                                                  an asymptotically optimum decoding algorithm. IEEE
References                                                        Transaction on Information Theory 61(3) : 268-278.
Avinesh PVS & Karthik G. POS tagging and chunking using           1967.
  Conditional Random Field and Transformation based