Rule Based Hindi to English Transliteration System for Proper Names

Document Sample
Rule Based Hindi to English Transliteration System for Proper Names Powered By Docstoc
					                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                      Vol. 10, No. 8, August 2012




 Rule Based Hindi to English Transliteration System
                for Proper Names
                                         Monika Bhargava #1, M.Kumar *2, Sujoy Das #3
                                    #1
                                        M.Tech Scholar CSE Department, SIRT, Bhopal, India
                                        *2
                                           Professor CSE Department, SIRT, Bhopal, India
                       #3
                          Associate Professor, Department of Computer Application, MANIT, Bhopal, India
                                                1
                                                    monika_bhrgv@yahoo.co.in
                                                    2
                                                      prof.mkumar@gmail.com
                                                       3
                                                       sujdas@gmail.com


   Abstract— There are Cross Language Information Retrieval              which generates the meaning of the input text e.g. book is
systems that uses bilingual dictionary for translating user query        translated as fdrkc but transliterated as cqd. OOV words are
from one language to another. The problem arises when a query            problematic in Cross Lingual Information Retrieval. A
term is not available in the bilingual dictionary. Such words are        common source of error in CLIR is out of vocabulary words,
called Out of Vocabulary (OOV) words, and should be
                                                                         named entity and technical terms.
transliterated during translation process.OOV words are mainly
proper nouns, named entity, and technical terms. We have                    Among OOV, the proper nouns pose a major problem in
developed a rule based transliteration system from Hindi to              the transliteration .This is due to the fact that a proper noun
English script. We have also created a database of specialized           (name of person) is written by different persons with different
spelling, e.g. some city names, person names, etc. which has             spelling. This research has developed a rule based Hindi-
considerably improved performance of our system.                         English transliteration system especially for proper nouns with
                                                                         a fair degree of accuracy.
Keywords- CLIR, OOV, Transliteration
                                                                                             II. LITERATURE SURVEY
                       I. INTRODUCTION                                      The problem of transliteration has been studied by a
    In past 20 years the area of Information Retrieval (IR) has          number of researchers during the last decade. Knight and
grown well beyond its primary goals of indexing text and                 Graehl [1] use five probability distributions at various phases
searching useful documents in a collection. Nowadays,                    of transliteration for the language pair English to Katakana (a
research in IR includes modelling, document classification               form of Japanese Language) writing system. Al-Onaizan and
and categorization, data visualization, filtering, etc. The Web          Knight [2] have studied transliteration system from Arabic to
is becoming a universal repository of human knowledge and                English writing, which uses existing named entity recognition
culture which has allowed unprecedented sharing of ideas and             system. Asif et.al [3] have considered Bengali to English
information in a scale never seen before. Now the Web is seen            transliteration scheme and used supervised training set to
as a publishing medium with accessibility to everybody. The              obtain a direct orthographic mapping. Lehal and Saini [4]
Web contents are growing very rapidly and contain                        have developed a Hindi to Urdu transliteration system by
information written in many languages. Often a user of Web               improving on the work of Bushra and Tafseer [5]. Lehal and
needs information written in a language not familiar to the              Saini have claimed an accuracy of 99.46% when Hindi
user but he /she wishes to get it in the native language. This is        Unicode text is transliterated to Urdu.
possible through a process called Cross Language Information                Haung et.al [6] have developed a system which extracts
Retrieval. Several methods are used to convert the text of one           Hindi – English named entity pairs through alignment of
language to another language. Machine translation systems in             parallel corpus. Here, Chinese-English pairs are first extracted
many language pairs are available and widely used. The                   using a dynamic programming string matching. This model is
bilingual dictionaries are also frequently used to convert text          then adapted for Hindi-English named entity pairs. Sinha et.al
from one language to another language. A major problem                   [7] have developed a simple yet powerful method for mining
arises when a word of the text is not available in the bilingual         of Hindi – English names from parallel text corpus. The Hindi
dictionary. Such words are called Out of Vocabulary (OOV)                text written in Devanagari is first converted to IITK-Roman
words and should be transliterated.                                      form which is direct representation of UTF-8 or ISCII -8
   Transliteration is the task of transcribing a word or text            coding scheme and claimed an accuracy of nearly 93%.
from one writing system into another writing system such that               In this paper, an effort is made to develop a rule based
pronunciation of the word remains same and a person reading              transliteration scheme for proper names of Hindi-English
the transcribed words can read it in original language. In               language pair. The system is under extensive experimentation
others words, transliteration is the task of converting a text in        and test.
its customary orthography. It is different from translation,




                                                                    18                             http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 10, No. 8, August 2012


        III. OVERVIEW OF HINDI AND ENGLISH SCRIPTS                                               TABLE II
                                                                              MAPPING OF MOST CONSONANTS FROM H INDI TO ENGLISH
Hindi, official language of India, is an Indo-Aryan language
with about 487 million speakers. Hindi is written in Devanagri
script which uses 52 symbols for representing 10 vowels, 40                         S.No         Hindi             English
consonants and 2 modifiers. The vowels are transcribed in two                       1        क                KA
forms i.e. independent and dependent form. Dependent form                           2        ख                KHA
is also known as matraa. Former is used when vowel letter
appears alone at the beginning of word or is immediately
                                                                                    3        ग                GA
followed by another vowel. Latter is used when vowel                                4        घ                GHA
followed consonant [8].                                                             5        ङ                NA
English is the most widely used language in the world.
Approximately 375 million people speak English. It has been                         6        च                CHA
referred to as ‘world Language. English speaker have many                           7        छ                CHHA
different accents which often signal the speaker’s native                           8        ज                JA
dialect or language. English is derived from West Germanic
branch of Indo –European family. English has 21 consonants                          9        झ                JHA
and 5 vowels.                                                                       10       ञ                NA
   The Indian languages TRANSliteration (ITRANS) is an                              11       ट                TA
ASCII transliteration scheme for Indic scripts, particularly
for Devanagari script. [9]. It is a pre-processor that converts                     12       ठ                THA
English-encoded text into various Indian languages script and                       13       ड                DA
has 7-bit ASCII encoding schemes (see [10]).
                                                                                    14       ढ                DHA
Mapping from Hindi to English                                                       15       ण                NA
   There is no one to one correspondence from Hindi to                              16       त                TA
English script. Tables I and II show the ITRANS mapping
between source language [Hindi] to target language
                                                                                    17       थ                THA
[English].We have used these mappings to transliterate proper                       18       द                DA
names in Hindi to English language.                                                 19       ध                DHA
                           TABLE I                                                  20       न                NA
            MAPPING OF VOWELS FROM H INDI TO ENGLISH
                                                                                    21       प                PA
     S.No         Hindi Vowel               English Vowel
            Dependent   Independent
                                                                                    22       फ                PHA
            Form        Form                                                        23       ब                BA
     1                  अ                  A                                        24       भ                BHA
     2                     आ               AA                                       25       म                MA
     3                     इ               I                                        26       य                YA
     4                     ई               II                                       27       र                RA
     5                     उ               U                                        28       ल                LA
     6                     ऊ               UU                                       29       व                VA OR WA
     7                     ऋ               RRI                                      30       श                SHA
     8                     ए               E                                        31       ष                SHA OR SHHA
     9                     ऐ               AI                                       32       स                SA
     10                    ओ               O                                        33       ह                HA
     11                    औ               AU                                       34                        KSHA
     12                    अ               AM                                       35                        GYA
     13                    अ               AH




                                                                  19                              http://sites.google.com/site/ijcsis/
                                                                                                  ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                   Vol. 10, No. 8, August 2012


                          IV. EXPERIMENTAL SETUP                                                                 TABLE IV
                                                                                                     SAMPLE 2 (TOTAL 45 STRINGS TESTED)
      We have developed a rule based transliteration system
                                                                                    Hindi    Expected           Through             Rule Based
   using JAVA SE DEVELOPMENT KIT (JDK), VERSION 6,                                  String   Transliteration    Mapping             Transliteration
   MYSQL Server 5.5 for database. The program flow and the
   system architecture of Hindi to English rule based                               हरशद     HARSHAD            HARASHADA           HARSHAD
   transliterator is shown in Fig1. The string of Hindi language is                 तनमय     TANMAY             TANAMAYA            TANMAY
   first searched in the database created for specialized spelling
                                                                                    अरनव     ARNAV              ARANAVA             ARNAV
   used in proper names, if the string match is found in the
                                                                                    अकबर     AKBAR              AKABARA             AKBAR
   database then its transliteration equivalent is produced. If the
   string is not found in the database, it is then transliterated
   using nine rules given below:                                                   2) Rule2: The observation of Table IV gives the rule that if
                                                                                      length of proper noun is of 4 characters containing no
                                                                                      vowel (matraa) then ‘A’ is removed from the second and
              Hindi Word                                                              last position.
                                                                                                                 TABLE V
                                                                                                     SAMPLE 3 (TOTAL 45 STRINGS TESTED)
                                                      Special
                                                                                    Hindi    Expected           Through             Rule Based
       Database Look Up                              Spellings                      String   Transliteration    Mapping             Transliteration
                                                     Database
                                                                                    क पल     KAPIL              KAPILA              KAPIL

                                                                                    अ भनव    ABHINAV            ABHINAVA            ABHINAV
       Transliteration
                                                Mapping Tables                      व पन     VIPIN              VIPINA              VIPIN
       Rules
                                                                                    मकल      MUKUL              MUKULA              MUKUL


                                                                                   3) Rule3: The observation of Table V gives the rule that if
       English Output                                                              proper noun ends with a consonant then ‘A’ should be
                                                                                   removed from last position in English spelling.
                                                                                                                 TABLE VI
Fig. 1. Program Flow of the Transliteration system                                                   SAMPLE 4 (TOTAL 45 STRINGS TESTED)

                                                                                    Hindi    Expected           Through             Rule Based
   Observations and Rules Creation                                                  String   Transliteration    Mapping             Transliteration

      Sample results of our experimentation are shown in Tables                     वमल      VIMLA              VIMALAA             VIMLA
   III-XI where sample Hindi strings, expected English                              पदम      PADMA              PADAMAA             PADMA
   transliteration (commonly used English spellings),
                                                                                    बशर      BUSHRA             BUSHARAA            BUSHRA
   transliteration through ITRANS mappings and results
   produced by our transliteration system are shown.                                बलदव     BALDEV             BALADEVA            BALDEV

                                    TABLE III
                                                                                   4) Rule4: The observation of Table VI gives the rule that if
                      SAMPLE 1 (TOTAL 45 STRINGS TESTED)                           two consonants occur in succession, latter consonant followed
     Hindi      Expected            Through             Rule Based                 by a vowel in a proper noun and index of first consonant
     String     Transliteration     Mapping             Transliteration            should be greater than 1, then ‘A’ is removed from first
                                                                                   consonant during transliteration.
     पलक        PALAK               PALAKA              PALAK

     अभय        ABHAY               ABHAYA              ABHAY                                                    TABLE VII
                                                                                                     SAMPLE 5 (TOTAL 45 STRINGS TESTED)
     भरत        BHARAT              BHARATA             BHARAT
                                                                                    Hindi    Expected           Through              Rule Based
     करण        KARAN               KARANA              KARAN
                                                                                    String   Transliteration    Mapping              Transliteration

   1) Rule1: The observation of Table III gives the rule that if                    आ सफ     AASIF              AASIPHA              AASIF

   length of proper noun is of 3 characters containing no vowel                     आमन      AAMNA              AAMANAA              AAMNA
   (matraa) then ‘A’ is removed from the last position.                             हम न     HIMANI             HIMAANII             HIMANI

                                                                                    ग रम     GARIMA             GARIMAA              GARIMA


                                                                                   5) Rule5: The observation of Table VII gives the rule that if
                                                                                   proper noun begins with ‘आ’ it is replaced with ‘AA’, and if




                                                                              20                               http://sites.google.com/site/ijcsis/
                                                                                                               ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 10, No. 8, August 2012


                                                                                                     TABLE XI
diacritical form ‘ ’ occurs in between or end then it is                                 SAMPLE 9 (TOTAL 45 STRINGS TESTED)
replaced with ‘A’ during transliteration.
                                                                         Hindi       Expected            Through              Rule Based
                              TABLE VIII                                 String      Transliteration     Mapping              Transliteration
                   SAMPLE 6(TOTAL 45 STRINGS TESTED)
                                                                         ओज व        OJASWI              OJASVII              OJASWI
 Hindi    Expected          Through            Rule Based
 String   Transliteration   Mapping            Transliteration            वप नल      SWAPNIL             SVAPANILA            SWAPNIL

 ई वर     ISHWAR            IISHWARA           ISHWAR                     वत         SHWETA              SHVETAA              SHWETA
 कर न     KAREENA           KARIINAA           KAREENA                    व व        VISHWA              VISHVAA              VISHWA
 द पक     DEEPAK            DIIPAKA            DEEPAK
                                                                        9) Rule9: The observation of Table XI gives the rule that
 म नस     MANSI             MAANASII           MANSI
                                                                        whenever a name contains ‘L’ or ‘'’ followed by ‘व’ then the
                                                                        mapping of ‘व’ becomes ‘W’ instead of ‘V’.
6) Rule6: The observation of Table VIII gives the rule that if
proper noun begins with ‘ई’ or ends with its diacritical form                                   V. CONCLUSION
‘ ’ then it is replaced with ‘I’ and if its diacritical form               Many authors have remarked that the rule based
occurs in middle then it is replaced with ‘EE’ instead of ‘II’          transliteration system is very complex to develop (see [4, 5,
during transliteration.                                                 and 7]).However, in this paper we have presented a Rule
                                                                        Based Transliteration system from Hindi to English for proper
                               TABLE IX
                   SAMPLE 7 (TOTAL 45 STRINGS TESTED)                   names. We have used standard ITRANs mapping (shown in
                                                                        Table 1 and 2) for our transliteration system. We have
 Hindi    Expected          Through            Rule Based
                                                                        performed experimentation with 45 strings each of the similar
 String   Transliteration   Mapping            Transliteration
                                                                        types and produced nine rules for correcting the output, which
 पनम      POONAM            PUUNAMA            POONAM                   matches with the expected (commonly used) spellings. We
 अपव      APOORVA           APUURVAA           APOORVA                  observed that there were several names, for example, city
                                                                        names, train names, etc, have specialized spellings. We have
 च        CHARU             CHAARUU            CHARU
                                                                        created a database containing specialized spellings. This has
 कह       KUHU              KUHUU              KUHU
                                                                        increased the performance of the system considerably. The
                                                                        transliteration system is under extensive test and some
7) Rule7: The observation of Table IX gives the rule that if            additional rules will be reported soon.
diacritical form of ऊ ( ) occurs in middle of proper noun then
                                                                                                    REFERENCES
it is replaced with ‘OO’ and if its occurs in end of name it is
replaced with ‘U’ instead of ‘UU’ during transliteration.                [1] Knight K. and J. Graehl, “Machine Transliteration”, Computational
                               TABLE X                                       Linguistics, 24(4): pp 599-612, 1998.
                   SAMPLE 8 (TOTAL 45 STRINGS TESTED)                    [2] Al-Onaizan Y. and Knight K., “Translating Named Entities Using
                                                                             Monolingual and Bilingual Resources”, Proceedings of ACL 2002, pp
Hindi     Expected          Through            Rule Based                    400-408, July2002.
String    Transliteration   Mapping            Transliteration           [3] Ekbal Asif, Sudip Kumar Naskar and Sivaji Bandyopadhyay, “A
                                                                             Modified Joint Source-Channel Model for Transliteration”,
आफत ब     AAFTAB            AAPHATAAB          AAFTAB                        Proceedings of ACL 2006, pp 191-198, 2006.
                                                                         [4] Lehal G.S and Saini T.S., “A Hindi to Urdu Transliteration System”,
फ गन      FALGUNI           PHAALGUNII         FALGUNI
                                                                             Proceedings of ICON, pp 235-240,2010
          PHALGUNI
                                                                         [5] Bushra and Tafseer, “Hindi to Urdu Conversion: Beyond Simple
आफर न     AAFREEN           AAPHARIINA         AAFREEN
                                                                             Transliteration”, Proceedings of the Conference on Language and
सफ        SAIF              SAIPHA             SAIF                          Technology, pp 24-31, 2009.
                                                                         [6] Huang Fei, Stephan Vogel, and Alex Waibel, “Extracting Named
                                                                             Entity Translingual Equivalence with Limited Resources”, ACM
8) Rule8: The observation of Table X gives the rule that                     Transactions on Asian Language Information Processing
whenever a name contains ‘फ’ then mapping of ‘फ’ becomes                     (TALIP),2(2):pp 124–129,2003
                                                                         [7] R. Mahesh, K. Sinha, “Automated Mining Of Names Using Parallel
‘F’ instead of ‘PH’.                                                         Hindi-English Corpus”, 7th Workshop on Asian Language Resources,
                                                                             ACLIJCNLP 2009, pp 48–54, 2009.
                                                                         [8] http://hindilanguage.info/devanagari/
                                                                         [9] http://en.wikipedia.org/wiki/ITRANS
                                                                        [10] http://www.aczoom.com/itrans




                                                                   21                                  http://sites.google.com/site/ijcsis/
                                                                                                       ISSN 1947-5500

				
DOCUMENT INFO
Description: The International Journal of Computer Science and Information Security (IJCSIS) focuses to publish the emerging area of computer applications and practices, and latest advances in cloud computing, information security, green IT etc. IJCSIS addresses innovative developments, research issues/solutions in computer science and related technologies. It is a well-established and notable venue for publishing high quality research papers as recognised by various universities, international professional bodies and Google scholar citations. IJCSIS editorial board solicits authors/researchers/scholars to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences. The aim is also to allow academia promptly publish research work to sustain or further one's career. For complete details about IJCSIS archives publications, abstracting/indexing, editorial board and other important information, please refer to IJCSIS homepage. IJCSIS appreciates all the insights and advice from authors/readers and reviewers. Indexed by the following International Agencies and institutions: Google Scholar, Bielefeld Academic Search Engine (BASE), CiteSeerX, SCIRUS, Cornell’s University Library EI, Scopus, DBLP, DOI, ProQuest, EBSCO. Google Scholar reported a large amount of cited papers published in IJCSIS. We will continue to encourage the readers, authors and reviewers and the computer science scientific community and authors to continue citing papers published by the journal. Considering the growing interest of academics worldwide to publish in IJCSIS, we invite universities and institutions to partner with us to further encourage open-access publications We look forward to receive your valuable papers. The topics covered by this journal are diverse. (See monthly Call for Papers). If you have further questions please do not hesitate to contact us at ijcsiseditor@gmail.com. Our team is committed to provide a quick