Rule Based Hindi to English Transliteration System for Proper Names
The International Journal of Computer Science and Information Security (IJCSIS) focuses to publish the emerging area of computer applications and practices, and latest advances in cloud computing, information security, green IT etc. IJCSIS addresses innovative developments, research issues/solutions in computer science and related technologies. It is a well-established and notable venue for publishing high quality research papers as recognised by various universities, international professional bodies and Google scholar citations. IJCSIS editorial board solicits authors/researchers/scholars to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences. The aim is also to allow academia promptly publish research work to sustain or further one's career. For complete details about IJCSIS archives publications, abstracting/indexing, editorial board and other important information, please refer to IJCSIS homepage. IJCSIS appreciates all the insights and advice from authors/readers and reviewers. Indexed by the following International Agencies and institutions: Google Scholar, Bielefeld Academic Search Engine (BASE), CiteSeerX, SCIRUS, Cornell’s University Library EI, Scopus, DBLP, DOI, ProQuest, EBSCO. Google Scholar reported a large amount of cited papers published in IJCSIS. We will continue to encourage the readers, authors and reviewers and the computer science scientific community and authors to continue citing papers published by the journal. Considering the growing interest of academics worldwide to publish in IJCSIS, we invite universities and institutions to partner with us to further encourage open-access publications We look forward to receive your valuable papers. The topics covered by this journal are diverse. (See monthly Call for Papers). If you have further questions please do not hesitate to contact us at ijcsiseditor@gmail.com. Our team is committed to provide a quick
- views:
- 88
- posted:
- 9/11/2012
- language:
- English
- pages:
- 4

(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 8, August 2012
Rule Based Hindi to English Transliteration System
for Proper Names
Monika Bhargava #1, M.Kumar *2, Sujoy Das #3
#1
M.Tech Scholar CSE Department, SIRT, Bhopal, India
*2
Professor CSE Department, SIRT, Bhopal, India
#3
Associate Professor, Department of Computer Application, MANIT, Bhopal, India
1
monika_bhrgv@yahoo.co.in
2
prof.mkumar@gmail.com
3
sujdas@gmail.com
Abstract— There are Cross Language Information Retrieval which generates the meaning of the input text e.g. book is
systems that uses bilingual dictionary for translating user query translated as fdrkc but transliterated as cqd. OOV words are
from one language to another. The problem arises when a query problematic in Cross Lingual Information Retrieval. A
term is not available in the bilingual dictionary. Such words are common source of error in CLIR is out of vocabulary words,
called Out of Vocabulary (OOV) words, and should be
named entity and technical terms.
transliterated during translation process.OOV words are mainly
proper nouns, named entity, and technical terms. We have Among OOV, the proper nouns pose a major problem in
developed a rule based transliteration system from Hindi to the transliteration .This is due to the fact that a proper noun
English script. We have also created a database of specialized (name of person) is written by different persons with different
spelling, e.g. some city names, person names, etc. which has spelling. This research has developed a rule based Hindi-
considerably improved performance of our system. English transliteration system especially for proper nouns with
a fair degree of accuracy.
Keywords- CLIR, OOV, Transliteration
II. LITERATURE SURVEY
I. INTRODUCTION The problem of transliteration has been studied by a
In past 20 years the area of Information Retrieval (IR) has number of researchers during the last decade. Knight and
grown well beyond its primary goals of indexing text and Graehl [1] use five probability distributions at various phases
searching useful documents in a collection. Nowadays, of transliteration for the language pair English to Katakana (a
research in IR includes modelling, document classification form of Japanese Language) writing system. Al-Onaizan and
and categorization, data visualization, filtering, etc. The Web Knight [2] have studied transliteration system from Arabic to
is becoming a universal repository of human knowledge and English writing, which uses existing named entity recognition
culture which has allowed unprecedented sharing of ideas and system. Asif et.al [3] have considered Bengali to English
information in a scale never seen before. Now the Web is seen transliteration scheme and used supervised training set to
as a publishing medium with accessibility to everybody. The obtain a direct orthographic mapping. Lehal and Saini [4]
Web contents are growing very rapidly and contain have developed a Hindi to Urdu transliteration system by
information written in many languages. Often a user of Web improving on the work of Bushra and Tafseer [5]. Lehal and
needs information written in a language not familiar to the Saini have claimed an accuracy of 99.46% when Hindi
user but he /she wishes to get it in the native language. This is Unicode text is transliterated to Urdu.
possible through a process called Cross Language Information Haung et.al [6] have developed a system which extracts
Retrieval. Several methods are used to convert the text of one Hindi – English named entity pairs through alignment of
language to another language. Machine translation systems in parallel corpus. Here, Chinese-English pairs are first extracted
many language pairs are available and widely used. The using a dynamic programming string matching. This model is
bilingual dictionaries are also frequently used to convert text then adapted for Hindi-English named entity pairs. Sinha et.al
from one language to another language. A major problem [7] have developed a simple yet powerful method for mining
arises when a word of the text is not available in the bilingual of Hindi – English names from parallel text corpus. The Hindi
dictionary. Such words are called Out of Vocabulary (OOV) text written in Devanagari is first converted to IITK-Roman
words and should be transliterated. form which is direct representation of UTF-8 or ISCII -8
Transliteration is the task of transcribing a word or text coding scheme and claimed an accuracy of nearly 93%.
from one writing system into another writing system such that In this paper, an effort is made to develop a rule based
pronunciation of the word remains same and a person reading transliteration scheme for proper names of Hindi-English
the transcribed words can read it in original language. In language pair. The system is under extensive experimentation
others words, transliteration is the task of converting a text in and test.
its customary orthography. It is different from translation,
18 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 8, August 2012
III. OVERVIEW OF HINDI AND ENGLISH SCRIPTS TABLE II
MAPPING OF MOST CONSONANTS FROM H INDI TO ENGLISH
Hindi, official language of India, is an Indo-Aryan language
with about 487 million speakers. Hindi is written in Devanagri
script which uses 52 symbols for representing 10 vowels, 40 S.No Hindi English
consonants and 2 modifiers. The vowels are transcribed in two 1 क KA
forms i.e. independent and dependent form. Dependent form 2 ख KHA
is also known as matraa. Former is used when vowel letter
appears alone at the beginning of word or is immediately
3 ग GA
followed by another vowel. Latter is used when vowel 4 घ GHA
followed consonant [8]. 5 ङ NA
English is the most widely used language in the world.
Approximately 375 million people speak English. It has been 6 च CHA
referred to as ‘world Language. English speaker have many 7 छ CHHA
different accents which often signal the speaker’s native 8 ज JA
dialect or language. English is derived from West Germanic
branch of Indo –European family. English has 21 consonants 9 झ JHA
and 5 vowels. 10 ञ NA
The Indian languages TRANSliteration (ITRANS) is an 11 ट TA
ASCII transliteration scheme for Indic scripts, particularly
for Devanagari script. [9]. It is a pre-processor that converts 12 ठ THA
English-encoded text into various Indian languages script and 13 ड DA
has 7-bit ASCII encoding schemes (see [10]).
14 ढ DHA
Mapping from Hindi to English 15 ण NA
There is no one to one correspondence from Hindi to 16 त TA
English script. Tables I and II show the ITRANS mapping
between source language [Hindi] to target language
17 थ THA
[English].We have used these mappings to transliterate proper 18 द DA
names in Hindi to English language. 19 ध DHA
TABLE I 20 न NA
MAPPING OF VOWELS FROM H INDI TO ENGLISH
21 प PA
S.No Hindi Vowel English Vowel
Dependent Independent
22 फ PHA
Form Form 23 ब BA
1 अ A 24 भ BHA
2 आ AA 25 म MA
3 इ I 26 य YA
4 ई II 27 र RA
5 उ U 28 ल LA
6 ऊ UU 29 व VA OR WA
7 ऋ RRI 30 श SHA
8 ए E 31 ष SHA OR SHHA
9 ऐ AI 32 स SA
10 ओ O 33 ह HA
11 औ AU 34 KSHA
12 अ AM 35 GYA
13 अ AH
19 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 8, August 2012
IV. EXPERIMENTAL SETUP TABLE IV
SAMPLE 2 (TOTAL 45 STRINGS TESTED)
We have developed a rule based transliteration system
Hindi Expected Through Rule Based
using JAVA SE DEVELOPMENT KIT (JDK), VERSION 6, String Transliteration Mapping Transliteration
MYSQL Server 5.5 for database. The program flow and the
system architecture of Hindi to English rule based हरशद HARSHAD HARASHADA HARSHAD
transliterator is shown in Fig1. The string of Hindi language is तनमय TANMAY TANAMAYA TANMAY
first searched in the database created for specialized spelling
अरनव ARNAV ARANAVA ARNAV
used in proper names, if the string match is found in the
अकबर AKBAR AKABARA AKBAR
database then its transliteration equivalent is produced. If the
string is not found in the database, it is then transliterated
using nine rules given below: 2) Rule2: The observation of Table IV gives the rule that if
length of proper noun is of 4 characters containing no
vowel (matraa) then ‘A’ is removed from the second and
Hindi Word last position.
TABLE V
SAMPLE 3 (TOTAL 45 STRINGS TESTED)
Special
Hindi Expected Through Rule Based
Database Look Up Spellings String Transliteration Mapping Transliteration
Database
क पल KAPIL KAPILA KAPIL
अ भनव ABHINAV ABHINAVA ABHINAV
Transliteration
Mapping Tables व पन VIPIN VIPINA VIPIN
Rules
मकल MUKUL MUKULA MUKUL
3) Rule3: The observation of Table V gives the rule that if
English Output proper noun ends with a consonant then ‘A’ should be
removed from last position in English spelling.
TABLE VI
Fig. 1. Program Flow of the Transliteration system SAMPLE 4 (TOTAL 45 STRINGS TESTED)
Hindi Expected Through Rule Based
Observations and Rules Creation String Transliteration Mapping Transliteration
Sample results of our experimentation are shown in Tables वमल VIMLA VIMALAA VIMLA
III-XI where sample Hindi strings, expected English पदम PADMA PADAMAA PADMA
transliteration (commonly used English spellings),
बशर BUSHRA BUSHARAA BUSHRA
transliteration through ITRANS mappings and results
produced by our transliteration system are shown. बलदव BALDEV BALADEVA BALDEV
TABLE III
4) Rule4: The observation of Table VI gives the rule that if
SAMPLE 1 (TOTAL 45 STRINGS TESTED) two consonants occur in succession, latter consonant followed
Hindi Expected Through Rule Based by a vowel in a proper noun and index of first consonant
String Transliteration Mapping Transliteration should be greater than 1, then ‘A’ is removed from first
consonant during transliteration.
पलक PALAK PALAKA PALAK
अभय ABHAY ABHAYA ABHAY TABLE VII
SAMPLE 5 (TOTAL 45 STRINGS TESTED)
भरत BHARAT BHARATA BHARAT
Hindi Expected Through Rule Based
करण KARAN KARANA KARAN
String Transliteration Mapping Transliteration
1) Rule1: The observation of Table III gives the rule that if आ सफ AASIF AASIPHA AASIF
length of proper noun is of 3 characters containing no vowel आमन AAMNA AAMANAA AAMNA
(matraa) then ‘A’ is removed from the last position. हम न HIMANI HIMAANII HIMANI
ग रम GARIMA GARIMAA GARIMA
5) Rule5: The observation of Table VII gives the rule that if
proper noun begins with ‘आ’ it is replaced with ‘AA’, and if
20 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 10, No. 8, August 2012
TABLE XI
diacritical form ‘ ’ occurs in between or end then it is SAMPLE 9 (TOTAL 45 STRINGS TESTED)
replaced with ‘A’ during transliteration.
Hindi Expected Through Rule Based
TABLE VIII String Transliteration Mapping Transliteration
SAMPLE 6(TOTAL 45 STRINGS TESTED)
ओज व OJASWI OJASVII OJASWI
Hindi Expected Through Rule Based
String Transliteration Mapping Transliteration वप नल SWAPNIL SVAPANILA SWAPNIL
ई वर ISHWAR IISHWARA ISHWAR वत SHWETA SHVETAA SHWETA
कर न KAREENA KARIINAA KAREENA व व VISHWA VISHVAA VISHWA
द पक DEEPAK DIIPAKA DEEPAK
9) Rule9: The observation of Table XI gives the rule that
म नस MANSI MAANASII MANSI
whenever a name contains ‘L’ or ‘'’ followed by ‘व’ then the
mapping of ‘व’ becomes ‘W’ instead of ‘V’.
6) Rule6: The observation of Table VIII gives the rule that if
proper noun begins with ‘ई’ or ends with its diacritical form V. CONCLUSION
‘ ’ then it is replaced with ‘I’ and if its diacritical form Many authors have remarked that the rule based
occurs in middle then it is replaced with ‘EE’ instead of ‘II’ transliteration system is very complex to develop (see [4, 5,
during transliteration. and 7]).However, in this paper we have presented a Rule
Based Transliteration system from Hindi to English for proper
TABLE IX
SAMPLE 7 (TOTAL 45 STRINGS TESTED) names. We have used standard ITRANs mapping (shown in
Table 1 and 2) for our transliteration system. We have
Hindi Expected Through Rule Based
performed experimentation with 45 strings each of the similar
String Transliteration Mapping Transliteration
types and produced nine rules for correcting the output, which
पनम POONAM PUUNAMA POONAM matches with the expected (commonly used) spellings. We
अपव APOORVA APUURVAA APOORVA observed that there were several names, for example, city
names, train names, etc, have specialized spellings. We have
च CHARU CHAARUU CHARU
created a database containing specialized spellings. This has
कह KUHU KUHUU KUHU
increased the performance of the system considerably. The
transliteration system is under extensive test and some
7) Rule7: The observation of Table IX gives the rule that if additional rules will be reported soon.
diacritical form of ऊ ( ) occurs in middle of proper noun then
REFERENCES
it is replaced with ‘OO’ and if its occurs in end of name it is
replaced with ‘U’ instead of ‘UU’ during transliteration. [1] Knight K. and J. Graehl, “Machine Transliteration”, Computational
TABLE X Linguistics, 24(4): pp 599-612, 1998.
SAMPLE 8 (TOTAL 45 STRINGS TESTED) [2] Al-Onaizan Y. and Knight K., “Translating Named Entities Using
Monolingual and Bilingual Resources”, Proceedings of ACL 2002, pp
Hindi Expected Through Rule Based 400-408, July2002.
String Transliteration Mapping Transliteration [3] Ekbal Asif, Sudip Kumar Naskar and Sivaji Bandyopadhyay, “A
Modified Joint Source-Channel Model for Transliteration”,
आफत ब AAFTAB AAPHATAAB AAFTAB Proceedings of ACL 2006, pp 191-198, 2006.
[4] Lehal G.S and Saini T.S., “A Hindi to Urdu Transliteration System”,
फ गन FALGUNI PHAALGUNII FALGUNI
Proceedings of ICON, pp 235-240,2010
PHALGUNI
[5] Bushra and Tafseer, “Hindi to Urdu Conversion: Beyond Simple
आफर न AAFREEN AAPHARIINA AAFREEN
Transliteration”, Proceedings of the Conference on Language and
सफ SAIF SAIPHA SAIF Technology, pp 24-31, 2009.
[6] Huang Fei, Stephan Vogel, and Alex Waibel, “Extracting Named
Entity Translingual Equivalence with Limited Resources”, ACM
8) Rule8: The observation of Table X gives the rule that Transactions on Asian Language Information Processing
whenever a name contains ‘फ’ then mapping of ‘फ’ becomes (TALIP),2(2):pp 124–129,2003
[7] R. Mahesh, K. Sinha, “Automated Mining Of Names Using Parallel
‘F’ instead of ‘PH’. Hindi-English Corpus”, 7th Workshop on Asian Language Resources,
ACLIJCNLP 2009, pp 48–54, 2009.
[8] http://hindilanguage.info/devanagari/
[9] http://en.wikipedia.org/wiki/ITRANS
[10] http://www.aczoom.com/itrans
21 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Get documents about "