Structural Analysis of Bangla Sentences of Different Tenses for Automatic Bangla Machine Translator

Document Sample
Structural Analysis of Bangla Sentences of Different Tenses for Automatic Bangla Machine Translator Powered By Docstoc
					                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                             Vol. 8, No. 9, December 2010

Structural Analysis of Bangla Sentences of Different Tenses for Automatic
                       Bangla Machine Translator
                Md. Musfique Anwar, Nasrin Sultana Shume and Md. Al-Amin Bhuiyan
          Dept. of Co mputer Science & Engineering, Jahangirnagar University , Dhaka, Bangladesh
        Email: musfique.anwar@g m,,

Abstract                                                        “Unladylike” can be divided into the morphemes as
This paper addresses about structural mappings of               Un – not (prefix), Lady – well behaved female (root
Bangla sentences of different tenses for machine                word), Like – having the characteristics of (suffix).
translation (MT). Machine translation requires                  Morphological info rmation of words are stored
analysis, transfer and generation steps to produce              together with syntactic and semantic information of
target language output from a source language input.            the words.
Structural representation of Bangla sentences
encodes the information of Bangla sentences and a               The purpose of syntactic analysis is to determine the
transfer module has been designed that can generate             structure of the input text. This structure consists of a
English sentences using Context Free Grammar                    hierarchy of phrases, the smallest of wh ich are the
(CFG). The MT system generates parse tree                       basic symbols and the largest of which is the sentence.
according to the parse rules and a lexicon provides             It can be described by a tree known as parse/syntax
the properties of the word and its meaning in the               tree with one node for each phrase. Basic symbols are
target language. The MT system can be extendable to             represented by leaf nodes and other phrases by
paragraph translation.                                          interior nodes. The root of the tree represents the
Machine Translation, Structural representation,                 Syntactic analysis aims to identify the sequence of
Context Free Grammar, Parse tree, Lexicon etc.                  grammatical elements e.g. article , verb, preposition,
                                                                etc or of functional elements e.g. subject, predicate,
1. Introduction                                                 the grouping of grammatical elements e.g. nominal
                                                                phrases consisting of nouns, articles, adjectives and
                                                                other modifiers and the recognition of dependency
Machine translator refers to computerized system
                                                                relations i.e. hierarchical relat ions. If we can identify
responsible for the production of translation from one
                                                                the syntactic constituents of sentences, it will be
natural language to another, with or without human
                                                                easier for us to obtain the structural representation of
assistance. It excludes computer-based translation
                                                                the sentence [3].
tools, which support translators by providing access to
on-line dict ionaries, remote terminology databanks,
                                                                Most grammar ru le formalis ms are based on the idea
transmission and reception of texts, etc. The core of
                                                                of phrase structure – that strings are composed of
MT itself is the automation of the full translation
                                                                substrings called phrases, which come in different
process. Machine translation (MT) means translation
                                                                categories. There are three types of phrases in Bangla-
using computers.
                                                                Noun phrase, Adjective Phrase and Verb Phrase.
                                                                Simp le sentences are composed of these phrases.
We need to determine a sentence structure at first
                                                                Co mplex and compound sentences are composed of
using grammatical rules to interpret any language.
                                                                simp le sentences [4].
Parsing or, more formally, syntactic analysis, is the
process of analyzing a text, made of a sequence of
                                                                Within the early standard transformational models it
tokens (for examp le, words), to determine its
grammatical structure with respect to a given formal            is assumed that basic phrase markers are generated by
grammar. Parsing a sentence produces structural                 phrase structure rules (PS rules) of the following sort
representation (SR) o r parse tree of the sentence [1].         [5]:

Analysis and generation are two major phases of                 S → NP AUX VP
mach ine translation. There are two main techniques             NP → A RT N
concerned in analysis phase and these are
                                                                VP → V NP
morphological analysis and syntactic analysis.

Morphological parsing strategy decomposes a word                PS rules given above tell us that a S (sentence) can
into morphemes given lexicon list, proper lexicon               consist of, or can expanded as, the sequence NP (noun
order and different spelling change rules [2]. That             phrase) AUX (au xiliary verb) VP (verb phrase). The
means, it incorporates the rules by which the words             rules also indicate that NP can be expanded as ART N
are analyzed. For examp le, in the sentence - “The              and that VP can be exp ressed as V NP.
young girl‟s behavior was unladylike”, the word

                                                                                     ISSN 1947-5500
                                                (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                               Vol. 8, No. 9, December 2010

This paper implements a technique to perform                     A complex sentence consists of one or mo re
structural analysis of Bangla sentences of different             subordinate clause within a princip le clause [2]. As
tenses using Context Free Grammar rules.                         for examp le,
                                                                 Bangla co mpound sentence is formed by two or mo re
2. Bangla Sentences Structure                                    principal      clauses      joined        by       an
                                                                 indeclinable/conjunctive         . Example:
In Bangla language, a simple sentence is formed by an
independent clause or principal clause. Examp le:
                                                                 Types of Bangla tense are given below in Fig. 1:

                                             Fig. 1 Types of Bangla tense

2.1 Basic Structural Difference between B angla                  2.2.1 Top-Down Parsing
and English Language                                             Top-down parsing starts at the most abstract level (the
Following are the structural differences between                 level of sentences) and work down to the most
Bangla and English languages:                                    concrete level (the level of words). An input sentence
                                                                 is derived using the context-free grammar ru les by
• The basic sentence pattern in English is subject +             matching the terminals of the sentence. So, given an
verb + object (SVO), whereas in Bangla it is - subject           input string, we start out by assuming that it is a
+ object + verb (SOV). Example:                                  sentence, and then try to prove that it really is one by
English: I (S) eat (V) rice (O)                                  using the grammar ru les left-to-right. That works as
Bangla:         (S)     (O)     (V)                              follows: If we want to prove that the input is of
                                                                 category S and we have the rule S → NP VP, then we
• Au xiliary verb is absent in Bangla language.                  will try next to prove that the input string consists of a
Example: I (Pronoun) am (Au xiliary verb) reading                noun phrase followed by a verb phrase.
(Main verb) a (Art icle) book (Noun)
Bangla:          (Pronoun)         (Article)    (Noun)           2.2.2 Bottom-Up Parsing
          (Main verb)                                            The basic idea of bottom up parsing is to begin with
                                                                 the concrete data provided by the input string --- that
• Preposition is a word placed before a noun or                  is, the words we have to parse/recognize --- and try to
pronoun or a noun-equivalent to show its relation to             build more abstract high-level in formation.
any other word of the sentence [6]. In Bangla
language, bivakti will p lace after noun or pronoun or a         Example:      Consider    the     Bangla    sentence
noun-equivalent. Examp le:                                       “                    ”. To perform bottom-up parsing
English: The man sat on the chair                                of the sentence using the follo wing rules of the
Bangla:                       , here „ ‟ is bivakt i             context-free grammar,

2.2 Structural Transfer from Bangla to English                   <SENTENCE>  <NOUN-PHRASE> <VERB-
Parsing is the process of building a parse tree for an           PHRASE*>
input string . We can extract the syntactic structure of         <NOUN-PHRA SE>  <CM PLX-NOUN> |
a Bangla sentence using any of the two approaches: i)            <CM PLX-NOUN> <PREP-PHRASE* > | <A RT>
top-down parsing ii) bottom-up parsing.                          <ADJ> <NOUN> <PREP-PHRASE*>

                                                                                      ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                   Vol. 8, No. 9, December 2010

    <VERB-PHRA SE>  <CM PLX-VERB> |                                 structure are then replaced by the constituents of the
    <CM PLX-VERB> <PREP-PHRASE* > | <CM PLX-                         same or smaller unit till a SENTENCE is obtained,
    VERB> <PREP-PHRASE* >                                            which is shown below:
    <PREP> <PRONOUN>                                                     Input
    <CM PLX-NOUN>  <A RTICLE> <NOUN> |                               Sentence    NOUN ARTICLE NOUN MAIN-VERB
    <NOUN> | <PRONOUN>                                                           NOUN-PHRASE NOUN-PHRASE
    | <NOUN> <PRONOUN> <NOUN>                                                     MAIN-VERB
    <CM PLX-VERB>  < MAIN-VERB> | < MAIN-                                       NOUN-PHRASE CM PLX-VERB
    VERB> <NOUN-PHRASE*>                                                         NOUN-PHRASE VERB-PHRASE
                                                                                 SENTENCE
    During the bottom-up parsing of the Bangla sentence
    “                  ”, we obtain the syntactical                  3. Proposed MT Model
    grammatical structure NOUN ARTICLE NOUN
    MAIN-VERB.                                                       The model proposed model for s tructural analysis of
                                                                     Bangla sentences is shown in Fig. 2.
    The syntactic categories in the resulting grammat ical

                         Context Free
                        Grammar Ru les

Source Language                              Parse Tree                                Parse Tree               Output target
                             Parser                                Conversion
Sentence (Bangla)                            of Bangla                                 of English             Language Sentence
                                              Sentence                                  Sentence                  (English)

 Sen sentence
                                         Fig. 2 Block d iagram o f proposed MT model
    3.1 Descripti on of the Proposed Model                           patterns of strings. It provides a simple and precise
    The proposed MT system will take a Bangla natural                mechanis m for describing the methods by which
    sentence as input for parsing. Stream of characters are          phrases in some natural language are built from
    sequentially scanned and grouped into tokens                     smaller b locks, capturing the "block structure" of
    according to lexicon. The words having a collective              sentences in a natural way. Such as noun, verb, and
    mean ing are grouped together in a lexicon. The output           preposition and their respective phrases lead to a
    of the Tokenizer o f the input sentence “Cheleti Boi             natural recursion because noun phrase may appear
    Porche” is as follo ws [1] [4]:                                  inside a verb phrase and vice versa. The most
    TOKEN = (“Chele”, “Ti”, “Boi”, “Por”, “Che”).                    common way to represent grammar is as a set of
                                                                     production rules which says how the parts of speech
    The parser involves grouping of tokens into                      can put together to make grammatical, or “well-
    grammatical phrases that are used to synthesize the              formed” sentences [8].
    output. Usually, the phrases are represented by a parse
    tree that depicts the syntactic structure of the input.          In the conversion unit, an input sentence is analyzed
                                                                     and a source language (SL) parse tree is produced
    A lexicon can be defined as a dictionary of words                using bottom-up parsing methodology. Then the
    where each word contains some syntactic, semantic,               corresponding parse tree of target language (TL) is
    and possibly some prag matic informat ion. The entries           produced. Each Bangla word of the input sentence is
    in a lexicon could be grouped and given by word                  replaced with the corresponding English word fro m
    category (nouns, verbs, prepositions and so on), and             the lexicon in the target (English) parse tree to
    all words contained within the lexicon listed within             produce the target (English) language sentence.
    the categories to which they belong [1] [4] [5] [7]. In
                                                                     Structural Representation (SR) is a process of finding
    our project, the lexicon contains the English meaning
                                                                     a parse tree for a given input string. For examp le, the
    and parts of speech of a Bangla wo rd.
                                                                     parse tree of the input sentence “                     ”
                                                                     and the corresponding parse tree of the English
    A context-free grammar (CFG) is a set of recursive
                                                                     sentence “The boy drinks tea” is shown in Fig. 3.
    rewrit ing rules (or productions) used to generate

                                                                                          ISSN 1947-5500
                                            (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                           Vol. 8, No. 9, December 2010

                  S                                                                 S

            NP          VP                                                   NP               VP

        N        ART          CV                                     ART            N                  CV
                                                                    The             boy
                           NP           MV                                                      MV               NP
                             N                                                                                   N
                   Fig. 3 Bangla and English Parse tree of the sentence “                          ”

4. Implementation of the Proposed Model

Flow-chart of the proposed MT model is given                 After executing the above procedure according to the
bellow:                                                      Flow-chart, it is possible to translate a Bangla
                                                             sentence into corresponding English sentence.

                                                             5. Experime ntal Results

                                                             Several experiments were conducted to justify the
                                                             effectiveness of the proposed MT model. Success rate
                                                             for different types of sentences is shown in Fig. 5.
                                                             Fig. 6 illustrates the snapshot of the implemented

                                                             Table 1: Success rate for different types of sentences

                                                                                                 Correct ly
                                                                  Type of         Total no.     performed       Success
                                                                 Sentences            of         mach ine         rate
                                                                                  sentences     translation       (%)
                                                                   Simp le          770                745       96.75
                                                                  Co mplex          540                517       95.74
                                                                 Co mpound          360                338       93.89

   Fig. 4 Flo w-chart of the proposed MT Model                 Fig. 5 Success rate for different types of sentences

                                                                                    ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                                Vol. 8, No. 9, December 2010

                            Fig. 6: Samp le output of the program fo r the co mplex sentence

6. Conclusion                                                            Proceedings of International Conference on
                                                                         Co mputer and Information Technology
This paper main ly focuses on the s tructural analysis                   (ICCIT), Dhaka, Bangladesh, pp. 321-326
phase of how to build parse tree of a given Bangla                       (2003).
sentence according to CFG. The translation process is             [4]    L. Mehedy, N. Arifin and M. Kaykobad,
then applied to the Source Language (SL) Tree to                         “Bangla Syntax Analysis: A Comprehensive
obtain a tree with target language words (TL Tree).                      Approach”, Proceedings of International
Finally, the output sentence in the target language is                   Conference on Co mputer and Information
extracted fro m this tree in the target language and also                Technology (ICCIT), Dhaka, Bangladesh, pp.
indicates the type of the tense. But the sentences                       287-293 (2003).
composed of idioms and phrases are beyond the scope               [5]    S. K. Chakravarty, K. Hasan, A. Alim, “A
of this project.                                                         Machine Translation (MT) Approach to
                                                                         Translate Bangla Co mp lex Sentences into
References                                                               English”     Proceedings    of   International
                                                                         Conference on Co mputer and Information
[1]    M. M. Hoque and M. M. Ali, “A Parsing                             Technology (ICCIT), Dhaka, Bangladesh, pp.
       Methodology for Bangla Natural Language                           342-346 (2003).
       Sentences”, Proceedings of International                   [6]    S. A. Rah man, K. S. Mahmud, B. Roy and K.
       Conference on Co mputer and Information                           M. A. Hasan, “English to Bengali Translation
       Technology (ICCIT), Dhaka, Bangladesh, pp.                        Using A New Natural Language Processing
       277-282 (2003).                                                   Algorith m” Proceedings of International
[2]    S. Dasgupta and M. Khan, “Feature Unification                     Conference on Co mputer and Information
       for Morphological Parsing in Bangla”,                             Technology (ICCIT), Dhaka, Bangladesh, pp.
       Proceedings of International Conference on                        294-298 (2003).
       Co mputer and Information Technology                       [7]    S. Russell and P. Norvig, Artificial
       (ICCIT), Dhaka, Bangladesh, pp. 642-                              Intelligence: A Modern Approach, 2nd Ed ition,
       647(2004).                                                        Pearson Education publisher, New Yo rk, 2003.
[3]    K. D. Islam, M. Billah, R. Hasan and M. M.                 [8]    M. M. Anwar, M. Z. Anwar and M. A.
       Asaduzzaman, “Syntactic Transfer and                              Bhuiyan, “Syntax Analysis and Machine
       Generation of Co mplex-Co mpound Sentences                        Translation of Bangla Sentences ”, IJCSNS
                                                                         International Journal of Computer Science and
       for Bangla-English Machine Translation”,

                                                                                       ISSN 1947-5500
                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                             Vol. 8, No. 9, December 2010

      Network Security, VOL.9 No.8, August 2009, pp.                                  Md.     Al-Amin       Bhui yan
      317–326 (2009).                                                                 received his B.Sc (Hons) and
                                                                                      M.Sc. in Applied Physics and
                    Md.     Musfi que     Anwar                                       Electronics fro m Un iversity
                    completed his B.Sc (Engg.) in                                     of Dhaka, Dhaka, Bangladesh
                    Co mputer     Science      and                                    in     1987     and     1988,
                    Engineering fro m Dept. of                                        respectively. He got the Dr.
                    CSE,           Jahangirnagar                                      Eng. Degree in Electrical
                    University, Bangladesh in                  Engineering fro m Osaka City University, Japan, in
                    2006. He is now a Lecturer in              2001. He has completed his Postdoctoral in the
the Dept. of CSE, Jahangirnagar Un ivers ity, Savar,           Intelligent Systems fro m National In formatics
Dhaka, Bangladesh. His research interests include              Institute, Japan. He is now a Professor in the Dept. of
Natural Language Processing, Artificial Intelligence,          CSE, Jahangirnagar University, Savar, Dhaka,
Image Processing, Pattern Recognition, Software                Bangladesh. His main research interests include
Engineering and so on.                                         Image Face Recognition, Cognitive Science, Image
                                                               Processing, Computer Graphics, Pattern Recognition,
                       Nasrin Sultana Shume                    Neural      Networks,     Hu man-machine      Interface,
                       completed her B.Sc (Engg.)              Artificial Intelligence, Robotics and so on.
                       in Co mputer Science and
                       Engineering fro m Dept. of
                       CSE,          Jahangirnagar
                       University, Bangladesh in
                       2006. She is now a Lecturer
in the Dept. of CSE, Green Un iversity of Bangladesh,
Mirpur, Dhaka, Bangladesh. Her research interests
include Artificial Intelligence, Neural Networ ks,
Image Processing, Pattern Recognition, Database and
so on.

                                                                                    ISSN 1947-5500

Description: The International Journal of Computer Science and Information Security (IJCSIS) is a well-established publication venue on novel research in computer science and information security. The year 2010 has been very eventful and encouraging for all IJCSIS authors/researchers and IJCSIS technical committee, as we see more and more interest in IJCSIS research publications. IJCSIS is now empowered by over thousands of academics, researchers, authors/reviewers/students and research organizations. Reaching this milestone would not have been possible without the support, feedback, and continuous engagement of our authors and reviewers. Field coverage includes: security infrastructures, network security: Internet security, content protection, cryptography, steganography and formal methods in information security; multimedia systems, software, information systems, intelligent systems, web services, data mining, wireless communication, networking and technologies, innovation technology and management. ( See monthly Call for Papers) We are grateful to our reviewers for providing valuable comments. IJCSIS December 2010 issue (Vol. 8, No. 9) has paper acceptance rate of nearly 35%. We wish everyone a successful scientific research year on 2011. Available at IJCSIS Vol. 8, No. 9, December 2010 Edition ISSN 1947-5500 � IJCSIS, USA.