"UML Model of Deeper Meaning Natural Language Translation System using Conceptual Dependency Based Internal Representation"
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 UML MODEL OF DEEPER MEANING NATURAL LANGUAGE TRANSLATION SYSTEM USING CONCEPTUAL DEPENDENCY BASED INTERNAL REPRESENTATION Sandhia Valsala Dr Minerva Bunagan Roger Reyes College of Computer Studies College of Computer Studies College of Computer Studies AMA International University AMA International University AMA International University Salmabad,Kingdom of Bahrain Salmabad,Kingdom of Bahrain Salmabad,Kingdom of Bahrain email@example.com firstname.lastname@example.org email@example.com Abstract— Translation from one language to another language methods described here are generally extendable for most involves many mechanical rules or statistical inferences. Statistical natural languages. inference based translations lack any depth or logical basis for the translation. For a deeper meaning translation to be performed using b.Grammar of a Natural Language only the mechanical rules are not sufficient. There is a need to extract suggestions from common world knowledge and cultural knowledge. Grammar of a language is a set of production rules (Aho et These suggestions can be used to fine tune or may be even reject the al., 2006) using meta-symbols or non-terminals and tokens possible candidate sentences. This research presents a software design (class of words of the language). These rules can be used to for a translation system that will examine sentences based on the determine if a sentence is valid or invalid. Extended Backus- syntax rules of the natural language. It will then construct an internal Naur Form (EBNF) is used to theoretically describe such representation to store this knowledge. It can then annotate and fine tune the translation process by using the previously stored world grammars (Rizvi, 2009) (Wang, 2009). knowledge. c. Conceptual Dependency Keywords Natural language, Translation, Conceptual Dependency,Unified The theory of Conceptual Dependency (CD) was Modeling Language (UML) developed by Shank and his fellow researches for representing the higher level interpretation of natural language sentences and constructs (Shank and Tesler, 1969). It is a slot-and-filler data I. Introduction structure can be modeled in an object oriented programming Living in an electronic age has increased international language (Luger and Stubblefield, 1996). CD structures have interaction among individuals and communities. Rapid and been used as a means of internal representation of meaning of accurate translation from one natural language to another is the sentences in several language understanding systems (Schank required for communication directly with individuals natives of and Riesbeck, 1981). a foreign language. III. Review of Relevant Literature Automated translation desired by anyone wishing to study international subjects. There are a large number of naturally Automated translation systems from companies like spoken languages. Some automated software systems are Google and Microsoft use probability and statistics to predict available that allow translation from one natural language to translation based upon previous training (Anthes, 2010). another. By using these systems one can translate a sentence Usually they train on huge sample data sets of two or more from one natural language to another without any human natural language document sets. In a situation where there is a translator. But these systems often fail to convey the deeper sentence using less commonly used words so that no translation meaning of original text to the translated language. exists previously for those group of words such a translation The objective of this paper is to present a design an automated system may not give accurate results. natural language translation system from English to Urdu or Arabic. This system will use a system-internal representation Conceptual Dependency (CD) theory has been developed for storing the deeper meaning of input sentences. This paper to extract underlying knowledge from natural language input will also identify natural language grammar rules that can be (Shank and Tesler, 1969). The extracted knowledge is stored used to construct this system. and processed in the system using strong slot-and-filler type data abstractions. The significance of CD to this research is that II. Definition of Terms it describes a natural language independent semantic network a.Natural Language that can be used to disambiguate the meaning by comparing it with internally stored common world knowledge. Natural language is any language used by people to communicate with other people. In this paper the two natural Conceptual dependency theory is based on a limited languages selected for translation are English and Urdu. The number of primitive act concepts (Schank and Riesbeck, 1981). These primitive act concepts represent the essence of the 40 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 meaning of an input sentence and are independent of syntax Table 2 - Schank's Conceptual Categories. related peculiarities of any one natural language. The important primitive acts are summarized in Table 1. Table 1 - Schank's Primitive Act Concepts. Governing Categories Primitive Description Example Act Name Description ATRANS Transfer of an abstract give, take, relationship such as buy possession, ownership or PP Picture Producer. Represents control. ATARNS requires an physical objects actor, object and recipient. ACT Action. Physical actions. PTRANS Transfer of the physical go, fly location of an object. PTRANS requires an actor, LOC Location. A location of a object and direction. conceptualization. PROPEL Application of a physical force push, pull T Time. Time of to an object. Direction, object conceptualization. and actor are required. Assisting Categories MTRANS Transfer of mental information tell, between or within an animal. remember, Name Description forget MBUILD Construction of new describe, PA Producer Attribute. Attribute information from old answer, of a PP. information. imagine AA Action Attribute. Attribute ATTEND Focus a sense on a stimulus. listen, of an ACT. watch SPEAK Utter a sound. Say Traditionally EBNF grammar rules are used to express a language grammar (Aho et al., 2004). Most natural languages in GRASP To hold an object. Clutch general and English in particular has been a particular focus of research in many countries (Wang, 2009). A study of the Urdu MOVE Movement of a body part by kick, owner. shake language grammar for computer based software processing has been done previously (Rizvi, 2007). Urdu language shares INGEST Ingest an object. It requires an Eat many traits with Arabic and other South-Asian languages. actor and object. Traits like common script and some common vocabulary are the most well known of these. EXPEL To expel something from body. IV. Implementation Materials and Methods Valid combinations of the primitive acts are governed by 4 governing categories and 2 assisting categories (Schank and For the purpose of design of the software this research Tesler, 1969). These conceptual categories are like meta-rules utilizes English as the first or source natural language and Urdu about the primitive acts and they dictate how the primitive acts as the second or target natural language. This choice is based can be connected to form networks. In Schank and Tesler’s primarily upon the familiarity of the researchers with the work there is implicit English dependent interpretation of languages. Another reason is that EBNF grammar is available Producer Attribute (PA) and Action Attribute (AA). But in this for these languages (Wang, 2009) (Rizvi, 2007). However, the research the interpretation of PA and AA is natural language design presented here can be equally appropriate for most of the independent. The conceptual categories are summarized in natural languages. The design primarily uses UML diagrams Table 2. notation and can be drawn in Microsoft Visual Studio 2010 (Loton, 2010) or Oracle JDeveloper software (Miles and Hamilton, 2006). 41 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 The design is broken into two main use-case scenarios. The first use-case is for first natural language user (English). The system components identified in this use case include a tokenizer, parser, CD annotator and CD world-knowledge integrator. In this use-case the working system will take an input sentence and then construct an internal representation of that sentence. The user will be returned a Reference ID (REFID) number which is a mechanism to identify the internal representation (concept) inside the systems memory. The second use-case is for the target language user (Urdu). The user identifies an internal concept through a REFID. The system will then generate the corresponding Urdu sentence. The system components identified in this use-case include CD world- knowledge integrator, tokenizer and sentence formulator. Two sequence diagrams corresponding to the two use cases are shown in Figure 1 and Figure 2. Figure 2 - Sequence diagram for target natural language conversion use case. A discussion of the functions of the major components identified in these figures is given below. Tokenizer Tokenizer component will have two functions. The first function will take a source natural language sentence as input and it will create a stream of tokens from it if the words are found in the dictionary of the language. Tokens can be an extension of the parts of speech of the natural language (English) or taken from the terminal symbols in the EBNF grammar. These tokens will be used in specifying the EBNF grammar rules. This function will also generate an Accepted or Rejected signal for the User. If the token stream is valid it will be passed to the Parser component. This function is shown in Figure 1. The second function of the tokenizer component is in target natural language conversion use case. This function will take Figure 1 - Sequence diagram for User Input Language input of a CD primitives graph and return all corresponding Processing use case. words found in the dictionary of the target natural language. Tokenizer component can be implemented in an object oriented programming language. This function is shown in Figure 2. Parser Parser component will take as input a token stream consisting of tokens from the source natural language parts of speech or grammar terminal symbols. The parser will match the token stream against all syntax rules of the source natural language. If the sentence is valid and unambiguous one parse tree will be generated as output. If the sentence is not valid an error message will be given as output. If the sentence is ambiguous then all parse trees will be returned to the calling component for a possible selection. The selected parse tree will be given as input to the CD Annotator component for further processing. This component is shown in context in Figure 1. For most natural languages the parser component can be 42 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 prototyped or implemented in Prolog programming language Conceptual Dependency Graph and it may be generated from a LR parser generator tool like In this research CD based object oriented (OO) architecture YACC or Bison. is proposed for the internal representation of meaning of the natural language. Each primitive concept has to be CD Annotator implemented as a class in an OO programming language. Most CD annotator component will take as input the parse tree of these classes will have predefined attributes and some generated by the parser component and create and annotate a implementation specific attributes will be added to them. The CD graph data structure. The CD graph structure will be based work done by (Schank and Tesler, 1969) provides general rules upon the CD primitives as listed in Table 1 and Table 2. The concerning the structure and meaning of such a network. CD graph data structure can be implemented in an object oriented programming language. This component is shown in Language Dictionaries Figure 1. For the source natural language and the target natural language a Dictionary will have to be created. It can be CD World Knowledge Integrator implemented as a file or a database. The dictionary will contain This component will have two main functions. First of all it words from the closed world scenario (Faculty Room). For each will add the new sentence Concept Graph into a bigger word part-of-speech attribute (or the corresponding EBNF non- common world knowledge graph. The common world terminal symbol name) will have to be identified. For some knowledge will consist of facts like “Gravity pulls matter words there will also be mappings to primitive concepts (Table down”, “Air is lighter than water”, etc. This knowledge will be 1). relevant to the closed world assumption of a Faculty Room in English Grammar in Prolog Programming Language the University. Internally this knowledge will be represented in CD form itself. Upon receiving new input this component will create links with common world knowledge already stored in The following computer program is a source-code listing in the system. After integration of the new Concept Graph a Prolog Programming Language. It describes a simple English Reference Identification number (REFID) will be returned to sentence parser. It can validate or invalidate a sentence made of the user for later retrieval of the newly stored concept. This words in the vocabulary. For testing purposes, this parser can function is shown in Figure 1. be used to generate sentences of a given word length according to the words in vocabulary and Prolog unification order. It has Second function of this component will be to receive as been tested on SWI Prolog Programming Environment. input a REFID number and to locate its corresponding (http://www.swi-prolog.org) integrated concept graph. By scanning the integrated concept graph it will generate a list of primitive CD in use in the REFID /* **** English Sentence Grammar in Prolog */ referenced integrated concept graph. This list will be passed to /* Assumes a closed world assumption */ the tokenizer component which will return target natural /* Faculty room in a university */ language word sets matching the list of primitive CD. These /* ****************************** */ word sets will be used by this component to annotate the integrated concept graph with target natural language words. /* In absence of a Tokenizer, hard coding of words The target natural language annotated CD graph will be given (vocabulary) and Tokens */ as input to sentence formulator component for sentence p_noun('pname1'). generation. This function is shown in Figure 2. imp_noun('student'). Sentence Formulator imp_noun('book'). Sentence Formulator component will take as input the target natural language annotated CD graph and it will apply pro_noun_subject('i'). the syntax rules of the target language to produce valid pro_noun_subject('he'). sentences of the target language. This component is shown in pro_noun_subject('she'). Figure 2. pro_noun_subject('we'). pro_noun_subject('they'). Design of the Parser pro_noun_subject('it'). This research presents a simple Prolog Programming pro_noun_object('me'). Language English parser (Appendix), that is based on the pro_noun_object('him'). English grammar rules described in (Wang, 2009) and as taught pro_noun_object('her'). in university English courses. pro_noun_object('us'). pro_noun_object('them'). pro_noun_object('it'). 43 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 sub_noun(X) :- noun(X), person(X). pro_noun_possesive('his'). pro_noun_possesive('her'). obj_noun(X) :- pro_noun_object(X). pro_noun_possesive('their'). obj_noun(X) :- pro_noun_nominative_possesive(X). pro_noun_possesive('our'). obj_noun(X) :- noun(X). pro_noun_possesive('your'). pro_noun_possesive('whose'). subject(X) :- sub_noun(X). pro_noun_nominative_possesive('mine'). object(X) :- obj_noun(X). pro_noun_nominative_possesive('yours'). pro_noun_nominative_possesive('ours'). indirect_object(X) :- pro_noun_object(X). pro_noun_nominative_possesive('theirs'). indirect_object(X) :- noun(X), person(X). pro_noun_indefinite('few'). determiner(X) :- article(X). pro_noun_indefinite('more'). determiner(X) :- pro_noun_possesive(X). pro_noun_indefinite('each'). determiner(X) :- pro_noun_indefinite(X). pro_noun_indefinite('every'). determiner(X) :- pro_noun_demonstrative(X). pro_noun_indefinite('either'). pro_noun_indefinite('all'). noun_phrase(X) :- noun(X). pro_noun_indefinite('both'). noun_phrase([X|Y]) :- adjective(X), listsplit(Y, H, T), T=, pro_noun_indefinite('some'). noun(H). pro_noun_indefinite('any'). preposition_phrase([X|Y]) :- preposition(X), listsplit(Y, H1, pro_noun_demonstrative('this'). T1), determiner(H1), noun_phrase(T1). pro_noun_demonstrative('that'). pro_noun_demonstrative('these'). object_complement(X) :- noun_phrase(X). pro_noun_demonstrative('those'). object_complement(X) :- preposition_phrase(X). pro_noun_demonstrative('such'). %% object_complement(X) :- adjective_phrase(X). /* For ease in testing reducing the number of unifications, /* Breaking the head off a list */ limited items defined */ listsplit([Head|Tail], Head, Tail). person('pname1'). person('student'). /* Determining length of list */ listlength(, 0). thing('book'). listlength([_|Y], N) :- listlength(Y, N1), N is N1 + 1. verb('sings'). /* Pattern1: Subject-Verb */ verb('teaches'). sentence([X|Y]) :- subject(X), listsplit(Y, Head, Tail), Tail=, verb('writes'). verb(Head). adjective('thick'). /* Pattern2: Subject-Verb-Object */ adjective('brilliant'). sentence([X|Y]) :- subject(X), listsplit(Y, H, T), verb(H), listsplit(T, H2, T2), preposition('in'). object(H2), T2=. preposition('on'). sentence([X|Y]) :- subject(X), listsplit(Y, H, T), verb(H), preposition('between'). listsplit(T, H2, T2), preposition('after'). pro_noun_possesive(H2), listsplit(T2, H3, T3), object(H3), T3=. article('a'). /* Pattern3: Subject-Verb-Indirect Object-Object */ article('an'). sentence([X|Y]) :- subject(X), listsplit(Y, H, T), verb(H), article('the'). listsplit(T, H2, T2), indirect_object(H2), listsplit(T2, H3, T3), /* Actual Rules */ object(H3), T3=. noun(X) :- p_noun(X). sentence([X|Y]) :- subject(X), listsplit(Y, H, T), verb(H), noun(X) :- imp_noun(X). listsplit(T, H2, T2), indirect_object(H2), listsplit(T2, H3, T3), sub_noun(X) :- pro_noun_subject(X). pro_noun_possesive(H3), listsplit(T3, H4, T4), 44 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 object(H4), T4=. 8.Schank, Roger C. and Tesler, Larry (1969) A Conceptual /* Pattern4: Subject-Verb-Object-Object Complement */ Dependency Parser for Natural Language. Proceedings of the sentence([X|Y]) :- subject(X), listsplit(Y, H, T), verb(H), 1969 conference on Computational linguistics Association for listsplit(T, H2, T2), Computational Linguistics Stroudsburg, PA, USA. object(H2), object_complement(T2). doi:10.3115/990403.990405 9.Schank, Roger C. and Riesbeck, Christopher K. eds. (1981) Inside Computer Understanding: Five Programs plus V.Conclusion and Recommendations Miniatures. Psychology Press, 400pp, http://www.questia.com A system level modular design of a software system for Web. translation between a source natural language to a target natural 10.Wang, Yingxu (2009) A Formal Syntax of Natural language was presented. A functional behaviour of each of the Languages and the Deductive Grammar. Journal Fundamenta major software components was also discussed. Informaticae - Cognitive Informatics, Cognitive Computing, For extending this system to other languages the following and Their Denotational Mathematical Foundations (II), Vol 90, 3 additions will need to be made. First of all an EBNF grammar Issue 4 should be made available for new language to be integrated. Second a system dictionary should be created for the new language as mentioned above. And third, the tokenizer, parser and sentence formulator components need to be enhanced to handle the new language. These components form the front-end (user facing part) of the system. The back end remains unchanged. For extending the scope of the system translation from the closed-world-scenario of a faculty room to more general translator, universal common knowledge base can be integrated into this system design. One such universal common knowledge base is the CYC project as described in (Lenat et al., 1990). VI. References 1.Aho, Alfred V., Lam, Monica S., Sethi, Ravi, and Ullman, Jeffery D. (2006) Compilers: Principles, Techniques, and Tools. Addison Wesley Publishing Company, 1000pp. 2.Anthes, Gary (2010) Automated Translation of Indian Languages. Communications of the ACM, Vol 53, No. 1: 24-26 3.Lenat, Douglas B., Guha, R. V., Pittman, Karen, Pratt, Dexter, and Shepherd, Mary (1990), Cyc: toward programs with common sense, Communications of the ACM, Volume 33, Issue 8 4.Loton, Tony (2010), UML Software Design with Visual Studio 2010: What you need to know, and no more! CreateSpace Press, 136pp 5.Luger, G. and Stubblefield, W. (1997) Artificial Intelligence: Structures and Strategies for Complex Problem Solving. Addison Wesley Publishing Company, 868pp. 6.Miles, Russ and Hamilton, Kim (2006), Learning UML 2.0, O’Reilly Media, 288pp 7.Rizvi, Syed M. J. (2007) Development of Algorithms and Computational Grammar for Urdu. Doctoral Thesis, Pakistan Institute of Engineering and Applied Sciences Nilore Islamabad, 242pp 45 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 10,October 2011 AUTHOR’S PROFILE: Ms.Sandhia Valsala, is presently associated with AMA International University, Bahrain as Asst Professor in the Computer Science Department. She holds a Master’s degree in Computer Applications from Bharatiyar University, Coimbatore and is currently pursuing her Phd from Karpagam University Coimbatore. Dr Minerva Bunagan is presently associated with AMA International University,Bahrain as the Dean of College of Computer Studies .She holds a P.hD in Education from Cagayan State University, Tuguegarao City, Philippines. She also holds a Master of Science in Information Technology from Saint Paul University Philippines8. Roger Reyes is presently associated with AMA International University, Bahrain as Asst Professor in the Computer Science Department. He holds a masters degree from AMA computer university Quezon city Philippines 46 http://sites.google.com/site/ijcsis/ ISSN 1947-5500