Docstoc

Instructions for ACL-IJCNLP 09 Proceedings - DOC

Document Sample
Instructions for ACL-IJCNLP 09 Proceedings - DOC Powered By Docstoc
					    Towards a Combined Attempt at Simultaneous Synset Linkage and
                            Expansion

Aadil Amin Kak       Nazima Mehdi      Aadil A. Lawaye Farooq A. Shiekh Muneera Hakim
                                       Dept. of Linguistics
                                      University of Kashmir
                       {aadilaminkak,nazimamehdi,aadillawaye}@yahoo.com
                           {farooqahmad84,muneera.hakim}@gmail.com


                                                           1. The lexicographers while making com-
                   Abstract                                   mon synset entries can at that very time
                                                              feel that there may be some language
                                                              specific associated or related concepts in
                                                              their language. At that very time they
      The paper focuses on a method for the ex-               can easily note down the extra senses
      pansion of the Indowordnet during linkage.              (language specific senses) of the synset
      The method proposes an idea of simulta-                 members. If the process of the expansion
      neous expansion and linkage. The present                will be undertaken later on, it will need
      paper also proposes running a parallel                  more time and effort and probably some
      ‘storage tool’ to store the extra senses of             senses might be missed.
      the synset members (language specific                2. This method can also be a useful tool at
      senses) during linkage. The paper also pre-             times to reduce confusion and problems
      sents an overview of inclusion of larger                faced by lexicographers by getting a bet-
      units ( Proverbs, Compound words, Adjec-                ter and unambiguous view of the concept
      tival phrases).                                         at hand.
                                                           3. This method should also be able to act as
                                                              a strong facilitator for future inclusion of
1     Introduction                                            sister languages and dialects of the lan-
The expansion of the wordnet for the ongoing                  guages being worked on.
project, Indradhanush is a proposed step for the     2.2     Method of Expansion
future implementation. The present paper is a
proposal which mainly focuses on working not         Regarding the how of the process, the lexico-
only on the process of synset linkage but on the     graphers, while linking synsets shall look for any
expansion of the wordnet simultaneously.             extra sense, which may be their-language specif-
The paper has been divided into three sections.      ic. While looking for ‘equality in concepts’, it
Section 1 tries to propose as to why should not      will be natural for the lexicographer to come
both processes i,e synset linkage and the wordnet    across (in the mind at least) different senses and
expansion go on simultaneously.                      other culture specific concepts. For this very
Section 2 of the paper proposes a tentative idea     purpose there shall be separate interface provided
as to how it can be done.                            to the users that can be linked to the main tool
Section 3 proposes the inclusion of the larger       provided to the lexicographers. The lexicograph-
units of language which have until now not been      ers shall enter the extra senses of the synset
talked about                                         members (which are language specific and not
                                                     presently available in wordnet). The interface
2     Proposals                                      shall have facility for storing temporary Synset
                                                     ID, Concept, Componential Analysis of the con-
2.1    Simultaneous Expansion                        cept, Synset Members, Category and Example.
Why should not both the processes i.e. the           The interface shall have a save option and an edit
Wordnet expansion and the synset linkage go on       option. The storage tool shall look like as
together? We propose this as the basis of the fol-
lowing:
                                                              The example shows how it will be easy for extra
                                                              senses to pop-up in the mind and how these lan-
                                                              guage specific concepts can be added.
     Synset ID                Lexical   Category
                                                              2.2.1.2 Culture Specific Terms
                                                              The storage tool shall also be used to store cul-
                                                              tural specific terms which has to be included in
                                                              the wordnet at a later stage. But the culture spe-
   Concept                    Componential Analysis
                                                              cific terms should be tagged to identify their cul-
                                                              ture-specificity which will help in their identifi-
   Synset Members                                             cation. A common tag shall be provided for the
                                                              culture specific terms of all the languages.

   Example                                                    2.2.2    Name of the Places in a Language
                                                              The place names in a language shall also be en-
                                                              tered in the storage tool with their conceptual as
        Save             Edit                 Exit
                                                              well as componential analysis. The recurring oc-
                                                              curance and the validity of componential analysis
                                                              will be dealt later in a separate section. There
Different concepts/senses should be categorized               should be a tag provided to the lexicographers
as follows and simultaneously should be in-                   for the places names in a language. E.g. Srinagar,
cluded in the tool.                                           shopian, pulwama. Etc.
2.2.1    Language/Culture Specific Concept                    2.2.3    Flora and Fauna
2.2.1.1 Language Specific Senses                              The lexicographers shall use the flora and fauna
 This categorization will handle extra senses of              sources of the area to include the specific plant
the synset members. These members though al-                  and animal species names in the storage tool
ready present in the pivot language carry some                which are not covered in the ongoing wordnet
extra sense/senses which are specific to the target           concepts.
language. For example                                         It has been proposed earlier that all the culture
                                                              specific terms, the place names, the names of the
Entry in Main Tool               Entry in Storage Tool
                                                              specific species have to be transliterated (and
 Concept       Hindi Kash-       Concept           Kashmiri   coined in some languages like Sanskrit) to all
                     miri                                     other member languages of the IndoWordnet. It
 A colour-                                                    is proposed here that a tranliterator tool be incor-
 less liquid
                                 A colourless
                                                              porated in the system for transliterating those
 present in                                                   terms in other languages. The above mentioned
                                 liquid specifi-
 the rivers,
 streams,
             paani
                       aab
                                 cally for           treesh   specific terms will be stored with their phonetic
                                 drinking.                    sounds. The transliterator will use the phonetics
 oceans and
 seas                                                         of all language specific terms of all the languages
                                 Partial over-                as its input and will transliterate it in the target
                                 cooking                      language. The transliterator can identify what to
                                 (burning) of
                                 rice while
                                                              transliterate either by their specific tags or they
                                 cooking         vItsun       can be as subset of the conceptual pivot of the
                                 which doesn’t                wordnet.
                                 burn the rice
 The burning                     but leaves a                 2.3     Inclusion of Larger Units
 of the food                     burning smell.
 while cook- jalnaa   dazun      Partial over-                In this section we also propose that the larger
 ing                             cooking                      units of language such as proverbs, compound
                                 (burning) of                 words, adjectival phrases etc be considered as
                                 milk while
                                 heating it
                                                 vItsun,      separate units. This is illustrated below.
                                                kraaDun
                                 which doesn’t
                                 burn the milk
                                 but leaves a
                                 burning smell.
 2.3.1    Proverbs                                                   The two groups of categories of compound
                                                                     words shall be included in the wordnet as illus-
 The proverbs shall be considered as units differ-
                                                                     trated in the table below.
 ent from their individual members. These units
 shall be entered as separate concepts. If the con-                            Entries in the Current Working Tool
 cepts are already present they shall be entered as
 synset members against the concept. The prov-                       Conceptual                                   L4 L5
                                                                                    Hindi Kashmiri       L3
                                                                     Pivot
 erbs which are present in specific language shall                   Ups        & utaar-                          --do-- --do--
 be entered in the storage tool. For example.                                                        Conceptual
                                                                     downs in life chadau
                                                                                           heri-bon equivalent if
                                                                                    duup-
                                                                                                     not present
Conceptual                    Kashmiri (L2)              (L3) (L4)                 chaavun
               Hindi (L1)                                            The younger                                  --do-- --do--
Pivot                                                                               chota-  lokut-
Giving     two                siikh ti rachun tI kabab               maternal                          --do--
                                                                                    mamu    maam
benefits by an sanp bhi       ti.                                    uncle
act             maray aur     (saving seekh (iron rod                This place or udhar-   hokun-                --do-- --do--
                                                                                                       --do--
               lathi bhi na   on which the kebab is                  that place     idhar   yokun
                  TuuTe       roasted) as well as the
                              kebab)
 However, this may not always be the case when                       Language Specific Entries in the Storage Tool
 there is a proverb to proverb linkage. Consider                     Concept descrip-
                                                                                                Hindi            Kashmiri
 the following example                                               tion
                                                                     A wooden block
 Conceptual Hindi (L1)        Kashmiri (L3)             (L4)         used to support a       -- Conceptual
                                                                                                                labi-thamb
 Pivot                        (L2)                                   carpet weaver from       equivalent--
 Moving idly --conceptual     bang'an     ---Conceptual ----do-      both sides
 here    and equivalent--     dәr' natsun equivalent--- ---          A specific basket
 there                        (lit. danc-                            used to carry meals
                                                                                               ----do----         batI-pәj'
                              ing in Can-                            to the field by peas-
                              nabis                                  ants
                              fields)                                A pot in which food
                                                                                               ----do----       khenI-baanI
                                                                     is eaten
 Here, we have a proverb in Kashmiri which does
 not appear to be there in Hindi. In this case in-                   so, concepts denoted by compound words should
                                                                     be equated by corresponding compound words in
 stead of the proverb Hindi will have a conceptual
 equivalent.                                                         other languages where ever possible and when-
 This can be useful means of incorporating prov-                     ever it is not possible then conceptual equiva-
 erbs in the framework without considering the                       lents should be used.
 individual concepts of words of which the prov-                     2.3.3     Adjectival Phrases
 erb is made.
                                                                     The main question is how to include adjectival
 2.3.2    Compound Words                                             phrases such as participles and infinitives in
 We propose that all the compound words be                           Wordnet Indradanush. Should we take such
 taken as separate conceptual units irrespective of                  phrases as separate units or not, as the sense of
 the concepts of their members. If the concept of a                  these words are different in adjectival phrases
 compound word is present in the main tool it                        than in other contexts. E.g.
 shall be entered as a synset member, and if it is                            udta parinda, chalti gaaDi, gaate jarney
 language specific they shall be entered in the                      The bold and underlined words being verbs in
 separate tool.                                                      ordinary context, but acting as adjective in these
                                                                     phrases. Thus the change of category changes the
      Language specific compound words in
                                                                     concept itself. So for such units we need to give
         Kashmiri such as yar-bal (body of water-
                                                                     separate concepts.
         bank), bon-i-bagh (chinar-garden), bati-
                                                                     The adjectival phrases such as the participles can
         paji (lit. cooked rice-large basket), labi-
                                                                     be language specific also. For example wadwun
         thamb (lit. wall-wooden piller), etc.
                                                                     bacchi (lit. crying baby/a baby who cries at the
       Compound words conceptually present
                                                                     smallest excuse) wadwun insaan (lit. crying per-
         in more than one language or all other
                                                                     son/a person who cries at the smallest excuse)
         languages, such as utaar-chadau (ups
                                                                     wadwun nab ( lit. weeping sky/ when it is very
         and downs), duup-chaavun (sunshine-                         cloudy )
         shade), idhar-udhar (here-there), etc.
The language specific adjectival phrases shall be       line Lexical Database". International Journal of
entered in the storage tool with their concepts.        Lexicography, Vol 3, No.4, 235-244.


2.4    Componential Analysis                          Sinha, Manish., Mahesh Reddy and Pushpak Bhat-
                                                        tacharyya. (2006). An Approach towards Construc-
It is proposed that for the given concepts there        tion and Application of Multilingual Indo-
should be a componential analysis associated            WordNet. 3rd Global Wordnet Conference ( GWC
with the concepts as this method will convey the        06), Jeju Island, Korea, January, 2006.
concept more clearly and in a more professional
manner. The idea of whether or not to incorpo-
rate a strict form of componential analysis is a
proposal which has to be thought about in detail.
But it appears that (a) if a formal componential
analysis frame is agreed upon and (b) if lexicog-
raphers are trained in it, the task will definitely
become much easier. The formalization of the
componential analysis part should work as a very
important tool for removing ambiguity. Further-
more, this can lead to the concept being more
formal (a set of yes-es and no-s) which probably
can be more machine friendly.

3     Conclusion
To conclude, the paper is basically an effort to
make the expansion of IndoWordnet less labori-
ous. It seems more logical to implement both
linkage and expansion simultaneously. The paper
is a tentative step in that direction and, with
modifications and strict formalization, can be
used as a guide for the same.


References
Cruse D.A. (1986). Lexical Semantics, Cambridge
  University Press. Kulkarni M., Dangarikar C.,
  Kulkarni I., Nanda A. And Bhattacharyya P.
  (2010), Introducing Sanskrit Wordnet, Global
  Wordnet Conference (GWC10), Mumbai, India.


Kachru, B. B. ( 1969 ). A Reference Grammar of
  Kashmiri. Urbana-Champagne: U of Illinois Press.


Kak, A. A., and R. Talashi. (2004). Keshur: Akh
  Grammari Vetshnai (Kashmiri: A Grammatical
  Description). Jammu: Nihar Publications.


Koul, O. N., and K. Wali. ( 2006 ). Modern Kashmiri
  Grammar. USA: Dunwoody Press.


Miller G., R. Beckwith, C. Fellbaum, D. Gross, K.J.
  Miller. (1990). “Introduction to WordNet: An On-