Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Surface Realization

VIEWS: 5 PAGES: 50

									CRLP (Center for Research on Language Processing)                     NUM




                 MONGOLIAN LANGUAGE
                     RESOURCES
                                       Altangerel Chagnaa
                                       PhD in Computer Science

                       Center for Research on Language Processing
                                         [CRLP]
                             National University of Mongolia
                                          (NUM)

                                    altangerel@num.edu.mn



                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                                     OUTLINE
         Brief introduction to Mongolian Language
         About NLP and CL in Mongolia
         Available Mongolian Language Resources
           –   Written corpus
           –   Speech corpus
         Ongoing researches and projects
           –Text To Speech (TTS)
          – Machine Translation (MT)
         WordNet for Mongolian
         Conclusion
                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                 MONGOLIAN LANGUAGE
         Belongs to Altaic language family
           –   Highly agglutinative
           –   Similar to Turkish, Korean, Japanese and so on

         (8 mln) Speakers in some Asian countries:
           –   Mongolia: 2.7 mln
           –   Inner Mongolia (in China): 3.38 mln
           –   Afghanistan: ?
           –   Russia: 0.5 mln




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                            NUM




                 MONGOLIAN LANGUAGE
         Two writing systems
           –   Cyrillic: daily used, available in computer and
               internet environment
                   Borrowed from Russian in 1942
                   Most of research work including our work is related to
                    that script

           –   Classic (old): in the school curriculum, few
               newspaper




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                        SOME CYRILLIC
                       CHARACTERISTICS

         That writing system is very similar to
          Russian, and English
           –   Words are separated with space
           –   Sentences end with stop words such as
                .      ;       :       ?       !    etc
           –   Writing direction is left to right and top to
               down




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                                   NLP and CL
         Mongolian language study is still conducted in the
          traditional way

         Linguists do not use computer for their research work

         Few research works are conducted by computer scientists
          who have less knowledge about linguistics

         Lack of HR in the field of Mongolian language processing
          and Computational Mongolian study




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                                            CRLP
         Center for Research on Language
          Processing
           –   Established in 2007
                   At National University of Mongolia

                   Number of staffs is 8 including One professor
                    (Computer scientist), One PhD of NLP (2008,
                    Korea), 2 Researchers (Computer scientist), 2
                    Linguists, and 2 Assistants

         First NLP research center in Mongolia



                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




       AVAILABLE RESOURCES
       FOR MONGOLIAN

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                        WRITTEN CORPUS
         Center for Research on Language Processing
          (CRLP), National University of Mongolia
          (NUM)
           –   Joined in 2007 in PAN Localization project


         Written corpus for Mongolian
           –   Project duration is 2007 - 2009
           –   5 million words
           –   POS tagged
           –   Lexicon development based that corpus
           –   Developing related tools for building and
               analyzing the corpus

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                   CORPUS TEXT STYLES
         After analyzing around 100 text styles of Mongolian

         Some text selection criteria
           –   Common and public usage
           –   Well-writing styles and formats

          styles are chosen for the corpus
           –   Press
           –   Literature
           –   Law




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
National University of Mongolia                        School of Information Technology




              TEXT COLLECTION [2]

                                                      Corpus Words
                  Domains
                                        Total Words              Distinct Words
        Literature                              1,012,779                               78,972
        Law                                       577,708                               15,235
        publish                                 2,460,225                              118,601
        Newspaper “Unen Sonin”                    949,558                               61,125
                     Total               5,000,270                      192,061

                                                                          Mongolian Corpus


                                                                Unen Sonin                    Literature
                                                                   19%                           20%


                                                                                                     Law
                                                                                                     12%
                                                                       Publish
                                                                        49%



                                                                 Literature   Law   Publish   Unen Sonin
 ALRN (Asian Language Resource Network) Workshop 2007, March 1 / 2, Akihabara Daibiru, Tokyo
CRLP (Center for Research on Language Processing)                     NUM




                                   POS Tagset

         We designed a POS Tagset for the
          corpus

         Two tagsets:
           –   High level (noun, verb, etc)
           –   Low level (inflectional suffixes)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                          High-Level Tagset
         High-level tagset is similar to English tags such
          as noun, verb, adword, etc




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                       NUM




                           Low-Level Tagset
         Low-level tagset consists of tags for inflectional suffixes
          such as cases, verb tenses, comparative, etc




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                                      NUM




                     USING HIGH and LOW
         Currently, around 160 combination tags are created
           –    Most of them are tags for noun and verb inflections
           –    Tag marking length is 1 - 5
         Some examples of the tags created while tagging the
          corpus

                        Only high-level tag for Noun is used
               Tag                   Meaning                      Mongolian           English
           N         Noun                                        морь         horse
           NB        Noun Ablative                               мориноос     from horse
           NGHB      Noun Genitive Special-possessive Ablative   орныхоос     from someone’s country




                         ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                                      NUM




                     USING HIGH and LOW
         Currently, around 160 combination tags are created
           –     Most of them are tags for noun and verb inflections
           –     Tag marking length is 1 - 5
         Some examples of the tags created while tagging the
          corpus

                                 high-level tag for Noun and low-level
                                    tag for aBlative case are used
               Tag                   Meaning                      Mongolian           English
          N          Noun                                        морь         horse
          NB         Noun Ablative                               мориноос     from horse
          NGHB       Noun Genitive Special-possessive Ablative   орныхоос     from someone’s country




                            ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                                      LEXICON

         10k words lexicon
           –   High frequency words from our corpus
           –   80% of our corpus (around 4 mln words)
           –   Manually tagged
           –   Used to tag the whole corpus




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                      Computer Terminology

         Government projects
           –   Online dictionary (free)
                   2005 – 2006
                   7800 words of ICT
                   Mongolian and English description
                   URL: www.itdic.edu.mn
           –   Computer Terminology Standard
                   2009 - 2010
                   8000 words of ICT
                   Mongolian and English description

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                NUM




                    TOOLS FOR WRITTEN
                         CORPUS
         We have developed several tools for building the
          corpus
           –   Text segmentation
                   Syllabler
                   Tokenizer
                   Sentence splitter/segmenter
           –   Cleaning tools
                   Dictionary based spell-checker
                   Document file and character encoding checker and converter
                   Hyphened-word merging tool
           –   Text annotator
                   Annotator for text structure such as (TEI) header and body
                    parts, paragraphs, title, publisher, etc
                   Manual POS Tagger (User convenient GUI) and Bigram POS
                    Tagger
                   Trigram HMM POS Tagger (trained on 5million tagged words )
                   Mongolian XML based Concordancer


                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM



                    Manual POS Tagger (1)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM



                    Manual POS Tagger (2)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM


                Mongolian Concordancer(1)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                Mongolian Concordancer(2)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                SPEECH CORPUS and SRS
         PAN Localization project
           –   2007 – 2009
           –   Developed by MUST(Mongolian University of
               Science & Technology)
         Currently
           –   From our 5 million words corpus
                   2500 isolated words
                   By 5 male and 5 female speakers, respectively
           –   Using HTK Toolkits
           –   Isolated word recognition’s accuracy is 95%

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                  NUM



                    TTS for MONGOLIAN
         Funded by PAN Localization


         2005 – 2006 (13 month project)
         Developed by Infocon Co Ltd
         Objective:
           –   The objective of this project is to develop a Mongolian text-to-
               speech (TTS) converter and a simple human computer interface that
               is suitable for visually impaired people
         Outputs
           –   Mongolian TTS converter
           –   Mongolian character recognition tool
           –   TTS converter software package for visually impaired people
           –   A user manual in Mongolian

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




      ONGOING RESEARCH AND
      PROJECTS

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                        NUM



                    TTS for MONGOLIAN
        Funded by ITU, AMD DBCDE (Australian Government)

        Schedule: June – December 2010 (Ongoing)
        Developers: NECTEC and CRLP

        Objective:
          –   To build Mongolian HMM based TTS engine for 2 platforms: MS windows
              and Linux
          –   To make Mongolian TTS compatible with screen readers for the blinds
          –   To conduct usability testing of TTS and screen readers in the 2
              platforms by the Blind
          –   To organize ICT literacy training for the Blind using screen reader with
              TTS engine




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM


               Machine Translation(MT) for
                       Mongolian
         English to Mongolian Online Translation
          System (EMOTS) project

         Funded by Mongolian Government
         Schedule: April – Dec 2010, 8 months

         Objective:
           –   Start machine translation for Mongolian
           –   Beginning and Elementary level translation

         Our approach
           –   Rule based
           –   We don’t have enough parallel corpora for
               Statistical MT
                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                       NUM




                         EMOTS Outputs(1)
         English to Mongolian Rule Based MT Engine
           –   English analyzer/parser
           –   Transfer module
           –   Generation module
                   Syntactic generator
                   Dictionary based Mongolian morphological analyzer
                    and generator
           –   English to Mongolian Word Sense Disambiguator
                   Lesk algorithm
                   Princeton Wordnet 3.0
           –   Document parsers
                   MS Word, MS Power Point, Web page, PDF e.g.
           –   Dictionary creating and cleaning tools
                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                         EMOTS Outputs(2)
         ~4,000 English to Mongolian syntactic transfer rules
         ~300 Mongolian syntactic generation rules
         English to Mongolian Dictionary
           –   ~50,000 head words
           –   ~80,000 translations
           –   Typed by hand
         ~42,000 Mongolian headwords for Morph analyzing
         Mongolian Thesaurus that consists from ~40,000 words
         Other tools
           –   Transfer rule creator
           –   Syntactic rule creator
           –   Spell checker

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                    Automat translation(1)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                     Automat translation(2)
         User can recommend better sentence and word
          translation.
           –   Help to improve translation quality
           –   To create parallel corpus




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                       NUM




                 User Assisted Translation

                     Check the user assisted




                          Suffixes




                      Choose translation




                                                        Morphology generated sentence




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM



                                     Thesaurus




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                                 NUM



               Transfer rule creation tool


                                                                             Order to Mongolian

                                  Tree bank


                                                           English Context
                                                           Free Rule




                                   Transferred Tree
                                                                                 English Tree




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




      WORDNET FOR MONGOLIAN


                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                      NUM




                    WordNet for Mongolian
         There are two works

         Asian WordNet, TCLLab, Thailand
           –   Totally 3384 words
                   Noun              868
                   Verb              917
                   Adjective         1326
                   Adverb 273

         Asian research centre
           –   Mongolian Lexical Semantic Network
                   Pattern
                   10k words
                   Semi-automatic creation
                   Browser tool




                        ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




               Mongolian Lexical Semantic
                       Network(1)
         A model of the Mongolian lexical semantic
          network
           –   A network of ~10k Mongolian noun

         A methodology for effectively creating
          Mongolian lexical semantic network

         Browser, Editor tool for Mongolian lexical
          semantic network creation
           –   Manual Editor
           –   Semiautomatic creation tool
           –   Visualizer
                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




               Mongolian Lexical Semantic
                       Network(2)

         Applications of Mongolian lexical
          semantic network in some systems
           –   Information retrieval
           –   Document clustering, Classification


         Publication of research papers
           –   International Conference paper 1, (6 pages)
               ISBN: 978-89-88678-18-3


                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                                             NUM




              Manual editor for Mongolian
                         LSN



                               Network in a tree structure:
                                                                         Detail
                               Shows hierarchical structure of LSN in a tree fields, editing section
                                                                         Head Filter
                               structure. Active field which can move nodes word, ID, sense numbers,
                                                                         uigur Used the currently
                               between nodes, add new node directly or add fields ofto filter vocabulary base on
                               from vocabulary base, and remove node from          node are shown in these
                                                                         selectedcertain pattern. For example
                               network.                                  fields. In adding or editing mode set to
                                                                                  above figure shows filter
                                                                                   change them.
                                                                         user can„хали“ which shows all the entries
                                                                                starting with „хали“.



                      Vocabulary or word sense base
                      Lists word senses in the entry
                      table and user can view all lists
                      or can set a filter on them.




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                         NUM




                    Semiautomatic tool for
                    Mongolian LSN building
                         2




                                                                                      1



                                                      Lists all the clusters
                                                      (via its most characterisitic
                                                      feature or genius term)




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




            Clustering result of Mongolian
                      10k words
         We have extracted entry’s features from
          Tsevel dictionary (Mongolian Thesaurus)
          definition’s fiorst sentences
           –   Parse and lexical analyze
           –   Remove some stop word

         For clustering algorithm
           –   CBC and implemented in MatLab framework
         Result of clustering
           –   11,468 Mongolian nouns
           –   1202 clusters/committees.
           –   From them 220 clusters got 2 words and biggest
               cluster with 548 words

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                                                               NUM




                   Clustering result exerpt
                                                                  #Clusters


                       250




                       200




                       150

                                                                                                                    #Clusters

                       100




                        50




                         0
                             2
                                 4
                                     6
                                          8
                                         10
                                              12
                                                   14
                                                        16
                                                             18
                                                                  20
                                                                       22
                                                                            24
                                                                                 26
                                                                                      28
                                                                                           30
                                                                                                32
                                                                                                     34
                                                                                                          36
                                                                                                               39
                        ds
                        or
                      #W




                             Horizontal – number of words, Vertically -frequency

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                              Visualization(1)
         To view in a tree structure is not appropriate
          because it lack much computational power and
          space.

         Prefuse visualization library which is developed
          in Java

         TreeMap in Prefuse library can visualize huge
          amount of hierarchical data in rectangles and
          also its hierarchy by blocks.

                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                  NUM



                              Visualization(2)




                   Higher level nodes in semiautomatically created Mongolian LSN


                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                          NUM




                              Visualization(3)




                    Higher level nodes in manually created Mongolian LSN


                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                                               HR
         HR is crucial for us because there is a lack of NLP
          and CL professionals
         Schools of NLP are very helpful to our research
          work and HR development in Mongolia
           –   Summer school of Asian language processing
                   2006, in Lahore, Pakistan
           –   ADD School
                   2006-2010, Thailand
           –   These are only schools our staffs have participated




                       ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                                 NUM




              CONCLUSION and FUTURE
                      PLAN
        Mongolian language is less developed in the computer and
         internet environment
        Near objectives of CRLP in future:
          –   Create national corpus for Mongolian
          –   Improve quality of EMOTS such as WSD, Transfer and Generation etc
          –   Collect parallel corpus for statistical machine translation
          –   Use WordNet for cross lingual word sense disamiguation and others

        A need of extension to develop Mongolian language processing
         and NLP
        Researchers and staffs need to continuously train
        Increase HR (graduate study in higher educational places)
        Actively participate in activities such as workshop, conference,
         etc to share problems and achievements

                        ADD-6 2010, December 6 / 9, Phuket, Thailand
CRLP (Center for Research on Language Processing)                     NUM




                               Thank you

                                        Q&A




                       ADD-6 2010, December 6 / 9, Phuket, Thailand

								
To top