Extracting loanwords from Mongolian corpora and producing a Japanese by qos48214


									     Extracting loanwords from Mongolian corpora and producing a

                       Japanese-Mongolian bilingual dictionary

     Badam-Osor Khaltar                          Atsushi Fujii                      Tetsuya Ishikawa
  Graduate School of Library,            Graduate School of Library,         The Historiographical Institute
Information and Media Studies          Information and Media Studies            The University of Tokyo
    University of Tsukuba                   University of Tsukuba           3-1 Hongo 7-chome, Bunkyo-ku
1-2 Kasuga Tsukuba, 305-8550           1-2 Kasuga Tsukuba, 305-8550                 Tokyo, 133-0033
            Japan                                   Japan                                Japan
  khab23@slis.tsukuba.ac.jp                fujii@slis.tsukuba.ac.jp            ishikawa@hi.u-tokyo.ac.jp

                     Abstract                            targeting various languages.
    This paper proposes methods for extracting              In this paper, we focus on extracting loanwords in
    loanwords from Cyrillic Mongolian corpora            Mongolian. The Mongolian language is divided into
    and producing a Japanese–Mongolian                   Traditional Mongolian, written using the Mongolian
    bilingual dictionary. We extract loanwords           alphabet, and Modern Mongolian, written using the
    from Mongolian corpora using our own                 Cyrillic alphabet. We focused solely on Modern
    handcrafted rules. To complement the                 Mongolian, and use the word “Mongolian” to refer
    rule-based extraction, we also extract words         to Modern Mongolian in this paper.
    in Mongolian corpora that are phonetically              There are two major problems in extracting
    similar to Japanese Katakana words as                loanwords from Mongolian corpora.
    loanwords. In addition, we correspond the               The first problem is that Mongolian uses the
    extracted loanwords to Japanese words and            Cyrillic alphabet to represent both conventional
    produce a bilingual dictionary. We propose a         words and loanwords, and so the automatic
    stemming method for Mongolian to extract             extraction of loanwords is difficult. This feature
    loanwords correctly. We verify the                   provides a salient contrast to Japanese, where the
    effectiveness of our methods experimentally.         Katakana alphabet is mainly used for loanwords and
                                                         proper nouns, but not used for conventional words.
1   Introduction                                            The second problem is that content words, such as
Reflecting the rapid growth in science and               nouns and verbs, are inflected in sentences in
technology, new words and technical terms are being      Mongolian. Each sentence in Mongolian is
progressively created, and these words and terms are     segmented on a phrase-by-phase basis. A phrase
often transliterated when imported as loanwords in       consists of a content word and one or more suffixes,
another language.                                        such as postpositional particles. Because loanwords
   Loanwords are often not included in dictionaries,     are content words, then to extract loanwords
and decrease the quality of natural language             correctly, we have to identify the original form using
processing,      information    retrieval,    machine    stemming.
translation, and speech recognition. At the same time,      In this paper, we propose methods for extracting
compiling dictionaries is expensive, because it relies   loanwords from Cyrillic Mongolian and producing a
on human introspection and supervision. Thus, a          Japanese–Mongolian bilingual dictionary. We also
number of automatic methods have been proposed to        propose a stemming method to identify the original
extract loanwords and their translations from corpora,   forms of content words in Mongolian phrases.
2   Related work                                          of the extracted loanwords also corresponded to a
To the best of our knowledge, no attempt has been         Japanese word during the extraction process, a
made to extract loanwords and their translations          Japanese–Korean bilingual dictionary was produced
targeting Mongolian. Thus, we will discuss existing       in a single framework.
methods targeting other languages.                           However, a number of open questions remain
   In Korean, both loanwords and conventional             from Fujii et al.’s research. First, their stemming
words are spelled out using the Korean alphabet,          method can only be used for Korean. Second, their
called Hangul. Thus, the automatic extraction of          accuracy in extracting loanwords was low, and thus,
loanwords in Korean is difficult, as it is in             an additional extraction method was required. Third,
Mongolian. Existing methods that are used to extract      they did not report on the accuracy of extracting
loanwords from Korean corpora (Myaeng and Jeong,          translations, and finally, because they used Dynamic
1999; Oh and Choi, 2001) use the phonetic                 Programming (DP) matching for computing the
differences between conventional Korean words and         phonetic similarities between Korean and Japanese
loanwords. However, these methods require                 words, the computational cost was prohibitive.
manually tagged training corpora, and are expensive.         In an attempt to extract Chinese–English
   A number of corpus-based methods are used to           translations from corpora, Lam et al. (2004)
extract bilingual lexicons (Fung and McKeown,             proposed a similar method to Fujii et al. (2004).
1996; Smadja, 1996). These methods use statistics         However, they searched the Web for
obtained from a parallel or comparable bilingual          Chinese–English bilingual comparable corpora, and
corpus, and extract word or phrase pairs that are         matched named entities in each language corpus if
strongly associated with each other. However, these       they were similar to each other. Thus, Lam et al.’s
methods cannot be applied to a language pair where        method cannot be used for a language pair where
a large parallel or comparable corpus is not available,   comparable corpora do not exist. In contrast, using
such as Mongolian and Japanese.                           Fujii et al.’s (2004) method, the Katakana dictionary
   Fujii et al. (2004) proposed a method that does not    and a Korean corpus can be independent.
require tagged corpora or parallel corpora to extract        In addition, Lam et al.’s method requires
loanwords and their translations. They used a             Chinese–English named entity pairs to train the
monolingual corpus in Korean and a dictionary             similarity computation. Because the accuracy of
consisting of Japanese Katakana words. They               extracting named entities was not reported, it is not
assumed that loanwords in multiple countries              clear to what extent this method is effective in
corresponding to the same source word are                 extracting loanwords from corpora.
phonetically similar. For example, the English word
“system” has been imported into Korean, Mongolian,        3   Methodology
and Japanese. In these languages, the romanized           3.1 Overview
words are “siseutem”, “sistem”, and “shisutemu”,          In view of the discussion outlined in Section 2, we
respectively.                                             enhanced the method proposed by Fujii et al. (2004)
   It is often the case that new terms have been          for our purpose. Figure 1 shows the method that we
imported into multiple languages simultaneously,          used to extract loanwords from a Mongolian corpus
because the source words are usually influential          and to produce a Japanese–Mongolian bilingual
across cultures. It is feasible that a large number of    dictionary. Although the basis of our method is
loanwords in Korean can also be loanwords in              similar to that used by Fujii et al. (2004),
Japanese. Additionally, Katakana words can be             “Stemming”, “Extracting loanwords based on rules”,
extracted from Japanese corpora with a high               and “N-gram retrieval” are introduced in this paper.
accuracy. Thus, Fujii et al. (2004) extracted the            First, we perform stemming on a Mongolian
loanwords in Korean corpora that were phonetically        corpus to segment phrases into a content word and
similar to Japanese Katakana words. Because each          one or more suffixes.
                  Mongolian corpus                                    Katakana dictionary


            Extracting candidate loanwords                                Romanization

          Extracting loanwords based on rules                                                 Japanese-Mongolian bilingual dictionary

                    Romanization                                        N-gram retrieval
                                                High Similarity
            Computing phonetic similarity                           Mongolian loanword dictionary

                                 Figure 1: Overview of our extraction method.
   Second, we discard segmented content words if                             Type                            Example
they are in an existing dictionary, and extract the                (a) No inflection.            ном + ын → номын
remaining words as candidate loanwords.                                                          Book + Genitive Case
   Third, we use our own handcrafted rules to extract              (b) Vowel elimination.        ажил +аас+ аа→ ажлаасаа
loanwords from the candidate loanwords. While the                                                Work + Ablative Case +Reflexive
rule-based method can extract loanwords with a high                (c) Vowel insertion.          ах + д → ахад
accuracy, a number of loanwords cannot be extracted                                              Brother + Dative Case
using predefined rules.                                            (d) Consonant insertion.      байшин + ийн→ байшингийн
   Fourth, as performed by Fujii et al. (2004), we use                                           Building + Genitive Case
a Japanese Katakana dictionary and extract a                       (e) The letter “ь” is         сургууль+ аас→ сургуулиас
candidate loanword that is phonetically similar to a               converted to “и”, and         School + Ablative Case
Katakana word as a loanword. We romanize the                       the vowel is eliminated.
candidate loanwords that were not extracted using
                                                                    Figure 2: Inflection types of nouns in Mongolian.
the rules. We also romanize all words in the
Katakana dictionary.                                              the inflection types of content words in phrases. In
   However, unlike Fujii et al. (2004), we use                    phrase (a), there is no inflection in the content word
N-gram retrieval to limit the number of Katakana                  “ном (book)” concatenated with the suffix “ын
words that are similar to the candidate loanwords.                (genitive case)”.
Then, we compute the phonetic similarities between                   However, in phrases (b)–(e) in Figure 2, the
each candidate loanword and each retrieved                        content words are inflected. Loanwords are also
Katakana word using DP matching, and select a pair                inflected in all of these types, except for phrase (b).
whose score is above a predefined threshold. As a                 Thus, we have to identify the original form of a
result, we can extract loanwords in Mongolian and                 content word using stemming. While most
their translations in Japanese simultaneously.                    loanwords are nouns, a number of loanwords can
   Finally, to identify Japanese translations for the             also be verbs. In this paper, we propose a stemming
loanwords extracted using the rules defined in the                method for nouns. Figure 3 shows our stemming
third step above, we perform N-gram retrieval and                 method. We will explain our stemming method
DP matching.                                                      further, based on Figure 3.
   We will elaborate further on each step in Sections                First, we consult a “Suffix dictionary” and
3.2–3.7.                                                          perform backward partial matching to determine
3.2 Stemming                                                      whether or not one or more suffixes are concatenated
A phrase in Mongolian consists of a content word                  at the end of a target phrase.
and one or more suffixes. A content word can                         Second, if a suffix is detected, we use a “Suffix
potentially be inflected in a phrase. Figure 2 shows              segmentation rule” to segment the suffix and extract
                                                                    Table 1: Entries of the suffix dictionary.
              Suffix dictionary    Suffix segmentation rule
                                                                          Case                   Suffix
                                                                   Genitive         н, ы, ын, ны, ий, ийн, ний
   phrase     detect a suffix in     segment a suffix              Accusative       ыг, ийг, г
              the phrase             and extract a noun            Dative           д, т
                                                                   Ablative         аас (иас), оос (иос), ээс, өөс
   noun     No   check if the last two characters of the           Instrumental     аар (иар), оор (иор), ээр, өөр
                 noun are both consonants                          Cooperative      тай, той, тэй
                    Yes                                            Reflexive        аа (иа), оо (ио), ээ, өө
                 insert a vowel        Vowel insertion rule        Plural           ууд (иуд), үүд (иүд)
 Figure 3: Overview of our noun stemming method.
                                                                       Suffix         Noun phrase              Noun
the noun. The inflection type in phrases (c)–(e) in                                (a) Ээжийн              ээж
Figure 2 is also determined.                                        ийн            mother’s                mother
   Third, we investigate whether or not the vowel                   Genitive       (b) Хараагийн           Хараа
elimination in phrase (b) in Figure 2 occurred in the                              Haraa’(river name)s     Haraa
extracted noun. Because the vowel elimination                  Figure 4: Examples of the suffix segmentation rule.
occurs only in the last vowel of a noun, we check the
last two characters of the extracted noun. If both of         deferent rule independently. The underlined suffixes
the characters are consonants, the eliminated vowel           are segmented in each phrase, respectively. In phrase
is inserted using a “Vowel insertion rule” and the            (a), there is no inflection, and the suffix is easily
noun is converted into its original form.                     segmented. However, in phrase (b), a consonant
   Existing Mongolian stemming methods (Ehara et              insertion has occurred. Thus, both the inserted
al., 2004; Sanduijav et al., 2005) use noun                   consonant, “г”, and the suffix have to be removed.
dictionaries. Because we intend to extract loanwords             The vowel insertion rule consists of 12 rules. To
that are not in existing dictionaries, the above              insert an eliminated vowel and extract the original
methods cannot be used. Noun dictionaries have to             form of the noun, we check the last two characters of
be updated as new words are created.                          a target noun. If both of these are consonants, we
   Our stemming method does not require a noun                determine that a vowel was eliminated.
dictionary. Instead, we manually produced a suffix               However, a number of nouns end with two
dictionary, suffix segmentation rule, and vowel               consonants inherently, and therefore, we referred to a
insertion rule. However, once these resources are             textbook on Mongolian grammar (Bayarmaa, 2002)
produced, almost no further compilation is required.          to produce 12 rules to determine when to insert a
   The suffix dictionary consists of 37 suffixes that         vowel between two consecutive consonants.
can concatenate with nouns. These suffixes are                   For example, if any of “м”, “г”, “л”, “б”, “в”, or
postpositional particles. Table 1 shows the dictionary        “р” are at the end of a noun, a vowel is inserted.
entries, in which the inflection forms of the                 However, if any of “ц”, “ж”, “з”, “с”, “д”, “т”, “ш”,
postpositional particles are shown in parentheses.            “ч”, or “х” are the second to last consonant in a noun,
   The suffix segmentation rule consists of 173 rules.        a vowel is not inserted.
We show examples of these rules in Figure 4. Even                The Mongolian vowel harmony rule is a
if suffixes are identical in their phrases, the               phonological rule in which female vowels and male
segmentation rules can be different, depending on             vowels are prohibited from occurring in a single
the counterpart noun.                                         word together (with the exception of proper nouns).
   In Figure 4, the suffix “ийн” matches both the             We used this rule to determine which vowel should
noun phrases (a) and (b) by backward partial                  be inserted. The appropriate vowel is determined by
matching. However, each phrase is segmented by a              the first vowel of the first syllable in the target noun.
For example, if there are “а” and “у” in the first        (f) A word beginning with the consonant “р”.
syllable, the vowel “а” is inserted between the last               In a modern Mongolian dictionary (Ozawa,
two consonants.                                                 2000), there are 49 words beginning with “р”,
3.3 Extracting candidate loanwords                              of which only four words are conventional
After collecting nouns using our stemming method,               Mongolian words. Therefore, a word beginning
we discard the conventional Mongolian nouns. We                 with “р” is probably a loanword.
discard nouns defined in a noun dictionary                (g) A word ending with “<consonant> + и”.
(Sanduijav et al., 2005), which includes 1,926 nouns.              We discovered this rule empirically.
We also discard proper nouns and abbreviations. The       3.5 Romanization
first characters of proper nouns, such as “Эрдэнэбат      We manually aligned each Mongolian Cyrillic
(Erdenebat)”, and all the characters of abbreviations,    alphabet to its Roman representation1.
such as “ЦШНИ (Nuclear research centre)”, are                In Japanese, the Hepburn and Kunrei systems are
written using capital letters in Mongolian. Thus, we      commonly used for romanization proposes. We used
discard words that are written using capital              the Hepburn system, because its representation is
characters, except those occurring at the beginning of    similar to that used in Mongolian, compared to the
sentences. In addition, because “ө” and “ү” are not       Kunrei system.
used to spell out Western languages, words including         However, we adapted 11 Mongolian romanization
those characters are also discarded.                      expressions to the Japanese Hepburn romanization.
3.4 Extracting loanwords based on rules                   For example, the sound of the letter “L” does not
We manually produced seven rules to identify              exist in Japanese, and thus, we converted “L” to “R”
loanwords in Mongolian. Words that match with one         in Mongolian.
of the following rules are extracted as loanwords.        3.6 N-gram retrieval
(a) A word including the consonants “к”, “п”, “ф”,        By using a document retrieval method, we efficiently
      or “щ”.                                             identify Katakana words that are phonetically similar
         These consonants are usually used to spell out   to a candidate loanword. In other words, we use a
      foreign words.                                      candidate loanword, and each Katakana word as a
(b) A word that violated the Mongolian vowel              query and a document, respectively. We call this
      harmony rule.                                       method “N-gram retrieval”.
          Because of the vowel harmony rule, a word          Because the N-gram retrieval method does not
       that includes female and male vowels, which is     consider the order of the characters in a target word,
       not based on the Mongolian phonetic system, is     the accuracy of matching two words is low, but the
       probably a loanword.                               computation time is fast. On the other hand, because
(c) A word beginning with two consonants.                 DP matching considers the order of the characters in
          A conventional Mongolian word does not          a target word, the accuracy of matching two words is
       begin with two consonants.                         high, but the computation time is slow. We combined
(d) A word ending with two particular consonants.         these two methods to achieve a high matching
          A word whose penultimate character is any       accuracy with a reasonable computation time.
       of: “п”, ”б”, “т”, ”ц”, “ч”, ”з”, or “ш” and          First, we extract Katakana words that are
       whose last character is a consonant violates       phonetically similar to a candidate loanword using
       Mongolian grammar, and is probably a               N-gram retrieval. Second, we compute the similarity
       loanword.                                          between the candidate loanword and each of the
(e) A word beginning with the consonant “в”.              retrieved Katakana words using DP matching to
          In a modern Mongolian dictionary (Ozawa,        improve the accuracy.
       2000), there are 54 words beginning with “в”,         We romanize all the Katakana words in the
       of which 31 are loanwords. Therefore, a word       dictionary and index them using consecutive N
       beginning with “в” is probably a loanword.
                                                              http://badaa.mngl.net/docs.php?p=trans_table (May, 2006)
characters. We also romanize each candidate                reports from our corpus, and used them to evaluate
loanword when use as a query. We experimentally            the accuracy of our stemming method. These
set N = 2, and use the Okapi BM25 (Robertson et al.,       technical reports were related to: medical
1995) for the retrieval model.                             science (17), geology (10), light industry (14),
3.7 Computing phonetic similarity                          agriculture (6), and sociology (3). In these 50 reports,
Given the romanized Katakana words and the                 the number of phrase types including conventional
romanized candidate loanwords, we compute the              Mongolian nouns and loanword nouns was 961 and
similarity between the two strings, and select the         206, respectively. We also found six phrases
pairs associated with a score above a predefined           including loanword verbs, which were not used in
threshold as translations. We use DP matching to           the evaluation.
identify the number of differences (i.e., insertion,          Table 2 shows the results of our stemming
deletion, and substitution) between two strings on an      experiment, in which the accuracy for conventional
alphabet-by-alphabet basis.                                Mongolian nouns was 98.7% and the accuracy for
   While consonants in transliteration are usually the     loanwords was 94.6%. Our stemming method is
same across languages, vowels can vary depending           practical, and can also be used for morphological
on the language. The difference in consonants              analysis of Mongolian corpora.
between two strings should be penalized more than             We analyzed the reasons for any failures, and
the difference in vowels. We compute the similarity        found that for 12 conventional nouns and 11
between two romanized words using Equation (1).            loanwords, the suffixes were incorrectly segmented.
                     2 × (α × dc + dv )             (1)    4.3 Evaluating loanword extraction
                         α ×c+v                            We used our stemming method on our corpus and
Here, dc and dv denote the number of differences in        selected the most frequently used 1,300 words. We
consonants and vowels, respectively, and α is a            used these words to evaluate the accuracy of our
parametric consonant used to control the importance        loanword extraction method. Of these 1,300 words,
of the consonants. We experimentally set α = 2.            165 were loanwords. We varied the threshold for the
Additionally, c and v denote the number of all the         similarity, and investigated the relationship between
consonants and vowels in the two strings,                  precision and recall. Recall is the ratio of the number
respectively. The similarity ranges from 0 to 1.           of correct loanwords extracted by our method to the
                                                           total number of correct loanwords. Precision is the
4      Experiments                                         ratio of the number of correct loanwords extracted
4.1 Method                                                 by our method to the total number of words
We collected 1,118 technical reports published in          extracted by our method. We extracted loanwords
Mongolian from the “Mongolian IT Park”2 and used           using rules (a)–(g) defined in Section 3.4. As a result,
them as a Mongolian corpus. The number of phrase           139 words were extracted.
types and phrase tokens in our corpus were 110,458            Table 3 shows the precision and recall of each rule.
and 263,512, respectively.                                 The precision and recall showed high values using
   We collected 111,116 Katakana words from                “All rules”, which combined the words extracted by
multiple Japanese dictionaries, most of which were         rules (a)–(g) independently.
technical term dictionaries.                                  We also extracted loanwords using the phonetic
   We evaluated our method from four perspectives:         similarity, as discussed in Sections 3.6 and 3.7.
“stemming”, “loanword extraction”, “translation
extraction”, and “computational cost.” We will               Table 2: Results of our noun stemming method.
discuss these further in Sections 4.2-4.5, respectively.                    No. of each phrase type   Accuracy (%)
4.2 Evaluating stemming                                     Conventional                       961             98.7
We randomly selected 50 Mongolian technical                 nouns
                                                            Loanwords                          206             94.6
    http://www.itpark.mn/ (May, 2006)
                            Table 3: Precision and recall for rule-based loanword extraction.
                   Rules                  (a)         (b)          (c)        (d)         (e)            (f)          (g)          All rules
      Words extracted automatically         102         63           21             6           4              5          24                150
      Extracted correct loanwords           101         60           20             5           4              5          19                139
      Precision (%)                        99.0       95.2         95.2        83.3        100            100          79.2                 92.7
      Recall (%)                           61.2       36.4         12.1          3.0        2.4          3.03          11.5                 84.2

We used the N-gram retrieval method to obtain up to                       Table 5: Precision and recall of different loanword
the top 500 Katakana words that were similar to each                      extraction methods.
candidate loanword. Then, we selected up to the top                                       No. of           No. that            Precision       Recall
five pairs of a loanword and a Katakana word whose                                        words         were correct             (%)            (%)
similarity computed using Equation (1) was greater                        Rule                  150                 139             92.7           84.2
than 0.6. Table 4 shows the results of our                                Similarity             60                  12             20.0           46.2
similarity-based extraction.                                              Both                  210                 151             71.2           91.5
   Both the precision and the recall for the
similarity-based loanword extraction were lower                                                 Mongolian          English gloss
than those for the “All rules” data listed in Table 3.                                      альбумин               albumin
                                                                                            лаборатор              laboratory
Table 4: Precision and recall for our similarity-based                                      механизм               mechanism
loanword extraction.                                                                        митохондр              mitochondria
 Words extracted      Extracted correct   Precision    Recall                    Figure 5: Example of extracted loanwords.
 automatically        loanwords             (%)         (%)
           3,479                    109         3.1         66.1          translations. As a result, Japanese translations were
                                                                          extracted for 109 loanwords. Table 6 shows the
   We also evaluated the effectiveness of a                               results, in which the precision and recall of
combination of the N-gram and DP matching                                 extracting Japanese–Mongolian translations were
methods. We performed similarity-based extraction                         56.2% and 72.2%, respectively.
after rule-based extraction. Table 5 shows the results,                      We analyzed the data and identified the reasons
in which the data of the “Rule” are identical to those                    for any failures. For five loanwords, the N-gram
of the “All rules” data listed in Table 3. However, the                   retrieval failed to search for the similar Katakana
“Similarity” data are not identical to those listed in                    words. For three loanwords, the phonetic similarity
Table 4, because we performed similarity-based                            computed using Equation (1) was not high enough
extraction using only the words that were not                             for a correct translation. For 27 loanwords, the
extracted by rule-based extraction.                                       Japanese translations did not exist inherently. For
   When we combined the rule-based and                                    seven loanwords, the Japanese translations existed,
similarity-based methods, the recall improved from                        but were not included in our Katakana dictionary.
84.2% to 91.5%. The recall value should be high                              Figure 6 shows the Japanese translations extracted
when a human expert modifies or verifies the                              for the loanwords shown in Figure 5.
resultant dictionary.
   Figure 5 shows example of extracted loanwords in                       Table 6: Precision and recall for translation
Mongolian and their English glosses.                                      extraction.
4.4 Evaluating Translation extraction                                     No. of translations       No. of extracted            Precision      Recall
In the row “Both” shown in Table 5, 151 loanwords                         extracted                 correct                       (%)              (%)
were extracted, for each of which we selected up to                       automatically             translations
the top five Katakana words whose similarity                                              194                         109            56.2           72.2
computed using Equation (1) was greater than 0.6 as
      Japanese               Mongolian             English gloss        dictionaries was also proposed. Finally, we evaluated
アルブミン                     альбумин             albumin                  the effectiveness of the components experimentally.
ラボラトリー                    лаборатор            laboratory
メカニズム                     механизм             mechanism                References
ミトコンドリア                   митохондр            mitochondria             Terumasa Ehara, Suzushi Hayata, and Nobuyuki Kimura. 2004.
Figure 6: Japanese translations extracted for the                         Mongolian morphological analysis using ChaSen. Proceedings
loanwords shown in Figure 5.                                              of the 10th Annual Meeting of the Association for Natural
                                                                          Language Processing, pp. 709-712. (In Japanese).
4.5 Evaluating computational cost                                       Atsushi Fujii, Tetsuya Ishikawa, and Jong-Hyeok Lee. 2004.
We randomly selected 100 loanwords from our                               Term extraction from Korean corpora via Japanese.
corpus, and used them to evaluate the computational                       Proceedings of the 3rd International Workshop on
cost of the different extraction methods. We                              Computational Terminology, pp. 71-74.
compared the computation time and the accuracy of                       Pascal Fung and Kathleen McKeown. 1996. Finding terminology
“N-gram”, “DP matching”, and “N-gram + DP                                 translations from non-parallel corpora. Proceedings of the 5th
matching” methods. The experiments were                                   Annual Workshop on Very Large Corpora, pp. 53-87.
performed using the same PC (CPU = Pentium III 1                        Wai Lam, Ruizhang Huang, and Pik-Shan Cheung. 2004.
GHz dual, Memory = 2 GB).                                                 Learning phonetic similarity for matching named entity
   Table 7 shows the improvement in computation                           translations and mining new translations. Proceedings of the
time by “N-gram + DP matching” on “DP matching”,                          27th Annual International ACM SIGIR Conference on
and the average rank of the correct translations for                      Research and Development in Information Retrieval, pp.
“N-gram”. We improved the efficiency, while                               289-296.
maintaining the sorting accuracy of the translations.                   Sung      Hyun      Myaeng   and     Kil-Soon     Jeong.    1999.
                                                                          Back-Transliteration of foreign words for information retrieval.
    Table 7: Evaluation of the computational cost.                        Information Processing and Management, Vol. 35, No. 4, pp.
         Method              N-gram       DP         N-gram + DP          523 -540.

Loanwords                                      100                      Jong-Hooh Oh and Key-Sun Choi. 2001. Automatic extraction of

Computation time (sec.)          95      136,815              293         transliterated foreign words using hidden markov model.

Extracted correct                66          66                    66     Proceedings of the International Conference on Computer

translations                                                              Processing of Oriental Languages, 2001, pp. 433-438.
                                                                        Shigeo Ozawa. Modern Mongolian Dictionary. Daigakushorin.
Average rank of correct         44.8         2.7               2.7
                                                                        Stephen E. Robertson, Steve Walker, Susan Jones, Micheline
                                                                          Hancock-Beaulieu, and Mike Gatford. 1995. Okapi at TREC-3,
5    Conclusion
                                                                          Proceedings of the Third Text REtrieval Conference (TREC-3),
We proposed methods for extracting loanwords from
                                                                          NIST Special Publication 500-226. pp. 109-126.
Cyrillic Mongolian corpora and producing a
                                                                        Enkhbayar Sanduijav, Takehito Utsuro, and Satoshi Sato. 2005.
Japanese–Mongolian bilingual dictionary. Our
                                                                          Mongolian phrase generation and morphological analysis
research is the first serious effort in producing
                                                                          based on phonological and morphological constraints. Journal
dictionaries of loanwords and their translations
                                                                          of Natural Language Processing, Vol. 12, No. 5, pp. 185-205.
targeting Mongolian. We devised our own rules to
                                                                          (In Japanese) .
extract loanwords from Mongolian corpora. We also
                                                                        Frank Smadja, Vasileios Hatzivassiloglou, Kathleen R. McKeown.
extracted words in Mongolian corpora that are
                                                                          1996. Translating collocations for bilingual lexicons: A
phonetically similar to Japanese Katakana words as
                                                                          statistical approach. Computational Linguistics, Vol. 22, No. 1,
loanwords. We also corresponded the extracted
                                                                          pp. 1-38.
loanwords to Japanese words, and produced a
                                                                        Bayarmaa Ts. 2002. Mongolian grammar in I-IV grades. (In
Japanese–Mongolian bilingual dictionary. A noun
stemming method that does not require noun

To top