Docstoc

P98-2220

Document Sample
P98-2220 Powered By Docstoc
					        Automatic English-Chinese name transliteration for develop-
                    ment of multilingual resources

                                 Stephen Wan and Cornelia Maria Verspoor
                                        Microsoft Research Institute
                                          Macquarie University
                                       Sydney NSW 2109, Australia
                                      {swan, kversp } @mri.mq.edu.au

                                                              manufacturer, with values of personal and place
                          Abstract                            names. Place names and personal names do not
                                                              fall into a well-defined set, nor do they have se-
    In this paper, we describe issues in the translation      mantic content which can be expressed in other
    of proper names from English to Chinese which             languages through words equivalent in meaning.
    we have faced in constructing a system for multi-         As more objects are added to our database (as
    lingual text generation supporting both languages.        will happen as a museum acquires new objects),
    We introduce an algorithm for mapping from                new names will be introduced, and these must
    English names to Chinese characters based on (1)          also be added to the lexica for each language in
    heuristics about relationships between English            the system. We require an automatic procedure
    spelling and pronunciation, and (2) consistent re-        for achieving this, and concentrate here on tech-
    lationships between English phonemes and Chi-             niques for the creation of a Chinese lexicon.
    nese characters.                                          2      English-Chinese Transliteration
1       Introduction                                          We use the term transliteration to refer generally
                                                              to the problem of the identification of a specific
In the context of multilingual natural language
                                                              textual form in an output language (in our case
processing systems which aim for coverage of
                                                              Chinese characters) which corresponds to a
both languages using a roman alphabet and lan-
                                                              specific textual form in an input language (an
guages using other alphabets, the development of
                                                              English word or phrase). For words with
lexical resources must include mechanisms for
                                                              semantic content, this process is essentially
handling words which do not have standard
                                                              equivalent to the translation of individual words.
translations. Words falling into this category are
                                                              So, the English word "black" is associated with a
words which do not have any obvious semantic
                                                              concept which is expressed as " ~ " ([h~i]) in
content, e.g. most indo-european personal and
                                                              Chinese. In t h i s c a s e , a dictionary search
place names, and which can therefore not simply
                                                              establishes the input-output correspondence.
be mapped to translation equivalents.
                                                                  For words with little or no semantic content,
     In this paper, we examine the problem of
                                                              such as personal and place names, dictionary
generating Chinese characters which correspond
                                                              lookup may suffice where standard translations
to English personal and place names. Section 2
                                                              exist, but in general it cannot be assumed that
introduces the basic principles of English-
                                                              names will be included in the bilingual
Chinese transliteration, Section 3 identifies issues
                                                              dictionary. In multilingual systems designed only
specific to the domain of name transliteration,
                                                              for languages sharing the roman alphabet, such
and Section 4 introduces a rule-based algorithm
                                                              names pose no problem as they can simply be
for automatically performing the name translit-
                                                              included unaltered in output texts in any of the
eration. In Section 5 we present an example of
                                                              languages. They cannot, however, be included in
the application of the algorithm, and in Section 6
                                                              a Chinese text, as the roman characters cannot
we discuss extensions to improve the robustness
                                                              standardly be realized in the Han character set.
of the algorithm.
     Our need for automatic transliteration                   3     Name Transliteration
mechanisms stems from a multilingual text gen-
eration system which we are currently construct-              English-Chinese name transliteration occurs on
ing, on the basis of an English-language database             the basis of pronunciation. That is, the written
containing descriptive information about museum               English word is mapped to the written Chinese
objects (the POWER system; Verspoor et al                     character(s) via the spoken form associated with
1998). That database includes fields such as                  the word. The idealized process consists of:


                                                       1352
1. mapping an English word (grapheme) to a pho-            The algorithm does not aim to specify general
   nemic representation                                    grapheme-phoneme conversion for English, but
2. mapping each phoneme composing the word to a            only for the subset of English words relevant to
   corresponding Chinese character                         place name transliteration. This limited domain
       In practice, this process is not entirely           rarely exhibits complex morphology and thus a
straightforward. We outline several issues com-            robust morphological module is not included. In
plicating the automation of this process below.            addition, foreign language morphemes are treated
       The written form of English is less than            superficially. Thus, the algorithm transliterates
normalized. A particular English grapheme (letter          the "-istan" (a morpheme having meaning in
or letter group) does not always correspond to a           Persian) of "Afghanistan" in spite of a standard
single phoneme (e.g. ea is pronounced differently          transliteration which omits this morpheme.
in eat, threat, heart, etc.), and many English                  The transliteration process is intended to be
multi-letter combinations are realised as a single         based purely on phonetic equivalency. On
phoneme in pronunciation (so f, if, ph, and gh             occasion, country names will have some
can all map to /f/) (van den Bosch 1997). An               additional meaning in English apart from the
important step in grapheme-phoneme conversion              referential function, as in "The United States".
is the segmentation of words into syllables.               Such names are often translated semantically
However, this process is dependent on factors              rather than phonetically in Chinese. However,
such as morphology. The syllabification of                 this in not uniformly true, for example "'Virgin"
"hothead" divides the letter combination th,               in "British Virgin Islands" is transliterated. We
while the same combination corresponds to a                therefore introduce a dictionary lookup step prior
single phoneme in "bother". Automatic                      to commencing transliteration, to identify cases
identification of the phonemes in a word is                which have a standard translation.
                                                                The transliteration algorithm results in a
therefore a difficult problem.
       Many approaches exist in the literature to          string of Han characters, the ideographic script
solving the grapheme-phoneme conversion                    used for Chinese. While the dialects of Chinese
problem. Divay and Vitale (1997) review several            share the same orthography, they do not share the
of these, and introduce a rule-based approach              same pronunciation. This algorithm is based on
(with 1,500 rules for English) which achieved              the Mandarin dialect.
94.9% accuracy on one corpus and 64.37% on                      Because automation of this algorithm is our
another. Van den Bosch (1997) evaluates                    primary goal, the transliteration starts with a
instance-based learning algorithms and a decision          written source and it is assumed that the
tree algorithm, finding that the best of these             orthography       represents     an     assimilated
algorithms can achieve 96.9% accuracy.                     pronunciation, even though English has borrowed
       Even when a reliable grapheme-to-phoneme            many country names. This is permitted only
conversion module can be constructed, the                  because the mapping from English phonemes to
English-Chinese transliteration process is faced           Chinese phonemes loses a large degree of
with the task of mapping phonemes in the source            variance: English vowel monothongs are
language to counterparts in the target language,           flattened into a fewer number Chinese
difficult due to phonemic divergence between the           monothongs. However, Chinese has a larger set
two languages. English permits initial and final           of diphthongs and triphthongs. This results in
consonant clusters in syllables. Mandarin                  approximating a prototypical vowel by the
Chinese, in contrast, primarily has a consonant-           closest match within the set of Chinese vowels.
vowel or consonant-vowel-[nasal consonant (/n/             4 An Algorithm for Auto Transliteration
or /0/)] syllable structure. English consonant
clusters, when pronounced within the Chinese               The algorithm begins with a proper noun phrase
phonemic system, must either be reduced to a               (PNP) and returns a transliteration in Chinese
single phoneme or converted to a consonant-                characters. The process involves five main
vowel-consonant-vowel structure by inserting a             stages: Semantic Abstraction, Syllabification,
vowel between the consonants in the cluster. In            Sub-syllable Divisions, Mapping to Pinyin, and
addition to these phonotactic constraints, the             Mapping to Han Characters.
range of Chinese phonemes is not fully                     4.1 S e m a n t i c Abstraction
compatible with those of English. For instance,
Mandarin does not use the phoneme Iv/ and so               The PNP may consist of one or more words. If it
that phoneme in English words is realized as               is longer than a single word, it is likely that some
e i t h e r / w / o r / f / i n the Chinese counterpart.   part of it may have an existing semantic
       We focus on the specific problem of country         translation. "The" and "of' are omitted by
name transliteration from English into Chinese.


                                                    1353
convention. To ensure that such words as                               clusters are reduced to a single phoneme
"Unitear" are translated and not transliterated ~, we                  represented by a single ASCII character (e.g. ff
pass the entire PNP into a dictionary in search of                     and ph are both reduced to f). Instances of 'y' as
a standard translation. If a match is not                              a vowel are also replaced by the vowel 'i'.
immediately successful, we break the PNP into                          For each pair of identical consonants in the input string
words and pass each word into the dictionary to                            Reduce the pair to a singular instance of the consonant
check for a semantic translation 2. This portion of                    For each substring in the input string listed in Appendix A
the algorithm controls which words in the PNP                              Replace substring with the corresponding phoneme (App. A)
are translated and which are transliterated.                           For all instances where 'y' is not followed by a vowel or 'y' follows a
                                                                       consonant
Search for PNP in dictionary                                               Replace this instance of 'y' with the vowel 'i'
    If exact match exists then                                         When 'e' is followed by a consonant and an 'ia#'
         return corresponding characters                                   ;; (where # is the end of string marker)
    else                                                                   Replace the the preceding 'e' with 'i
       remove article 'The' and preposition 'of'
      For each (remaining) word in PNP                                 4.2.2 Syllabification
         search for word in dictionary                                 If string begins with a consonant
         If exact match exists                                               Then read/store consonants until next vowel and call this
              add matching characters to output string 3                     substring initial_consonant_group (or icg)
         else if the word is not already a chinese word                Read/store vowels until next consonant and call this substring
              transliterate the word and add to output string                vowels (or v)
                                                                       If more characters, read/store consonants until next vowel and call
4.2     Transliteration 1: Syllabification                                   this final_consonant_cluster (or fcc)
Because Chinese characters are monosyllabic,                           If length of fcc = 1 and fcc followed by substrings 'e#'
each word to be transliterated must first be                                  final_vowel (or fv) = 'e'
divided into syllables. The outcome is a list of                             syllable = icg + v +fcc +fv
syllables, each with at least one vowel part.                          else if the last two letters of fcc form a substring in Appendix B
    We distinguish between a consonant group                                       then this string has a double consonant cluster
and a consonant cluster, where a group is an                                            next_syllable (or ns) = the last two letters of fcc
arbitrary collection of consonant phonemes and a                                   reset fcc to be fcc with ns removed
cluster is a known collection of consonants. Like                            else
Divay and Vitale (1997), we identify syllable                                      next_syllable (or ns) = the last letter of fcc
boundaries on the basis of consonant clusters and                             reset fcc to be fcc with ns removed
vowels (ignoring morphological considerations).                              syllable = icg + v + fcc
Any consonant group is divided into two parts,                         Store syllable in a list
by identifying the final consonant cluster or lone                     Call syllabification procedure on substring [ns .. #]
consonant in that group and grouping that
consonant (cluster) with the following vowel.                          4.3     Transliteration 2: Sub-syllable Divisions
The sub-syllabification algorithm then further                         The algorithm then proceeds to find patterns
divides each identified syllable. While this                           within each syllable of the list. The pattern
procedure may not always strictly divide a word                        matching consists of splitting those consonant
into standard syllables, it produces syllables of                      clusters that cannot be pronounced within the
the form consonant-vowel, the common                                   Chinese phonemic         set. These       separated
pronunciation of most Chinese characters.                              consonants are generally pronounced by inserting
                                                                       a context-dependent vowel.           The     Pinyin
4.2.1 Normalization                                                    romanization consists of elements that can be
Prior to the syllabification process, the input                        described as consonants (including three
string must be normalized, so that consonant                           consonant clusters "zh", "ch" and "sh") and
                                                                       vowels which consist of monothongs, diphthongs
                                                                       and vowels followed by a nasal In/ or /rj/.
I The historical interactions of some European and Asian nations       Consonants that follow a set of vowels are
has lead to names that include some special meaning. Interaction       examined to determine if they "modify" the
with the dialects of the South may have produced transliterations
                                                                       vowel. Such consonants include the alveolar
based on regional pronunciations which are accepted as standard.
                                                                       approximant /r/, the pharyngeal fricative /h/ or
2 There is some discrepency among speakers about the balance           the above mentioned nasal consonants. These are
between translation and transliteration. For instance, the word
'New' is translated by some and transliterated by others.
                                                                       then joined to the vowel to form the "vowel
                                                                       part". The "vowel part" may be divided so as to
3 Identification of syntactic constraints is work-in-progress. Known
nouns such as 'island' are moved to the end of the phrase while        map onto a Pinyin syllable. Any remaining
modifers (remaining words) maintain their relative order.              consonants are then split by inserting a vowel.

                                                                1354
For each syllable s identified above                                          specifying the Pinyin <cv>       Han character
    Initialize subsyllable_list (or s/) to the empty string                   correspondence (Appendix E). In some cases,
    Identify initial_consonant_group s~g                                      multiple characters might be possible but the
    While s~g is non-null                                                     table includes only the most common.
          If the first two letters of s~g appear in Appendix C
                then consonant_pair (or cp) = those two letters               5      An Example
                     append cp to sl                                          The transliteration of the place name "Faeroe
                     reset S~gto be the remainder of S~cg                     Islands" according to the algorithm will proceed
                else add the first letter of S~=gtOsl                         as follows:
                     reset S~gto be the remainder of S~=g                     1. No match for "Faeroe" in the dictionary, so must be
    Identify vowels (v) in s                                                       transliterated :
    append v to last element of sl                                            2.   Divide Faeroe into two syllables by recognizing the syllabic
    identify final_consonant_cluster (fcc) of s                                    break falls before the "?' in the middle consonant group.
    if sfccis non-null                                                        3.   Map/fae/and/roe/onto their Chinese equivalents. Since no
          if Sfccis equal to 'n', 'm', 'ng', 'h' or 'r'                            vowel form/ae/exists in Chinese, this is mapped to/ei/. The
                identify final vowels of s (Sly)                                   Irl of the second syllable is mapped to /1/ and /oe/ is
                If s~ exists and Sfcc= 'n' or 'm'                                  correspondingly mapped to luol.
                     append Sfc=to last element of sl                         4.   Since each syllable is of the form <cv>, no subsyllabic
                else if s~ exists and Sfccnot = 'n' or 'm'                         processing is required.
                     append Sfc¢+sty to last element of sl                    5.   The transliterated phrase "fei luo" is the mapped to the Han
                else if Sly exists and sfc¢= 'h' or 'r'                            characters: "-:lie~'"
                     discard sfc¢+s~                                          6.   "Islands" is searched for and found in the dictionary : "1~'%"
             else                                                                  (qOn d~o)
                while sfccis non null                                         7.   The characters of the translated "Islands" are placed after the
                      If the first two letters of sfc¢appear in Appendix C         transliteration of "Faeroe" : "tlz ~' ~ ,%" (f~i/0o qOn d~o)
                           then
                            cp = those two letters                            6      Conclusions and Future Extensions
                            append cp to sl                                   The algorithm we have outlined is being
                            reset S~cctObe the remainder of sfc¢              implemented as a tool for the creation of Chinese
                           else                                               lexical resources within a multilingual text
                            add the first letter of SfcctOsl                  generation project from an English-language
                            reset stc¢to be the remainder of Sfc=             source database. We focused on the requirements
For each element of sl                                                        of the domain of English place names. The
    If element does not include a vowel                                       algorithm is currently being extended to include
          Insert context dependent vowel                                      personal name transliteration as well, which
                                                                              requires a different set of characters. A personal
   This procedure will subdivide the syllable into                            name transliteration standard has been developed
pronounceable sections for mapping to the                                     and is in use in China (Chanzhong Wu, p.c.). By
Chinese phoneme set. Thus each subsection                                     mapping the Pinyin transliterations arrived by our
should be of the form <cv>, <v> or <vc,>, where                               algorithm to this different set of characters, we
"c" is a single consonant, "v" is a monothong or                              can extend the domain to include personal names.
diphthong and "c," is a nasal consonant.                                          In its present form, the algorithm will not
                                                                              always generate transliterations matching those
4.4 T r a n s l i t e r a t i o n 3: M a p p i n g to P i n y i n
                                                                              which might be produced by a human
The subsyllables are then mapped to the Pinyin                                transliterator due to the influence of historical
romanization standard equivalents by means of a                               factors or individual differences. However, the
table (Appendix D). This table is indexed on the                              aim of the algorithm is to produce a
columns on the consonants of the subsyllable,                                 transliteration understandable by readers of a
and on the rows on the vowel part of the                                      Chinese text. While the algorithm mimics the
subsyllable. When an exact match cannot be                                    intuitive superimposition of phonemic and
found we prioritize aspects of the subsyllable.                               phonotactic systems, the ultimate goals of the
Often the highest priority is the initial consonant.                          algorithm are generality and reliability. Indeed,
Of next priority are nasal consonants. This may                               the result from the example above corresponds to
demand an alternate vowel choice if no such                                   a standard transliteration. Thus the algorithm
combination of phonemes exists in the table.                                  produces results which are recognisable. The
                                                                              degree to which the transliteration is recognised
4.5 Transliteration 4: Mapping to Han                                         by the human speaker is dependent in part on the
Once the Pinyin of a word is established, the Han                             length of the original name. Longer names with
characters are simply extracted from a table of                               many syllables are less recognisable than shorter
                                                                       1355
names. The introduced phonemic conversion                 PhD thesis, University of Maastricht, Uitgeverij
rules are merely those most common and further            Phidippides, Cadier en Keer, the Netherlands, 229p.
work will strengthen the generality of the tool.
Further research will include a more formal               Appendices A. B. and C. English-Chinese uni-
analysis of the correspondences between English           tary consonant correspondences, consonant
and Chinese phonemes. Furthermore, the                     mirs, and double consonant correspondences
algorithm is far from robust due to its current
                                                          bh =>b        cqu =>k       tr     bl          cz => ch           sp => xi b-
limited focus, and errors made in earlier stages                        sc =>c
                                                           ngh => ngh                 sh     cl          st = > s h i d -   sw =>ru-
are propagated and possibly magnified as the               gh => gh     dj => j       ch     fl          ch => ch           sh => sh
algorithm continues. Since place names and                Iph =>f       ts =>c       cz      kl
people's names originate from many cultures,              Ith =>t       lk =>k       sp      pl
this algorithm will not produce desirable results         !ck =>k       we=>w         st     sl
unless the written form exhibits some                      r + cons. => cons.         SW

assimilation to English spelling. We are currently
                                                          Appendix D. Portion of English phoneme -
investigating the application of lazy learning
techniques (as described by van den Bosch 1997)           Chinese Pinyin Mapping Table
to learning the English naming word-phoneme                              f-          n-           p-              r-           v
correspondences from a corpus of names. Such a                a           fa         na            ba              la           wa
module could eventually replace our simplistic                ae          fei        nei           bei             lei         wei
                                                              ai          fei        nei           bei             lei         wei
rule-based procedure, and could feed into the                 ai         fai         nai          bai             lai          wai
phoneme-Pinyin mapping module, ultimately                     ai         fa yi       na yi        ba yi           la yi        wa yi
resulting in greater accuracy.                                ao                     nao           bao             lao
    The applications of such an algorithm are                 ar#                    nuo                           luo         wuo
countless. Currently, the process of finding a less           au                     nuo                           luo         wuo
common country, city, or county name is an                    ay          fei        nei           bei             lei         wei
arduous procedure. Because transliteration uses               o           fo                       bo                          wo
                                                              o#                     nuo#                         luo#         wuo#
no semantic content, it is a obvious task for                 oa                                   bo ya                       wo ya
automation. This algorithm could also be applied              oe                     nuo                           luo         wuo
in the character entry on a Chinese word                      oi                                                               #
processor or to index Chinese electronic atlases.             on                                                   lun
When attached to a robust grapheme-to-phoneme                 or#                    nuo#                         luo#         wuo#
module, the transliteration into Chinese                      ou                     nuo                           luo         wuo
characters is ultimately a mapping to Chinese-
                                                          Appendix E. Pinyin-Han table (portion)
specific IPA phonetics, raising the possibility of
speech synthesis of English names in Chinese,             a;l~"       di;~          hong;~'J~     lun;~            qi;~l~       wang;[
gwen that Pinyin is a phonemically normalized             ai;~        dian;.~l~:   jiJ'L          luo;~            qiu;~        wang;j
                                                          ai;~        dian;~i~     ji;~.          luo;~            ri;Et        wei;~
orthography.                                              an;~        du;/~        ji;i~          luu;'J~          rui;~        wei;~
Acknowledgements                                          an;~        du;glI       ji;~           ma;-~            rui;~        wei;~
                                                          ang;~       dun;]ll~     ji;}':~:       mai;~            sa;~         wei;,~
Our thanks go to Canzhong Wu for help with                ao;'~       duo;~        jia;~fl        mai;~            sai;i        wei;.~
identifying Chinese mappings, and the members             ba;Fq       e;~          jian;~         man;J            sang;~       wen;~
of Dynamic Document Delivery project at the               bai;-I~     e;~          jie;~j~        mao;~            se;~.        wu;-~
Microsoft Research Institute (the POWER team).            ban;t'~     er;~         jin;ff~        mei;~            sen;~        wuo;~,
                                                          bao;~       er;~l~       jing;~         men;f"           sha;~        xi;~
References                                                bao;t~      fa;~         ju;~           meng;~           shao;.~      xi;i~i
                                                          bei;:ll~    fei;~        ka;"~,         meng;]           she;~        xian;~
    Divay M. and Vitale A.J. (1997) Algorithms for        bei;~       fei;~        ka;l~          meng;]           shi;-&"      xiang;~
Grapheme-Phoneme Translation for English and              ben;:~      fei;~l~      kai;-~         mi;~             shi;~        xiang;~
French: Applications. Computational Linguistics,          bi;l~       fen;:~:      ke;P-~         mi;~2,           shi;llr]"    xin;~
23/4, pp. 495--524.                                       bing;,~     fo;~         ke;~-[.        mi;;~:           shi;J~       xiong;!
    Verspoor, C., Dale, R., Green, S., Milosavljevic,     bing;~      fu;~         ken;'l~"       mian;~           si;ll/~      xu;~
                                                          bo;~fl      fu;'~        la;~'~         mo;IJ'           song;Jl~     ya;,'ll7
M., Pads, C., and Williams, S. (1998) Intelligent         bo;tl~      fu;~         la;~t          mo;~             su;~         ya;~
Agents for Information Presentation: Dynamic De-          bo;jl~      gan;-~       lai;~          mo;~             suo;~        ye;~
scription of Knowledge Base Objects. In the proceed-      bo;J~       gang;~       lan; -~"       mu:t~            suo;~        yi;I,2
ings of the International Workshop on Intelligent         bo;~        gang;~lJ     lang;l~I]      na;lt!:          ta;~         yi;~
Agents on the Internet and Web, Mexico City, Mex-         bu;~l~      gang;~       lao;:~         na;~             ta;t~:       yi;.~
ico, 16-20 March 1998, pp. 75-86.                         bu;~        ge;-~]-      le;l~          na;~JIl          tai;~        yin;l~ll
                                                          bu;~        ge;t~        li;~l          nan;]~           tai;~        yue;~J
    van den Bosch A. (1997) Learning to pronounce         chao;~      ge;~l'       li;~J          nao;t~l          tai;~        yue;/~
written words: A study in inductive language learning.


                                                   1356