Machine Transliteration

Document Sample
Machine Transliteration Powered By Docstoc
					             Machine Transliteration

                  Joshua Waxman

11/17/2011                             1
• Words written in a language with alphabet
  A  written in a language with alphabet B
• ‫ שלום‬ “shalom”
• Importance for MT, for cross-language IR
• Forward transliteration, Romanization,

11/17/2011                                    2
      Is there a convergence towards
Perhaps for really famous names. Even for such standard names, multiple acceptable spellings.
    Whether there is someone regulating such spellings probably dependent culturally. In meantime,
    have a lot of variance. Especially on Web. E.g. holiday of Succot, ‫סּכֹות ,סוכות‬

Variance in pronunciation culturally across different groups (soo-kot, suh-kes) = dialect, variance in
    how one chooses to transliterate different Hebrew letters (kk, cc, gemination).
•   Sukkot: 7.1 million
•   Succot: 173 thousand
•   Succos: 153 thousand
•   Sukkoth: 113 thousand
•   Succoth: 199 thousand
•   Sukos: 112 thousand
•   Sucos: 927 thousand, but probably almost none related to holiday
•   Sucot: 101 thousand. Spanish transliteration of holiday
•   Sukkes: 1.4 thousand. Yiddish rendition
•   Succes: 68 million. Misspelling of “success”
•   Sukket: 45 thousand. But not Yiddish, because wouldn’t have “t” ending

Recently in the news: AP: Emad Borat; Arutz Sheva: Imad Muhammad Intisar Boghnat

11/17/2011                                                                                               3
      Can we enforce standards?
• Would make task easier.
• News articles, perhaps
• However:
     – Would they listen to us?
     – Does the standard make sense across the
       board? Once again, dialectal differences. E.g.
       ‫ ,ה ,ת‬vowels. Also, fold-over of alphabet. ‫,ע-א‬
       ‫ק-כ, ח-כ, ת-ט, ת-ס‬
     – 2N for N laguages

11/17/2011                                               4
11/17/2011   5
                       Four Papers
• “Cross Linguistic Name Matching in English and Arabic”
     – For IR – search. Fuzzy string matching. Modification of Soundex
       to use cross-language mapping, using character equivalence
• “Machine Transliteration”
     – For Machine translation. Back transliteration. 5 steps in
       transliteration. Use Bayes’ rule
• “Transliteration of Proper Names in Cross-Language
     – Forward transliteration, purely statistical based
• “Statistical Transliteration for English-Arabic Cross
  Language Information Retrieval”
     – Forward transliteration. For IR, generating every possible
       transliteration, then evaluate. Using selected n-gram model

11/17/2011                                                           6
      Cross Linguistic Name
  Matching in English and Arabic
             A “One to Many Mapping” Extension of the
                Levenshtein Edit Distance Algorithm

               Dr. Andrew T. Freeman, Dr. Sherri L. Condon and
                          Christopher M. Ackerman
                             The Mitre Corporation

11/17/2011                                                       7
 Cross Linguistic Name Matching
• What?
     – Match personal names in English to the same names in Arabic script.
• Why is this not a trivial problem?
     – There are multiple transcription schemes, so it is not one-to-one
     – e.g. ‫ معمر القذافي‬can be Muammar Gaddafi, Muammar Qaddafi, Moammar
       Gadhafi, Muammar Qadhafi, Muammar al Qadhafi
     – because certain consonants and vowels can be represented multiple
       ways in English
     – note: Arabic is just an example of this phenomenon
     – so standard string comparison insufficient
• For What purpose?
     – For search on, say, news articles. How do you match all occurrences of
• Their solution
     – Enter the search term in Arabic, use Character Equivalence Classes
       (CEQ) to generate possible transliterations, supplement the Levenshtein
       Edit Distance Algorithm
11/17/2011                                                                   8
              Elaboration on Multiple
             Transliteration Schemes
• Why?
     – No standard English
       corresponding to
       Arabic /q/
     – Different dialects – in
       Libya, this is
       pronounced [g]
     – note: Similar for
       Hebrew dialects

11/17/2011                              9
             Fuzzy string matching
• def: matching strings based on similarity
  rather than identity
• Examples:
     – edit-distance
     – n-gram matching
     – normalization procedures like Soundex.

11/17/2011                                      10
     Survey of Fuzzy Matching Methods - Soundex

•   Soundex
     – Odell and Russel, 1918
•   Some obvious pluses:
     – (not mentioned explicitly by paper)
     – we eliminate vowels, so Moammar/Muammar
       not a problem
     – Groups of letters will take care of different
       English letters corresponding to Arabic
     – Elimination of repetition and of h will remove
•   Some minuses
     – Perhaps dialects will transgress Soundex
       phonetic code boundaries. e.g. ‫ ת‬in Hebrew can
       be t, th, s. ‫ ח‬can be ch or h. Is a ‫ ו‬to be w or v?
       But could modify algorithm to match.
     – note al in al-Qadafi
     – Perhaps would match too many inappropriate

    11/17/2011                                               11
             Noisy Channel Model

11/17/2011                         12
           Levenshtein Edit Distance
• AKA Minimum Edit Distance
  – Minimum number of operations of insertion, deletion, substitution.
    Cost per operation = 1
  – Via dynamic programming
  – Example taken from Jurafsky and Martin, but with corrections
  – Minimum of diagonal + subst, or down/left + insertion/deletion cost

  11/17/2011                                                              13
    Minimum Edit Distance Example
         (substitution cost = 2)
   N         9   8   9   10   11   12   11   10   9    8
   O         8   7   8   9    10   11   10   9    8    9
   I         7   6   7   8    9    10   9    8    9    10
   T         6   5   6   7    8    9    8    9    10   11
   N         5   4   5   6    7    8    9    10   11   12
   E         4   3   4   5    6    7    8    9    10   11
   T         3   4   5   6    7    8    7    8    9    10
   N         2   3   4   5    6    7    8    9    10   11
   I         1   2   3   4    5    6    7    8    9    10
   #         0   1   2   3    4    5    6    7    8    9
             #   E   X   E    C    U    T    I    O    N    14
    Minimum Edit Distance Example
         (substitution cost = 1)
N       9    7   7   7   7   8   8   7   6   5
O       8    6   6   6   7   7   7   6   5   6
I       7    5   5   6   6   6   6   5   6   7
T       6    4   5   5   5   5   5   6   7   8
N       5    4   4   5   4   5   6   7   7   7
E       4    3   4   3   4   5   6   6   7   8
T       3    3   3   3   4   5   5   6   7   8
N       2    2   2   3   4   5   6   7   7   8
I       1    1   2   3   4   5   6   6   7   8
#       0    1   2   3   4   5   6   7   8   9
        #    E   X   E   C   U   T   I   O   N
11/17/2011                                       15
             Minimum Edit Distance
•   Score of 0 = perfect match, since no edit ops

•   s of len m, t of len n

•   Fuzzy match: divide edit score by length of shortest (or longest) string, 1 –
    this number. Set threshold for strings to be a match. Then, longer pairs of
    strings more likely to be matched than shorter pairs of strings with same
    number of edits. So get percentage of chars that need ops. Otherwise, “A”
    vs “I” has same edit distance as “tuning” vs. “turning.”

•   Good algorithm for fuzzy string comparison – can see that Muammar
    Gaddafi, Muammar Qaddafi, Moammar Gadhafi, Muammar Qadhafi,
    Muammar al Qadhafi are relatively close.

•   But, don’t really want substitution cost of G/Q, O/U, DD/DH, certain
    insertion/deletion costs. That is why they supplement it with these Character
    Equivalence Classes (CEQ), which we’ll get to a bit later.

11/17/2011                                                                          16
•   Zobel and Dart (1996) – Soundex + Levenshtein Edit Distance

•   replace e(si, tj) which was basically 1 if unequal, 0 if equal (that is, cost of an
    op), with r(si, tj), which makes use of Soundex equivalences. 0 if identical, 1
    if in same group, 2 if different

•   Also neutralizes h and w in general. Show example based on chart from
    before. In terms of initializing or calculating cost of insertion/deletion, do not
    count, otherwise have cost of 1.

•   Other enhancements to standard Soundex and Edit distance for the
    purpose of comparison. e.g. tapering – (counts less later in the word);
    phonometric methods – input strings mapped to phonemic representations.
    E.g. rough.

•   Say performed better than Soundex, Min Edit Distance, counting n-gram
    sequences, ~ 10 permutations of tapering, phonemetric enhancements to
    standard algorithms

11/17/2011                                                                           17
             SecondString (Tool)
• Java based implementation of many of
  these string matching algorithms. They
  use this for comparison purposes. Also,
  SecondString allows hybrid algorithms by
  mixing and matching, tools for string
  matching metrics, tools for matching
  tokens within strings.

11/17/2011                                   18
                     Baseline Task (??)
•   Took 106 Arabic, 105 English texts from newswire articles

•   Took names from these articles, 408 names from English, 255 names from Arabic.

•   manual cross-script matching, got 29 common names (rather than manually coming up with all
    possible transliterations)

•   But to get baseline, tried matching all names in Arabic (transliterated using Atrans by Basis –
    2004) to all names in English, using algorithms from SecondString. Thus, have one standard
    transliteration, and try to match it to all other English transliterations

•   Empirically set threshold to something that yielded good result.

•   R = recall = # correctly matched English names / # available correct English matches in set; what
    percentage of total correct did they get?

•   P = Precision = total # correct names / total # of names returned; what percentage of their
    guesses were accurate?

•   Defined F-score as 2 X (PR) / (P + R)

11/17/2011                                                                                            19
             Other Algorithms Used For
•   Smith – Waterman = Levenstein
    Edit, with some parameterization
    of gap score

•   SLIM = iterative statistical learning
    algorithm based on a variety of
    estimation-maximization in which
    a Levenshtein edit-distance matrix
    is iteratively processed to find the
    statistical probabilities of the
    overlap between two strings.

•   Jaro = n-gram

•   Last one is Edit distance

11/17/2011                                  20
             Their Enhancements
• Motivation: Arabic letter has more than
  one possible English letter equivalent.
  Also, Arabic transliterations of English
  names not predictable. 6 different ways to
  represent Milosevic in Arabic.

11/17/2011                                 21
             Some Real World

11/17/2011                     22
 Character Equivalence Classes
• Same idea as Editex, except use Ar(si, tj) where s
  is an Arabic word, so si is an Arabic letter, and t is
  an English word, and tj is an English letter.
• So, comparing Arabic to English directly, rather
  than a standard transliteration
• The sets within Ar to handle (modified) Buckwater
  transliteration, default transliteration of Basis’
• Basis’ uses English digraphs for certain letters

11/17/2011                                                 23
 Buckwalter Transliteration Scheme
A “scholarly” transliteration scheme, unlikely to be found in newspaper articles:
Wikipedia:The Buckwalter Arabic transliteration was developed at Xerox by Tim Buckwalter in the 1990s. It is an
     ASCII only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the more common
     romanization schemes that add morphological information not expressed in Arabic script. Thus, for example, a waw
     will be transliterated as w regardless of whether it is realized as a vowel [u:] or a consonant [w]. Only when the waw
     is modified by a hamza ( ‫ )ؤ‬does the transliteration change to &. The unmodified letters are straightforward to read
     (except for maybe *=dhaal and E=ayin, v=thaa), but the transliteration of letters with diacritica and the harakat take
     some time to get used to, for example the nunated i`rab -un, -an, -in appear as N, F, K, and the sukun ("no vowel")
     as o. Ta marbouta ‫ ة‬is p.

      –    hamza
                     –   lone hamza: '
                     –   hamza on alif: >
                     –   hamza on wa: &
                     –   hamza on ya: }
      –    alif
                     –   madda on alif: |
                     –   alif al-wasla: {
                     –   dagger alif: `
                     –   alif maqsura: Y
      –    harakat
                     –   fatha: a
                     –   damma: u
                     –   kasra: i
                     –   fathatayn: F
                     –   dammatayn: N
                     –   kasratayn K
                     –   shadda: ~
                     –   sukun: o
     –   ta
11/17/2011 marbouta: p
     –   tatwil: _
             The Equivalence

11/17/2011                     25
• They normalize Buckwalter and the
  English in the newspaper articles.
• Thus, $  sh from Buckwalter,
• ph  f in English, eliminate dupes, etc.
• Move vowels from each language closer to
  one another by only retaining matching
  vowels (that is, where exist in both)

11/17/2011                               26
11/17/2011   27
    Why different from Soundex and
• “What we do here is the opposite of the
  approach taken by the Soundex and Editex
  algorithms. They try to reduce the complexity by
  collapsing groups of characters into a single
  super-class of characters. The algorithm here
  does some of that with the steps that normalize
  the strings. However, the largest boost in
  performance is with CEQ, which expands the
  number of allowable cross-language matches for
  many characters.”
11/17/2011                                      28
                  Machine (Back-)
             Kevin Knight and Jonathan Graehl
              University of Southern California

11/17/2011                                        29
             Machine Transliteration
•   For Translation purposes
•   Foreign Words commonly transliterated, using approximate phonemic
     – “computer”  konpyuuta
•   Problem: Usually, translate by looking up in dictionaries, but these often
    don’t show up in dictionaries
•   Usually not a problem for some languages, like Spanish/English, since have
    similar alphabets. But non-alphabetic languages or with different alphabets,
    more problematic. (e.g. Japanese, Arabic)
•   Popular on the Internet: “The Coca-Cola name in China was first read as
    "Ke-kou-ke-la," meaning "Bite the wax tadpole" or "female horse stuffed with
    wax," depending on the dialect. Coke then researched 40,000 characters to
    find a phonetic equivalent to "ko-kou-ko-le," translating into "happiness in the
    mouth." “
•   Solution: Backwards transliteration to get the original word, using a
    generative model

11/17/2011                                                                       30
             Machine Transliteration
• Japanese transliterates e.g. English in katakana.
  Foreign names and loan-words.
• Compromises: e.g. golfbag
     – L/R map to same character
     – Japanese has alternating consonant vowel pattern, so
       cannot have consonant cluster LFB
     – Syllabary instead of alphabet.
     – Goruhubaggu
     – Dot separator, but inconsisent, so
     aisukuriimu can be “I scream”
     or “ice cream”
11/17/2011                                               31
             Back Transliteration
• Going from katakana back to
  original English word
• for translation – katakana not
  found in bilingual dictionaries,
  so just generate original
  English (assuming it is English)
• Yamrom 1994 – pattern
  matching – ***
• Arbabi 1994 – neural net/expert
  system ***
• Information loss, so not easy to

11/17/2011                           32
              More Difficult Than
• Forward transliteration
   – several ways to transliterate into katakana, all valid, so you
     might encounter any of them
   – But only one English spelling; can’t say “arture” for “archer”
• Romanization
   – we have seen examples of this;
     the katakana examples above
   – more difficult because of spelling variations
• Certain things cannot be handled by back-transliteration
   – Onomatopoeia
   – Shorthand: e.g. waapuro = word processing

11/17/2011                                                       33
               Desired Features
• Accuracy
• Portability to other languages
• Robust against OCR errors
• Relevant to ASR where speaker
  has heavy accent
• Ability to take context
  (topical/syntactic) into account, or
  at least return ranked list of
• Really requires 100% knowledge

11/17/2011                               34
Learning Approach – Initial Attempt
• Can learn what letters transliterate for what by
  training on corpus of katakana phrases in
  bilingual dictionaries
• Drawbacks:
     – with naïve approach, how can we make sure we get a
       normal transliteration?
     – E.g. we can get iskrym as back transliteration for
     – Take letter frequency into account! So can get isclim
     – Restrict to real words! Is crime.
     – We want ice cream!
11/17/2011                                                35
      Modular Learning Approach
Build generative model of transliteration process,

1.     English phrase is written
2.     Translator pronounces it in English
3.     Pronunciation modified to fit Japanese sound inventory
4.     Sounds are converted into katakana
5.     Katakana is written

Solve and coordinate solutions to these subproblems, use
    generative models in reverse direction
Use probabilities and Bayes Rule
11/17/2011                                                 36
                     Bayes’ Rule Example
Example #1: Conditional probabilities – from Wikipedia
Suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip cookies and 30 plain cookies, while bowl #2
      has 20 of each. Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no
      reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be
      a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1.
      The precise answer is given by Bayes's theorem. But first, we can clarify the situation by rephrasing the question
      to "what’s the probability that Fred picked bowl #1, given that he has a plain cookie?” Thus, to relate to our
      previous explanation, the event A is that Fred picked bowl #1, and the event B is that Fred picked a plain cookie.
      To compute Pr(A|B), we first need to know:
Pr(A), or the probability that Fred picked bowl #1 regardless of any other information. Since Fred is treating both bowls
      equally, it is 0.5.
Pr(B), or the probability of getting a plain cookie regardless of any information on the bowls. In other words, this is the
      probability of getting a plain cookie from each of the bowls. It is computed as the sum of the probability of getting a
      plain cookie from a bowl multiplied by the probability of selecting this bowl. We know from the problem statement
      that the probability of getting a plain cookie from bowl #1 is 0.75, and the probability of getting one from bowl #2 is
      0.5, and since Fred is treating both bowls equally the probability of selecting any one of them is 0.5. Thus, the
      probability of getting a plain cookie overall is 0.75×0.5 + 0.5×0.5 = 0.625.
Pr(B|A), or the probability of getting a plain cookie given that Fred has selected bowl #1. From the problem statement,
      we know this is 0.75, since 30 out of 40 cookies in bowl #1 are plain.
Given all this information, we can compute the probability of Fred having selected bowl #1 given that he got a plain
      cookie, as such:

As we expected, it is more than half.

11/17/2011                                                                                                                37
     Application To Task At Hand
English Phrase Generator produces word sequences
  according to probability distribution P(w)
English Pronouncer probabilistically assigns a set of
  pronunciations to word sequence, according to P(p|w)
Given pronunciation p, find word sequence that maximizes
Based on Bayes’ Rule: P(w|p) = P(p|w) * P(w) / P(p)
But P(p) will be the same regardless of the specific word
  sequence, so can just search for word sequence that
  maximizes P(p|w) * P(w), which are the two distributions
  we just modeled

11/17/2011                                               38
       Five Probability Distributions
Extending this notion, built 5 probability distributions

1.       P(w) – generates written English word sequences
2.       P(e|w) – pronounces English word sequences
3.       P(j|e) – converts English sounds into Japanese sounds
4.       P(k|j) – converts Japanese sounds into katakana writing
5.       P(o|k) – introduces misspellings caused by OCR

Parallels 5 steps above
1.        English phrase is written
2.        Translator pronounces it in English
3.        Pronunciation modified to fit Japanese sound inventory
4.        Sounds are converted into katakana
5.        Katakana is written

Given katakana string o observed by OCR, we wish to maximize:
P(w) * P(e|w) * P(j|e) * P(k|j) * P(o | k)                         over all e, j, k

Why? Lets say have e and want to determine most probable w given e – that is, P(w|e), would maximize P(w) * P(e|w) /
Let us say had j and want to get most probable e given j – that is, P(e|j), would maximize P(e) * P(j|e).
Note that while usually we ignore the divisor, here we maintain it. P(e) / P(e) = 1
And so on for each in turn.

11/17/2011                                                                                                       39
    Implementation of the probability
P(w) as WFSA (weighted finite state acceptor), others as
  WFST (transducers)
WFSA = state transition diagram with both symbols and
  weights on the transitions, such that some transitions
  more likely than others
WFST = the same, but with both input and output symbols
Implemented composition algorithm to yield P(x|z) from
  models P(x|y) and P(y|z), treating WFSAs simply as
  WFST with identical input and output
Yields one large WFSA, and use Djikstra’s shortest path
  algorithm to extract most probable one
No pruning, use Viterbi approximation, searching best path
  through WFSA rather than best sequence

11/17/2011                                               40
  First Model – Word Sequences
• “ice cream” > “ice crème” > “aice kreme”
• Unigram scoring mechanism which multiplies
  scores of known words and phrases in a
• Corpus: WSJ corpus + online English name
  list + online gazeteer of place names
• Should really e.g. ignore auxiliaries and favor
  surnames. Approximate by removing high
  frequency words

11/17/2011                                          41
Model 2 – Eng Word Sequences 
     Eng Sound Sequences
•   Use English phoneme inventory
    from CMU Pronunciation
    Dictionary, minus stress marks

•   40 sounds: 14 vowel sounds, 25
    consonant sounds (e.g. K, HH, R),
    additional symbol PAUSE

•   Dictionary has 100,000 (125,000)
    word pronunciation

•   Used top 50,000 words because
    of memory limitations

•   Capital letters – Eng sounds;
    lowercase words – Eng words

11/17/2011                              42
11/17/2011   43
11/17/2011   44
             Example Second WFST

Note: Why not letters instead of phonemes? Doesn’t match Japanese
transliteration mispronunciation, and that is modeled in next step.
11/17/2011                                                            45
             Model 3: English Sounds 
                Japanese Sounds
• Information losing process: R, L  r, 14 vowels  5 Japanese
• Identify Japanese sound inventory
• Build WFST to perform the sequence mapping
• Japanese sound inventory has 39 symbols: 5 vowels, 33 consonants
  (including doubled kk), special symbol pause.
• (P R OW PAUSE S AA K ER) (pro-soccer) maps to (p u r o pause s
  a kk a a)
• Use machine learning to train WFST from 8000 pairs of
  English/Japanese sound sequences (for example, soccer). Created
  this corpus by modifying an English/katakana dictionary, converting
  into these sounds; used EM (estimation maximization) algorithm to
  generate symbol matching probabilities. See table on next page

11/17/2011                                                         46
11/17/2011   47
             The EM Algorithm

                      Note: pays no heed to context
11/17/2011                                            48
    Model 4: Japanese sounds  Katakana

• Manually construct 2.
• #1 just merges
  sequential doubled
  sounds into single
  sound. o o  oo
• #2 just does mapping,
  accounting for
  different spelling
  variation. e.g.

11/17/2011                                49
       Model 5: katakana  OCR

11/17/2011                       50

11/17/2011             51
          Transliteration of Proper
         Names in Cross-Language
             Paola Virga, Sanjeev Khudanpur
                Johns Hopkins University

11/17/2011                                    52
•      For MT, for IR, specifically cross-language IR
•      Names important, particularly for short queries
•      Transliteration: writing name in foreign language, preserving the way it

1.     Render English name in phonemic form
2.     Convert phonemic string into foreign orthography, e.g. Mandarin Chinese

•      Mentions back transliteration for Japanese, and application to Arabic, by
       Knight etc.
•      For Korean, strongly phonetic orthography allows good transliteration
       using simple HMMS
•      Hand-crafted rules to change English spelling to accord to Mandarin
       syllabification, then learns to convert English phoneme sequence to
       Mandarin syllable sequence.
•      They extend the previous, making it fully data-driven rather than relying
       on hand-crafted rules, to accomplish English  Mandarin transliteration

11/17/2011                                                                        53
               Four steps in transliteration
1.         English  Phonetic English (using Festival)
      1.       Festival – free, source available, multilingual, interfaces to shell, Scheme, Java, C++, emacs (see next
2.         English phoneme  initials and finals
3.         Initial + final sequence  pin-yin symbols
               Wikipedia: Pinyin is a system of romanization (phonemic notation and transcription to Roman script) for
               Standard Mandarin, where pin means "spell" and yin means "sound". …
      –        Pinyin is a romanization and not an anglicization; that is, it uses Roman letters to
               represent sounds in Standard Mandarin. The way these letters represent sounds in Standard Mandarin
               will differ from how other languages that use the Roman alphabet represent sound. For example, the
               sounds indicated in pinyin by b and g are not as heavily voiced as in the Western use of the Latin script.
               Other letters, like j, q, x or zh indicate sounds that do not correspond to any exact sound in English.
               Some of the transcriptions in pinyin, such as the ang ending, do not correspond to English
               pronunciations, either.
      –        By letting Roman characters refer to specific Chinese sounds, pinyin produces a compact and accurate
               romanization, which is convenient for native Chinese speakers and scholars. However, it also means
               that a person who has not studied Chinese or the pinyin system is likely to severely mispronounce
               words, which is a less serious problem with some earlier romanization systems such as Wade-Giles.
      –        Diff than katakana
4.         Pin-yin  Chinese character sequence

1, 3: deterministic; 2, 4: statistics

11/17/2011                                                                                                            54
11/17/2011   55
             Noisy Channel Model
• We had concept before

• Think of e an i-word English sentence output
  from noisy channel, c as j-word Chinese input
  into the noisy channel. Except words =
• Find most likely Chinese sentence to have
  generated English output. Use Bayes’ rule.

11/17/2011                                        56
         How train, use transliteration
           system – see next slide

11/17/2011                                57
• Got from authors of [3] {+ [4]}, their
• 3875 English names, Chinese
  transliterations, pin-yin counterparts, +
  used Festival to generate phonemic
  English, + pronunciation of pinyin
  based on Initial/Final inventory from
  Mandarin phonology text
• First corpus: lines 2, 3
• Second corpus: lines 4, 5

• Compare to [4], Do more general test

11/17/2011                                    58
      Spoken Document Retrieval
• Infrastructure developed at Johns Hopkins
  Summer Workshop – Mandarin audio to
  be searched using English text queries
• English proper names unavailable in
  translation lexicon, thus ignored during
• Improved mean average precision by
  adding name transliteration (from 0.501 to
11/17/2011                                 59
      Statistical Transliteration for
         English-Arabic Cross
         Language Information
             Nasreen AbdulJaleel, Leah Larkey

11/17/2011                                      60
• For IR
• Motivation – not proper nouns but rather OOV (out of
  vocabulary) words – when have no corresponding word
  in dictionary, simply transliterate it
• Though train English to Arabic transliteration model from
  pairs of names
• Selected n-gram model
     – Two stage training model
     – Learn which n-gram segments should be added to unigram
       inventory for source language
     – Then learn translation model over this inventory
     – + No need for heuristics
     – + No need for knowledge of either language

11/17/2011                                                      61
                       The Problem
• OOV words – problem in cross language information
     –   Named entities
     –   Numbers
     –   Technical terms
     –   Acronyms
• These compose significant portion of OOV, and when
  named entity translation not available, reduction in
  average precision of 50%
• Variability of spelling foreign words. E.g. Qaddafi from
• OK to use own spelling in foreign language when share
  same alphabet (e.g. Italian, Spanish, German), but not
  when has different alphabet. Then transliteration.
11/17/2011                                                   62
       Multiple Spellings In Arabic

• Thus, useful to have way to generate multiple spellings in
  Arabic from single source
• Use statistical transliteration to generate – no heuristics, no
  linguistic knowledge
• “Statistical transliteration is special case of statistical
  translation, in which the words are letters.”

11/17/2011                                                   63
      Selected N-gram transliteration
• Generative statistical model, producing string of Arabic
  chars from string of English chars
• Model: set of conditional probability distributions over
  Arabic chars and NULL
• Each English char n-gram ei can be mapped to Arabic
  char or sequence of chars ai with probability P(ai|ei)
• Most probabilities are 0, in practice.
• Probabilities of s, z, tz

• Also, English source symbol inventory has, besides
  unigrams (such as single letters), some end symbols and
  n-grams such as sh, bb, eE
11/17/2011                                                   64
                    Training of Model
•   From lists of English/Arabic name pairs
•   2 alignment stages
     – 1: to select n-grams for the model
     – 2: Determine translation probabilities for the n-grams
•   Used GIZA++ for letter alignment rather than word alignment, treating
    letters as words
•   Corpus: 125,000 English proper nouns and Arabic translations, retaining
    only those existing in AP news article corpus
•   Some normalization – made lowercase, prefixed with B and ended with E
•   Alignment 1: Align using GIZA++, count instances in which English char
    sequence aligned to single Arabic character. Take top 50 of these n-grams
    and add to English symbol inventory
•   Resegment based on new inventory, using greedy-ish method
     – Ashcroft  a sh c r o f t
•   Alignment #2, using GIZA++
•   Count up alignments, use them as conditional probabilities, removing
    alignments with probability threshold of 0.01

11/17/2011                                                                  65
             Generation of Arabic
• Take English word ew.
• Segment, greedily (?) from n-gram
• All possible transliterations, wa generated
• Rank according to probabilities, by
• Ran experiments, improvement over
  unigram only. Etc.
11/17/2011                                      66
11/17/2011   67

Shared By: