Machine Transliteration
Shared by: 583Ye5mv
-
Stats
- views:
- 9
- posted:
- 11/17/2011
- language:
- English
- pages:
- 67
Document Sample


Machine Transliteration
Joshua Waxman
11/17/2011 1
Overview
• Words written in a language with alphabet
A written in a language with alphabet B
• שלום “shalom”
• Importance for MT, for cross-language IR
• Forward transliteration, Romanization,
back-transliteration
11/17/2011 2
Is there a convergence towards
standards?
Perhaps for really famous names. Even for such standard names, multiple acceptable spellings.
Whether there is someone regulating such spellings probably dependent culturally. In meantime,
ֻ
have a lot of variance. Especially on Web. E.g. holiday of Succot, סּכֹות ,סוכות
Variance in pronunciation culturally across different groups (soo-kot, suh-kes) = dialect, variance in
how one chooses to transliterate different Hebrew letters (kk, cc, gemination).
• Sukkot: 7.1 million
• Succot: 173 thousand
• Succos: 153 thousand
• Sukkoth: 113 thousand
• Succoth: 199 thousand
• Sukos: 112 thousand
• Sucos: 927 thousand, but probably almost none related to holiday
• Sucot: 101 thousand. Spanish transliteration of holiday
• Sukkes: 1.4 thousand. Yiddish rendition
• Succes: 68 million. Misspelling of “success”
• Sukket: 45 thousand. But not Yiddish, because wouldn’t have “t” ending
Recently in the news: AP: Emad Borat; Arutz Sheva: Imad Muhammad Intisar Boghnat
11/17/2011 3
Can we enforce standards?
• Would make task easier.
• News articles, perhaps
• However:
– Would they listen to us?
– Does the standard make sense across the
board? Once again, dialectal differences. E.g.
,ה ,תvowels. Also, fold-over of alphabet. ,ע-א
ק-כ, ח-כ, ת-ט, ת-ס
– 2N for N laguages
11/17/2011 4
11/17/2011 5
Four Papers
• “Cross Linguistic Name Matching in English and Arabic”
– For IR – search. Fuzzy string matching. Modification of Soundex
to use cross-language mapping, using character equivalence
classes
• “Machine Transliteration”
– For Machine translation. Back transliteration. 5 steps in
transliteration. Use Bayes’ rule
• “Transliteration of Proper Names in Cross-Language
Applications”
– Forward transliteration, purely statistical based
• “Statistical Transliteration for English-Arabic Cross
Language Information Retrieval”
– Forward transliteration. For IR, generating every possible
transliteration, then evaluate. Using selected n-gram model
11/17/2011 6
Cross Linguistic Name
Matching in English and Arabic
A “One to Many Mapping” Extension of the
Levenshtein Edit Distance Algorithm
Dr. Andrew T. Freeman, Dr. Sherri L. Condon and
Christopher M. Ackerman
The Mitre Corporation
11/17/2011 7
Cross Linguistic Name Matching
• What?
– Match personal names in English to the same names in Arabic script.
• Why is this not a trivial problem?
– There are multiple transcription schemes, so it is not one-to-one
– e.g. معمر القذافيcan be Muammar Gaddafi, Muammar Qaddafi, Moammar
Gadhafi, Muammar Qadhafi, Muammar al Qadhafi
– because certain consonants and vowels can be represented multiple
ways in English
– note: Arabic is just an example of this phenomenon
– so standard string comparison insufficient
• For What purpose?
– For search on, say, news articles. How do you match all occurrences of
“Qadhafi”
• Their solution
– Enter the search term in Arabic, use Character Equivalence Classes
(CEQ) to generate possible transliterations, supplement the Levenshtein
Edit Distance Algorithm
11/17/2011 8
Elaboration on Multiple
Transliteration Schemes
• Why?
– No standard English
phoneme
corresponding to
Arabic /q/
– Different dialects – in
Libya, this is
pronounced [g]
– note: Similar for
Hebrew dialects
11/17/2011 9
Fuzzy string matching
• def: matching strings based on similarity
rather than identity
• Examples:
– edit-distance
– n-gram matching
– normalization procedures like Soundex.
11/17/2011 10
Survey of Fuzzy Matching Methods - Soundex
• Soundex
– Odell and Russel, 1918
• Some obvious pluses:
– (not mentioned explicitly by paper)
– we eliminate vowels, so Moammar/Muammar
not a problem
– Groups of letters will take care of different
English letters corresponding to Arabic
– Elimination of repetition and of h will remove
gemination/fricatives
• Some minuses
– Perhaps dialects will transgress Soundex
phonetic code boundaries. e.g. תin Hebrew can
be t, th, s. חcan be ch or h. Is a וto be w or v?
But could modify algorithm to match.
– note al in al-Qadafi
– Perhaps would match too many inappropriate
results
11/17/2011 11
Noisy Channel Model
11/17/2011 12
Levenshtein Edit Distance
• AKA Minimum Edit Distance
– Minimum number of operations of insertion, deletion, substitution.
Cost per operation = 1
– Via dynamic programming
– Example taken from Jurafsky and Martin, but with corrections
– Minimum of diagonal + subst, or down/left + insertion/deletion cost
11/17/2011 13
Minimum Edit Distance Example
(substitution cost = 2)
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 12
E 4 3 4 5 6 7 8 9 10 11
T 3 4 5 6 7 8 7 8 9 10
N 2 3 4 5 6 7 8 9 10 11
I 1 2 3 4 5 6 7 8 9 10
# 0 1 2 3 4 5 6 7 8 9
11/17/2011
# E X E C U T I O N 14
Minimum Edit Distance Example
(substitution cost = 1)
N 9 7 7 7 7 8 8 7 6 5
O 8 6 6 6 7 7 7 6 5 6
I 7 5 5 6 6 6 6 5 6 7
T 6 4 5 5 5 5 5 6 7 8
N 5 4 4 5 4 5 6 7 7 7
E 4 3 4 3 4 5 6 6 7 8
T 3 3 3 3 4 5 5 6 7 8
N 2 2 2 3 4 5 6 7 7 8
I 1 1 2 3 4 5 6 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
11/17/2011 15
Minimum Edit Distance
• Score of 0 = perfect match, since no edit ops
• s of len m, t of len n
• Fuzzy match: divide edit score by length of shortest (or longest) string, 1 –
this number. Set threshold for strings to be a match. Then, longer pairs of
strings more likely to be matched than shorter pairs of strings with same
number of edits. So get percentage of chars that need ops. Otherwise, “A”
vs “I” has same edit distance as “tuning” vs. “turning.”
• Good algorithm for fuzzy string comparison – can see that Muammar
Gaddafi, Muammar Qaddafi, Moammar Gadhafi, Muammar Qadhafi,
Muammar al Qadhafi are relatively close.
• But, don’t really want substitution cost of G/Q, O/U, DD/DH, certain
insertion/deletion costs. That is why they supplement it with these Character
Equivalence Classes (CEQ), which we’ll get to a bit later.
11/17/2011 16
Editex
• Zobel and Dart (1996) – Soundex + Levenshtein Edit Distance
• replace e(si, tj) which was basically 1 if unequal, 0 if equal (that is, cost of an
op), with r(si, tj), which makes use of Soundex equivalences. 0 if identical, 1
if in same group, 2 if different
• Also neutralizes h and w in general. Show example based on chart from
before. In terms of initializing or calculating cost of insertion/deletion, do not
count, otherwise have cost of 1.
• Other enhancements to standard Soundex and Edit distance for the
purpose of comparison. e.g. tapering – (counts less later in the word);
phonometric methods – input strings mapped to phonemic representations.
E.g. rough.
• Say performed better than Soundex, Min Edit Distance, counting n-gram
sequences, ~ 10 permutations of tapering, phonemetric enhancements to
standard algorithms
11/17/2011 17
SecondString (Tool)
• Java based implementation of many of
these string matching algorithms. They
use this for comparison purposes. Also,
SecondString allows hybrid algorithms by
mixing and matching, tools for string
matching metrics, tools for matching
tokens within strings.
11/17/2011 18
Baseline Task (??)
• Took 106 Arabic, 105 English texts from newswire articles
• Took names from these articles, 408 names from English, 255 names from Arabic.
• manual cross-script matching, got 29 common names (rather than manually coming up with all
possible transliterations)
• But to get baseline, tried matching all names in Arabic (transliterated using Atrans by Basis –
2004) to all names in English, using algorithms from SecondString. Thus, have one standard
transliteration, and try to match it to all other English transliterations
• Empirically set threshold to something that yielded good result.
• R = recall = # correctly matched English names / # available correct English matches in set; what
percentage of total correct did they get?
• P = Precision = total # correct names / total # of names returned; what percentage of their
guesses were accurate?
• Defined F-score as 2 X (PR) / (P + R)
11/17/2011 19
Other Algorithms Used For
Comparison
• Smith – Waterman = Levenstein
Edit, with some parameterization
of gap score
• SLIM = iterative statistical learning
algorithm based on a variety of
estimation-maximization in which
a Levenshtein edit-distance matrix
is iteratively processed to find the
statistical probabilities of the
overlap between two strings.
• Jaro = n-gram
• Last one is Edit distance
11/17/2011 20
Their Enhancements
• Motivation: Arabic letter has more than
one possible English letter equivalent.
Also, Arabic transliterations of English
names not predictable. 6 different ways to
represent Milosevic in Arabic.
11/17/2011 21
Some Real World
Knowledge
11/17/2011 22
Character Equivalence Classes
• Same idea as Editex, except use Ar(si, tj) where s
is an Arabic word, so si is an Arabic letter, and t is
an English word, and tj is an English letter.
• So, comparing Arabic to English directly, rather
than a standard transliteration
• The sets within Ar to handle (modified) Buckwater
transliteration, default transliteration of Basis’
software
• Basis’ uses English digraphs for certain letters
11/17/2011 23
Buckwalter Transliteration Scheme
A “scholarly” transliteration scheme, unlikely to be found in newspaper articles:
Wikipedia:The Buckwalter Arabic transliteration was developed at Xerox by Tim Buckwalter in the 1990s. It is an
ASCII only transliteration scheme, representing Arabic orthography strictly one-to-one, unlike the more common
romanization schemes that add morphological information not expressed in Arabic script. Thus, for example, a waw
will be transliterated as w regardless of whether it is realized as a vowel [u:] or a consonant [w]. Only when the waw
is modified by a hamza ( )ؤdoes the transliteration change to &. The unmodified letters are straightforward to read
(except for maybe *=dhaal and E=ayin, v=thaa), but the transliteration of letters with diacritica and the harakat take
some time to get used to, for example the nunated i`rab -un, -an, -in appear as N, F, K, and the sukun ("no vowel")
as o. Ta marbouta ةis p.
– hamza
– lone hamza: '
– hamza on alif: >
– hamza on wa: &
– hamza on ya: }
– alif
– madda on alif: |
– alif al-wasla: {
– dagger alif: `
– alif maqsura: Y
– harakat
– fatha: a
– damma: u
– kasra: i
– fathatayn: F
– dammatayn: N
– kasratayn K
– shadda: ~
– sukun: o
– ta
11/17/2011 marbouta: p
– tatwil: _
24
The Equivalence
Classes
11/17/2011 25
Normalization
• They normalize Buckwalter and the
English in the newspaper articles.
• Thus, $ sh from Buckwalter,
• ph f in English, eliminate dupes, etc.
• Move vowels from each language closer to
one another by only retaining matching
vowels (that is, where exist in both)
11/17/2011 26
11/17/2011 27
Why different from Soundex and
Editex
• “What we do here is the opposite of the
approach taken by the Soundex and Editex
algorithms. They try to reduce the complexity by
collapsing groups of characters into a single
super-class of characters. The algorithm here
does some of that with the steps that normalize
the strings. However, the largest boost in
performance is with CEQ, which expands the
number of allowable cross-language matches for
many characters.”
11/17/2011 28
Machine (Back-)
Transliteration
Kevin Knight and Jonathan Graehl
University of Southern California
11/17/2011 29
Machine Transliteration
• For Translation purposes
• Foreign Words commonly transliterated, using approximate phonemic
equivalents
– “computer” konpyuuta
• Problem: Usually, translate by looking up in dictionaries, but these often
don’t show up in dictionaries
• Usually not a problem for some languages, like Spanish/English, since have
similar alphabets. But non-alphabetic languages or with different alphabets,
more problematic. (e.g. Japanese, Arabic)
• Popular on the Internet: “The Coca-Cola name in China was first read as
"Ke-kou-ke-la," meaning "Bite the wax tadpole" or "female horse stuffed with
wax," depending on the dialect. Coke then researched 40,000 characters to
find a phonetic equivalent to "ko-kou-ko-le," translating into "happiness in the
mouth." “
• Solution: Backwards transliteration to get the original word, using a
generative model
11/17/2011 30
Machine Transliteration
• Japanese transliterates e.g. English in katakana.
Foreign names and loan-words.
• Compromises: e.g. golfbag
– L/R map to same character
– Japanese has alternating consonant vowel pattern, so
cannot have consonant cluster LFB
– Syllabary instead of alphabet.
– Goruhubaggu
– Dot separator, but inconsisent, so
aisukuriimu can be “I scream”
or “ice cream”
11/17/2011 31
Back Transliteration
• Going from katakana back to
original English word
• for translation – katakana not
found in bilingual dictionaries,
so just generate original
English (assuming it is English)
• Yamrom 1994 – pattern
matching – ***
• Arbabi 1994 – neural net/expert
system ***
• Information loss, so not easy to
invert
11/17/2011 32
More Difficult Than
• Forward transliteration
– several ways to transliterate into katakana, all valid, so you
might encounter any of them
– But only one English spelling; can’t say “arture” for “archer”
• Romanization
– we have seen examples of this;
the katakana examples above
– more difficult because of spelling variations
• Certain things cannot be handled by back-transliteration
– Onomatopoeia
– Shorthand: e.g. waapuro = word processing
11/17/2011 33
Desired Features
• Accuracy
• Portability to other languages
• Robust against OCR errors
• Relevant to ASR where speaker
has heavy accent
• Ability to take context
(topical/syntactic) into account, or
at least return ranked list of
possibilities
• Really requires 100% knowledge
11/17/2011 34
Learning Approach – Initial Attempt
• Can learn what letters transliterate for what by
training on corpus of katakana phrases in
bilingual dictionaries
• Drawbacks:
– with naïve approach, how can we make sure we get a
normal transliteration?
– E.g. we can get iskrym as back transliteration for
aisukuriimu.
– Take letter frequency into account! So can get isclim
– Restrict to real words! Is crime.
– We want ice cream!
11/17/2011 35
Modular Learning Approach
Build generative model of transliteration process,
1. English phrase is written
2. Translator pronounces it in English
3. Pronunciation modified to fit Japanese sound inventory
4. Sounds are converted into katakana
5. Katakana is written
Solve and coordinate solutions to these subproblems, use
generative models in reverse direction
Use probabilities and Bayes Rule
11/17/2011 36
Bayes’ Rule Example
Example #1: Conditional probabilities – from Wikipedia
Suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip cookies and 30 plain cookies, while bowl #2
has 20 of each. Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no
reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be
a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1.
The precise answer is given by Bayes's theorem. But first, we can clarify the situation by rephrasing the question
to "what’s the probability that Fred picked bowl #1, given that he has a plain cookie?” Thus, to relate to our
previous explanation, the event A is that Fred picked bowl #1, and the event B is that Fred picked a plain cookie.
To compute Pr(A|B), we first need to know:
Pr(A), or the probability that Fred picked bowl #1 regardless of any other information. Since Fred is treating both bowls
equally, it is 0.5.
Pr(B), or the probability of getting a plain cookie regardless of any information on the bowls. In other words, this is the
probability of getting a plain cookie from each of the bowls. It is computed as the sum of the probability of getting a
plain cookie from a bowl multiplied by the probability of selecting this bowl. We know from the problem statement
that the probability of getting a plain cookie from bowl #1 is 0.75, and the probability of getting one from bowl #2 is
0.5, and since Fred is treating both bowls equally the probability of selecting any one of them is 0.5. Thus, the
probability of getting a plain cookie overall is 0.75×0.5 + 0.5×0.5 = 0.625.
Pr(B|A), or the probability of getting a plain cookie given that Fred has selected bowl #1. From the problem statement,
we know this is 0.75, since 30 out of 40 cookies in bowl #1 are plain.
Given all this information, we can compute the probability of Fred having selected bowl #1 given that he got a plain
cookie, as such:
As we expected, it is more than half.
11/17/2011 37
Application To Task At Hand
English Phrase Generator produces word sequences
according to probability distribution P(w)
English Pronouncer probabilistically assigns a set of
pronunciations to word sequence, according to P(p|w)
Given pronunciation p, find word sequence that maximizes
P(w|p)
Based on Bayes’ Rule: P(w|p) = P(p|w) * P(w) / P(p)
But P(p) will be the same regardless of the specific word
sequence, so can just search for word sequence that
maximizes P(p|w) * P(w), which are the two distributions
we just modeled
11/17/2011 38
Five Probability Distributions
Extending this notion, built 5 probability distributions
1. P(w) – generates written English word sequences
2. P(e|w) – pronounces English word sequences
3. P(j|e) – converts English sounds into Japanese sounds
4. P(k|j) – converts Japanese sounds into katakana writing
5. P(o|k) – introduces misspellings caused by OCR
Parallels 5 steps above
1. English phrase is written
2. Translator pronounces it in English
3. Pronunciation modified to fit Japanese sound inventory
4. Sounds are converted into katakana
5. Katakana is written
Given katakana string o observed by OCR, we wish to maximize:
P(w) * P(e|w) * P(j|e) * P(k|j) * P(o | k) over all e, j, k
Why? Lets say have e and want to determine most probable w given e – that is, P(w|e), would maximize P(w) * P(e|w) /
P(e)
Let us say had j and want to get most probable e given j – that is, P(e|j), would maximize P(e) * P(j|e).
Note that while usually we ignore the divisor, here we maintain it. P(e) / P(e) = 1
And so on for each in turn.
11/17/2011 39
Implementation of the probability
distributions
P(w) as WFSA (weighted finite state acceptor), others as
WFST (transducers)
WFSA = state transition diagram with both symbols and
weights on the transitions, such that some transitions
more likely than others
WFST = the same, but with both input and output symbols
Implemented composition algorithm to yield P(x|z) from
models P(x|y) and P(y|z), treating WFSAs simply as
WFST with identical input and output
Yields one large WFSA, and use Djikstra’s shortest path
algorithm to extract most probable one
No pruning, use Viterbi approximation, searching best path
through WFSA rather than best sequence
11/17/2011 40
First Model – Word Sequences
• “ice cream” > “ice crème” > “aice kreme”
• Unigram scoring mechanism which multiplies
scores of known words and phrases in a
sequence
• Corpus: WSJ corpus + online English name
list + online gazeteer of place names
• Should really e.g. ignore auxiliaries and favor
surnames. Approximate by removing high
frequency words
11/17/2011 41
Model 2 – Eng Word Sequences
Eng Sound Sequences
• Use English phoneme inventory
from CMU Pronunciation
Dictionary, minus stress marks
• 40 sounds: 14 vowel sounds, 25
consonant sounds (e.g. K, HH, R),
additional symbol PAUSE
• Dictionary has 100,000 (125,000)
word pronunciation
• Used top 50,000 words because
of memory limitations
• Capital letters – Eng sounds;
lowercase words – Eng words
11/17/2011 42
11/17/2011 43
11/17/2011 44
Example Second WFST
Note: Why not letters instead of phonemes? Doesn’t match Japanese
transliteration mispronunciation, and that is modeled in next step.
11/17/2011 45
Model 3: English Sounds
Japanese Sounds
• Information losing process: R, L r, 14 vowels 5 Japanese
vowels
• Identify Japanese sound inventory
• Build WFST to perform the sequence mapping
• Japanese sound inventory has 39 symbols: 5 vowels, 33 consonants
(including doubled kk), special symbol pause.
• (P R OW PAUSE S AA K ER) (pro-soccer) maps to (p u r o pause s
a kk a a)
• Use machine learning to train WFST from 8000 pairs of
English/Japanese sound sequences (for example, soccer). Created
this corpus by modifying an English/katakana dictionary, converting
into these sounds; used EM (estimation maximization) algorithm to
generate symbol matching probabilities. See table on next page
11/17/2011 46
11/17/2011 47
The EM Algorithm
Note: pays no heed to context
11/17/2011 48
Model 4: Japanese sounds Katakana
• Manually construct 2.
• #1 just merges
sequential doubled
sounds into single
sound. o o oo
• #2 just does mapping,
accounting for
different spelling
variation. e.g.
11/17/2011 49
Model 5: katakana OCR
11/17/2011 50
Example
11/17/2011 51
Transliteration of Proper
Names in Cross-Language
Applications
Paola Virga, Sanjeev Khudanpur
Johns Hopkins University
11/17/2011 52
Abstract
• For MT, for IR, specifically cross-language IR
• Names important, particularly for short queries
• Transliteration: writing name in foreign language, preserving the way it
sounds
1. Render English name in phonemic form
2. Convert phonemic string into foreign orthography, e.g. Mandarin Chinese
• Mentions back transliteration for Japanese, and application to Arabic, by
Knight etc.
• For Korean, strongly phonetic orthography allows good transliteration
using simple HMMS
• Hand-crafted rules to change English spelling to accord to Mandarin
syllabification, then learns to convert English phoneme sequence to
Mandarin syllable sequence.
• They extend the previous, making it fully data-driven rather than relying
on hand-crafted rules, to accomplish English Mandarin transliteration
11/17/2011 53
Four steps in transliteration
process
1. English Phonetic English (using Festival)
1. Festival – free, source available, multilingual, interfaces to shell, Scheme, Java, C++, emacs (see next
page)
2. English phoneme initials and finals
3. Initial + final sequence pin-yin symbols
Wikipedia: Pinyin is a system of romanization (phonemic notation and transcription to Roman script) for
Standard Mandarin, where pin means "spell" and yin means "sound". …
– Pinyin is a romanization and not an anglicization; that is, it uses Roman letters to
represent sounds in Standard Mandarin. The way these letters represent sounds in Standard Mandarin
will differ from how other languages that use the Roman alphabet represent sound. For example, the
sounds indicated in pinyin by b and g are not as heavily voiced as in the Western use of the Latin script.
Other letters, like j, q, x or zh indicate sounds that do not correspond to any exact sound in English.
Some of the transcriptions in pinyin, such as the ang ending, do not correspond to English
pronunciations, either.
– By letting Roman characters refer to specific Chinese sounds, pinyin produces a compact and accurate
romanization, which is convenient for native Chinese speakers and scholars. However, it also means
that a person who has not studied Chinese or the pinyin system is likely to severely mispronounce
words, which is a less serious problem with some earlier romanization systems such as Wade-Giles.
– Diff than katakana
4. Pin-yin Chinese character sequence
1, 3: deterministic; 2, 4: statistics
11/17/2011 54
11/17/2011 55
Noisy Channel Model
• We had concept before
• Think of e an i-word English sentence output
from noisy channel, c as j-word Chinese input
into the noisy channel. Except words =
phonemes
• Find most likely Chinese sentence to have
generated English output. Use Bayes’ rule.
11/17/2011 56
How train, use transliteration
system – see next slide
11/17/2011 57
Training
• Got from authors of [3] {+ [4]}, their
corpus.
• 3875 English names, Chinese
transliterations, pin-yin counterparts, +
used Festival to generate phonemic
English, + pronunciation of pinyin
based on Initial/Final inventory from
Mandarin phonology text
• First corpus: lines 2, 3
• Second corpus: lines 4, 5
• Compare to [4], Do more general test
11/17/2011 58
Spoken Document Retrieval
• Infrastructure developed at Johns Hopkins
Summer Workshop – Mandarin audio to
be searched using English text queries
• English proper names unavailable in
translation lexicon, thus ignored during
retrieval
• Improved mean average precision by
adding name transliteration (from 0.501 to
0.515)
11/17/2011 59
Statistical Transliteration for
English-Arabic Cross
Language Information
Retrieval
Nasreen AbdulJaleel, Leah Larkey
11/17/2011 60
Overview
• For IR
• Motivation – not proper nouns but rather OOV (out of
vocabulary) words – when have no corresponding word
in dictionary, simply transliterate it
• Though train English to Arabic transliteration model from
pairs of names
• Selected n-gram model
– Two stage training model
– Learn which n-gram segments should be added to unigram
inventory for source language
– Then learn translation model over this inventory
– + No need for heuristics
– + No need for knowledge of either language
11/17/2011 61
The Problem
• OOV words – problem in cross language information
retrieval
– Named entities
– Numbers
– Technical terms
– Acronyms
• These compose significant portion of OOV, and when
named entity translation not available, reduction in
average precision of 50%
• Variability of spelling foreign words. E.g. Qaddafi from
before
• OK to use own spelling in foreign language when share
same alphabet (e.g. Italian, Spanish, German), but not
when has different alphabet. Then transliteration.
11/17/2011 62
Multiple Spellings In Arabic
• Thus, useful to have way to generate multiple spellings in
Arabic from single source
• Use statistical transliteration to generate – no heuristics, no
linguistic knowledge
• “Statistical transliteration is special case of statistical
translation, in which the words are letters.”
11/17/2011 63
Selected N-gram transliteration
model
• Generative statistical model, producing string of Arabic
chars from string of English chars
• Model: set of conditional probability distributions over
Arabic chars and NULL
• Each English char n-gram ei can be mapped to Arabic
char or sequence of chars ai with probability P(ai|ei)
• Most probabilities are 0, in practice.
• Probabilities of s, z, tz
• Also, English source symbol inventory has, besides
unigrams (such as single letters), some end symbols and
n-grams such as sh, bb, eE
11/17/2011 64
Training of Model
• From lists of English/Arabic name pairs
• 2 alignment stages
– 1: to select n-grams for the model
– 2: Determine translation probabilities for the n-grams
• Used GIZA++ for letter alignment rather than word alignment, treating
letters as words
• Corpus: 125,000 English proper nouns and Arabic translations, retaining
only those existing in AP news article corpus
• Some normalization – made lowercase, prefixed with B and ended with E
• Alignment 1: Align using GIZA++, count instances in which English char
sequence aligned to single Arabic character. Take top 50 of these n-grams
and add to English symbol inventory
• Resegment based on new inventory, using greedy-ish method
– Ashcroft a sh c r o f t
• Alignment #2, using GIZA++
• Count up alignments, use them as conditional probabilities, removing
alignments with probability threshold of 0.01
11/17/2011 65
Generation of Arabic
Transliterations
• Take English word ew.
• Segment, greedily (?) from n-gram
inventory
• All possible transliterations, wa generated
• Rank according to probabilities, by
multiplying
• Ran experiments, improvement over
unigram only. Etc.
11/17/2011 66
11/17/2011 67
Get documents about "