# Machine Translation Statistical Machine Translation Word-based

Document Sample

```					                                                                                   Statistical Machine Translation

Machine Translation
Phrase-Based Statistical MT

Jörg Tiedemann

jorg.tiedemann@lingfil.uu.se
Department of Linguistics and Philology
Uppsala University
Probabilistic view on MT (E = target language, F = source language):

E    = argmaxE P(E|F )
= argmaxE P(F |E)P(E)

Jörg Tiedemann                                                              1/69   Jörg Tiedemann                                                                 2/69

Word-based Translation Models                                                      Word-based Alignment Models (IBM 1)
Example: translate this: “a house”                                          Where do we get the lexical probabilities from?
Translation candidate 1: “ett hus”                                          → Automatic Word alignment!
Question: What is P("ett hus"|"a house")?
Example corpus and EM (see chapter 4.2!):
ett     hus                   ett barn                    mitt barn
simplest model: context-independent lexical probabilities
t(wordenglish |wordswedish ) (no NULL alignments)

P("a house"|"ett hus") = sum of all possible ways to generate “ett             a house                      a child                     my child
hus” from “a house” given our model (table of lexical
probabilities)                                                                       Basic question: How often do we link certain words together in
P("a house"|"ett hus") = /22 ∗ t("a"|"ett") ∗ t("house"|"hus") +                       all possible alignments (relative to other possible links)?
/22 ∗ t("ett"|"house") ∗ t("hus"|"a") +
We don’t have ﬁxed links — we only know the likelihood of an
/22 ∗ t("ett"|"a") ∗ t("hus"|"a") +                     alignment! → count link likelihoods instead!
/22 ∗ t("ett"|"house") ∗ t("hus"|"house")
Initially: All links have the same probability (t(e|f ) = 0.25)!
According to this model: What is P("a house"|"hus ett")?
Jörg Tiedemann                                                              3/69   Jörg Tiedemann                                                                 5/69
IBM 1: Initialization                                                             IBM 1: Iteration 1
ett    hus                  ett barn                       mitt barn                                                                mitt barn
ett    hus            ett barn

a house                     a child                       my child                                                                 my child
a house               a child
e       f      total(f)   count         t
e       f      total(f)   count        t
house   mitt    0.000     0.000    0.250
house   ett     0.000     0.000    0.250
house   ett     2.000     0.500   0.250
house   barn    0.000     0.000    0.250                                    house   hus     1.000     0.500   0.500
house   hus     0.000     0.000    0.250                                    a       ett     2.000     1.000   0.500
a       mitt    0.000     0.000    0.250                                    a       barn    2.000     0.500   0.250
a       ett     0.000     0.000    0.250                                    a       hus     1.000     0.500   0.500
...     ...         ...      ...      ...                                   my      mitt    1.000     0.500   0.500
my      barn    2.000     0.500   0.250
Example count: in sentence pair 1: “ett” and “a” are linked once                       child   mitt    1.000     0.500   0.500
with likelihood 0.25 out of two possible links for “a”                                 child   ett     2.000     0.500   0.250
→ relative count = 0.25/(0.25+0.25) = 0.25/0.5 = 0.5                                   child   barn    2.000     1.000   0.500
The same in sentence pair 2 → total count = 0.5 + 0.5 = 1.0!
Jörg Tiedemann                                                             7/69   Jörg Tiedemann                                                    9/69

IBM 1: Iteration 2                                                                IBM 1: Iteration 3

ett    hus                  ett barn                       mitt barn              ett    hus            ett barn                    mitt barn

a house                     a child                       my child                a house               a child                    my child
e        f      total(f)   count         t                                   e       f      total(f)   count        t
house    ett     1.833     0.333     0.182                                   house   ett     1.839     0.241   0.131
house    hus     1.167     0.667     0.571                                   house   hus     1.161     0.759   0.653
a        ett     1.833     1.167     0.636                                   a       ett     1.839     1.375   0.748
a        barn    1.833     0.333     0.182                                   a       barn    1.839     0.222   0.121
a        hus     1.167     0.500     0.429                                   a       hus     1.161     0.402   0.347
my       mitt    1.167     0.667     0.571                                   my      mitt    1.161     0.759   0.653
my       barn    1.833     0.333     0.182                                   my      barn    1.839     0.241   0.131
child    mitt    1.167     0.500     0.429                                   child   mitt    1.161     0.402   0.347
child    ett     1.833     0.333     0.182                                   child   ett     1.839     0.222   0.121
child    barn    1.833     1.167     0.636                                   child   barn    1.839     1.375   0.748

Jörg Tiedemann                                                            11/69   Jörg Tiedemann                                                   13/69
IBM 1: Iteration 4                                                       IBM 1: Iteration 5

ett    hus            ett barn                    mitt barn              ett    hus                     ett barn                       mitt barn

a house               a child                    my child                a house                        a child                        my child
e       f      total(f)   count       t                                       e           f      total(f)   count          t
house   ett     1.851     0.167   0.090                                       house       ett     1.863     0.111     0.060
house   hus     1.149     0.833   0.724                                       house       hus     1.137     0.889     0.782
a       ett     1.851     1.544   0.834                                       a           ett     1.863     1.669     0.896
a       barn    1.851     0.139   0.075                                       a           barn    1.863     0.083     0.044
a       hus     1.149     0.317   0.276                                       a           hus     1.137     0.248     0.218
my      mitt    1.149     0.833   0.724                                       my          mitt    1.137     0.889     0.782
my      barn    1.851     0.167   0.090                                       my          barn    1.863     0.111     0.060
child   mitt    1.149     0.317   0.276                                       child       mitt    1.137     0.248     0.218
child   ett     1.851     0.139   0.075                                       child       ett     1.863     0.083     0.044
child   barn    1.851     1.544   0.834                                       child       barn    1.863     1.669     0.896

Jörg Tiedemann                                                   15/69   Jörg Tiedemann                                                                  17/69

IBM 1: Iteration 13                                                      Word-based Translation Models

ett    hus            ett barn                    mitt barn                       What happens if we introduce position parameters?
a(1|2) = probability that word at position 1 is generated by
word at position 2
a house               a child                    my child                        What is P("a house"|"ett hus") now?
e       f      total(f)   count       t
house   ett     1.942     0.002   0.001
P("a house"|"ett hus")   =    ∗ t("a"|"ett") ∗ a(1|1) ∗ t("house"|"hus") ∗ a(2|2) +
house   hus     1.058     0.998   0.944
∗ t("ett"|"house") ∗ a(2|1) ∗ t("hus"|"a") ∗ a(1|2) +
a       ett     1.942     1.940   0.999
a       barn    1.942     0.001   0.000                                                    ∗ t("ett"|"a") ∗ a(1|1) ∗ t("hus"|"a") ∗ a(1|2) +
a       hus     1.058     0.059   0.056                                                    ∗ t("ett"|"house") ∗ a(2|1) ∗ t("hus"|"house") ∗ a(2|2)
my      mitt    1.058     0.998   0.944
my      barn    1.942     0.002   0.001
child   mitt    1.058     0.059   0.056                      According to this model: What is P("a house"|"hus ett")?
child   ett     1.942     0.001   0.000
Probably: P("a house"|"ett hus") > P("a house"|"hus ett")
child   barn    1.942     1.940   0.999
Jörg Tiedemann                                                   19/69   Jörg Tiedemann                                                                  20/69
Word-based Translation Models                                                 Statistical word alignment

What are the issues with EM:
Increasing complexity:
IBM 1: lexical translation probabilities                                      expensive procedure especially with many open variables
IBM 2: add absolute reordering                                                (especially E-step)
IBM 3: add fertility                                                          no guarantee to ﬁnd the global optimum (if local optima
exist) → good initialization is necessary!
IBM 4: relative reordering & word classes
IBM 1 has only one (global) optimum → good!
from IBM 3 & 4: need to approximate E-step
Why do we need the lower models with less information?

→ start with simple models to initialize more complex ones

Jörg Tiedemann                                                        21/69   Jörg Tiedemann                                                   22/69

Summary on statistical word alignment                                         Statistical Machine Translation: Language Modeling

In word-based SMT:
Remember:
translation model is based on probabilistic parameters
lexical translation model                                                              ˆ
E = argmaxE P(F |E)P(E)
reordering model (distortion)
fertility model
statistical word alignment with EM                                            now we have the translation model P(F |E)
cascaded training procedure                                              we still need the language model P(E)
bi-product of parameter estimation: word alignment
(for mathematical details: see chapter 4)
Easy! → Use standard N-gram language models

Something is still missing in our SMT system ...

Jörg Tiedemann                                                        23/69   Jörg Tiedemann                                                   24/69
Statistical Machine Translation: Language Modeling                                         Statistical Machine Translation: Language Modeling

Language modeling:                                                                         Remember: MLE for conditional probabilities
count(e1 , e2 , ..., ej )
(probabilistic) LM = predict likelihood of any given string                                            P(ej |e1 , .., ej−1 ) =
count(e1 , e2 , ..., ej−1 )
What is the likelihood P(E) to observe sentence E?
PLM (the house is small) > PLM (small the is house)                                        Again: What is the problem?
PLM (ett hus) > PLM (en hus)                                                               → sparse counts for large N-grams!

Estimate probabilities from corpora:                                                       → Markov assumption! (bigram model: P(e3 |e1 , e2 ) ≈ P(e3 |e2 ))
P(E) = P(e1 , e2 , e3 , .., ej )
unigram model: P(E) = P(e1 ) ∗ P(e2 )...P(en )
P(E) = P(e1 ) ∗ P(e2 |e1 ) ∗ P(e3 |e1 , e2 ) ∗ ... ∗ P(ej |e1 , .., ej−1 )
bigram model: P(E) = P(e1 ) ∗ P(e2 |e1 ) ∗ P(e3 |e2 )...P(en |en−1 )

What is the problem here again?                                                                   trigram model: P(E) = P(e1 ) ∗ P(e2 |e1 ) ∗ P(e3 |e1 , e2 )...P(en |en−2 en−1 , )

Jörg Tiedemann                                                                     25/69   Jörg Tiedemann                                                                       26/69

Statistical Machine Translation: Language Modeling                                         Statistical Machine Translation: Decoding

Another problem: zero counts!
ˆ
Decoding = search a solution for E given F using:
some N-grams are never observed (→ count(e1 , e2 ) = 0)
ˆ
E = argmaxE P(F |E)P(E)
... but appear in real data (e.g. as translation candidate)
→ multiplying with one factor = 0 → everything is zero
→ BAD!                                                                              Far too many possible E’s to search globally!

→ Smoothing! (reserve probability mass for unseen events)                                  → Approximate search using good partial candidates!
... there would be so much more to say about LM’s (see ch.7)

Jörg Tiedemann                                                                     27/69   Jörg Tiedemann                                                                       28/69
Motivation for Phrase-based SMT                                                Phrase-based SMT

Word-based SMT
statistical word alignment → P(F |E)
language modeling → P(E)
global decoding argmaxE P(F |E)P(E)
Motivation
Word-by-word translation is too weak!                                                 phrases = word N-grams
contextual dependencies, local reordering                                      less ambiguity, more context in translation table
non-compositional constructions                                                handle non-compositional expressions
n:m relations                                                                  local reorderings covered by phrase translations
“distortion”: reordering on phrase level
→ look at larger chunks!
→ Moses toolkit: (http://www.statmt.org/moses/)

Jörg Tiedemann                                                         29/69   Jörg Tiedemann                                                  30/69

Phrase-based SMT                                                               Phrase-based SMT

Translation model in PSMT:
I
P(F |E) =         φ(fi |ei )d(starti , endi−1 )                Phrase translation probabilities:
i=1
need phrase alignments in parallel corpus
phrases are extracted from word aligned parallel corpora
induce them from word alignments (IBM models)
phrase translation probabilities (MLE):
score extracted phrases (MLE)
count(f , e)
φ(f |e) =
f count(f , e)
distance-based reordering (d)

Jörg Tiedemann                                                         31/69   Jörg Tiedemann                                                  32/69
Statistical word alignment                                                Viterbi Word Alignment

Standard models:

IBM models 1 - 5 (cascaded), EM training,
ﬁnal parameters:
word translation probabilities (lexical model)
special NULL word (NULL → la)
fertility probabilities
distortion probabilities (reordering)                                EMPTY alignment possible (did)
only 1:many (slap); not many:1
Viterbi alignment → assign most likely links between words                       → depending on alignment direction
according to the statistical word alignment model from above              → Alignment tool: GIZA++

Jörg Tiedemann                                                    33/69   Jörg Tiedemann                                                         34/69

Viterbi Word Alignment from GIZA++                                        Viterbi Word Alignment

From the German-English Europarl corpus:

# Sentence pair (5) source length 12 target
Asymmetric alignment!
length 11 alignment score : 2.14036e-24                                        no n:1 alignments
ich bitte sie , sich zu einer schweigeminute zu erheben .
NULL ({ }) please ({ 1 2 3 }) rise ({ }) , ({ 4 }) then ({ 5 })                  can run IBM models in both directions!
, ({ }) for ({ 6 }) this ({ 7 }) minute ({ 8 }) ’ ({ }) s ({ })                  different links in source-to-target and target-to-source
silence ({ 9 10 }) . ({ 11 })
best alignment = merge both directions (?!)
# Sentence pair (6) source length 12 target
length 10 alignment score : 3.38628e-15
( das parlament erhebt sich zu einer schweigeminute . )                   How? → Symmetrization heuristics!
NULL ({ }) ( ({ 1 }) the ({ 2 }) house ({ 3 }) rose ({ 4 5 })
and ({ }) observed ({ 6 }) a ({ 7 }) minute ({ 8 }) ’ ({ })
s ({ }) silence ({ 9 }) ) ({ 10 })

Jörg Tiedemann                                                    36/69   Jörg Tiedemann                                                         37/69
Word Alignment Symmetrization                                            Word Alignment Symmetrization

Jörg Tiedemann                                                   38/69   Jörg Tiedemann                                                      39/69

Word Alignment Symmetrization                                            Phrase extraction

Many symmetrization heuristics exist!                                    How do we get phrase alignments from word aligned data?

intersection (→ high precision, low recall)                              phrases = contiguous word sequences
union (→ low precision, high recall)                                     short phrases & long phrases are important
grow, grow-diag, grow-diag-ﬁnal, ...                                     (general vs. speciﬁc translation units)
(→ different kinds of balance between precision & recall)                should be conform to alignment

→ better would be: symmetric alignment models!                           → phrase extraction algorithms

Jörg Tiedemann                                                   40/69   Jörg Tiedemann                                                      41/69
Phrase extraction                                                                       Phrase extraction
Get ALL phrase pairs that are consistent with word alignments
What is a phrase pair that is consistent with the word
alignment?

all alignment points of words in source and target phrase
are within the phrase pair
no word is aligned to any other word outside of the phrase
pair
words may be unaligned

(algorithm see Figure 5.5 in chapter 5)

Jörg Tiedemann                                                                  42/69   Jörg Tiedemann                                                                  43/69

Phrase extraction                                                                       Phrase extraction
bofetada         bruja                                                 Maria no daba una       a   la       verde
Maria no daba una       a   la       verde
Mary
Mary
did
did
not
not
slap
slap
the
the
green
green
witch
witch

(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja,
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja,
witch), (verde, green), (Maria no, Mary did not), (no daba una bofetada, did
witch), (verde, green)
not slap), (daba una bofetada a la, slap the), (bruja verde, green witch)

Jörg Tiedemann                                                                  44/69   Jörg Tiedemann                                                                  45/69
Phrase extraction                                                                                   Phrase extraction
Maria no daba una       a   la       verde
Maria no daba una       a                 la       verde
Mary
Mary
did

did                                                                                                     not

not                                                                                                    slap

the
slap
green
the
witch

green

witch
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja,
witch), (verde, green), (Maria no, Mary did not), (no daba una bofetada, did
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja,                       not slap), (daba una bofetada a la, slap the), (bruja verde, green witch),
witch), (verde, green), (Maria no, Mary did not), (no daba una bofetada, did                        (Maria no daba una bofetada, Mary did not slap), (no daba una bofetada a la,
not slap), (daba una bofetada a la, slap the), (bruja verde, green witch),                          did not slap the), (a la bruja verde, the green witch), (Maria no daba una
(Maria no daba una bofetada, Mary did not slap), (no daba una bofetada a la,                        bofetada a la, Mary did not slap the), (daba una bofetada a la bruja verde,
did not slap the), (a la bruja verde, the green witch)                                              slap the green witch)
Jörg Tiedemann                                                                              46/69   Jörg Tiedemann                                                                    47/69

Phrase extraction                                                                                   Scoring phrases
Maria no daba una       a   la       verde

Mary

did

not

slap

the                                                                    Simple Maximum likelihood estimation:
green

witch                                                                                                                  count(f , e)
φ(f |e) =
f count(f , e)
(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja,
witch), (verde, green), (Maria no, Mary did not), (no daba una bofetada, did
not slap), (daba una bofetada a la, slap the), (bruja verde, green witch),
→ A huge phrase table! (with a lot of garbage?)
(Maria no daba una bofetada, Mary did not slap), (no daba una bofetada a la,
did not slap the), (a la bruja verde, the green witch), (Maria no daba una
bofetada a la, Mary did not slap the), (daba una bofetada a la bruja verde,
slap the green witch), (no daba una bofetada a la bruja verde, did not slap
the green witch), (Maria no daba una bofetada a la bruja verde, Mary did not
slap the green witch)
Jörg Tiedemann                                                                              48/69   Jörg Tiedemann                                                                    49/69
Phrase tables                                                                          The ﬁnal model for Phrase-Based SMT

Examples from a phrase table (Pirates of the Caribbean):

Swedish      English                         Score               ˆ
, det ?   , it’ s                         0.666667
E     = argmaxE P(E|F )
, det ?   , that’ s                       1                         = argmaxE φ(fi |ei ) ∗ d(starti , endi−1 ) ∗ P(E) ∗ ω length(E)
att bli besvikna     be disappointed                 1
att bli en sj?v    to becoming one                 1
bara vi    just                            0.1
bara    just                            0.6                Distortion d: Chance to move phrases to other positions
bara    only                            0.375
barbossa och hans bes?tning       barbossa and his crew           1                         ﬁxed distortion limit (e.g. 6)
barbossa och hans        barbossa and his                1                         simple penalty for moving: α|starti −endi−1 −1| OR
barbossa t?ker g?a . allt    barbossa is up to ... ... all   1
lexicalized distortion (learned from alignment)
Word cost: ω length(E) = bias for longer output
(The training set was too small to get reasonable counts!)

Jörg Tiedemann                                                                 50/69   Jörg Tiedemann                                                        51/69

PB-SMT extension: Log-linear Models                                                    PB-SMT extension: Log-linear Models
P(E|F ) = weighted (λm ) combination of feature functions (hm )
ˆ
Instead of noisy-channel model E = argmaxE P(F |E)P(E):
M
1
P(E|F ) =     ∗ exp         λm hm (E, F )
ˆ
model posterior directly: E = argmaxE P(E|F )                                                           Z
m=1
many feature functions hm (E, F ) may inﬂuence P(E|F )                                          ˆ
E   = argmaxE P(E|F ) = argmaxE (logP(E|F ))
M

phrase translation model E → F                                                                 = argmaxE            λm hm (E, F )
phrase translation model F → E                                                                                m=1
lexical weights from underlying word alignment
a language model P(E)                                                      How to learn weights λm ?
lexicalized reordering model
Minimum error rate training (MERT) on development set!
length features (word/phrase costs/penalties)
Measure error in terms of BLEU scores (n-best list)
→ P(E|F ) = weighted combination of feature functions!                                        Iterative adjustment of model parameters
(slow but effective!)

Jörg Tiedemann                                                                 52/69   Jörg Tiedemann                                                        53/69
Phrase table with multiple scores                                                           Translation = “decoding”

That’s what you will get from Moses:
Swedish             English            Scores                                                          ˆ
Global search: E = argmaxE P(E|F )
, det ?             , it’ s            0.6667 0.0959975 0.6667 0.0263227 2.718
att bli besvikna    be disappointed    1 0.0221815 1 0.105472 2.718
att bli en sj?v     to becoming one    1 0.00896375 1 0.00157689 2.718                         many translation alternatives (huge phrase table)
bara vi             just               0.1 0.0102041 1 0.128968 2.718                          many ways to segment sentences into phrases
bara                just               0.6 0.285714 0.6 0.25 2.718
bara                naught but         1 0.268518 0.1 0.00195312 2.718                         re-ordering makes it even more complex
bara                only               0.375 0.222222 0.3 0.125 2.718                          Very Expensive! → need search heuristics
pruning (early discard weak hypotheses)
phrase translation probability φ(f |e)                                                               stack decoding (histograms & thresholds)
reordering limits
lexical weighting lex(f |e)
phrase translation probability φ(e|f )
lexical weighting lex(e|f )
phrase penalty (always exp(1) ≈ 2.718)

Jörg Tiedemann                                                              54/69           Jörg Tiedemann                                                            55/69

Decoding Process                                                                            Decoding Process

Maria           no         dio     una     bofetada   a        la     bruja        verde

Maria           no        dio      una   bofetada    a          la   bruja       verde

Mary

Mary       did not

build translation left-to-right
select foreign word to be translated                                                       mark ﬁrst (foreign) word as translated
select translation in phrase table                                                         new example: one-to-many translation
add translation to partial translation (hypothesis)

Jörg Tiedemann                                                              56/69           Jörg Tiedemann                                                            57/69
Decoding Process                                                               Decoding Process

Maria           no     dio una bofetada   a          la   bruja       verde    Maria           no      dio una bofetada   a la   bruja       verde

Mary       did not         slap                                                Mary       did not          slap          the

many-to-one translation                                                        many-to-one translation

Jörg Tiedemann                                                 58/69           Jörg Tiedemann                                         59/69

Decoding Process                                                               Decoding Process

Maria           no     dio una bofetada       a la        bruja       verde    Maria           no      dio una bofetada   a la   bruja       verde

Mary       did not         slap              the         green                 Mary       did not          slap          the    green       witch

example for re-ordering                                                        translation ﬁnished

Jörg Tiedemann                                                 60/69           Jörg Tiedemann                                         61/69
Decoding Process: Lattice of translation options                                           Hypothesis expansion

Maria            no      daba   una     bofetada    a          la   bruja          verde

Mary             not     give    a           slap   to        the   witch          green
did not                  a slap        by               green witch
no              slap                to the
did not give                            to
the
slap                     the witch

Jörg Tiedemann                                                             62/69           Jörg Tiedemann                                  63/69

Hypothesis expansion                                                                       Hypothesis expansion

... and continue adding more hypothesis
→ exponential explosion of search space!

Jörg Tiedemann                                                             64/69           Jörg Tiedemann                                  65/69
Hypothesis Stacks                                                        Phrase-based SMT

→ Homepage of the Moses toolkit
http://www.statmt.org/moses/

here: based on number of foreign words translated
expand all hypotheses from one stack during translation
place expanded hypotheses into appropriate stacks
→ get n-best list of translations

Jörg Tiedemann                                                  66/69    Jörg Tiedemann                                         67/69

Summary PB-SMT                                                           What’s next?

phrase-based SMT = state-of-the-art in data-driven MT (?!)        Next lab session:
based on standard word alignment models                                  build your own SMT models
phrase extraction heuristics & simple scoring                            run different setups and evaluate
simplistic re-ordering model
huge phrase table = big memory of fragment translations           Lecture:
heuristics for efﬁcient decoding                                         a quick look at other topics
→ Active research area! New developments all the time!

Jörg Tiedemann                                                  68/69    Jörg Tiedemann                                         69/69

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 4 posted: 12/13/2011 language: pages: 15