The MITLL/AFRL MT System
Wade Shen, Brian Delaney, and Tim Anderson
23 October 2005
This work is sponsored by the United States Air Force Research Laboratory under Air Force Contract FA8721-05-C-0002.
Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by
the United States Government.
MIT Lincoln Laboratory
999999-1
XYZ 11/26/2011
Statistical Translation System
Experimental Architecture
• Standard Statistical Architecture Model Training Translation
• Developed in-house to support SMT Ch En Ch
experiments Training Bitext Test Set
– Framework for experiments with low-
resource languages GIZA++ Word Decode
Alignment
– Test-bed for S2S MT system
• Most components are home-grown Alignment
Rescore
– Phrase Training/Minimum Error Rate Expansion
Training
– Pharaoh used for decoding in IWSLT, Phrase
comparable performance with new Extraction
En
Viterbi Decoder
Minimum Error Rate Translated Output
Training
• Participated in Chinese English Ch En
Supplied Data track
Dev Set
999999-2
MIT Lincoln Laboratory
WS 11/26/2011
The MITLL/AFRL MT System
Overview
• Translation Model
• Minimum Error Rate Training
• Decoder
• Development Experiments
– Segmentation
– Distortion
• Evaluation Results
– Manual Transcription
– ASR Transcription
• Next Steps
999999-3
MIT Lincoln Laboratory
WS 11/26/2011
Translation Model
Phrase Extraction
• Basic Alignment Template Model
Proposed by Och & Ney 2000
– Expand word alignments interpolating
between the intersection and union of
bidirectional GIZA++ alignments
– Extract consistent phrase pairs from
expanded alignments
• Modifications
1. Add points to intersection that are
unaligned in both source and target
language sentences before iterative
expansion
2. Allow target phrases to be longer/shorter
than source phrases by a fixed factor
(target phrase factor)
3. 1+2 results in +2 BLEU points
999999-4
MIT Lincoln Laboratory
WS 11/26/2011
Translation Model
Distortion, Lexical and Language Models
• Distortion
– We used Pharaoh’s simple model (unlimited):
• Lexical Weighting
– Both model 4 and expanded alignment lexical translation
models tried
– Expanded alignments 1.5 BLEU point gain
• Language Model
– Trained with SRILM
– Interpolated trigram model with Knesser-Ney discounting
used for decoding
– 4-gram LM and 5-gram class-based LM used during rescoring
999999-5
MIT Lincoln Laboratory
WS 11/26/2011
Minimum Error Rate Training
• Log-linear Model Combination
Model Weight Parameters
1 P(f|e) – Forward Translation Model
• Additional Language models 2 P(e|f) – Backward Translation Model
applied during rescoring 3 LexW(f|e) – Forward Lexical Weight
4 LexW(e|f) – Backward Lexical Weight
• N-best lists of 2k and 8k used 5 PPen – Constant, per-phrase Penalty
– Minor gain with 8k n-best 6 WPen – Constant, per-word Penalty
7 Dist – Distortion Model
• 5-7% relative improvement over 8 Tri-LM – Trigram Language Model
hand optimized parameters 9 4-LM – Four-gram Language Model
10 ClassLM – Five-gram class-based LM
• Insignificant differences from
beam-width relaxation
999999-6
MIT Lincoln Laboratory
WS 11/26/2011
Decoder Development
• A phrase-based Viterbi beam search
decoder has been implemented
• Decoder can account for word
movement between source and
target languages (distortion)
– With distortion, search complexity
approaches O(2n)
• Decoding speed:
– Monotone search (without
distortion) can exceed 500 words
per second
– With distortion, search slows to 10
words per second but can be
improved with limits on distortion
• Decoder can produce word lattice
output for optional second pass
rescoring with higher order
language models
999999-7
MIT Lincoln Laboratory
WS 11/26/2011
Development Experiments I
Dev Sets and Results
• Code development experiments summary (on IWSLT04 devset)
Implementation Summary BLEU Manual Transcription
Dev Results
Basic Phrase Extraction ~36
+ Enhancements to Phrase Extraction 37.7 Dev 1 2
+ Lexical Weights from expanded ali. 39.1
Test
1 36.64
• Dev Set Design
– Dev1: CSTAR 2003 (supplied) 2 42.00
– Dev2: IWSLT 2004 (supplied)
– Dev3: ½ Dev1 + ½ Dev2 (first half) Dev 3 4
– Dev4: ½ Dev1 + ½ Dev2 (second half)
– Dev5: Dev1 + Dev2 Test
3 42.44
• Manual Transcription Results (BLEU)
– Full Evaluation System 4 33.84
999999-8
MIT Lincoln Laboratory
WS 11/26/2011
Development Experiments II
Phrase Extraction/MER Experiments
Parameters Varied
Segmentation Additional Language Models
(word or character) (4-gram and 5-gram)
Lexical Back-off Minimum Error Rate Training
Configurations BLEU
Base: CharSeg, UTF-8, 4x TPF, hand-tuned weights 39.12
+ lexbackoff 40.32
+ lexbackoff + 2x TPF 40.76
+ lexbackoff + 2x TPF + WordSeg 34.12
+ lexbackoff + 2x TPF + MER 40.99
+ lexbackoff + 2x TPF + extra LMs 41.45
+ lexbackoff + 2x TPF + extra LMs + MER 42.00
999999-9
MIT Lincoln Laboratory
WS 11/26/2011
Development Experiments III
Details and ASR
• ASR
– Compared 1-best vs. N-best
Using Nbest 7-10% relative improvement
– Scored N-best without weighting acoustic model or ASR
language model parameters
– Used system trained/optimized with manual transcription
ASR N-best N-best Correct % BLEU
1 1 68.7 26.15
2 1 80.9 32.30
3 1 87.3 35.08
1 20 80.1 28.37
2 20 91.8 36.90
3 20 94.5 37.68
999999-10
MIT Lincoln Laboratory
WS 11/26/2011
IWSLT 2005 Results
MT Evaluation Metrics
• Metrics Used for IWSLT-2005
– WER: word error rate – the edit distance between output and closest
reference translation
– PER: position independent WER – same as WER but disregards word
ordering
– BLEU: geometric mean of n-gram precision between output and all
references
– NIST: a variant of BLEU - arithmetic mean of weighted n-gram
precision
– GTM: general text matcher – measures similarity between output and
reference in terms of precision and recall using a unigram based F-
measure
– METEOR: uses natural language processing tools including word
stemmer and synonym matching to find unigram matches
999999-11
MIT Lincoln Laboratory
WS 11/26/2011
IWSLT 2005 Results
Manual Transcription
• Participated in supplied data track, Chinese English Translation Task
– Manual and ASR transcription
– 20,000 sentence pair training
– Used in-house trainer and freely available Pharaoh decoder from ISI (in-
house decoder was not ready at submission time)
System BLEU4 NIST METEOR WER PER GTM
ITC 0.528 9.060 0.689 0.414 0.346 0.620
RWTH 0.511 9.567 0.665 0.428 0.358 0.601
EDINBURGH 0.465 6.492 0.632 0.453 0.398 0.599
TALP 0.452 7.974 0.663 0.459 0.380 0.609
MIT 0.450 9.311 0.709 0.464 0.355 0.619
CMU 0.444 6.188 0.564 0.513 0.459 0.524
IBM 0.440 8.436 0.642 0.469 0.391 0.588
ATR-C3 0.394 8.000 0.629 0.523 0.428 0.553
USC 0.332 5.566 0.567 0.544 0.469 0.526
NTT 0.278 7.519 0.593 0.653 0.521 0.492
MIT New Decoder
999999-12
MIT Lincoln Laboratory
WS 11/26/2011
IWSLT 2005 Results
ASR Transcription
• Used ASR n-best lists as input to MT
• Decode and merge resulting MT output
• Rescore combined output and select best output
• Results
System BLEU4 NIST METEOR WER PER GTM
RWTH 0.383 7.389 0.540 0.565 0.472 0.488
CMU 0.363 6.533 0.520 0.581 0.499 0.483
MIT 0.360 7.556 0.593 0.560 0.455 0.000
IBM 0.336 7.083 0.533 0.598 0.504 0.481
NTT 0.274 6.519 0.522 0.643 0.535 0.458
999999-13
MIT Lincoln Laboratory
WS 11/26/2011
IWSLT 2005 Results
Example Output
Translation of MT output vs. reference transcription
Reference
Chinese Input System Output
Translation
Manual
Transcription i'd like to take a group i'd like to take a
tour sightseeing tour
Sentence #1
ASR output i'd like to take a
to take a tour group
Sentence #1 sightseeing tour
Manual
Transcription request wear formal is formal dress
dress night ? required
Sentence #2
ASR output is formal dress
there's been feel and required
Sentence #2 formal dresses ?
999999-14
MIT Lincoln Laboratory
WS 11/26/2011
Summary
• The MIT/AFRL MT system is capable of state-of-the-art
performance on a Chinese-English task with a limited
training set
• Many in-house components were built, but we also rely on
the existence of freely available components such as
Pharaoh and GIZA++ to accelerate development
• Further research into error mitigation techniques for
speech to speech machine translation is needed
999999-15
MIT Lincoln Laboratory
WS 11/26/2011
Next Steps
• ASR Lattice rescoring and joint optimization
• Decoder development and evaluation
• Scale to large vocabulary tasks
• Hybrid Interlingual efforts with MIT/CSAIL
999999-16
MIT Lincoln Laboratory
WS 11/26/2011