Confidence Estimation for
Machine Translation
Lucia Specia
Xerox Research Centre Europe – Grenoble
(collaboration with Marco Turchi & Nello Cristianini – U. Bristol)
Outline
The task of Confidence Estimation for Machine
Translation
Our approach
Method
Features
Algorithms
Experiments
On-going and future work
The task of CE for MT
Goal: given the output of a Machine Translation (MT)
system for a given input, provide an estimate of its quality.
Motivation: assessing the quality of translations is
Time consuming – reading the translation takes time:
Los piratas se han incautado un gigante petrolero saudita llevando su
carga completa de 2 m de barriles - más de una cuarta parte de la Arabia
Saudita de la producción diaria - el sábado frente a la costa de Kenya
(unas 450 millas náuticas al sudeste de Mombasa) y la dirección hacia la
puerto somalí de AEL, la Marina de los EE.UU. dice.
Not possible – if user does not know the source language:
海賊や巨大なサウジの石油タンカー1日、サウジアラビアの生産量の4分の1 -土曜
日には、ケニア沖以上(一部450海里南東モンバサ)バレル分の全負荷-帳簿を押
収しているに向けて運営しているソマリアEylのポートは、米国海軍という。....
The task of CE for MT
Uses:
Is it worth providing this translation as
suggestion to the professional translator?
– Filter out “bad” translations to avoid professional
translators wasting time reading / post-editing them.
Should this translation be highlighted as
“suspect” to the reader?
– Make end-users aware of the translation quality.
General approach
Different from MT evaluation (BLEU/NIST): reference
translations are NOT available
Unit: word, phrase or sentence
Embedded to SMT system (word or phrase probabilities)
or dedicated layer (machine learning problem)
Binary problem: distinguish between good and bad
translations
General approach
Different from MT evaluation (BLEU/NIST): reference
translations are NOT available
Unit: word, phrase or sentence
Embedded to SMT system (word or phrase probabilities)
or dedicated layer (machine learning problem)
Binary problem: distinguish between good and bad
translations
Related work (sentence-level)
Workshop at JHU (Blatz et al, Coling-04) & MSR (Quirk, LREC-04)
Automatic MT metrics or few manually assessed cases
Poor analysis on the contribution of different features
Predictions did not have a positive effect on practical tasks
MSR (Gamon et al, EAMT-05)
Human likeness classification
Resource dependent features
Poor performance if compared to BLEU: little correlation
with human
One MT system, one domain, one language pair
Only good / bad estimates: binary task
Our approach
Sentence-level: natural scenario for MT
Many resource & language independent features
Take contribution of features into account
MT system dependent x independent features
Machine learning problem: regression
Any continuous score
Human annotation as training
Several MT systems, text domains, language pairs and
quality scores
Results: useful in practical applications
Method
1. Identify and extract information sources.
2. Refine the set of information sources to keep only the
relevant ones
Increase performance.
Decrease extraction cost (time).
3. Learn a model to produce quality scores for new
translations.
4. Use the CE score in some application.
Features
Most of those identified in previous work + new ones
Black-box (77): from the input and translation
sentences, monolingual or parallel corpora, e.g.:
Source and target sentence lengths and their ratios
Language model and other statistics in the corpus
Shallow syntax checking (target and target against source)
Average number of possible translations per source word (SMT)
Practical scenario:
Useful when it is not possible to have access to internal
features of the MT systems (commercial systems, e.g.).
Provides a way to perform the task of CE across different
MT systems, which may use different frameworks.
Features
Glass-box (54): depend on some aspect of the translation
process, e.g.:
Language model (target) using n-best list – word/phrase-based
Proximity with other hypothesis in the n-best list
MT base model features
Distortion count, gap count, (compositional) bi-phrase probability
Search nodes in the graph (aborted, pruned)
Proportion of unknown words in the source
Richerscenario:
When it is possible to have access to internal features of
the MT systems.
Learning methods
Feature selection: Partial Least Squares (PLS)
Regression: PLS, SVM
Partial Least Square Regression
Given two matrices X (input variables) and Y (response
variables): predict Y from X and describe their common
structure.
Projects the original data onto a different space of latent
variables (or “components”):
Provides by-product an ordering of the original features
according to their importance.
Particularly indicated when the features in X are strongly
correlated (multicollinearity) case of CE datasets.
Widely applied in other fields – not yet for NLP.
Partial Least Square Regression
Ordinary multiple regression problem:
Y XBw F
Where:
Bwis the regression matrix computed directly using an optimal
number of components.
F is the residual matrix.
When X is standardized, an element of Bw with large absolute
value indicates an important X-variable.
Feature Section with PLS
Method:
1. Compute the Bw matrix on some training data for different
numbers of components (all possible)
2. Sort the absolute value of the bw-coefficients. This
produces a list of features from the most important to the
less important (Lb)
3. Select the n features training and testing on a validation
set according to some objective criteria
4. Train and test these n features on a test set
5. Evaluate predictions using appropriate metrics
Feature Section with PLS
Method:
1. Compute the Bw matrix on some training data for different
numbers of components (all possible)
2. Sort the absolute value of the bw-coefficients. This
produces a list of features from the most important to the
less important (Lb)
Done for each i-th training subsample: obtain several Lb(i)
66 7 … 35 10
44 56 … 9 10
… … … … …
66 56 35 9
The final list L is obtained picking the most “voted” features for each
column (mode): L = {66, 56, …, 35, 10}
Feature Section with PLS
Method:
1. Compute the Bw matrix on some training data for different
numbers of components (all possible)
2. Sort the absolute value of the bw-coefficients. This
produces a list of features from the most important to the
less important (Lb)
3. Select the n features training and testing on a validation
set according to some objective criterion
• Objective criterion: RMSPE
• Analyze learning curves to select top n features:
Feature Section with PLS
Feature Section with PLS
Method:
1. Compute the Bw matrix on some training data for different
numbers of components (all possible)
2. Sort the absolute value of the bw-coefficients. This
produces a list of features from the most important to the
less important (Lb)
3. Select the n features training and testing on a validation
set according to some objective criteria
4. Train and test these n features on a test set
SVM or PLS with optimal number of components
Feature Section with PLS
Method:
1. Compute the Bw matrix on some training data for different
numbers of components (all possible)
2. Sort the absolute value of the bw-coefficients. This
produces a list of features from the most important to the
less important (Lb)
3. Select the n features training and testing on a validation
set according to some objective criteria
4. Train and test these n features on a test set
5. Evaluate predictions
Root Mean Square Error (RMSPE)
Feature Section with PLS
Root Mean Squared Error (RMSPE)
N
1
RMSPE
N
( y j y j )2
j 1
ˆ
N = number of test cases
TP = True positives
FP = False positive
y = expected value
^
y = prediction obtained
Experiments
Datasets:
Europarl data with quality scores from automatic metrics.
News data with manually assigned quality scores (1-5).
Europarl with manually assigned quality scores (1-4).
Technical documents with manually assigned quality
scores (1-4).
Technical documents with post-edition time annotation.
GKLS 1-4 en-es dataset
WMT-2008 Europarl English-Spanish dev & test data
4K Translations: SMT systems: Matrax, Portage, Sinuhe
and MMR (P-ES-1, P-ES-2, P-ES-3 and P-ES-4)
Quality score: 1-4
1: requires complete retranslation 2: editing quicker than retranslation
3: a little post editing needed 4: fit for purpose
Features: P-ES-1gb = 131, others = 77 black-box
•Little gain with
glass-box
features
Go
od
from
a
GKLS 1-4 en-dk dataset
Automotive documents English-Danish
En-Es is a reasonably close language-pair, try En-Dk
2K Translations: SMT system trained on 170K parallel
sentences: Matrax
Quality score: 1-4
Features: black-box + glass-box
Results (RMSPE):
Matrax BB + GB Matrax BB
0.67 0.96
Considerable gain in using glass-box features
• Expected with more distant language pairs
GKLS post-edition time dataset
Automotive documents English-Russian
3KTranslations: SMT systems trained on 250K sentences:
Matrax, Portage and MMR (P-ER-1, P-ER-2 and P-ER-3)
Qualityscore: post-edition time in seconds (normalized by
sentence length)
• Given a source sentence in English and its translation into Russian, a
professional translator post-edited such translation to make it into a
good quality sentence, while the time was recorded.
Features: 77 black-box
Discussion
Results for a subset of features outperform all features.
GKLS 1-4: CE models deviate ~0,6-0,7. E.g.: sentence
that should be classified as “fit for purpose” would never
be classified as “requires complete retranslation”.
Post-edition time: vary considerably from system to
system. E.g.: P-ES-1 (~2), CE models deviate up to 2
seconds/word.
Manually annotated datasets
Best features (BB):
source & target sentence 3-gram language
model probability
source & target sentence lengths
percentage and mismatch in the numbers and punctuation
symbols in the source and target
Manually annotated datasets
Best features (GB):
size of the n-best list
sentence n-gram log-probabilities using the n-best for
training a LM (using words or phrases)
bi-phrase count
distortion count
bi-phrase probability
translation model
average size of hypotheses in the n-best list
number of search graph nodes in final decoder’s graph
More results: GKLS 1-4 en-es
System combination:
1. Produce CE predictions for a test set decoded by 4
systems
2. Sort the four CE predictions for each test instance
3. Select the test instance with higher score
4. Evaluatewhether this was the best sentence according
to human annotation:
– matches for Top = 659 / 802 = 82.1%
More results: GKLS 1-4 en-es
Predictive power of features: Pearson’s correlation:
0.8
0.6 CE
Aborted nodes
0.4 SMT score
Ratio scores
0.2 LM target
LM source
0 Bi-phrase prob
1 TM
-0.2 Sent length
BAD 117
-0.4 BAD 76
-0.6
More results: GKLS 1-4 en-es
CE score x MT metrics - Pearson’s correlation:
0.7
0.6
0.5
BLEU-4
0.4
BLEU-2
0.3
NIST
0.2
TER
0.1
Meteor exact
0
Meteor porter
-0.1 1
CE
-0.2
-0.3
-0.4
More results: GKLS 1-4 en-es
Filter out bad translations for Lang. Service Providers
Average scores x TOP N
3.7 Human
3.6 CE
3.5 Aborted nodes
3.4 SMT score
3.3 Ratio scores
LM target
3.2
LM source
3.1
Bi-phrase prob
3
TM
2.9
Sent length
2.8
BAD 117
2.7
BAD 76
2.6
2.5
average top 100 average top 200 average top 300 average top 500
More results: GKLS 1-4 en-es
Filter out bad translations for Lang. Service Providers
360
340
320
300
280 Human
260
240 CE
220
200 Aborted N
180 Length
160
140
120
100
80
how many 3-4 top 100 how many 3-4 top 200 how many 3-4 top 500
350
330
310
290
270
250
230
210
190
170
150
130
110
90
70
50
how many 1-2 bottom 100 how many 1-2 bottom 200 how many 1-2 bottom 500
More results: GKLS 1-4 en-dk
Predictive power of features: Pearson’s correlation
0.8
0.6
Source LM log probability
0.4
Length
0.2 Best BB
0 Best GB
1 CE
-0.2
Target sentence 1-gram
-0.4 perplexity using n-best
for training a LM
-0.6
More results: GKLS 1-4 en-dk
Filter out bad translations for Lang. Service Providers
Aim is to do better than sentence length
Rank (300) test sentences according to true score, CE score or
sent length and take average true score for top-n:
True CE score Sent length
average top 100 3.53 3.26 3.02
average top 200 3.06 2.93 2.825
CE >= 2.3 (~ 3 true score) x Sentence Length = 2.3 2.983516 182
CE >= 2.11* 2.868778 221
More results: GKLS 1-4 en-dk
Using thresholds on CE score or Sent Length we
select:
86% of the good sentences (human scores = 3 or 4)
‘bad’ sentences (human scores = 1 or 2):
Length > 12 44%
CE score 12 AND CE score >= 2.3 6% are 3-4 67%
Length <=12 AND CE score < 2.3 19% are 1-2 77%
Discussion
Results considered to be satisfactory (except for post-
edition time).
Prediction errors are similar across different language pairs
Quality of MT system have some influence.
Predictions correlate better with human scores than
metrics using reference translations.
Prediction error would yield uncertainty in the boundaries
between two adjacent categories only.
Estimating continuous scores is more appropriate than
binary classification, even for a binary application
Use of Inductive Confidence Machines to threshold
predicted score
Discussion
Most relevant features include many that have not
been used before
average size of the phrases in the target,
several mismatchings in the source and target
proportion of aborted search nodes, etc.
Future work: further investigate uses for these most
relevant features:
1. In SMT models to improve the translations quality
• To complement existing features in SMT models.
• To rerank n-best lists produced by SMT systems.
Discussion
2. In MT evaluation:
• To provide additional features to reference-based
metrics based on ML algorithms.
• To provide a score to be combined with other MT
evaluation metrics.
• To provide a new evaluation metric on itself, with some
function to optimize the correlation with human
annotations, without the need of reference translations.
Discussion
Uses of CE score for other applications:
Cross-language information retrieval
Finding parallel data from comparable corpus
…
Whenever is it important to identify whether a target
sentence is a GOOD translation of a source sentence.
Thanks!
Lucia Specia
lucia.specia@xrce.xerox.com
The source…
Pirates have seized a giant Saudi oil tanker carrying its full
load of 2m barrels - more than one-quarter of Saudi Arabia's
daily output - on Saturday off the Kenyan coast (some 450
nautical miles south-east of Mombasa) and are steering it
towards the Somali port of Eyl, the US Navy says. – BBC
News, 17/11/08