Embed
Email

SMART

Document Sample

Shared by: niusheng11
Categories
Tags
Stats
views:
0
posted:
12/2/2011
language:
English
pages:
42
Confidence Estimation for

Machine Translation

Lucia Specia

Xerox Research Centre Europe – Grenoble

(collaboration with Marco Turchi & Nello Cristianini – U. Bristol)

Outline

 The task of Confidence Estimation for Machine

Translation



 Our approach

 Method



 Features



 Algorithms





 Experiments



 On-going and future work

The task of CE for MT

 Goal: given the output of a Machine Translation (MT)

system for a given input, provide an estimate of its quality.



 Motivation: assessing the quality of translations is

 Time consuming – reading the translation takes time:

Los piratas se han incautado un gigante petrolero saudita llevando su

carga completa de 2 m de barriles - más de una cuarta parte de la Arabia

Saudita de la producción diaria - el sábado frente a la costa de Kenya

(unas 450 millas náuticas al sudeste de Mombasa) y la dirección hacia la

puerto somalí de AEL, la Marina de los EE.UU. dice.

 Not possible – if user does not know the source language:

海賊や巨大なサウジの石油タンカー1日、サウジアラビアの生産量の4分の1 -土曜

日には、ケニア沖以上(一部450海里南東モンバサ)バレル分の全負荷-帳簿を押

収しているに向けて運営しているソマリアEylのポートは、米国海軍という。....

The task of CE for MT

Uses:



Is it worth providing this translation as

suggestion to the professional translator?



– Filter out “bad” translations to avoid professional

translators wasting time reading / post-editing them.



Should this translation be highlighted as

“suspect” to the reader?



– Make end-users aware of the translation quality.

General approach

 Different from MT evaluation (BLEU/NIST): reference

translations are NOT available

 Unit: word, phrase or sentence



 Embedded to SMT system (word or phrase probabilities)

or dedicated layer (machine learning problem)

 Binary problem: distinguish between good and bad

translations

General approach

 Different from MT evaluation (BLEU/NIST): reference

translations are NOT available

 Unit: word, phrase or sentence



 Embedded to SMT system (word or phrase probabilities)

or dedicated layer (machine learning problem)

 Binary problem: distinguish between good and bad

translations

Related work (sentence-level)

 Workshop at JHU (Blatz et al, Coling-04) & MSR (Quirk, LREC-04)

 Automatic MT metrics or few manually assessed cases

 Poor analysis on the contribution of different features

 Predictions did not have a positive effect on practical tasks



 MSR (Gamon et al, EAMT-05)

 Human likeness classification

 Resource dependent features

 Poor performance if compared to BLEU: little correlation

with human

 One MT system, one domain, one language pair

 Only good / bad estimates: binary task

Our approach

 Sentence-level: natural scenario for MT

 Many resource & language independent features

Take contribution of features into account

 MT system dependent x independent features

 Machine learning problem: regression

 Any continuous score



Human annotation as training

 Several MT systems, text domains, language pairs and

quality scores

Results: useful in practical applications

Method

1. Identify and extract information sources.



2. Refine the set of information sources to keep only the

relevant ones

 Increase performance.

 Decrease extraction cost (time).



3. Learn a model to produce quality scores for new

translations.



4. Use the CE score in some application.

Features

 Most of those identified in previous work + new ones

 Black-box (77): from the input and translation

sentences, monolingual or parallel corpora, e.g.:

 Source and target sentence lengths and their ratios

 Language model and other statistics in the corpus



 Shallow syntax checking (target and target against source)



 Average number of possible translations per source word (SMT)





 Practical scenario:

 Useful when it is not possible to have access to internal

features of the MT systems (commercial systems, e.g.).

 Provides a way to perform the task of CE across different

MT systems, which may use different frameworks.

Features

 Glass-box (54): depend on some aspect of the translation

process, e.g.:

 Language model (target) using n-best list – word/phrase-based

 Proximity with other hypothesis in the n-best list



 MT base model features

 Distortion count, gap count, (compositional) bi-phrase probability

 Search nodes in the graph (aborted, pruned)

 Proportion of unknown words in the source







 Richerscenario:

 When it is possible to have access to internal features of

the MT systems.

Learning methods

 Feature selection: Partial Least Squares (PLS)



 Regression: PLS, SVM

Partial Least Square Regression

 Given two matrices X (input variables) and Y (response

variables): predict Y from X and describe their common

structure.



 Projects the original data onto a different space of latent

variables (or “components”):

 Provides by-product an ordering of the original features

according to their importance.



 Particularly indicated when the features in X are strongly

correlated (multicollinearity)  case of CE datasets.



 Widely applied in other fields – not yet for NLP.

Partial Least Square Regression

 Ordinary multiple regression problem:



Y  XBw  F

 Where:

 Bwis the regression matrix computed directly using an optimal

number of components.

F is the residual matrix.

 When X is standardized, an element of Bw with large absolute

value indicates an important X-variable.

Feature Section with PLS

 Method:

1. Compute the Bw matrix on some training data for different

numbers of components (all possible)

2. Sort the absolute value of the bw-coefficients. This

produces a list of features from the most important to the

less important (Lb)

3. Select the n features training and testing on a validation

set according to some objective criteria

4. Train and test these n features on a test set

5. Evaluate predictions using appropriate metrics

Feature Section with PLS

 Method:

1. Compute the Bw matrix on some training data for different

numbers of components (all possible)

2. Sort the absolute value of the bw-coefficients. This

produces a list of features from the most important to the

less important (Lb)

 Done for each i-th training subsample: obtain several Lb(i)



66 7 … 35 10

44 56 … 9 10

… … … … …

66 56 35 9



 The final list L is obtained picking the most “voted” features for each

column (mode): L = {66, 56, …, 35, 10}

Feature Section with PLS

 Method:

1. Compute the Bw matrix on some training data for different

numbers of components (all possible)

2. Sort the absolute value of the bw-coefficients. This

produces a list of features from the most important to the

less important (Lb)

3. Select the n features training and testing on a validation

set according to some objective criterion

• Objective criterion: RMSPE

• Analyze learning curves to select top n features:

Feature Section with PLS

Feature Section with PLS

 Method:

1. Compute the Bw matrix on some training data for different

numbers of components (all possible)

2. Sort the absolute value of the bw-coefficients. This

produces a list of features from the most important to the

less important (Lb)

3. Select the n features training and testing on a validation

set according to some objective criteria

4. Train and test these n features on a test set

SVM or PLS with optimal number of components

Feature Section with PLS

 Method:

1. Compute the Bw matrix on some training data for different

numbers of components (all possible)

2. Sort the absolute value of the bw-coefficients. This

produces a list of features from the most important to the

less important (Lb)

3. Select the n features training and testing on a validation

set according to some objective criteria

4. Train and test these n features on a test set

5. Evaluate predictions

Root Mean Square Error (RMSPE)

Feature Section with PLS

 Root Mean Squared Error (RMSPE)







N

1

RMSPE 

N

 ( y j  y j )2

j 1

ˆ







N = number of test cases

TP = True positives

FP = False positive

y = expected value

^

y = prediction obtained

Experiments

 Datasets:

 Europarl data with quality scores from automatic metrics.

 News data with manually assigned quality scores (1-5).

 Europarl with manually assigned quality scores (1-4).

 Technical documents with manually assigned quality

scores (1-4).

 Technical documents with post-edition time annotation.

GKLS 1-4 en-es dataset

 WMT-2008 Europarl English-Spanish dev & test data

 4K Translations: SMT systems: Matrax, Portage, Sinuhe

and MMR (P-ES-1, P-ES-2, P-ES-3 and P-ES-4)

 Quality score: 1-4

1: requires complete retranslation 2: editing quicker than retranslation

3: a little post editing needed 4: fit for purpose



 Features: P-ES-1gb = 131, others = 77 black-box



•Little gain with

glass-box

features

Go

od

from

a

GKLS 1-4 en-dk dataset

 Automotive documents English-Danish

 En-Es is a reasonably close language-pair, try En-Dk

 2K Translations: SMT system trained on 170K parallel

sentences: Matrax

 Quality score: 1-4

 Features: black-box + glass-box

 Results (RMSPE):



Matrax BB + GB Matrax BB

0.67 0.96



 Considerable gain in using glass-box features

• Expected with more distant language pairs

GKLS post-edition time dataset

 Automotive documents English-Russian

 3KTranslations: SMT systems trained on 250K sentences:

Matrax, Portage and MMR (P-ER-1, P-ER-2 and P-ER-3)

 Qualityscore: post-edition time in seconds (normalized by

sentence length)

• Given a source sentence in English and its translation into Russian, a

professional translator post-edited such translation to make it into a

good quality sentence, while the time was recorded.

 Features: 77 black-box

Discussion

 Results for a subset of features outperform all features.



 GKLS 1-4: CE models deviate ~0,6-0,7. E.g.: sentence

that should be classified as “fit for purpose” would never

be classified as “requires complete retranslation”.



 Post-edition time: vary considerably from system to

system. E.g.: P-ES-1 (~2), CE models deviate up to 2

seconds/word.

Manually annotated datasets

Best features (BB):

 source & target sentence 3-gram language

 model probability

 source & target sentence lengths

 percentage and mismatch in the numbers and punctuation

symbols in the source and target

Manually annotated datasets

Best features (GB):

 size of the n-best list

 sentence n-gram log-probabilities using the n-best for

training a LM (using words or phrases)

 bi-phrase count

 distortion count

 bi-phrase probability

 translation model

 average size of hypotheses in the n-best list

 number of search graph nodes in final decoder’s graph

More results: GKLS 1-4 en-es

 System combination:

1. Produce CE predictions for a test set decoded by 4

systems

2. Sort the four CE predictions for each test instance

3. Select the test instance with higher score

4. Evaluatewhether this was the best sentence according

to human annotation:

– matches for Top = 659 / 802 = 82.1%

More results: GKLS 1-4 en-es

 Predictive power of features: Pearson’s correlation:



0.8





0.6 CE

Aborted nodes

0.4 SMT score

Ratio scores

0.2 LM target

LM source

0 Bi-phrase prob

1 TM

-0.2 Sent length

BAD 117

-0.4 BAD 76





-0.6

More results: GKLS 1-4 en-es

 CE score x MT metrics - Pearson’s correlation:



0.7

0.6

0.5

BLEU-4

0.4

BLEU-2

0.3

NIST

0.2

TER

0.1

Meteor exact

0

Meteor porter

-0.1 1

CE

-0.2

-0.3

-0.4

More results: GKLS 1-4 en-es

 Filter out bad translations for Lang. Service Providers

Average scores x TOP N





3.7 Human

3.6 CE



3.5 Aborted nodes



3.4 SMT score



3.3 Ratio scores

LM target

3.2

LM source

3.1

Bi-phrase prob

3

TM

2.9

Sent length

2.8

BAD 117

2.7

BAD 76

2.6

2.5

average top 100 average top 200 average top 300 average top 500

More results: GKLS 1-4 en-es

 Filter out bad translations for Lang. Service Providers



360

340

320

300

280 Human

260

240 CE

220

200 Aborted N

180 Length

160

140

120

100

80

how many 3-4 top 100 how many 3-4 top 200 how many 3-4 top 500





350

330

310

290

270

250

230

210

190

170

150

130

110

90

70

50

how many 1-2 bottom 100 how many 1-2 bottom 200 how many 1-2 bottom 500

More results: GKLS 1-4 en-dk

 Predictive power of features: Pearson’s correlation





0.8



0.6

Source LM log probability

0.4

Length

0.2 Best BB



0 Best GB

1 CE

-0.2

Target sentence 1-gram

-0.4 perplexity using n-best

for training a LM

-0.6

More results: GKLS 1-4 en-dk

 Filter out bad translations for Lang. Service Providers

 Aim is to do better than sentence length

 Rank (300) test sentences according to true score, CE score or

sent length and take average true score for top-n:



True CE score Sent length

average top 100 3.53 3.26 3.02

average top 200 3.06 2.93 2.825



 CE >= 2.3 (~ 3 true score) x Sentence Length = 2.3 2.983516 182

CE >= 2.11* 2.868778 221

More results: GKLS 1-4 en-dk

 Using thresholds on CE score or Sent Length we

select:

 86% of the good sentences (human scores = 3 or 4)

 ‘bad’ sentences (human scores = 1 or 2):



Length > 12 44%

CE score 12 AND CE score >= 2.3 6% are 3-4 67%

Length <=12 AND CE score < 2.3 19% are 1-2 77%

Discussion

 Results considered to be satisfactory (except for post-

edition time).

 Prediction errors are similar across different language pairs

 Quality of MT system have some influence.

 Predictions correlate better with human scores than

metrics using reference translations.

 Prediction error would yield uncertainty in the boundaries

between two adjacent categories only.

 Estimating continuous scores is more appropriate than

binary classification, even for a binary application

 Use of Inductive Confidence Machines to threshold

predicted score

Discussion

 Most relevant features include many that have not

been used before

 average size of the phrases in the target,

 several mismatchings in the source and target

 proportion of aborted search nodes, etc.

 Future work: further investigate uses for these most

relevant features:

1. In SMT models to improve the translations quality

• To complement existing features in SMT models.

• To rerank n-best lists produced by SMT systems.

Discussion

2. In MT evaluation:

• To provide additional features to reference-based

metrics based on ML algorithms.

• To provide a score to be combined with other MT

evaluation metrics.

• To provide a new evaluation metric on itself, with some

function to optimize the correlation with human

annotations, without the need of reference translations.

Discussion

 Uses of CE score for other applications:

 Cross-language information retrieval

 Finding parallel data from comparable corpus

 …





Whenever is it important to identify whether a target

sentence is a GOOD translation of a source sentence.

Thanks!

Lucia Specia

lucia.specia@xrce.xerox.com

The source…



Pirates have seized a giant Saudi oil tanker carrying its full

load of 2m barrels - more than one-quarter of Saudi Arabia's

daily output - on Saturday off the Kenyan coast (some 450

nautical miles south-east of Mombasa) and are steering it

towards the Somali port of Eyl, the US Navy says. – BBC

News, 17/11/08



Related docs
Other docs by niusheng11
CIOFF-Groups-Report-2010
Views: 419  |  Downloads: 0
stockmkt
Views: 0  |  Downloads: 0
DIFFERENTIAL FLOAT CONTROL VALVE DIFL
Views: 3  |  Downloads: 0
travelrite_nzd
Views: 0  |  Downloads: 0
Office location checklist
Views: 2  |  Downloads: 0
You can help NNAAMI with
Views: 0  |  Downloads: 0
Carey Road CRD Lands
Views: 11  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!