Docstoc

Corpus-based Machine Translation

Document Sample
Corpus-based Machine Translation Powered By Docstoc
					Corpus-based Machine Translation
2 Statistical Machine Translation (SMT) 2 Example based Machine Translation (EBMT) 2 Multi-engine Machine Translation 2 Speech to Speech translation

11-682, LTI, Carnegie Mellon

Statistical Machine Translation
2 Mapping L0 to L1 without explicit rules 2 Language modeling to choose best path 2 Finding data

11-682, LTI, Carnegie Mellon

Translation: source-channel model
1. IBM: “5 models” 2. Statistical based: – thus using little linguistic knowledge 3. Based on Hansard Candian Parliament Bilingual corpus: – French to English

11-682, LTI, Carnegie Mellon

Statistical Translation
2 Estimate the prob of sentence pairs 2 Uses bilingual corpus (sentence aligned) 2 – – – e = argmax P r(e|f ) ˆ e representations target language (English) ˆ ˆ f representations source language (French)

2 by Bayes rule – e = argmax P r(e)P r(f |e) ˆ The Fundamental Equation of Statistical Machine Translation

11-682, LTI, Carnegie Mellon

Statistical Translation
Why care about P r(f |e) when P r(e|f ) just as hard 2 P r(e) probability of a English sentence – so we define a language model – to independently evaluate the “grammaticality” 2 P r(f |e) – removes the constraint from “e” being well-formed – potentially easier to train 2 decomposes the problem

11-682, LTI, Carnegie Mellon

IBM translation model: 5 models
Brown et al. 1993 2 Really all models are used at once 2 Trained at different stages 2 Split makes training tractable 2 All levels use some form of EM – Expectation Maximization

11-682, LTI, Carnegie Mellon

IBM translation model: General Framework
Align English and French sentence 2 align each French word to 0 or more English words 2 multiple French words may align to single English word 2 multiple English words may not align to single French word

11-682, LTI, Carnegie Mellon

IBM model 1
Generative model 2 choose length m of French sentence 2 choose which English word(s) the French words will align with 2 choose the identity of the French word. P r(f,a|e) = j−1 j P r(m|e) m P r(aj |aj−1, f1 , m,e)P r(fj |aj , f1 − 1, m,e) 1 i j−1 f - entire french sentence e - entire english sentence a - set of alignments m - length of French sentence

11-682, LTI, Carnegie Mellon

IBM model 2-5
still generative models 2 Model 2: care about which alignments 2 Model 3-5: “fertility of words”, permutations

11-682, LTI, Carnegie Mellon

“no linguistic knowledge”
Sort of 2 But French to English may make this easier: – F-E doesn’t care about gender, or morphology – Multiple French verbs go to one English verb

11-682, LTI, Carnegie Mellon

Getting Data
2 Want “buckets” of parallel corpora: – sentence to sentence 2 Government organizations: – Canadian Parliament (Hansard) – Hong Kong Parliament – EU and UN documents 2 Web spiders to find parallel corpora – search for sites with parallel info – universities, companies – news reports (e.g. asahi.com) But need careful design of the interlingua

11-682, LTI, Carnegie Mellon

Language Models
How can you find Pr(e) 2 What’s the probability “...” is an E sentence 2 List all English sentences ... 2 Need to model them

11-682, LTI, Carnegie Mellon

N-gram Language Models
Local probabilities 2 Likelihood of sequences: – unigrams, bigrams, trigrams, ... 2 Markov assumption: – P (Xt+1 | Xt, Xt−1, ...Xt−n) 2 Need lots of data: – say 64K different words – trigrams are 64k3 (which is quite big) 2 But distributions aren’t even: – some trigrams never appear – “oyster oyster oyster” never ocurrs – “the blue table” never occurs 2 Smoothing techniques: – how to deal wtih out of vocabulary/coverage
11-682, LTI, Carnegie Mellon

Probability of a sentence
We can approximate Pr(ˆ): e – k P r(ei|ei−1, ...ei−N +1) i=1 – These probabilities get *very* small – so we use log probilities (and add) – k log(P r(ei|ei−1, ...ei−N +1)) i=1

11-682, LTI, Carnegie Mellon

Example-based machine translation
2 From a parallel corpus of translated sentences: – select the one you’ve seen before 2 Generalization 2 Word-level alignment 2 Applications

11-682, LTI, Carnegie Mellon

EBMT paradigm
2 Build parses of sentences 2 Find the sentence that matches: – need a very large database 2 Find longest substrings: – need to align words in sentences 2 Need to generalize

11-682, LTI, Carnegie Mellon

EBMT Paradigm

11-682, LTI, Carnegie Mellon

Generalization
given: – John Hancock was in Philadelphia on July 4th. – John Hancock war am 4. Juli in Philadelphia. We would like to generalize to: – PERSONwas in CITYon DATE. – PERSONwar am DATEin CITY.

11-682, LTI, Carnegie Mellon

Word alignment
2 Different number of words in source and target 2 Not necessarily in same order 2 Need to optimally find best alignment 2 Take into account word frequencies

11-682, LTI, Carnegie Mellon

EBMT use
2 Sometimes not perfect synthesis 2 Requires only parallel corpora: – doesn’t required skilled people to build 2 Good for rapid development 2 Typical applications: – speech-to-speech translation – cross lingual IR

11-682, LTI, Carnegie Mellon

EBMT vs Statistical MT
2 Different histories: – IBM vs Japan 2 EBMT is non statistical – not a generative model 2 EBMT (probably) can work on less data 2 Both are data driven approaches 2 Both require parallel corpora

11-682, LTI, Carnegie Mellon

Multi-engine MT
2 Different systems have different strengths: – can we take advantage of that 2 Multi-engine MT: – use a number of different engines: – EBMT, bilingual dictionary, KBMT – device weighting to select best 2 Weighting measures: – in-domain for KBMT (weight KBMT more)

11-682, LTI, Carnegie Mellon

Why does multi-engine MT work
2 Large space of possible translations: – different parts of the mapping space 2 Statistical engines: – capture diferent parts of the space 2 Automatic weighting: – can boost overall coverage (Technique also used in speech recognition)

11-682, LTI, Carnegie Mellon

Speech-to-Speech Translation
2 Put together three difficult tasks: – recognition, translation, and synthesis – (and dialog ?) 2 Expect them to work better than parts

11-682, LTI, Carnegie Mellon

Speech vs Text Translation
2 Speech is: – ungramatical – badly spoken – more immediate 2 But speech translation is – part of a dialog – doesn’t need to be perfect

11-682, LTI, Carnegie Mellon

CSTAR Consortium
Joint effort with 16 other sites worldwide 2 Speech translation in the tourism information domain 2 “Can you tell me the way to the conference center” – Kaigi sentaa no hou ga oshiete kudasaimasen ga 2 Includes: – English, German, Italian, Korean, Japanese, ... 2 Interlingua based – Each side provides – recognition/analysis generation/synthesis – in own language 2 Internet based communication

11-682, LTI, Carnegie Mellon

Babylon (2002-2004)
Rapid deployment small footprint 2 DARPA program: – CMU, IBM, SRI, BBN, HRL – “competing systems” 2 CMU, Cepstral, MTI and Mobile: – Two-way – New language (Egyptian Arabic) – Interlingua based – Medical interviews, Refugee Processing 2 Speechalator: – Consumer Ipaq – 2xASR, 2xTTS – analysis built into ASR – Statistical Generation
11-682, LTI, Carnegie Mellon

Other S2S MT systems
(non-CMU) 2 Verbmobil (1992-200) – German Government funded – Large consortium (14 univresities plus companies) – German/English (Japanese) – Meeting scheduling 2 ATR (1987-), Japan – English/Japanese – Transfer approach – Conference registration

11-682, LTI, Carnegie Mellon

Handheld Translators
2 Phraselator – One-way – Recognizes around 500 phrase – Plays recorded translation 2 ectaco (commercial) – English phrases to – French/German/Spanish

11-682, LTI, Carnegie Mellon

Summary
2 Statistical Machine Translation: – can be good but needs lots of data 2 Example-based Machine Translation: – needs data but not as much as SMT 2 Multi-engine Machine Translation – can get benefits from all types of system 2 Speech-to-Speech Translation – has to combine multiple errorenous processes – but has human in loop to understand

11-682, LTI, Carnegie Mellon


				
DOCUMENT INFO
Shared By:
Stats:
views:115
posted:1/20/2010
language:English
pages:30
Description: Corpus-based Machine Translation