Stream Decoding for Simultaneous Translation

Document Sample
Stream Decoding for Simultaneous Translation Powered By Docstoc
					Preprocessing in Statistical Machine Translation

                   May 19, 2009

•   Bilingual Corpora in SMT

     properties and types of corpora

     sentence alignment

•   Preprocessing training corpus and translation input

     basic preprocessing steps
       •   tokenization
       •   casing
       •   abbreviations, numbers, dates, ...

     named entities


     preprocessing for advanced SMT
       •   POS tagging
       •   compound splitting
       •   morphological analysis
SMT Architecture

                   Preprocessing   Bilingual Corpus

                   Preprocessing   Monoling. Corpus
SMT Architecture

                   of the source text

                                                of text corpora

                                        Preprocessing        Bilingual Corpus

                                        Preprocessing        Monoling. Corpus
Text Corpora in SMT

•    Idea of SMT

      „learn“ how to translate by analyzing huge amounts of sample translations

      core of SMT system: training corpus of translated texts

•    Bilingual corpora

      collection of bilingual data: documents, texts, transcriptions of speech

      different types
         • different domains / topics (politics, economics, literature, …)

         • modality
                         – written (grammatical)
                         – spoken (less grammatical, incomplete sentences, filler words, stuttering)
         • styles
                         – formal (books, papers, law texts, business letters, lectures)
                         – informal (e-mail, chat, text messages)

     type of training corpus defines type of data, the SMT system is able to translate (best)
Domains in MT systems

•   Most commercial MT systems

    developed for or adapted to a particular domain or task

    using customer data (translated manuals, business letters, …)

•   MT Research

    focus on open-domain systems or covering large domains, e.g. news

    special topics: domain detection, automatic domain adaptation
Parallel Data vs. Comparable Data

•   Parallel Data

     texts are human translations

     typically sentence-by-sentence

     amount of available data restricted
       • human translations expensive

       • proprietary, copy right restricted  translations of books owned by publishing companies

       • confidential  business letters, …

 Comparable Data

     bilingual texts tell the same story

     e.g. newspaper reports about the same event in different newspapers, different languages

     text elements not necessarily in the same order, missing parts
Parallel Data for SMT

•   Available Data for SMT

    Official data
      • EU documents are translated into all European languages
                     – European Parliamentary Speeches  Europarl corpus
                     – European Laws  Acquis Communautaire
      • UN data available in multiple languages

    Research Associations collect and provide data
      • Linguistic Data Consortium, European Language Resources Association

      • sentence aligned, word aligned, syntactically/semantically annotated

    size of available parallel corpora:
      • several hundred million words e.g. for Chinese, Arabic, French, English

      • Europarl corpus: 30-40 Mio words, depending on language
Acquiring Data for SMT

•   Web crawling

    BBC publishes news in 32 languages

    bilingual web sites in bilingual countries


 typically comparable corpora that need extensive cleaning

 need to identify corresponding texts
Aligning Bilingual Texts

•   depending on the origin of the bilingual corpus certain preprocessing steps are

     Extract text from html or pdf documents

     Document alignment

     Sentence segmentation

     Sentence alignment

•   Document alignment

     identify matching documents
      • e.g. corresponding html pages in different languages

     identify matching paragraphs
      • e.g. corresponding news stories on bilingual websites
Sentence Segmentation

•   where to set sentence boundaries

    trigger sentence segmentation at punctuation marks

    full stop (.), exclamation (!) and question mark (?)

    possibly at semicolon (;) and colon (:) ?

    prevent segmentation after abbreviations
Sentence Alignment

•   text rarely translated word by word

•   sometimes not even sentence by sentence

     long sentences might be splitted up, short ones merged

•   not straight forward to identify corresponding sentences in a parallel corpus
Sentence Alignment Task

•   given sentences f1…fnf in the foreign language and sentences e1…ene in English

•   sentence alignment S = list of sentence pairs s1, …sn

•   each sentence pair si is a pair of sets:
     si = ({fstart_f(i) ,…, fend_f(i) },{estart_e(i) ,…, eend_e(i) })

•   restrictions
     sentences translated in sequence
        •   start_f(i) = end_f(i-1)+1
        •   start_e(i) = end_e(i-1)+1

     start_f(1) = 1

     start_e(1) = 1

     end_f(n) = nf

     end_e(n) = ne

     start_f(i) <= end_f(i)

     start_e(i) <= end_e(i)
Alignment Strategy

•   different alignment types

     number of sentences in each set within a sentence pair
      • 1-1 (substitution),

      • 1-0 (deletion), 0-1 (insertion),

      • 2-1 (contraction), 1-2 (expansion),

      • 2-2 (merger)

•   Requirements for a full sentence alignment of a corpus

     all sentences need to be covered

     each sentence can only be part of one sentence pair
Alignment Strategy

•   Search for sentence alignment S = {s1, …, sn}

        fulfilling the requirements of coverage and uniqueness

        optimize the applied measure of matching quality of all its sentence pairs


•   Search the possible space of sentence alignments for highest scoring one

        dynamic programming

Popular Sentence Alignment Algorithm

•     Gale and Church (1993)

•     2 components for the match function in

       probability distribution for alignment types:

       distance measure considering the number of letters in each of the sentences:

       δ = (l2 – l1/c)/sqrt(l1 s2)

       where c is the number of expected characters in L2 per character in L1 and s2 the variance

     match function estimated by Prob( match | δ ) = Prob( δ | match ) Prob( match )

•     Other algorithms consider cognates or other lexical clues, e.g. using a bilingual dictionaries
Popular Sentence Alignment Algorithm

•     Gale and Church (1993)

•     2 components for the match function in

       probability distribution for alignment types:

       distance measure considering the number of letters in each of the sentences:

       δ = (l2 – l1/c)/sqrt(l1 s2)

       where c is the number of expected characters in L2 per character in L1 and s2 the variance

     match function estimated by Prob( match | δ ) = Prob( δ | match ) Prob( match )

        = 2 (1- Prob( |δ| )) and Prob( |δ| ) is computed by integrating a standard normal distribution

•     Other algorithms consider cognates or other lexical clues, e.g. using a bilingual dictionaries
   Sentence Alignment Output

   •     sentence aligned parallel corpus:

          1 line = 1 „sentence“
line 1   The skyward zoom in food prices is the             Hauptgrund für den in der Eurozone
                                                   line 1
         dominant force behind the speed up in              gemessenen Anstieg der Inflation
         eurozone inflation \n                              seien die rasant steigenden
                                                            Lebensmittelpreise. \n
line 2   Official forecasts predicted just 3
         percent, Bloomberg said. \n               line 2   Offizielle Prognosen sind von nur 3
                                                            Prozent ausgegangen, meldete
                                                            Bloomberg. \n
line x   His performance is delightfully tongue-
         in-cheek. Essentially, his war reporter
         Simon Hunt is who Gere could have ended   line x   Er liefert eine wunderbar
         up as, had fate and the film industry              augenzwinkernde Darstellung: Im
         not been so kind to him: A man whose               Grunde genommen ist sein
         heyday is long past, but who has                   Kriegsreporter Simon Hunt das, was
         preserved considerable shreds of his               aus Gere hätte werden können, wenn
         former charm even as a ruin-like                   das Schicksal und die Filmbranche
         monument. \n                                       nicht so gnädig wären: Ein Mann, der
                                                            seine allerbesten Zeiten lange hinter
                                                            sich hat, der aber selbst als
                                                            ruinengleiches Denkmal seines
                                                            Niedergang noch beträchtliche Reste
                                                            des einstigen Charmes bewahrt hat. \n

•   Text format

     plain text

     xml annotated text


•   Filter sentences

     remove sentence pairs containing empty or long sentences

     remove sentence pairs with unreasonable ratio of words

•   Cleaning the corpus

     normalize quotes, clean spaces

     change special characters

•    Tokenization

      segment text (=sequence of characters) into words
        • especially challenging for Asian languages  no spaces to indicate word boundaries

      separate punctuation from words

      The Social Democratic Party’s fraction at the Bürgerschaft, Hamburg’s
      parliament, accused the senate of having wasted precious time.

     The Social Democratic Party ’s fraction at the Bürgerschaft , Hamburg ’s
      parliament , accused the senate of having wasted precious time .

•   Dealing with lower and upper case in SMT

    normal text obeys the rules of capitalization

    train translation model on normalized text
      • benefit from generalization of the vocabulary

      • but also introduces ambiguities: May (month)  may

    convert training corpus into lowercase / smartcase

    train recasing model on original and lowercased / smartcased text
       recreate true case for translated text in postprocessing step

•   Preprocess patterns

    expand contractions and abbreviations
      • he‘ll  he will, it‘s  it is, z.B.  zum Beispiel, …

    normalize dates and numbers
      • 4.12.01, 04.12.2001, 4. Dez. 2001, 4. Dez. 01  4. Dezember 2001

 normalization leads to generalization and therefore better coverage
Named Entities (NE)‫‏‬

•   specific instance of an object class which is referred to by its name

•   Types

     personal names, organizations, locations, temporal phrases, monetary expressions

     William Bell, Spice Girls, United Nations, München <-> Munich, the year 2001, …

•   Problems

     most NEs are out-of-vocabulary words

     should not be translated even if (partly) possible

     unlimited amount  new named entities each day

 Named Entities need to be recognized as such and treated separately
Named Entity Recognition (NER)‫‏‬

•   identify named entities in a running text

•   Indications


     patterns: am 3. Januar, 12.5.2009

•   Techniques

     grammar based:
       • define language-dependent rules, patterns

       • train a model from NE-annotated corpora
Named Entities in Translation

•   How to translate Named Entities?

       •   George Bush != George Strauch

     translate partially / change format
       •   e.g. rule-based pattern substitution of time and money expressions
                           – am 3. Januar <-> on January 3rd
                           – $ 300 <-> 300 Dollar
                           – 12.5.2009 <-> 5/12/2009
                           – 4pm <-> 16.00 Uhr

     fixed equivalence
       •   München – Munich
       •   United Nations – Vereinte Nationen

     transcribe (between letter-based scripts)
       •   russ. Антон Чехов – dt. Anton Tschechow – eng. Anton Chekhov – wiss. transcr. Anton Čechov

     transliterate (between letter-based and character-based scripts)
       •   Microsoft –

•   Hindi:

     translation: hello

     transliteration: namaste

•   method

     „sounding out“

     transform a word in the source language script
     into a phonetically identical word in the target
     language script

     mapping of source script characters to target
     script characters that are pronounced similarly
      no 1:1 mapping due to different phoneme
     spaces of different languages

                                                        source of images and hindi text: Kellner, 2007

•   Forward Transliteration

     word does not exist in target language

     alternative phonetically identical transliterations possible and usually ok

     e.g. spelling of Indian name         : Mina, Minaa, Meena, Meenaa, …

•   Backward Transliteration

     word exists in target language

     only one form correct

     e.g. city names         : London, *Lundan, *Lundon
Machine Transliteration Techniques

•   3 basic strategies

     Manually compiled rules / tables
       • language dependent

       • must be exhaustive

     Phonetic-based Model
       • Source word  source segments  source phonemes  target phonemes  target segments 
         target word
       • methods based on hand-crafted rules or machine learning

       • requires language-specific linguistic knowledge or data about segmentation and pronunciation

     Direct Orthographic Model
       • direct mapping: source segments  target segments

•   New approach: Direct Orthographic Modeling with automated segments
       • consider all segmentation possibilities  choose best one based on probabilities

       • determine segments and probabilities during iterative training based on word-transliteration pairs
Preprocessing for Advanced SMT Methods

•   Morphological preprocessing

    compound splitting
      • splitting compound words into their constituent words facilitates translation
      • universitätsgebäude  universität gebäude  university building

    stemming / lemmatization (of morphologically rich languages)
      • car, cars, car's, cars'  car

      • strip functional morphemes at word endings  stem
                      – geh-st  geh
      • use dictionary and morphological inflection rules to derive base form  lemma
                      – geh-st  gehen
      • use lemmatized word form for training

       less variability, more generalization, increased lexical coverage
      • recreate morphological information after translation
Preprocessing for Advanced SMT Methods

•   POS tagging

    assign POS tags to each word
      • for instance , the government introduced statewide comparative tests .

    make use of probability of certain POS sequences, e.g.
      • apply a POS language model

      • use POS as factors in factorized translation

Shared By: