Docstoc

Example based Machine Translation

Document Sample
Example based Machine Translation Powered By Docstoc
					Example-based Machine Translation
    The other corpus-based approach to MT
                                                   2/23



Example-based Machine Translation
The other corpus-based approach to MT
  Historically predates SMT (just about)
  At first seen as a rival approach
  Now almost marginalised …
  … despite (because of?) some convergence
In this talk I will
   Explain basic ideas and problems
   Point to differences and similarities between
   EBMT and SMT
                                                        3/23



Example-based MT
 Long-established approach to empirical MT
 First developed in contrast with rule-based MT
 Idea of translation by analogy (Nagao 1984)
   Translate by adapting previously seen examples
   rather than by linguistic rule
   “Existing translations contain more solutions to
   more translation problems than any other available
   resource.” (P. Isabelle et al., TMI, Kyoto, 1993)
   In computational terms, belongs in family of Case-
   based reasoning approaches
                                        4/23



EBMT basic idea
 database of translation pairs
 match input against example database
  (like Translation Memory)
 identify corresponding translation
 fragments (align)
 recombine fragment into target text
                                                           5/23




   Example (Sato & Nagao 1990)
Input
  He buys a book on international politics
Matches
  He buys a notebook.
  Kare wa nōto o kau.
  I read a book on international politics.
  Watashi wa kokusai seiji nitsuite kakareta hon o yomu.
Result
  Kare wa kokusai seiji nitsuite kakareta hon   o kau.
                                                                6/23



A bit less hand-waving
 Simple example hides some problems, but first
 notice already some differences with SMT
    If the input already appeared in the bitext, system is
    guaranteed to produce an exact (correct) translation
    (assuming no contradictory examples)
    If the input is only slightly different from the example,
    there’s a pretty good chance that the translation will
    be OK
       These are both properties of Translation Memories
    In its purest form, there is no preprocessing of the
    corpus in EBMT: everything is done at run time
                                                               7/23



Matching the input
 In principle, the simplest part of the process:
    Levenshtein distance for simple string match
    Can be enhanced by annotating the examples with
    linguistic knowledge (POS tags, semantic info,
    structural representations) to improve accuracy and
    flexibility
 Some approaches suggest generalizing
 example pairs
    you end up with something which looks like RBMT transfer
    rules
    Example generalization is done off-line
       Using “rules” that express linguistic knowledge
       Or more automatically by merging similar examples
                                                                    8/23



Generalization using knowledge
   John Miller flew to Frankfurt on December 3rd   .
  John Miller ist am 3. Dezember nach Frankfurt geflogen.

       <1stname> <lastname> flew to <city> on <month> <ord>.
 <1stname> <lastname> ist am <num>. <month> nach <city> geflogen.

                 <person-m> flew to <city> on <date> .
           < person-m > ist am <date> nach <city> geflogen.

 Dr Howard Johnson flew to Ithaca on 7 April 1997.
                                                         9/23



Generalization by analogy – an exercise
    The monkey ate a peach.  saru wa momo o tabeta.
    The man ate a peach.     hito wa momo o tabeta

                      monkey  saru
                         man  hito
           The … ate a peach.  … wa momo o tabeta

         The dog ate a rabbit.  inu wa usagi o tabeta
                          dog  inu
                        rabbit  usagi
             The … ate a … .  … wa … o tabeta
                                                                         10/23



Alignment
 Taking the input and the closely-matching example and
 deciding which fragments of the translation can be
 reused or need to be changed
  Input:
  The operation was interrupted because the Listening key was pressed.

  Matches:

  The operation failed because the print key was pressed.

  L’opération a échoué car la touche d’impression a été enfoncée.
                                   11/23



Alignment – how is this done?
 Dictionary look-up
 Comparison of multiple examples
                                                             12/23



    Alignment – Comparison of multiple examples
      Comparison of multiple examples to distinguish
      alternatives, using semantic similarity (Nagao 1984)
    Input
      He eats potatoes .
    Matches
☺     A man eats vegetables . Hito wa yasai o taberu.
      Acid eats metal .       San wa kinzoku o okasu.
    Result
     Kare wa jagaimo o taberu.
                                                             13/23



    Alignment – Comparison of multiple examples
      Comparison of multiple examples to distinguish
      alternatives, using semantic similarity (Nagao 1984)
    Input
      He eats potatoes .   Sulphuric acid eats iron .
    Matches
      A man eats vegetables . Hito wa yasai o taberu.
☺     Acid eats metal .       San wa kinzoku o okasu.
    Result
     Kare wa jagaimo o taberu.

      Ryūsan wa tetsu o okasu.
                                                     14/23



Alignment – how is this done?
 Dictionary look-up
 Comparison of multiple examples
 Precomputed as in SMT: using word-alignment model
                                                                             15/23



Phrase alignment
 Granularity of fragments is a problem
   Too small = too general when it comes to recombination
     (You wouldn’t dream of translating by looking up each individual word
       in a dictionary and pasting it into position)
   Too big = sparse, and difficult to recombine
 Working at an intermediate level seems attractive:
   Phrase-based chunking
   Also found in SMT
 One fairly successful approach (at DCU) has been …
                                                                              16/23



  Marker-based chunking
     Most languages have a set of “marker words” (Green
     1979) – roughly speaking, closed-class words
     Marker words can be used to distinguish chunks
        Start a new phrase every time you come across a marker word
        Except that each phrase must contain at least one non-marker
        word
these limits are designed to provide reasonable protection against
harmful interference when the equipment is operated in a residential
environment .

<D> these limits are designed
<P> to provide reasonable protection   <D> these limits are designed
<P> against harmful interference       <P> to provide reasonable protection
<WH> when                              <P> against harmful interference
<D> the equipment is operated          <WH> when the equipment is operated
<P> in                                 <P> in a residential environment .
<D> a residential environment .
                                                                 17/23



Chunk alignment
  1<D> these limits are designed                  Align by finding
  2<P> to provide reasonable protection
  3<P> against harmful interference
                                                  similar pairs of
  4<WH> when the equipment is operated            chunks in other
  5<P> in a residential environment .
  1<D> ces limites sont destinées
                                                  examples
  2<CONJ> à assurer                               No need for
  3<D> une protection raisonnable
  4<P> contre les interférences                   chunks to align
  5<CONJ> lorsque le matériel est utilisé         1:1, …
  6<P> dans un environnement résidentiel .
                                                  … nor follow
  1<NULL> consult                                 the same
  2<D> the dealer
  3<CONJ> or an experienced radio/TV technician   sequence
  4<P> for help .
  1<P> en cas                                     Markers can
  2<D> de besoin ,                                help, but don’t
  3<PRON> se adresser
  4<CONJ> à un technicien radio                   have to
  5<CONJ> ou TV qualifié .
                                                                                     18/23



Recombination
 Having identified target-language fragments, how do we
 put them together?
 Depends how examples are stored
   Templates with labelled slots
   <person-m> flew to <city> on <date> .
   Tree structures
                                                 aoi
                                                nagai                have


    Kanojo wa kami ga nagai.           wa          ga          subj    obj
    SHE (topic) HAIR (subj) IS-LONG.
    She has long hair.                 kanojo
                                       kare             kami
                                                        me     he
                                                               she            hair
                                                                             eyes
    Kare wa me ga aoi.
                                                                            mod
    He has blue eyes.
                                                                             long
                                                                             blue
                                                                           19/23



Recombination – a problem
    Consider again:
He buys a book on politics
Matches
He buys a notebook. Kare wa nōto o kau.
He buys a pen.      Kare wa pen o kau.
  I read a book on politics. Watashi wa seiji nitsuite kakareta hon o yomu.
She wrote a book on politics. Kanojo wa seiji nitsuite kakareta hon o kaita.

Result
  Kare wa wa seiji nitsuite kakareta hon o o kau.
    Kare wa seiji nitsuite kakareta hon o kau
                                                                          20/23



Recombination – another problem
  Boundary friction
 Input: The handsome boy entered the room
 Matches:
 The handsome boy ate his breakfast. Der schöne Junge aß sein Frühstück
 I saw the handsome boy.    Ich sah den schönen Jungen.




  Solutions?
      Labelled fragments
       (remember where you got the fragment from – use its context)
      Target-language grammar
      Target language model (as in SMT)
                                                         21/23



EBMT and SMT hybrids
 Recombination is like decoding
 Matching/alignment phases have produced a bag of
 fragments that now need to be recombined to form a
 grammatical target sentence
 Essentially the same task as is found in SMT decoding
 Doesn’t matter what the source of the fragments is

 Similarly, one could imagine an SMT translation model
 taking ideas from EBMT matching/alignment
                                                                  22/23



So are EBMT and SMT the same?
Some things in common which distinguish them from Rule-based MT

 Use of a bitext as the fundamental data source
 Empirical rather than rational: Principle of machine learning
 rather than human (linguist) writing rules
 From which it follows (in principle) that systems can be
 improved mainly by getting more data
 And it is hoped that new language-pairs can be developed
 “just” by finding suitable parallel corpus data
                                                                        23/23



So are EBMT and SMT the same?
It may be merely dogmatic to insist, but there are some definitional
differences
 SMT essentially uses statistical      EBMT uses the bitext as its
 data (parameters, probabilities)      primary data source
 derived from the bitext               Preprocessing the data is
 Preprocessing the data is             optional
 essential                             If the input is in the example
 Even if the input is in the           set, you are guaranteed to get
 training data, you are not            the same translation
 guaranteed to get the same
 translation