Moses Open Source Toolkit for Statistical Machine Translation

Document Sample
Moses Open Source Toolkit for Statistical Machine Translation Powered By Docstoc
					     Moses: Open Source Toolkit for Statistical Machine Translation
               Philipp Koehn                    Marcello Federico                Brooke Cowan
                 Hieu Hoang                      Nicola Bertoldi                  Wade Shen
              Alexandra Birch                       ITC-irst2                   Christine Moran
             Chris Callison-Burch                                                    MIT3
              University of Edin-
                 Richard Zens                     Chris Dyer                     Ondřej Bojar
                RWTH Aachen4                 University of Maryland5           Charles University6

                        Alexandra Constantin                     Evan Herbst
                          Williams College7                        Cornell8
 , {h.hoang, A.C.Birch-Mayne},
          {federico, bertoldi} 3,, 4 5 6 7 8

                                                           shown that it achieves results comparable to the
                    Abstract                               most competitive and widely used statistical ma-
                                                           chine translation systems in translation quality and
    We describe an open-source toolkit for sta-
                                                           run-time (Shen et al. 2006). It features all the ca-
    tistical machine translation whose novel
                                                           pabilities of the closed sourced Pharaoh decoder
    contributions are (a) support for linguisti-
                                                           (Koehn 2004).
    cally motivated factors, (b) confusion net-
                                                               Apart from providing an open-source toolkit
    work decoding, and (c) efficient data for-
                                                           for SMT, a further motivation for Moses is to ex-
    mats for translation models and language
                                                           tend phrase-based translation with factors and con-
    models. In addition to the SMT decoder,
                                                           fusion network decoding.
    the toolkit also includes a wide variety of
                                                               The current phrase-based approach to statisti-
    tools for training, tuning and applying the
                                                           cal machine translation is limited to the mapping of
    system to many translation tasks.
                                                           small text chunks without any explicit use of lin-
                                                           guistic information, be it morphological, syntactic,
1    Motivation                                            or semantic. These additional sources of informa-
    Phrase-based statistical machine translation           tion have been shown to be valuable when inte-
(Koehn et al. 2003) has emerged as the dominant            grated into pre-processing or post-processing steps.
paradigm in machine translation research. How-                 Moses also integrates confusion network de-
ever, until now, most work in this field has been          coding, which allows the translation of ambiguous
carried out on proprietary and in-house research           input. This enables, for instance, the tighter inte-
systems. This lack of openness has created a high          gration of speech recognition and machine transla-
barrier to entry for researchers as many of the            tion. Instead of passing along the one-best output
components required have had to be duplicated.             of the recognizer, a network of different word
This has also hindered effective comparisons of the        choices may be examined by the machine transla-
different elements of the systems.                         tion system.
    By providing a free and complete toolkit, we               Efficient data structures in Moses for the
hope that this will stimulate the development of the       memory-intensive translation model and language
field. For this system to be adopted by the commu-         model allow the exploitation of much larger data
nity, it must demonstrate performance that is com-         resources with limited hardware.
parable to the best available systems. Moses has
                     Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177–180,
                      Prague, June 2007. c 2007 Association for Computational Linguistics
2   Toolkit                                            3    Factored Translation Model
     The toolkit is a complete out-of-the-box trans-       Non-factored SMT typically deals only with
lation system for academic research. It consists of    the surface form of words and has one phrase table,
all the components needed to preprocess data, train    as shown in Figure 1.
the language models and the translation models. It         Translate:
also contains tools for tuning these models using                      i     am buying you a              green cat
minimum error rate training (Och 2003) and evalu-
ating the resulting translations using the BLEU
score (Papineni et al. 2002).                                          je vous achète                 un chat vert
     Moses uses standard external tools for some of
the tasks to avoid duplication, such as GIZA++             using phrase dictionary:
(Och and Ney 2003) for word alignments and                         i                                    je
SRILM for language modeling. Also, since these                      am buying                           achète
tasks are often CPU intensive, the toolkit has been                you                                  vous
designed to work with Sun Grid Engine parallel                     a                                    un
environment to increase throughput.                                a                                    une
     In order to unify the experimental stages, a                  green                                vert
utility has been developed to run repeatable ex-                   cat                                  chat
periments. This uses the tools contained in Moses
and requires minimal changes to set up and cus-               Figure 1. Non-factored translation
                                                           In factored translation models, the surface
     The toolkit has been hosted and developed un-
                                                       forms may be augmented with different factors,
der since inception. Moses has an
                                                       such as POS tags or lemma. This creates a factored
active research community and has reached over
                                                       representation of each word, Figure 2.
1000 downloads as of 1st March 2007.
                                                              ⎛ je ⎞ ⎛ vous ⎞ ⎛ achet ⎞ ⎛ un ⎞ ⎛ chat ⎞
     The main online presence is at                           ⎜     ⎟⎜      ⎟⎜                ⎟⎜       ⎟⎜              ⎟                          ⎜ PRO ⎟ ⎜ PRO ⎟ ⎜       VB      ⎟ ⎜ ART ⎟ ⎜      NN      ⎟
                                                              ⎜ je ⎟ ⎜ vous ⎟ ⎜ acheter ⎟ ⎜ un ⎟ ⎜ chat ⎟
     where many sources of information about the              ⎜
                                                              ⎜ 1st ⎟ ⎜ 1st ⎟ ⎜ 1st / present ⎟ ⎜ masc ⎟ ⎜ sing / masc ⎟
                                                                    ⎟⎜      ⎟⎜                ⎟⎜       ⎟⎜              ⎟
project can be found. Moses was the subject of this           ⎝     ⎠⎝      ⎠⎝                ⎠⎝       ⎠⎝              ⎠
year’s Johns Hopkins University Workshop on
Machine Translation (Koehn et al. 2006).
     The decoder is the core component of Moses.
To minimize the learning curve for many research-                ⎛ i ⎞⎛        buy      ⎞ ⎛ you ⎞ ⎛ a ⎞ ⎛ cat ⎞
                                                                 ⎜     ⎟⎜               ⎟⎜      ⎟⎜       ⎟⎜       ⎟
ers, the decoder was developed as a drop-in re-                  ⎜ PRO ⎟⎜       VB      ⎟ ⎜ PRO ⎟ ⎜ ART ⎟ ⎜ NN ⎟
placement for Pharaoh, the popular phrase-based                  ⎜ i ⎟⎜ tobuy ⎟ ⎜ you ⎟ ⎜ a ⎟ ⎜ cat ⎟
                                                                 ⎜     ⎟⎜               ⎟⎜      ⎟⎜       ⎟⎜       ⎟
decoder.                                                         ⎝ 1st ⎠⎝ 1st / present ⎠ ⎝ 1st ⎠ ⎝ sing ⎠ ⎝ sing ⎠
     In order for the toolkit to be adopted by the
community, and to make it easy for others to con-                          Figure 2. Factored translation
tribute to the project, we kept to the following            Mapping of source phrases to target phrases
principles when developing the decoder:                may be decomposed into several steps. Decompo-
   • Accessibility                                     sition of the decoding process into various steps
   • Easy to Maintain                                  means that different factors can be modeled sepa-
   • Flexibility                                       rately. Modeling factors in isolation allows for
   • Easy for distributed team development             flexibility in their application. It can also increase
   • Portability                                       accuracy and reduce sparsity by minimizing the
     It was developed in C++ for efficiency and fol-   number dependencies for each step.
lowed modular, object-oriented design.                      For example, we can decompose translating
                                                       from surface forms to surface forms and lemma, as
                                                       shown in Figure 3.

                                                         speech recognition and machine translation models.
                                                         Translation from speech input is considered more
                                                         difficult than translation from text for several rea-
                                                         sons. Spoken language has many styles and genres,
                                                         such as, formal read speech, unplanned speeches,
                                                         interviews, spontaneous conversations; it produces
                                                         less controlled language, presenting more relaxed
                                                         syntax and spontaneous speech phenomena. Fi-
    Figure 3. Example of graph of decoding steps         nally, translation of spoken language is prone to
    By allowing the graph to be user definable, we       speech recognition errors, which can possibly cor-
can experiment to find the optimum configuration         rupt the syntax and the meaning of the input.
for a given language pair and available data.                There is also empirical evidence that better
    The factors on the source sentence are consid-       translations can be obtained from transcriptions of
ered fixed, therefore, there is no decoding step         the speech recognizer which resulted in lower
which create source factors from other source fac-       scores. This suggests that improvements can be
tors. However, Moses can have ambiguous input in         achieved by applying machine translation on a
the form of confusion networks. This input type          large set of transcription hypotheses generated by
has been used successfully for speech to text            the speech recognizers and by combining scores of
translation (Shen et al. 2006).                          acoustic models, language models, and translation
    Every factor on the target language can have its     models.
own language model. Since many factors, like                 Recently, approaches have been proposed for
lemmas and POS tags, are less sparse than surface        improving translation quality through the process-
forms, it is possible to create a higher order lan-      ing of multiple input hypotheses. We have imple-
guage models for these factors. This may encour-         mented in Moses confusion network decoding as
age more syntactically correct output. In Figure 3       discussed in (Bertoldi and Federico 2005), and de-
we apply two language models, indicated by the           veloped a simpler translation model and a more
shaded arrows, one over the words and another            efficient implementation of the search algorithm.
over the lemmas. Moses is also able to integrate         Remarkably, the confusion network decoder re-
factored language models, such as those described        sulted in an extension of the standard text decoder.
in (Bilmes and Kirchhoff 2003) and (Axelrod
2006).                                                   5    Efficient Data Structures for Transla-
                                                              tion Model and Language Models
4    Confusion Network Decoding
                                                              With the availability of ever-increasing
    Machine translation input currently takes the        amounts of training data, it has become a challenge
form of simple sequences of words. However,              for machine translation systems to cope with the
there are increasing demands to integrate machine        resulting strain on computational resources. Instead
translation technology into larger information           of simply buying larger machines with, say, 12 GB
processing systems with upstream NLP/speech              of main memory, the implementation of more effi-
processing tools (such as named entity recognizers,      cient data structures in Moses makes it possible to
speech recognizers, morphological analyzers, etc.).      exploit larger data resources with limited hardware
These upstream processes tend to generate multiple,      infrastructure.
erroneous hypotheses with varying confidence.                 A phrase translation table easily takes up giga-
Current MT systems are designed to process only          bytes of disk space, but for the translation of a sin-
one input hypothesis, making them vulnerable to          gle sentence only a tiny fraction of this table is
errors in the input.                                     needed. Moses implements an efficient representa-
    In experiments with confusion networks, we           tion of the phrase translation table. Its key proper-
have focused so far on the speech translation case,      ties are a prefix tree structure for source words and
where the input is generated by a speech recog-          on demand loading, i.e. only the fraction of the
nizer. Namely, our goal is to improve performance        phrase table that is needed to translate a sentence is
of spoken language translation by better integrating     loaded into the working memory of the decoder.

    For the Chinese-English NIST task, the mem-                 Recognition and Understanding Workshop
ory requirement of the phrase table is reduced from             (ASRU), 2005.
1.7 gigabytes to less than 20 mega bytes, with no       Bilmes, Jeff A, and Katrin Kirchhoff. "Factored Lan-
loss in translation quality and speed (Zens and Ney             guage Models and Generalized Parallel Back-
2007).                                                          off." HLT/NACCL, 2003.
    The other large data resource for statistical ma-   Koehn, Philipp. "Pharaoh: A Beam Search Decoder for
chine translation is the language model. Almost                 Phrase-Based Statistical Machine Translation
unlimited text resources can be collected from the              Models." AMTA, 2004.
Internet and used as training data for language
modeling. This results in language models that are      Koehn, Philipp, Marcello Federico, Wade Shen, Nicola
                                                                Bertoldi, Ondrej Bojar, Chris Callison-Burch,
too large to easily fit into memory.
                                                                Brooke Cowan, Chris Dyer, Hieu Hoang,
    The Moses system implements a data structure                Richard Zens, Alexandra Constantin, Christine
for language models that is more efficient than the             Corbett Moran, and Evan Herbst. "Open
canonical SRILM (Stolcke 2002) implementation                   Source Toolkit for Statistical Machine Transla-
used in most systems. The language model on disk                tion". Report of the 2006 Summer Workshop at
is also converted into this binary format, resulting            Johns Hopkins University, 2006.
in a minimal loading time during start-up of the        Koehn, Philipp, and Hieu Hoang. "Factored Translation
decoder.                                                        Models." EMNLP, 2007.
    An even more compact representation of the
language model is the result of the quantization of     Koehn, Philipp, Franz Josef Och, and Daniel Marcu.
                                                               "Statistical   Phrase-Based    Translation."
the word prediction and back-off probabilities of
                                                               HLT/NAACL, 2003.
the language model. Instead of representing these
probabilities with 4 byte or 8 byte floats, they are    Och, Franz Josef. "Minimum Error Rate Training for
sorted into bins, resulting in (typically) 256 bins             Statistical Machine Translation." ACL, 2003.
which can be referenced with a single 1 byte index.     Och, Franz Josef, and Hermann Ney. "A Systematic
This quantized language model, albeit being less                Comparison of Various Statistical Alignment
accurate, has only minimal impact on translation                Models." Computational Linguistics 29.1
performance (Federico and Bertoldi 2006).                       (2003): 19-51.
                                                        Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-
6    Conclusion and Future Work                                 Jing Zhu. "BLEU: A Method for Automatic
                                                                Evaluation of Machine Translation." ACL,
     This paper has presented a suite of open-source
tools which we believe will be of value to the MT
research community.                                     Shen, Wade, Richard Zens, Nicola Bertoldi, and
     We have also described a new SMT decoder                  Marcello Federico. "The JHU Workshop 2006
which can incorporate some linguistic features in a            Iwslt System." International Workshop on Spo-
                                                               ken Language Translation, 2006.
consistent and flexible framework. This new direc-
tion in research opens up many possibilities and        Stolcke, Andreas. "SRILM an Extensible Language
issues that require further research and experimen-              Modeling Toolkit." Intl. Conf. on Spoken Lan-
tation. Initial results show the potential benefit of            guage Processing, 2002.
factors for statistical machine translation, (Koehn     Zens, Richard, and Hermann Ney. "Efficient Phrase-
et al. 2006) and (Koehn and Hoang 2007).                        Table Representation for Machine Translation
                                                                with Applications to Online MT and Speech
References                                                      Recognition." HLT/NAACL, 2007.
Axelrod, Amittai. "Factored Language Model for Sta-
        tistical Machine Translation." MRes Thesis.
        Edinburgh University, 2006.
Bertoldi, Nicola, and Marcello Federico. "A New De-
         coder for Spoken Language Translation Based
         on Confusion Networks." Automatic Speech