downloading - OPUS

Document Sample
downloading - OPUS Powered By Docstoc
					  LetsMT!: Cloud-Based Platform for Building User Tailored Machine
                        Translation Engines

    Andrejs Vasiļjevs                       Raivis Skadiņš                       Jörg Tiedemann
          Tilde                                  Tilde                          Uppsala University
 Vienibas gatve 75a, Riga               Vienibas gatve 75a, Riga                 Box 635, Uppsala
    LV1004, LATVIA                         LV1004, LATVIA                      SE-75126, SWEDEN                           jorg.tiedemann@

                                                       2010 and should achieve its goals till September
                     Abstract                          2012.

    To fully exploit the huge potential of existing    2 Applying user-provided data for SMT
    open SMT technologies and user-provided               training
    content, we have created an innovative online
    platform for data sharing and MT building.            The number of open source parallel resources is
    This platform is being developed in the EU         limited and this is an essential problem for SMT,
    collaboration project LetsMT!. This paper          since translation systems trained on data from a
    presents motivation in developing this plat-
                                                       particular domain, e.g. parliamentary proceedings,
    form, its architecture and main features.
                                                       will perform poorly when used to translate texts
                                                       from a different domain, e.g. news articles. At the
1 Introduction                                         same time, a huge amount of parallel texts and
   The goal of the LetsMT! project is to facilitate    translated documents are at the users’ disposal and
the use of open source SMT toolkits and to involve     they can be used for SMT system training. There-
users in the collection of training data. This will    fore, the LetsMT! online platform provides all cat-
result in populating and enhancing the currently       egories of users (public organizations, private
most progressive MT technology and making it           companies, individuals) with an opportunity to up-
available and accessible for all categories of users   load their proprietary resources to the repository
in the form of sharing MT training data and build-     and to receive a tailored SMT system trained on
ing tailored MT systems for different languages on     these resources. The latter can be shared with other
the basis of the online LetsMT! platform. The          users who can exploit them further on. Data and
LetsMT! project extends the use of existing state-     SMT model sharing can be managed by the users.
of-the-art SMT methods enabling users to partici-      In LetsMT! we emphasize data integrity and secu-
pate in data collection and MT customization to        rity that makes it possible to work with proprietary
increase quality, scope and language coverage of       collections as well as public sources.
MT. Currently LetsMT! is creating a cloud-based           The motivation of users to get involved in shar-
platform that gathers public and user-provided MT      ing their resources is based on the following fac-
training data and generates multiple MT systems        tors:
by combining and prioritizing this data.                         participate and contribute, in a recipro-
The LetsMT! Consortium includes the project co-                   cal manner, with a community of pro-
ordinator Tilde, the Universities of Edinburgh, Za-               fessionals and its goals;
greb, Copenhagen and Uppsala, the localization
company Moravia and the semantic technology                      achieve better MT quality for user spe-
company SemLab. The project started in March                      cific texts;
          build tailored and domain specific                    MT evaluation facilities.
           translation services;
                                                       3 Architecture overview
          enhance reputation for individuals and
           businesses;                                 Figure 1 illustrates the general architecture of the
                                                       LetsMT! platform. Its components for SMT train-
          ensure compliance with the require-         ing, parallel data collection and data processing are
           ment set forth by EU Directive to pro-      described further down in this paper. The devel-
           vide usability of public information in a   opment of the system was particularly facilitated
           convenient way for public institutions;     by the open-source alignment tool GIZA++ (Och
          deliver a ready resource for study and      et al. 2002) and the MT training and decoding tool
           teaching purposes for academic institu-     Moses (Koehn et al 2007).
           tions.                                         LetsMT! translation services can be used in sev-
                                                       eral ways: through the web portal, through a wid-
   The LetsMT! project is advancing the concept of     get provided for free inclusion in a web-page,
data sharing, which implies the practice of making     through browser plug-ins, and through integration
data used in one activity available to other users.    in computer-assisted translation (CAT) tools and
   LetsMT! platform provides the following key         different online and offline applications. Localisa-
features:                                              tion and translation businesses as well as other pro-
        Uploading of parallel texts for users         fessional translators can use the LetsMT! platform
            that will contribute their content;        for uploading their parallel corpora in the LetsMT!
                                                       website, building custom SMT solutions from the
          Directory of web and offline resources      specified collections of training data, and accessing
           gathered by LetsMT! users;                  these solutions in their productivity environments
                                                       (typically, various CAT tools).
          Automated training of SMT systems
                                                          The LetsMT! system has a multitier architecture.
           from specified collections of training
                                                       It has (i) an interface layer implementing the user
                                                       interface and APIs with external systems; (ii) an
          Custom building of MT engines from          application logic layer for the system logic and (iii)
           selected pools of training data;            a data storage layer consisting of file and database
                                                       storage. The LetsMT! system is performing vari-
          Custom building of MT engines from          ous time and resource consuming tasks; these tasks
           proprietary non-public data;                are defined by the application logic and the data
                                                       storage and are sent to a High Performance Com-

       Figure 1 General architecture of the LetsMT! platform.
puting (HPC) Cluster for execution.                    parallel or interactive jobs. It also manages and
   The Interface layer provides interfaces between     schedules the allocation of distributed resources
the LetsMT! system and external users. The system      such as processors, memory and disk space. The
has both human and machine users. Human users          LetsMT! HPC cluster is based on the Oracle Grid
can access the system through web browsers by          Engine (SGE). The HPC cluster accesses data
using the LetsMT! web page interface. External         stored in the data storage layer using the RR API.
systems such as CAT tools and browser plug-ins            The hardware infrastructure of the LetsMT!
can access the LetsMT! system through a public         platform is heterogeneous. The majority of ser-
API. The public API is available through both          vices run on Linux platforms (Giza++, Moses, Re-
REST/JSON and SOAP protocol web services.              source Repository, data processing tools). The
Some CAT tools or other external systems may           Web server and application logic services run on a
require different interfaces; they might be intro-     Microsoft Windows platform.
duced if necessary. A HTTPS protocol is used to        The system hardware architecture is designed to be
ensure secure user authentication and secure data      highly sizable. The LetsMT! platform contains
transfer.                                              several machines with both continuous and on-
   The application logic layer contains a set of       demand availability:
modules responsible for the main functionality or
logic of the systems. It receives queries and com-               Continuous availability – the core fron-
mands from the interface layer and prepares an-                   tend and backend services that guaran-
swers or performs tasks using the data storage and                tee LetsMT! webpage and external API
the HPC cluster. This layer contains several mod-                 availability;
ules such as the Resource Repository Manager, the                On-demand availability – training,
User Manager, the SMT Training Manager etc.                       translation and data import services
The interface layer accesses the application logic                (HPC cluster nodes); Additional fron-
layer through both REST/JSON and SOAP proto-                      tend and backend server instances to
col web services. The same protocols are used for                 increase availability.
communication between modules in the applica-
tion logic layer.                                      4 Application of the Moses SMT toolkit
   The LetsMT! system as a data sharing and MT
platform is able to store large amounts of SMT            A significant breakthrough in SMT was
training data (parallel and monolingual corpora) as    achieved by the EuroMatrix project. Th project
well as trained models of SMT systems. The data        objectives included the creation of translation sys-
is stored in one central Resource Repository (RR).     tems for all pairs of EU languages and the
The RR is also used to store various tools neces-      development of open source MT technology in-
sary for data processing and SMT training. As          cluding research tools, software and data collec-
training data may change (for example, grow), the      tions. Its result is the improved open source SMT
resource repository is based on a version-             toolkit Moses developed by the University of Ed-
controlled file system (currently we use SVN as        inburgh. The Moses SMT toolkit is a complete sta-
the backend system). A key-value store is used to      tistical translation system distributed under the
keep metadata and statistics about training data and   Lesser General Public License (LGPL). Moses in-
trained SMT systems. Modules from the applica-         cludes all the components needed to pre-process
tion logic layer and HPC cluster access RR through     data and to train language and translation models
a REST-based web service interface.                    (Koehn et al. 2007). Moses is widely used in the
   A High Performance Computing Cluster is used        research community and has also reached the
to execute many different data processing tasks,       commercial sector. While the use of the software is
training and running SMT systems. Modules from         not closely monitored (there is no need to sign a
the application logic and data storage layers create   license agreement), Moses is known to be in com-
jobs and send them to HPC cluster to execute. HPC      mercial use by companies such as Systran, Asia
cluster is responsible for accepting, scheduling,      Online, Autodesk, Matrixware, The
dispatching, and managing the remote and distrib-      LetsMT! project coordinator Tilde bases its free
uted execution of large numbers of standalone,         online Latvian MT system on the Moses platform.
LetsMT! uses Moses as a language independent            access to the LetsMT! resource repository which
SMT solution and integrates it as a cloud-based         consists mainly of a revision control system (Sub-
service into the LetsMT! online platform. One of        version), a database (TokyoCabinet) and a batch-
the important achievements of the LetsMT! project       queuing system (SGE, Oracle Grid Engine). The
will be the adaptation of the Moses toolkit to fit      purpose of the Web API is to enable the interaction
into the rapid training, updating, and interactive      with the repository system for uploading and
access environment of the LetsMT! platform. The         downloading data, requesting and searching infor-
SMT training pipeline implemented in Moses cur-         mation and triggering batch processes. The
rently involves a number of steps that each require     LetsMT! resource repository system is implement-
a separate program to run. In the framework of          ed in Perl and uses the Apache server and
LetsMT! this process will be streamlined and made       mod_perl to handle the requests and responses to
automatically configurable given a set of user-         and from the client system.
specified variables (training corpora, language
model data, dictionaries, tuning sets).                    All data sets of the LetsMT! platform are stored
Additional important improvements of Moses that         in a revision control system. In the current imple-
are being implemented by the University of Edin-        mentation, we use Subversion (SVN). However,
burgh as part of LetsMT!, are the incremental           the software is modular and another version con-
training of MT models, randomised language mod-         trol system may replace SVN or even work side-
els (Levenberg et al. 2009), and separate language      by-side with other storage backends.
and translation model servers. We expect some              Revision control systems are designed for dy-
users to add relatively small amounts of additional     namic repositories of textual data in multi-user en-
training data in frequent intervals. The incremental    vironments. They typically store all repository
training will benefit from the addition of these data   modifications and provide tools for tracking the
without re-running the entire training pipeline from    file history for any item in the repository. Further-
scratch.                                                more, they naturally support data sharing and pos-
                                                        sibilities to revert to specific versions.
5 LetsMT! Resource repository                           Modifications are stored efficiently by keeping
                                                        track of changes only. All of this makes them well
   Figure 2 illustrates the general architecture of     suited for our needs in which growing resources
the resource repository and its integration into the    may be accessed by multiple users.
LetsMT! platform. The LetsMT! resource reposito-           An important design goal for developing the re-
ry has a web API that is implemented as a REST          pository software was to allow arbitrary metadata
service with HTTP requests. The Web API gives           in terms of key-value pairs stored together with

                                   Figure 2. Resource repository overview
resources in the repository. The focus was set on                  Give me all monolingual data sets from
flexibility in a way that new fields and data sets in               the news domain that are larger than
various formats can easily be added to the database                 500 sentences.
during development. It has to be possible to store
appropriate metadata to any resource at any loca-           Our key-value store is able to process such que-
tion in the repository. Another important feature is     ries and to return matching resources and their as-
that this database should still be powerful enough       sociated metadata entries. Furthermore, we store
to allow complex search queries over the entire          permission information together with all data rec-
repository which reflects a hierarchical file struc-     ords to filter the data according to access re-
ture. At the same time, the system has to respect        strictions. The backend system we use is based on
permissions set to individual resources in order to      TokyoCabinet (https://fallabs/tokyocabinet) a
avoid that restricted material can be found. Stand-      freely available software package that implements
ard relational database management systems do not        an efficient database management system with all
support this degree of flexibility as they rely on       the flexibility required by our platform.
pre-defined relations (tables) with fixed data types        Another important feature of the Resource Re-
and operations over them. A recent trend is, there-      pository software is the support of data import,
fore, to move from relational database with SQL-         validation and conversion. Users may upload their
like queries to schema-less key-value stores that do     data sources in a variety of formats that will auto-
not require a fixed data model. The general idea of      matically be processed by our validation and con-
such a store for metadata is presented in the next       version tools. The software also includes a
section followed by some details about implemen-         sentence alignment module that makes it possible
tation choices in our repository package.                to create new parallel resources for SMT training
   A key-value store basically stores arbitrary data     from scratch. In the current implementation we
(values) by use of a single key. This conceptually       support the following data formats with dedicated
simple strategy allows a lot of flexibility in terms     import handlers: aligned parallel data in TMX,
of data storage without pre-defined schemas and          XLIFF and Moses formats, monolingual text doc-
data models. In relation to the resource repository      uments in PDF, Text and DOC formats, com-
we like to store arbitrary key-value pairs to any        pressed data and archives in gzip, zip and tar
resource in the repository. Various kinds of infor-      formats. Support for additional formats may be
mation shall be stored in this way ranging from          added in future releases.
descriptive data (textual domain, ownership, lan-
guage, size etc.) to status information (im-             6 Conclusions
port/conversion status, etc.), and internal                 Current development of SMT tools and tech-
information used by the LetsMT! frontend or re-          niques have reached the level where they can be
pository backend. Furthermore, we also need the          implemented in practical applications addressing
support of repeated keys, or better, keys that may       the needs of large user groups in a variety of appli-
contain several values.                                  cation scenarios. The work in progress that is de-
   Important here is that the database is not re-        scribed in this paper promises important advances
stricted to a list of pre-defined keys. In our system,   in the application of SMT by integrating available
arbitrary keys can be added containing arbitrary         tools and technologies into an easy-to-use cloud-
values. Furthermore, values can also be interpreted      based platform ( for data
as unordered lists, for example, in the case of lan-     sharing and generation of customized MT.
guage. Using these data sets we are able to ask             The successful implementation of the project
complex queries such as:                                 will enable wider use and greater impact of availa-
          Give me all public parallel data with         ble open-source SMT technologies, facilitate di-
           English as either source or target lan-       versification of free MT by tailoring it to specific
           guage,                                        domains and user requirements.
F.J. Och, H. Ney. 2003. A Systematic Comparison of
   Various Statistical Alignment Models. Computational
   Linguistics, (29)1: 19-51.
P. Koehn, M. Federico, B. Cowan, R. Zens, C. Duer, O.
   Bojar, A. Constantin, E. Herbst. 2007. Moses: Open
   Source Toolkit for Statistical Machine Translation, in
   Proceedings of the ACL 2007 Demo and Poster Ses-
   sions, pages 177-180, Prague.
A. Levenberg, M. Osborne. 2009. Stream-based Ran-
  domised Language Models for SMT, in Proceedings
  of the 2009 Conference on Empirical Methods in
  Natural Language Processing.

Shared By: