LetsMT!: Cloud-Based Platform for Building User Tailored Machine
Andrejs Vasiļjevs Raivis Skadiņš Jörg Tiedemann
Tilde Tilde Uppsala University
Vienibas gatve 75a, Riga Vienibas gatve 75a, Riga Box 635, Uppsala
LV1004, LATVIA LV1004, LATVIA SE-75126, SWEDEN
email@example.com firstname.lastname@example.org jorg.tiedemann@
2010 and should achieve its goals till September
To fully exploit the huge potential of existing 2 Applying user-provided data for SMT
open SMT technologies and user-provided training
content, we have created an innovative online
platform for data sharing and MT building. The number of open source parallel resources is
This platform is being developed in the EU limited and this is an essential problem for SMT,
collaboration project LetsMT!. This paper since translation systems trained on data from a
presents motivation in developing this plat-
particular domain, e.g. parliamentary proceedings,
form, its architecture and main features.
will perform poorly when used to translate texts
from a different domain, e.g. news articles. At the
1 Introduction same time, a huge amount of parallel texts and
The goal of the LetsMT! project is to facilitate translated documents are at the users’ disposal and
the use of open source SMT toolkits and to involve they can be used for SMT system training. There-
users in the collection of training data. This will fore, the LetsMT! online platform provides all cat-
result in populating and enhancing the currently egories of users (public organizations, private
most progressive MT technology and making it companies, individuals) with an opportunity to up-
available and accessible for all categories of users load their proprietary resources to the repository
in the form of sharing MT training data and build- and to receive a tailored SMT system trained on
ing tailored MT systems for different languages on these resources. The latter can be shared with other
the basis of the online LetsMT! platform. The users who can exploit them further on. Data and
LetsMT! project extends the use of existing state- SMT model sharing can be managed by the users.
of-the-art SMT methods enabling users to partici- In LetsMT! we emphasize data integrity and secu-
pate in data collection and MT customization to rity that makes it possible to work with proprietary
increase quality, scope and language coverage of collections as well as public sources.
MT. Currently LetsMT! is creating a cloud-based The motivation of users to get involved in shar-
platform that gathers public and user-provided MT ing their resources is based on the following fac-
training data and generates multiple MT systems tors:
by combining and prioritizing this data. participate and contribute, in a recipro-
The LetsMT! Consortium includes the project co- cal manner, with a community of pro-
ordinator Tilde, the Universities of Edinburgh, Za- fessionals and its goals;
greb, Copenhagen and Uppsala, the localization
company Moravia and the semantic technology achieve better MT quality for user spe-
company SemLab. The project started in March cific texts;
build tailored and domain specific MT evaluation facilities.
3 Architecture overview
enhance reputation for individuals and
businesses; Figure 1 illustrates the general architecture of the
LetsMT! platform. Its components for SMT train-
ensure compliance with the require- ing, parallel data collection and data processing are
ment set forth by EU Directive to pro- described further down in this paper. The devel-
vide usability of public information in a opment of the system was particularly facilitated
convenient way for public institutions; by the open-source alignment tool GIZA++ (Och
deliver a ready resource for study and et al. 2002) and the MT training and decoding tool
teaching purposes for academic institu- Moses (Koehn et al 2007).
tions. LetsMT! translation services can be used in sev-
eral ways: through the web portal, through a wid-
The LetsMT! project is advancing the concept of get provided for free inclusion in a web-page,
data sharing, which implies the practice of making through browser plug-ins, and through integration
data used in one activity available to other users. in computer-assisted translation (CAT) tools and
LetsMT! platform provides the following key different online and offline applications. Localisa-
features: tion and translation businesses as well as other pro-
Uploading of parallel texts for users fessional translators can use the LetsMT! platform
that will contribute their content; for uploading their parallel corpora in the LetsMT!
website, building custom SMT solutions from the
Directory of web and offline resources specified collections of training data, and accessing
gathered by LetsMT! users; these solutions in their productivity environments
(typically, various CAT tools).
Automated training of SMT systems
The LetsMT! system has a multitier architecture.
from specified collections of training
It has (i) an interface layer implementing the user
interface and APIs with external systems; (ii) an
Custom building of MT engines from application logic layer for the system logic and (iii)
selected pools of training data; a data storage layer consisting of file and database
storage. The LetsMT! system is performing vari-
Custom building of MT engines from ous time and resource consuming tasks; these tasks
proprietary non-public data; are defined by the application logic and the data
storage and are sent to a High Performance Com-
Figure 1 General architecture of the LetsMT! platform.
puting (HPC) Cluster for execution. parallel or interactive jobs. It also manages and
The Interface layer provides interfaces between schedules the allocation of distributed resources
the LetsMT! system and external users. The system such as processors, memory and disk space. The
has both human and machine users. Human users LetsMT! HPC cluster is based on the Oracle Grid
can access the system through web browsers by Engine (SGE). The HPC cluster accesses data
using the LetsMT! web page interface. External stored in the data storage layer using the RR API.
systems such as CAT tools and browser plug-ins The hardware infrastructure of the LetsMT!
can access the LetsMT! system through a public platform is heterogeneous. The majority of ser-
API. The public API is available through both vices run on Linux platforms (Giza++, Moses, Re-
REST/JSON and SOAP protocol web services. source Repository, data processing tools). The
Some CAT tools or other external systems may Web server and application logic services run on a
require different interfaces; they might be intro- Microsoft Windows platform.
duced if necessary. A HTTPS protocol is used to The system hardware architecture is designed to be
ensure secure user authentication and secure data highly sizable. The LetsMT! platform contains
transfer. several machines with both continuous and on-
The application logic layer contains a set of demand availability:
modules responsible for the main functionality or
logic of the systems. It receives queries and com- Continuous availability – the core fron-
mands from the interface layer and prepares an- tend and backend services that guaran-
swers or performs tasks using the data storage and tee LetsMT! webpage and external API
the HPC cluster. This layer contains several mod- availability;
ules such as the Resource Repository Manager, the On-demand availability – training,
User Manager, the SMT Training Manager etc. translation and data import services
The interface layer accesses the application logic (HPC cluster nodes); Additional fron-
layer through both REST/JSON and SOAP proto- tend and backend server instances to
col web services. The same protocols are used for increase availability.
communication between modules in the applica-
tion logic layer. 4 Application of the Moses SMT toolkit
The LetsMT! system as a data sharing and MT
platform is able to store large amounts of SMT A significant breakthrough in SMT was
training data (parallel and monolingual corpora) as achieved by the EuroMatrix project. Th project
well as trained models of SMT systems. The data objectives included the creation of translation sys-
is stored in one central Resource Repository (RR). tems for all pairs of EU languages and the
The RR is also used to store various tools neces- development of open source MT technology in-
sary for data processing and SMT training. As cluding research tools, software and data collec-
training data may change (for example, grow), the tions. Its result is the improved open source SMT
resource repository is based on a version- toolkit Moses developed by the University of Ed-
controlled file system (currently we use SVN as inburgh. The Moses SMT toolkit is a complete sta-
the backend system). A key-value store is used to tistical translation system distributed under the
keep metadata and statistics about training data and Lesser General Public License (LGPL). Moses in-
trained SMT systems. Modules from the applica- cludes all the components needed to pre-process
tion logic layer and HPC cluster access RR through data and to train language and translation models
a REST-based web service interface. (Koehn et al. 2007). Moses is widely used in the
A High Performance Computing Cluster is used research community and has also reached the
to execute many different data processing tasks, commercial sector. While the use of the software is
training and running SMT systems. Modules from not closely monitored (there is no need to sign a
the application logic and data storage layers create license agreement), Moses is known to be in com-
jobs and send them to HPC cluster to execute. HPC mercial use by companies such as Systran, Asia
cluster is responsible for accepting, scheduling, Online, Autodesk, Matrixware, Translated.net. The
dispatching, and managing the remote and distrib- LetsMT! project coordinator Tilde bases its free
uted execution of large numbers of standalone, online Latvian MT system on the Moses platform.
LetsMT! uses Moses as a language independent access to the LetsMT! resource repository which
SMT solution and integrates it as a cloud-based consists mainly of a revision control system (Sub-
service into the LetsMT! online platform. One of version), a database (TokyoCabinet) and a batch-
the important achievements of the LetsMT! project queuing system (SGE, Oracle Grid Engine). The
will be the adaptation of the Moses toolkit to fit purpose of the Web API is to enable the interaction
into the rapid training, updating, and interactive with the repository system for uploading and
access environment of the LetsMT! platform. The downloading data, requesting and searching infor-
SMT training pipeline implemented in Moses cur- mation and triggering batch processes. The
rently involves a number of steps that each require LetsMT! resource repository system is implement-
a separate program to run. In the framework of ed in Perl and uses the Apache server and
LetsMT! this process will be streamlined and made mod_perl to handle the requests and responses to
automatically configurable given a set of user- and from the client system.
specified variables (training corpora, language
model data, dictionaries, tuning sets). All data sets of the LetsMT! platform are stored
Additional important improvements of Moses that in a revision control system. In the current imple-
are being implemented by the University of Edin- mentation, we use Subversion (SVN). However,
burgh as part of LetsMT!, are the incremental the software is modular and another version con-
training of MT models, randomised language mod- trol system may replace SVN or even work side-
els (Levenberg et al. 2009), and separate language by-side with other storage backends.
and translation model servers. We expect some Revision control systems are designed for dy-
users to add relatively small amounts of additional namic repositories of textual data in multi-user en-
training data in frequent intervals. The incremental vironments. They typically store all repository
training will benefit from the addition of these data modifications and provide tools for tracking the
without re-running the entire training pipeline from file history for any item in the repository. Further-
scratch. more, they naturally support data sharing and pos-
sibilities to revert to specific versions.
5 LetsMT! Resource repository Modifications are stored efficiently by keeping
track of changes only. All of this makes them well
Figure 2 illustrates the general architecture of suited for our needs in which growing resources
the resource repository and its integration into the may be accessed by multiple users.
LetsMT! platform. The LetsMT! resource reposito- An important design goal for developing the re-
ry has a web API that is implemented as a REST pository software was to allow arbitrary metadata
service with HTTP requests. The Web API gives in terms of key-value pairs stored together with
Figure 2. Resource repository overview
resources in the repository. The focus was set on Give me all monolingual data sets from
flexibility in a way that new fields and data sets in the news domain that are larger than
various formats can easily be added to the database 500 sentences.
during development. It has to be possible to store
appropriate metadata to any resource at any loca- Our key-value store is able to process such que-
tion in the repository. Another important feature is ries and to return matching resources and their as-
that this database should still be powerful enough sociated metadata entries. Furthermore, we store
to allow complex search queries over the entire permission information together with all data rec-
repository which reflects a hierarchical file struc- ords to filter the data according to access re-
ture. At the same time, the system has to respect strictions. The backend system we use is based on
permissions set to individual resources in order to TokyoCabinet (https://fallabs/tokyocabinet) a
avoid that restricted material can be found. Stand- freely available software package that implements
ard relational database management systems do not an efficient database management system with all
support this degree of flexibility as they rely on the flexibility required by our platform.
pre-defined relations (tables) with fixed data types Another important feature of the Resource Re-
and operations over them. A recent trend is, there- pository software is the support of data import,
fore, to move from relational database with SQL- validation and conversion. Users may upload their
like queries to schema-less key-value stores that do data sources in a variety of formats that will auto-
not require a fixed data model. The general idea of matically be processed by our validation and con-
such a store for metadata is presented in the next version tools. The software also includes a
section followed by some details about implemen- sentence alignment module that makes it possible
tation choices in our repository package. to create new parallel resources for SMT training
A key-value store basically stores arbitrary data from scratch. In the current implementation we
(values) by use of a single key. This conceptually support the following data formats with dedicated
simple strategy allows a lot of flexibility in terms import handlers: aligned parallel data in TMX,
of data storage without pre-defined schemas and XLIFF and Moses formats, monolingual text doc-
data models. In relation to the resource repository uments in PDF, Text and DOC formats, com-
we like to store arbitrary key-value pairs to any pressed data and archives in gzip, zip and tar
resource in the repository. Various kinds of infor- formats. Support for additional formats may be
mation shall be stored in this way ranging from added in future releases.
descriptive data (textual domain, ownership, lan-
guage, size etc.) to status information (im- 6 Conclusions
port/conversion status, etc.), and internal Current development of SMT tools and tech-
information used by the LetsMT! frontend or re- niques have reached the level where they can be
pository backend. Furthermore, we also need the implemented in practical applications addressing
support of repeated keys, or better, keys that may the needs of large user groups in a variety of appli-
contain several values. cation scenarios. The work in progress that is de-
Important here is that the database is not re- scribed in this paper promises important advances
stricted to a list of pre-defined keys. In our system, in the application of SMT by integrating available
arbitrary keys can be added containing arbitrary tools and technologies into an easy-to-use cloud-
values. Furthermore, values can also be interpreted based platform (https://demo.letsmt.eu) for data
as unordered lists, for example, in the case of lan- sharing and generation of customized MT.
guage. Using these data sets we are able to ask The successful implementation of the project
complex queries such as: will enable wider use and greater impact of availa-
Give me all public parallel data with ble open-source SMT technologies, facilitate di-
English as either source or target lan- versification of free MT by tailoring it to specific
guage, domains and user requirements.
F.J. Och, H. Ney. 2003. A Systematic Comparison of
Various Statistical Alignment Models. Computational
Linguistics, (29)1: 19-51.
P. Koehn, M. Federico, B. Cowan, R. Zens, C. Duer, O.
Bojar, A. Constantin, E. Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation, in
Proceedings of the ACL 2007 Demo and Poster Ses-
sions, pages 177-180, Prague.
A. Levenberg, M. Osborne. 2009. Stream-based Ran-
domised Language Models for SMT, in Proceedings
of the 2009 Conference on Empirical Methods in
Natural Language Processing.