GNU Mailman, Internationalized by ccf65261


									                     GNU Mailman, Internationalized

                                       Barry A. Warsaw
                                       Zope Corporation

Abstract                                              When Mailman was first released,
                                          quickly adopted it and has
                                                   been using it ever since.       Mailman 2.0
   GNU Mailman is a mailing list manage-           marked a milestone in its development, as
ment system that has been in production use        version 2.0.13 is quite stable, and deployed
since 1998. In December 2002, a version 2.1        at thousands of sites. It runs everything
was released containing many new features.         from small special interest group lists to
This paper will describe one of the most im-       huge announcement lists, at sites ranging
portant – Mailman 2.1’s internationalization       from the commercial (RedHat, SourceForge,
support. Presented here are the tools that         Apple, Dell, SAP, and Zope Corporation),
were built and the approaches Mailman took         to the hacker community (XEmacs, Samba,
to marking and translating text, as well a re-     Gnome, KDE, Exim, and of course Python),
view of some of the benefits and pitfalls of        to numerous educational organizations
Mailman’s solution. Also presented will be         and non-profits. There are lots of host-
some future directions for internationalized       ing facilities providing Mailman services,
Mailman, as well as other complex Python           and increasingly, quite a few international
applications such as Zope.                         organizations.

                                                      One of the reasons for the increased interest
                                                   from the non-English speaking world is that
1   Introduction                                   Mailman 2.1, which had been in development
                                                   for about two years, is fully internationalized.
                                                   Internationalization is the process of prepar-
                                                   ing an application for use in multiple locales.
  GNU Mailman was invented by John Viega
                                                   Localization is the process of specializing the
sometime before 1997, or so is indicated by
                                                   application for a specific locale. For example,
the earliest known archived message on the
                                                   during internationalization, all end-user dis-
subject. The earliest hit for “viega mail-
                                                   playable text in the Mailman 2.1 source code
man” in Google groups is about the Dave
                                                   was specially marked as requiring translation.
Matthews band mailing list that John was
                                                   Mailman 2.1.1 (the latest available patch re-
running [Viega97].
                                                   lease at the time of writing), has been local-
                                                   ized to almost 20 natural languages.
   At the time, [Python.Org] was
running a hacked version of Majordomo for
                                                     While Mailman 2.1 may appear to be only a
all its special interest group (SIG) mailing
                                                   minor revision over 2.0.13, it really represents
lists, but this had two problems: first, the site
                                                   quite an extensive rewrite. It could easily
was becoming unmaintainable as the admin-
                                                   have been argued that this version should be
istrators tried to customize new features into
                                                   called Mailman 3.0. Before describing the de-
Majordomo, and second, it just wouldn’t do
                                                   tails of the internationalization work, a brief
to run Python’s mailing lists on a Perl-based
                                                   overview of Mailman is provided, including
list server.
a quick tour of some of the other important          the message. This means that various as-
features in the 2.1 release.                         pects of the message, i.e. the header or
                                                     footer, or the To field, can contain infor-
                                                     mation specific to the member receiving
                                                     the message.
2     What is GNU Mailman?                         • Extensive privacy options which allow
                                                     a list administrator to select policies
                                                     for subscribing and unsubscribing (open,
   “GNU Mailman” (informally referred to as          confirmation required, or approval re-
just “Mailman”), is a system for managing            quired), policies for posting to the list
electronic mailing lists. It is implemented pri-     (open, moderated, members only, ap-
marily in Python, an object-oriented, very           proved posters only), and some limited
high-level, open source programming lan-             spam defenses.
guage. Mailing lists are administered by a
list owner, and users can interact with the list   • Automatic bounce processing. Bouncing
– including subscribing and unsubscribing –          addresses are the bane of any mailing
through the web and through email. Site ad-          list, and Mailman provides two mech-
ministrators can also interact with Mailman          anisms for automatic bounce detection,
via a suite of command line scripts, or even         regular expression based bounce match-
via the interactive Python prompt. Mailman           ing and Variable Envelope Return Paths
is the official mailing list manager of the GNU        [VERP].
project and is available under terms of the          RFC 3464 [RFC3464] and the older RFC
GNU General Public License [GPL].                    it replaces [RFC1894] describe a stan-
                                                     dard format for bounce notifications.
   Mailman strives for standards compliance,         However, many mail systems ignore or
and as such is interoperable with a wide range       incorrectly implement this standard. For
of web servers and browsers, and mail servers        recognizing bounce messages, Mailman
and clients. Of the web servers, it requires the     has an extensive set of regular expression
ability to execute CGI scripts, and of mail          based matches used to dig the bouncing
servers it requires the ability to filter mes-        address out of the notice. For fool-proof
sages through programs. Apache is probably           bounce detection, Mailman also supports
the most widely used web server for Mail-            VERP, a technique where the intended
man, and any of the Big 4 mail servers (Send-        recipient’s address as it appears on the
mail, Postfix, Qmail, and Exim) will work             mailing list is encoded into the envelope
just fine. The HTML that Mailman out-                 sender of the message. Because remote
puts is extremely pedestrian so just about           mail servers are required to send bounces
any web browser should work with it, as long         to the envelope sender, Mailman can un-
as it supports cookies. Mailman should work          ambiguously decode the intended recip-
with any MIME-compliant mail reader. Mail-           ient’s address and register an accurate
man works on any Unix-like operating sys-            bounce. Note that technically, VERP
tem, such as GNU/Linux.                              must be implemented in the mail server,
                                                     but Mailman’s use of the technique is
  Mailman supports a wide range of features,         close enough to warrant the label.
such as:
                                                   • Archiving. Mailman comes bundled with
                                                     an archiver called Pipermail. Pipermail’s
    • User selectable delivery modes. Mem-           chief advantages are that it comes bun-
      bers can elect to receive messages im-         dled, that it is implemented in Python,
      mediately, or in batches called digests.       and that in Mailman 2.1 it is interna-
      Two forms of digests are supported, RFC        tionalized, allowing the display of mes-
      1153 style plain text digests [RFC1153],       sages in alternative languages and char-
      and MIME multipart/digest style di-            acter encodings. Its primary disadvan-
      gests. Non-digest deliveries can be per-       tages are that it doesn’t support search-
      sonalized specifically for the recipient of     ing and isn’t very customizable. Mail-
    man is easily integrated with external       3     Internationalization Issues

                                                    The new features in Mailman 2.1 are ex-
  • A mail to news gateway. Mailman can          tensive, but the most visible addition is the
    be configured to gateway lists to and         support for multiple natural languages. This
    from Usenet newsgroups. For example,         means that all the administrative and pub-
    the comp.lang.python newsgroup is gate-      licly visible web pages, all the email notifica-
    wayed to the          tions, and even the built-in archiver can be
    mailing list. Even moderated lists, such     configured to produce text in any of nearly
    as comp.lang.python.announce can be          20 natural languages out of the box. A large
    gatewayed, with Mailman serving as the       part of the re-architecting of Mailman for 2.1
    moderation tool.                             has been to provide a framework for easily
                                                 adding new natural languages as they become
                                                 available from volunteer translation teams.
  • Auto-responder, content filtering, and
    topics. The auto-responder can be set
    up to send a canned message when-
                                                 3.1    Message IDs
    ever someone posts to the list, or emails
    the list owner or -request robot. Con-
    tent filtering allows the list owner to
    explicitly filter or pass specific MIME           Not every string in an application needs to
    (Multipurpose Internet Mail Extensions       be translated. For example, some strings are
    [RFC2045]) content types. Topics allow       used as keys in dictionaries, or represent mail
    the list owner to assign incoming mes-       headers, or contain HTML tags. To make
    sages to any of a configurable number of      the proper distinction we refer to strings that
    groups, and members can “subscribe” to       are intended for human readability as “text”
    a specific topic, receiving only the sub-     or “messages”. One of the most labor in-
    set of list traffic that matches the desired   tensive parts of internationalizing an exist-
    topics.                                      ing code base such as Mailman’s is to go
                                                 through every string in the software and dis-
                                                 tinguish messages from ordinary strings. In
  • Virtual domains. Mailman can be used         addition to the non-translatable strings de-
    on a mail server that supports multi-        scribed above, the decision was made to not
    ple virtual domains. For example, the        translate log messages since these are not in- and mail domains         tended for the end-user, and would make de-
    are run on the same machine, from the        bugging in global community more difficult.
    same Mailman installation. The one lim-
    itation in Mailman 2.1 is that a mailing       Each message that is to be translated needs
    list with the same name may not appear       to have four pieces of information at runtime
    in more than one domain. This restric-       in order to calculate the translated text: the
    tion will be lifted in future versions.      application domain, the message id, the de-
                                                 fault text, and the target locale. Because
                                                 Mailman is a fairly self-contained application,
                                                 there is only one static domain, the “mail-
   Mailman also provides each list with its      man” domain, which never changes during
own home page (called a “listinfo” page)         the life of the program’s execution.
which can be customized through the web.
Mailing lists can be automatically created          The message id and default text are two re-
and deleted through the web (with proper         lated, but distinct concepts. The message id
support from the mail server). Mailman also      uniquely identifies the textual message to be
provides web-based approval of moderated         displayed to the user. The message id names
messages and subscriptions. There are a host     the message but it may not necessarily be the
of other smaller new features in Mailman 2.1     message. It is the message id which is the
which won’t be described in this paper.          primary key into a translation catalog dictio-
nary.                                              such as “Delete” which has one spelling in En-
                                                   glish, may be translated to one of several dif-
  The default text is the text to use as the       ferent words in another language, depending
translation of the message id, when the id is      on the context. This poses a problem for the
not found in the translation catalog. Because      translator because the message id “Delete”
coordinating 20 different language teams is a       may appear a dozen times in the application,
project management challenge, it is common         but may require several different words in the
for some language catalogs to lag behind the       target language. Also, minor changes in for-
source code development. Mailman releases          mating or punctuation change the message id,
are rarely delayed so that language teams can      which requires a re-translation (this may be
catch up (although advance notice of impend-       considered an advantage because changes in
ing releases is usually given). It is often the    punctuation can cause semantic differences,
case, therefore, that a particular message id      requiring a re-translation anyway).
won’t be found in a specific language catalog.
The default text is the fall back to use in this      There is no perfect solution, but Mailman
case.                                              has decided to use implicit message ids be-
                                                   cause of the source code readability advan-
  As an example, suppose a web form had a          tages. This occasionally requires negotiation
Delete button. The message id for the but-         between the application developers and the
ton might be something like “form27-delete-        translation teams to choose appropriate and
button”, while the default text might be           distinguishable message ids, and imposes a
“Delete”.                                          sort of inertia against changing existing text
                                                   in the source code. One way to alleviate these
   Message ids may be explicit or implicit. In     problems in future releases would be to use
the above example “form27-delete-button” is        a mix of implicit and explicit message ids,
an explicit message id. While it uniquely          where implicit ids are used predominantly,
identifies the message to be used, it does not      but in rare cases explicit ids (along with a
contain any text that will be displayed to the     partial English catalog) are used to resolve
user. The advantage of explicit message ids        ambiguities.
is that they are immune to minor typos or
formatting changes (e.g. whitespace or punc-
tuation additions or deletions). The disad-        3.2   The Locale
vantages of explicit message ids are two-fold:
they require an extra catalog mapping mes-
sage ids to the default language (e.g. English        Internationalizing a web-based application
in Mailman’s case), and they make the source       is much more complicated than internation-
code less readable. The latter is the more         alizing a command line program such as ‘ls’
serious consequence; since nearly all human        because the natural language context (i.e. the
readable text in Mailman exists in Python          “locale”) is determined by the web request, or
source code, using explicit message ids would      the email message being processed, instead
make the code nearly unreadable. A devel-          of by the user’s shell. In Mailman, the lo-
oper would have to consult the English cata-       cale is dynamic and fluid; there may in fact
log several times for some lines of code.          be several locales needed to process any par-
                                                   ticular email message. Most of the existing
   The alternative approach is to use implicit     techniques for internationalizing programs as-
message ids, where the message id serves           sume a static locale and a single domain.
a dual purpose as the default text. Thus           Mailman inherits the single domain tradition
the human readable text that appears in the        of these tools, but it uses dynamic techniques
Python source code is first used as the mes-        to calculate the translation locale.
sage id, and if that fails to find a transla-
tion, it is used as the default text. While           We use the term “locale” and “language”
this has the advantage of making the source        interchangeably below, although this is not
code more readable and easier to develop, it       completely accurate. A locale describes much
has several disadvantage. First, a message         more than the language used; it also defines
the character encoding, as well as the format-     proach where the locale is represented by an
ing of dates, numbers, currency, etc. How-         instance of a class. While this may seem nat-
ever, since Mailman does not currently sup-        ural, it is actually an elaboration of the global
port the localization of data such as dates, the   translation contexts in classic international-
language selection is the most important as-       ized programs. This will be described in more
pect of the active locale. Typically (although     detail later.
not exclusively) a single character encoding is
used for a single language.
                                                   3.3    Character Encodings
   At any given point in the processing, the
following locales may exist.
                                                     Above and beyond the natural language is-
  • The source code language. Mailman is           sues, character encoding issues are probably
    developed in English, so by default, the       the most vexing for the Mailman developers.
    English text is always available. Trans-       “Character encoding” is usually referred to as
    lations can be, and often are, incomplete      the character set or charset, after the email
    and the English message is the global fall     header parameter described in RFC 2045.
                                                      A naive view would create a one-to-one cor-
  • The site default language. Of the nearly       respondence between language and charset.
    20 languages that Mailman supports out         For example, you might say that all Span-
    of the box, the site administrator can         ish text should be rendered in the iso-8859-1
    choose one of those languages as the “site     (Latin-1) character set [ISOSoup]. However,
    default language”. When no other locale        even this simple example isn’t accurate be-
    is known, the site default will be used.       cause the Euro sign is available only in iso-
  • The list default language. Every mailing       8859-15.
    list has a default language as well as as
    set of alternatively supported languages.        The problem is exacerbated by some Asian
    The list default language is used for all      languages. Japanese for example may ap-
    the administrative pages. It is also the       pear in any of euc-jp, iso-2022-jp, shift-jis,
    language used when the list context is         and may be different depending on whether
    known and there is no overriding con-          the text appears in a web browser or in an
    text.                                          email message. In fact, Mailman 2.1’s naive
                                                   approach causes some problems for Japanese
  • The page default language. The list ad-        users, especially when an email message is
    ministrator can also choose to let the list    displayed as a web page in the archiver. This
    support any other language allowed by          will be fixed in a future release.
    the site administrator. A user browsing
    the list overview page can choose to view         Usually, English text uses the us-ascii char-
    that page in any of those languages, by        acter set, but for maximum interoperability,
    selecting that language in a pop up menu       a list conducted in English may still want to
    on the list’s public web pages.                be aware of Latin-1 characters. Mailman has
                                                   to be careful when combining characters in
  • The user’s preferred language. The user
                                                   different charsets, especially those for which
    can also choose one of the list supported
                                                   us-ascii is not a subset.
    languages to be their preferred language,
    by making this choice in their preferences
                                                      For example, say a Spanish list received a
    page. All email notices that Mailman
                                                   message in Turkish, which uses Latin-5 (a.k.a.
    sends out to the user, or any web page
                                                   iso-8859-9). When that message is archived,
    that the user views when logged in, is dis-
                                                   different parts of the HTML page for the mes-
    played in the user’s preferred language.
                                                   sage will be in iso-8859-1 and other parts will
                                                   be in iso-8859-9. But since HTML is inade-
  To support these multiple language con-          quate at allowing multiple charsets in a single
texts, Mailman uses an object-oriented ap-         web page, the characters in one or the other
of those charsets must be converted to HTML       and added to Python 2.2. The email pack-
entities, using their Unicode equivalent.         age is compliant with all the relevant MIME
                                                  RFCs, as well as other mail related standards.
   Multiple character set issues can also arise
in the processing of email messages. Say for
example that a message to a German list ar-       3.4   Message Catalogs
rives in Japanese. Mailman has a feature
called “headers and footers” which allow the
list administrator to add some canned text           GNU gettext [Gettext] is a widespread for-
to the start and end of a message (e.g. “To       mal model for supporting multilingual appli-
unsubscribe, click here”). Previous versions      cations in traditional C applications. Get-
of Mailman would simply paste the header          text encourages the use of implicit message
and/or footer around the original message         ids. This leads to a rhythm whereby the C
body. This was broken for several reasons.        programmer marks translatable text in the
The most obvious one is that if the message is    source code by wrapping them in a function
really a Base64 encoded image, adding some        call. The function is usually () – called “the
spurious ASCII text around the original body      underscore function” – and it has both a run-
would break the decoding. But if the mes-         time behavior and an off-line purpose. At
sage contained text in a different character       run-time, the underscore function performs
set than the header or footer text, concatena-    the lookup of the message id in a global lan-
tion may render the original body unreadable.     guage catalog. There is also an off-line tool
The solution requires careful examination of      which searches all the source code for marked
the original message, and in the extreme, rip-    strings, extracting them and placing them in
ping apart and reconstituting the structure of    a message catalog template, called a .pot file.
the original message, so that the headers and
footers will always be added in a MIME-safe          GNU gettext contains both a C library and
way.                                              a suite of tools provided by The Translation
                                                  Project [TranslationProject] to manage inter-
   Internationalization standards for email       nationalized programs. The message extrac-
and HTML are defined in a series of RFCs,          tion tool is called xgettext. While newer ver-
and these must be adhered to. For exam-           sions of xgettext understand Python source
ple, the most fundamental email RFC is 2822       code to some degree, a pure-Python version of
[RFC2822] (which recently superseded RFC          the program called pygettext was developed
822). This RFC describes the structure of an      and is distributed with Python. pygettext
email message, but it is naive in its ASCII       has some additional benefit, including the
bias. RFCs 2045 through 2047 were added to        ability to extract Python docstrings which
address the use of multilingual character sets    may not be marked with the underscore func-
in email messages. RFC 2047 [RFC2047] was         tion.
added to describe how non-ASCII characters
are to be encoded in Subject fields and in           Mailman has adopted the gettext model
other email headers. Mailman must be able         of marking and translating source strings,
to both interpret email messages with RFC         and to that end, a GNU gettext-like stan-
2047 encoded headers, and produce properly        dard module was implemented for Python
formatted ones when necessary. The chal-          [GettextModule]. While the gettext mod-
lenge is to parse well intentioned, but erro-     ule implements the same global translation
neously encoded headers (to give the bene-        model of the C library, two elaborations were
fit of the doubt). These types of errors are       necessary for a more Pythonic interface.
all too common in email messages found in
the wild and Mailman must be made robust             First, for long running daemon processes
against these types of poorly formed mes-         such as Mailman 2.1’s mail processor, mul-
sages.                                            tiple language contexts are required, so the
                                                  global state implied by gettext isn’t always
  Prodded by these various issues, a compre-      appropriate. Here’s an example to illustrate
hensive email package [Email] was developed       understand why.
   When a new member subscribes to a mail-          This is a critically important feature for in-
ing list, two notification messages can be sent.   ternationalized programs because some lan-
One is a welcome message sent to the mem-         guages may require a different order of the
ber, and the other a new member notification       substitutions to be grammatically correct.
sent to the list administrator. If the list’s     While stock Python supports this require-
preferred language is Spanish, but the user       ment, its implementation leads to overly ver-
prefers German, these two notifications will       bose code. In the above example, we’ve writ-
be sent out in two different languages. Since a    ten the words “listname” and “member” four
single process crafts and sends both notifica-     times each. Now imagine that level of ver-
tions, simply using () wrapping doesn’t give      bosity duplicated a hundred times per source
enough information. Which language should         file. “Tedious” comes to mind!
the underscore function translate its message
id to?                                              Mailman solves this by providing its own
                                                  underscore function, which wraps the gettext
  Python solves this problem by providing         standard function, but provides a little bit of
an object-oriented API in additional to get-      useful magic by looking up substitution vari-
text’s traditional functional API. Using the      ables in the local and global namespace of
object interface, a program can create in-        the caller. Using Mailman’s special under-
stances which represent the translation con-      score function, the above code can then be
text; in other words, a single target language    rewritten as:
catalog is fully encapsulated in an object. For
convenience, this object can be stored in some
global context, and in the Mailman source,        listname = get_listname()
this global object can be saved and restored      member = get_username()
                                                  print _(’%(member)s has been ’
as necessary. Here is a simplified Python ex-
                                                          ’subscribed to %(listname)s’)

# The list’s preferred language is in                While the average Perl programmer might
# effect right now                                ask what all the fuss is about, the Python
saved = i18n.get_translation()                    programmer will notice something interest-
try:                                              ing: there’s no interpolation dictionary and
     i18n.set_language(                           no modulus operator. The dictionary is
         users_preferred_language)                created from the namespaces of the caller
     send_user_notification()                     of the underscore function, which contains
                                                  the “listname” and “member” local variables.
                                                  The trick is that the underscore function
                                                  uses a little known Python function called
                                                  sys. getframe() to capture the global and
   The second problem might be termed syn-        local namespaces of the caller of underscore.
tactic sugar or simple convenience, but it        It then puts these in an interpolation dictio-
turns out to be extremely important in a          nary, with local variables overriding global
Python program filled with translatable text.      variables, and then applies the modulo op-
Python strings support variable substitution      erator to the translated string, using this dic-
(also called “interpolation”), whereby a dic-     tionary.
tionary can be used to supply the substitu-
tions. For example:                                 Marked translatable texts are used all over
                                                  Mailman, and we run pygettext over all the
                                                  source code to produce a gettext compatible
listname = get_listname()
                                                  mailman.pot catalog file. To translate this
member = get_username()
d = {’listname’: listname,                        to a new language, the translation team
     ’member’: member,                            would start by copying mailman.pot to
     }                                            messages/xx/LC MESSAGES/mailman.po
print _(’%(member)s has been ’                    where “xx” is the language code for the new
        ’subscribed to %(listname)s’) % d         language. From here, standard tools such
as po-mode for Emacs or KDE’s kbabel can            The first location to yield the desired tem-
be used to provide translations for all the       plate wins. Thus, as with the gettext cata-
source message ids. Then, standard gettext        logs, English is always an available fall back.
tools can be used to generate a
binary file, which Python’s gettext module            Templates, like marked translatable source
can read. In this way, internationalized          code text, support variable substitutions, us-
Python programs can leverage most of the          ing the same syntax. With templates, an ex-
tools translation teams normally use for          plicit substitution dictionary is always pro-
C programs.      Translators don’t have to        vided, and the interpolation is performed af-
learn new tools just to translation Python        ter the template is located.
                                                     While the template system works well
                                                  enough, its coarseness is a serious drawback.
3.5   Templates                                   For example, say a new feature required the
                                                  addition of an HTML button on one of the
                                                  templates. While this is trivial to do for the
   While gettext style message text in Python     English template, changing the English tem-
source code are essential for an international-   plate means all the other templates are out-
ized Mailman, they aren’t always appropri-        of-date. The translation teams must follow
ate. For example, Mailman has always used         up with new versions of the modified tem-
templates as a way of conveniently represent-     plates, or other languages will lag behind the
ing full web pages or parts of email messages.    English version.
These templates provide an easy way for site
administrators to customize the look of the          One solution for templates might be some-
Mailman web pages, or the text sent out un-       thing like Zope Page Templates (ZPT)
der various circumstances.                        [Pelletier], and specifically, internationalized
                                                  ZPT [I18NZPT]. Internationalized ZPT com-
   In an internationalized Mailman, the tem-      bines the best of gettext and templates by
plates serve another purpose: they serve as       allowing the template author to design the
a mechanism for providing language specific        template, marking sections of the template
versions of the templates. Mailman uses al-       as translatable text. Another extraction tool
most 50 templates for various purposes, and       can then run over the ZPT file and add the
of course provides the English versions of the    translatable messages to the overall catalog.
templates as a default. Each supported lan-       This has the huge advantage that structural
guage provides its own version of the tem-        changes to a template don’t require the trans-
plates, and Mailman has a defined search           lation teams to do any work. Changes to con-
order for template lookup. For example,           tent messages in the template simply mean
if Mailman were to display the public list        that one phrase may be out of date, but the
overview page for a mailing list, it would        whole template won’t be invalidated.
search for the listinfo.html page, in the fol-
lowing locations (relative to the installation
                                                  4   Unicode
  • The list-specific language directory
                                                     Python has two types of string objects,
  • The         virtual       domain-specific      traditional 8-bit byte data strings and Uni-
    language           directory       tem-       code character strings. Python also has lit-
    plates/ name/language/template       eral forms for each string type; quoted text
  • The site-wide language directory tem-         are defined to be 8-bit strings unless the lead-
    plates/site/language/template                 ing quote is prefixed with a “u”, in which
                                                  case it is a Unicode string. Because strings
  • The global default language directory         can come into Mailman in a variety of ways
    templates/language/template                   (e.g. through the web, an email message, or a
message catalog), the code must be prepared        forms and genders pose particularly thorny
to handle encoded 8-bit strings and Unicode        problems. Python 2.3’s gettext module sup-
strings. Encoded 8-bit strings must be con-        ports plural forms, but only alpha releases of
verted to Unicode via the unicode() built-in       Python 2.3 have been made available as of
function in order to properly combine strings      this writing. English doesn’t have gendered
using concatenation or interpolation. In ad-       nouns, and sometimes, English text source
dition, Unicode strings must be re-encoded         strings need to be rewritten to accommodate
when printing them to certain streams, such        translators.
as the log files, or standard output, but these
encoding operations must watch out for un-            Python supports a number of specific char-
supported characters. For example, if a Uni-       acter encoding “codecs” in the standard dis-
code string containing Latin-1 characters is       tribution. While Python has built-in sup-
printed to an ASCII-only terminal, a excep-        port for most Western codecs, Asian codecs
tion can be raised due to the non-ASCII char-      in particular are not supported. Fortunately
acters in the string.                              Japanese, Korean, and Chinese codecs are
                                                   available as third party distributions.
   There is no doubt that character conver-
sion issues have been the thorniest and most          Internationalization is a lot more than sim-
common bugs reported on Mailman 2.1 to             ply translating strings; many other values
date. While many issues have been fixed, the        from currencies to dates must also be local-
most important lesson learned is that Mail-        ized if they are to be displayed correctly for a
man should convert all text (not necessarily       particular language or country. Long term
all strings!) to Unicode at the earliest pos-      goals include wrapping IBM’s ICU library
sible time, ideally when the text enters the       [ICU] in Python.
system. Mailman should use Unicode strings
everywhere internally, converting to encoded          While internationalization imposes some
8-bit strings only where needed, and only at       performance overhead, the effect is negligi-
the last possible moment. Analysis will still      ble. In an application such as Mailman, the
be needed to decide how to handle conver-          performance of the mail server that Mailman
sion errors, such as those described above. In     feeds messages to, the network bandwidth,
Python, the conversion function can be given       and the performance of the operating system
an additional argument which specifies how          and file system have a far greater influence on
strict the conversion should be, e.g. raise an     the performance of the system than does the
exception if there are illegal characters found,   Mailman software. Internationalization has
throw the illegal characters away, or substi-      imposed no perceived performance penalty.
tute a question mark for any illegal charac-
ters. The exact choice of the strictness flag          Internationalization has increased the size
will be dependent on the context in which          of the software distribution, since by default
the conversion is occurring.                       the download contains the message catalogs
                                                   for all supported languages. The current cat-
                                                   alog contains over 1200 message ids and is
                                                   approximately 228 KB in size. The trans-
5   Other Issues                                   lated and compiled catalog files are from 80 to
                                                   300 KB in size depending on the completeness
                                                   of the translation. In all, the message cata-
   There are some operational issues that          logs themselves add approximately 16 MB to
need to be addressed for an international-         the uncompressed program source code. The
ized application such as Mailman. Care must        templates add about another approximately 3
be taken when marking the source code for          MB. For this reason, future releases of Mail-
translation so that the text is split in a gram-   man may provide an English-only distribu-
matically clear way. For example, when-            tion, with separately downloadable language
ever possible full sentences should be used,       packs.
since translating sentence fragments may not
be possible in all languages. Also, plural
6    Examples                                        def _(s): return s

                                                     categories = {
  Here is some sample Python code (refor-                ’cat1’: _(’Privacy’),
                                                         ’cat2’: _(’Autoreplies’),
matted for this paper) taken from Mailman
                                                         ’cat3’: _(’Topics’),
2.1 which shows marked messages:                         }

                                                     _ = i18n._
label = _(categories[category])
realname = mlist.real_name
doc.SetTitle(                                          Here, we’re marking three strings for ex-
    _(’%(realname)s Administration ’
                                                     traction, but we aren’t translating them at
                                                     the point of definition, because the target lo-
doc.AddItem(Center(Header(2, _(
    ’%(realname)s mailing list ’                     cale isn’t known at this time.
    ’administration<br>%(label)s ’
    ’Section’))))                                       Due to space limitations, sample web
                                                     pages can’t be included, however a pub-
                                                     lic internationalized list can be viewed at
   Notice first that the local variables “label”
and “realname” are referenced in the default         You can view this listinfo page in any of the
text, and that their values come from the            languages supported by Mailman.
magic interpolation described above. In a
translated message the order of the substi-
tutions may change, so this ensures that the
substitutions will occur in the grammatically        7   Future and Related Work
correct location in the translated message.
Also notice that there are actually three uses
of the underscore function. In the second and          Internationalized Mailman servers are in
third, the only function arguments are strings       deployed use around the world, and many of
(in Python, adjacent strings are concatenated        the earliest related bugs have been satisfac-
by the lexer). The pygettext rule for mes-           torily fixed. However the basic architecture
sage extraction is a single string inside the        used by Mailman may undergo additional re-
underscore function, so both these texts will        finement in future releases. In particular,
be extracted into the message catalog.               Mailman will be rewritten to use Unicode in-
                                                     ternally for all human readable text. The
   The first use of the underscore function is        templates used in Mailman will likely be re-
interesting in that it is getting a value out of a   designed to use something like Zope’s ZPT,
dictionary lookup. This is an example of a de-       which allow finer grain control over the evo-
ferred translation, where the underscore func-       lution of the templates.
tion is only used for its run-time behavior.
The text returned by the dictionary lookup is           The experiences learned during the Mail-
translated in a normal fashion, but the pro-         man internationalization effort have been car-
gram source code categories[category] isn’t          ried forward to the Zope 3 internationaliza-
extracted into the catalog because it isn’t a        tion effort [Zope3]. Zope is a web applica-
string.                                              tion server and framework, also written in
                                                     Python. Zope’s internationalization efforts
  At the place where the categories dictio-          are made more complicated by the fact that
nary is defined, the strings are also wrapped         it is a framework supporting multiple appli-
in an underscore function for pygettext ex-          cations rather than a single application. This
traction, but they aren’t translated at that         means that while Mailman needs only a single
place in the program. We do this in a num-           application domain, Zope may have multiple
ber of situations, such as when dictionaries         simultaneous domains, even in a single trans-
are defined in module global scope. In that           lation context. Zope therefore needs a way to
case, you would see something like:                  record the domain that a particular message
has come from so that it can be looked up in         Mailman was originally invented by John
the proper catalog at output time. The solu-       Viega, and at various times has been
tion has been to create a MessageID object,        shepherded, maintained, and developed by
which is a subclass of string that contains the    Thomas Wouters, Ken Manheimer, Harald
domain as an instance variable.                    Meland, and Scott Cotton. The author is the
                                                   current project leader.
   Also, in Zope few translatable messages are
found in Python source code. The predom-              Juan Carlos Rey Anaya and Victoriano Gi-
inant carrier of human readable text is the        ralt produced the first working prototypes
ZPT. Thus the mechanisms described above           of an internationalized Mailman, and worked
for simple interpolation and global transla-       with the author to designed the architecture
tion contexts aren’t appropriate for Zope.         for supporting internationalization. Others
While Zope’s internationalization efforts are       who provided invaluable contributions for
built on the Python tools developed during         the internationalization effort include Ben
Mailman’s internationalization, they will ul-      Gertzfield, Martin von Loewis, Simone Pi-
timately improve or expand on these tools.         unno, Daniel Buchmann, Tokio Kikuchi, and
                                                   Ousmane Wilane. The ACKNOWLEDGE-
   As part of the Zope 3 internationalization      MENTS file that comes with the Mailman
effort, a Translation Web Service (TWS) has         source distribution contains a detailed list of
been proposed [ZopeTWS]. While largely             contributors.
only science fiction at the time of this writ-
ing, the TWS is a vision of how to coordi-
nate the project management issues related
to internationalization. One of the most dif-
                                                   9    Availability
ficult ongoing problems for an international-
ized project is coordinating the output of the
software developers with the translation ef-
forts of the language teams. While the Trans-        GNU Mailman is free software, cov-
lation Project attempts to automate and co-        ered by the GNU General Public Li-
ordinate much of this process, the TWS plans       cense. It is available for download from
to take this aspect a step further by provid-
ing a global web service for truly collaborative
translations. With the TWS, a project man-           More information on Mailman can be found
ager could upload message templates for par-       at its home page
ticular domains, and language teams would
translate messages at their own pace. When a           Mirrors of the Mailman site are at
new version of the software is prepared for re-
lease, the project manager could then down-
load a snapshot of the current state of the
various translations for the project. The key
advance with the TWS is that once the in-
frastructure is in place, the software develop-    References
ers are no longer bottlenecks in the transla-
tion effort, and coordination among transla-
tion team members is automatic.                    [Viega97]

8   Acknowledgments                                [Python.Org] Python          Home        Page,

  The author would like to thank Zope Cor-         [GPL] Free       Software       Foundation,
poration ( for their sup-          GNU       General    Public    License,
port of this work.                            
[RFC1153] F.         Wancho,                   Di-     [I18NZPT]
    gest          Message                  Format,         ComponentArchitecture/ZPTInternationalizationSupport
                                                       [ICU] International Components for Uni-
[VERP] D.   J.     Bernstein,                 Vari-        code,
   able   Envelope    Return                 Paths,                     [Zope3] Welcome to the Zope 3 project,
[RFC3464] K. Moore and G. Vaudreuil,                       ComponentArchitecture/
    An    Extensible Message Format
    for Delivery Status Notifications,                  [ZopeTWS] Translation    Web      Service,        
[RFC1894] K. Moore and G. Vaudreuil,
    An    Extensible Message Format
    for Delivery Status Notifications,

[RFC2045] N. Freed and N. Borenstein,
    Multipurpose  Internet Mail  Ex-
    tensions (MIME) Part One:   For-
    mat of Internet Message Bodies,

[ISOSoup] The ISO 8859 Alphabet Soup,

[Gettext] GNU                               gettext,

[TranslationProject] The                    Trans-
    lation           Project                  Site,

[GettextModule] Multilingual                 in-
    ternationalization                 services,

[RFC2822] P.            Resnick,               In-
    ternet            Message              Format,

[RFC2047] K. Moore,    MIME (Multi-
    purpose Internet Mail Extensions)
    Part Three:   Message Header Ex-
    tensions  for   Non-ASCII   Text,

[Email] email     –       an       email       and
    MIME              handling             package,

[Pelletier] M. Pelletier and A. Latteier, The
     Zope book, Chapter 5, Using Zope Page
     Templates, ISBN 0735711372.

To top