Stochastic Grammar Channel by robaahbachik


									           Machine Translation with a Stochastic G r a m m a t i c a l Channel
                                    Dekai W u and H o n g s i n g WONG
                                  H u m a n Language Technology Center
                                    D e p a r t m e n t o f C o m p u t e r Science
                                  University o f Science and Technology
                                       Clear Water Bay, H o n g Kong

                     Abstract                                word alignments. The SBTG can be regarded as
We introduce a stochastic grammatical channel                a model of the language-universal hypothesis that
model for machine translation, that synthesizes sev-         closely related arguments tend to stay together (Wu,
eral desirable characteristics of both statistical and       1995a; Wu, 1995b).
grammatical machine translation. As with the                    In this paper we introduce a generalization of
pure statistical translation model described by Wu           Wu's method with the objectives of
(1996) (in which a bracketing transduction gram-               1. increasing translation speed further,
mar models the channel), alternative hypotheses                2. improving meaning-preservation accuracy,
compete probabilistically, exhaustive search of the            3. improving grammaticality of the output, and
translation hypothesis space can be performed in               4. seeding a natural transition toward transduc-
polynomial time, and robustness heuristics arise                  tion rule models,
naturally from a language-independent inversion-             under the constraint of
transduction model. However, unlike pure statisti-              • employing no additional knowledge resources
cal translation models, the generated output string               except a grammar for the target language.
is guaranteed to conform to a given target gram-             To achieve these objectives, we:
mar. The model employs only (1) a translation
                                                                • replace Wu's SBTG channel with a full
lexicon, (2) a context-free grammar for the target
                                                                  stochastic inversion transduction grammar or
language, and (3) a bigram language model. The
                                                                  SITG channel, discussed in Section 3, and
fact that no explicit bilingual translation rules are           • (mis-)use the target language grammar as a
used makes the model easily portable to a variety of
                                                                  SITG, discussed in Section 4.
source languages. Initial experiments show that it
also achieves significant speed gains over our ear-             In Wu's SBTG method, the burden of generating
lier model.                                                  grammatical output rests mostly on the bigram lan-
                                                             guage model; explicit grammatical knowledge can-
1 Motivation                                                 not be used. As a result, output grammaticality can-
Speed of statistical machine translation methods             not be guaranteed. The advantage is that language-
has long been an issue. A step was taken by                  dependent syntactic knowledge resources are not
Wu (Wu, 1996) who introduced a polynomial-time               needed.
algorithm for the runtime search for an optimal                 We relax those constraints here by assuming a
translation. To achieve this, Wu's method substi-            good (monolingual) context-free grammar for the
tuted a language-independent stochastic bracketing           target language. Compared to other knowledge
transduction grammar (SBTG) in place of the sim-             resources (such as transfer rules or semantic on-
pler word-alignment channel models reviewed in               tologies), monolingual syntactic grammars are rel-
Section 2. The SBTG channel made exhaustive                  atively easy to acquire or construct. We use the
search possible through dynamic programming, in-             grammar in the SITG channel, while retaining the
stead of previous "stack search" heuristics. Trans-          bigram language model. The new model facilitates
lation accuracy was not compromised, because the             explicit coding of grammatical knowledge and finer
SBTG is apparently flexible enough to model word-            control over channel probabilities. Like Wu's SBTG
order variation (between English and Chinese) even           model, the translation hypothesis space can be ex-
though it eliminates large portions of the space of          haustively searched in polynomial time, as shown in
Section 5. The experiments discussed in Section 6                      3     A SITG Channel Model
show promising results for these directions.                           The translation channel we propose is based on
                                                                       the recently introduced bilingual language model-
2    Review: Noisy Channel Model
                                                                       ing approach. The model employs a stochastic ver-
The statistical translation model introduced by IBM                    sion of an inversion transduction grammar or ITG
(Brown et al., 1990) views translation as a noisy                      (Wu, 1995c; Wu, 1995d; Wu, 1997). This formal-
channel process. The underlying generative model                       ism was originally developed for the purpose of par-
contains a stochastic Chinese (input) sentence gen-                    allel corpus annotation, with applications for brack-
erator whose output is "corrupted" by the transla-                     eting, alignment, and segmentation. Subsequently,
tion channel to produce English (output) sentences.                    a method was developed to use a special case of the
Assume, as we do throughout this paper, that the                       I T G R t h e aforementioned B T G R f o r the translation
input language is English and the task is to trans-                    task itself (Wu, 1996). The next few paragraphs
late into Chinese. In the IBM system, the language                     briefly review the main properties of ITGs, before
model employs simple n-grams, while the transla-                       we describe the SITG channel.
tion model employs several sets of parameters as                           An ITG consists of context-free productions
discussed below. Estimation of the parameters has                      where terminal symbols come in couples, for ex-
been described elsewhere (Brown et al., 1993).                         ample x/y, where x is a English word and y is an
   Translation is performed in the reverse direction                   Chinese translation of x, with singletons of the form
from generation, as usual for recognition under gen-                   x/e or e/y representing function words that are used
erative models. For each English sentence e to be                      in only one of the languages. Any parse tree thus
translated, the system attempts to find the Chinese                    generates both English and Chinese strings simulta-
sentence c, such that:                                                 neously. Thus, the tree:
c* = a r g m a x P r ( c l e ) = a r g m a x P r ( e l e ) Pr(c) (1)   (1)     [I/~-~ [[took/$-~ [a/-- e/:~s: book/:~]N P ]vP
            g                        g
                                                                               [for/.~ you/~J~]pp ]VP Is
   In the IBM model, the search for the optimal c , is                 produces, for example, the mutual translations:
performed using a best-first heuristic "stack search"                  (2)    a. [ ~ [ [ ~ ~   [--:~]NP   ]VP [,,~'{~]PP ]VP ]S
similar to A* methods.                                                        b. [I [[took [a book]Nv ]va [for you]pp ]vp ]s
   One of the primary obstacles to making the statis-                     An additional mechanism accommodates a con-
tical translation approach practical is slow speed of                  servative degree of word-order variation between
translation, as performed in A* fashion. This price                    the two languages. With each production of the
is paid for the robustness that is obtained by using                   grammar is associated either a straight orientation
very flexible language and translation models. The                     or an inverted orientation, respectively denoted as
language model allows sentences of arbitrary or-                       follows:           VP ~         [VPPP]
der and the translation model allows arbitrary word-                                      VP ~         (VPPP)
order permutation. No structural constraints and
                                                                          In the case of a production with straight orien-
explicit linguistic grammars are imposed by this
                                                                       tation, the right-hand-side symbols are visited left-
                                                                       to-right for both the English and Chinese streams.
   The translation channel is characterized by two
                                                                       But for a production with inverted orientation, the
sets of parameters: translation and alignment prob-
                                                                       right-hand-side symbols are visited left-to-right for
abilities, l The translation probabilities describe lex-
                                                                       English and right-to-left for Chinese. Thus, the tree:
ical substitution, while alignment probabilities de-
                                                                       (3)    [I/~ ( [ t o o k / ~ T [a/-- e/:~ book]--~]N P ]VP
scribe word-order permutation. The key problem
                                                                              [for/,,~ you/~J~]pp)vp ]S
is that the formulation of alignment probabilities
                                                                       produces translations with different word order:
a(ilj , V, T) permits the English word in position j
of a length-T sentence to map to any position i of a                   (4) a. [I [[took [a book]Np ]vP [for you]pp ]vp ]s
length-V Chinese sentence. So V T alignments are                              b. [ ~ [[.~/~]pp [ ~ 7 [ - - 2 ~ ] N P ]VP ]VP ]S
possible, yielding an exponential space with corre-                       The surprising ability of ITGs to accommodate
spondingly slow search times.                                          nearly all word-order variation between fixed-word-
                                                                       order languages 2 (English and Chinese in particu-
   I Various models have been constructed by the IBM team              lar), has been analyzed mathematically, linguisti-
(Brown et al., 1993). This description corresponds to one of the
simplest ones, "Model 2"; search costs for the more complex               2With the exception of higher-order phenomena such as
models are correspondingly higher.                                     neg-raising and wh-movement.

cally, and experimentally (Wu, 1995b; Wu, 1997).                      S      -4    NPVPPunc
Any ITG can be transformed to an equivalent                           VP     -4    V NP
binary-branching normal form.                                         NP     -4    NModNIPm
   A stochastic ITG associates a probability with                     S      ~     [NP VP Punc] / (Punc VP NP)
each production. It follows that a SITG assigns                       VP     -4    [VNP]I(NPV)
a probability P r ( e , c , q ) to all generable trees q              NP     -4    [N Mod N] I (N Mod N) I [Prn]
and sentence-pairs. In principle it can be used as                 Figure 1: An input CFG and its mirrored ITG.
the translation channel model by normalizing with
Pr(c) and integrating out Pr(q) to give Pr(clc) in
Equation (1). In practice, a strong language model              4.1 Production Mirroring
makes this unnecessary, so we can instead optimize              The first step is to convert the monolingual Chi-
the simpler Viterbi approximation                               nese CFG to a bilingual ITG. The production mir-
         c, = a r g m a x P r ( e , c , q ) Pr(c)     (2)       roring tactic simply doubles the number of pro-
                        c                                       ductions, transforming every monolingual produc-
To complete the picture we add a bigram model                   tion into two bilingual productions, 4 one straight
gc~_~c~ = g(cj ] cj-1) for the Chinese language                 and one inverted, as for example in Figure 1 where
model Pr(c).                                                    the upper Chinese CFG becomes the lower ITG.
   This approach was used for the SBTG chan-                    The intent of the mirroring is to add enough flex-
nel (Wu, 1996), using the language-independent                  ibility to allow parsing of English sentences using
bracketing degenerate case of the SITG: 3                       the language 1 side of the ITG. The extra produc-
         all                                                    tions accommodate reversed subconstituent order in
  A      -4     [AA]
         aO                                                     the source language's constituents, at the same time
  A      --+    (AA)                                            restricting the language 2 output sentence to con-
  A    b(54Y) x / y         VX, y lexical translations          form the given target grammar whether straight or
                                                                inverted productions are used.
  A     b(.~¢) .z'/~?       VX language 1 vocabulary               The following example illustrates how produc-
  A     b(_~y) e/y          Vy language 2 vocabulary            tion mirroring works. Consider the input sentence
                                                                He is the son of Stephen, which can be parsed by
In the proposed model, a structured language-                   the ITG of Figure 1 to yield the corresponding out-
dependent ITG is used instead.                                  put sentence ~ ~ 1 ~ : ~ ,            with the following
4 A Grammatical Channel Model                                   parse tree:
                                                                (5)    [[[He/{~         ]Pro]No [[is/~ ]v [the/e]NOlSE
Stated radically, our novel modeling thesis is that
                                                                       ( [ s o n / ~ ] N [of/~]Moa [ S t e p h e n / ~ f f ]N
a mirrored version of the target language grammar
                                                                       )NP]VP [.]o ]Punc ]S
can parse sentences of the source language.
                                                                Production mirroring produced the inverted NP
   Ideally, an ITG would be tailored for the desired
                                                                constituent which was necessary to parse son of
source and target languages, enumerating the trans-
duction patterns specific to that language pair. Con-
                                                                Stephen, i.e., ( s o n / . ~ of/flcJ S t e p h e n / ~ ) N p .
structing such an ITG, however, requires massive                   If the target CFG is purely binary branching,
                                                                then the previous theoretical and linguistic analy-
manual labor effort for each language pair. Instead,
                                                                ses (Wu, 1997) suggest that much of the requisite
our approach is to take a more readily acquired
monolingual context-free grammar for the target                 constituent and word order transposition may be ac-
                                                                commodated without change to the mirrored ITG.
language, and use (or perhaps misuse) it in the SITG
                                                                On the other hand, if the target CFG contains pro-
channel, by employing the three tactics described
                                                                ductions with long right-hand-sides, then merely in-
below: production mirroring, part-of-speech map-
                                                                verting the subconstituent order will probably be in-
ping, and word skipping.
                                                                sufficient. In such cases, a more complex transfor-
   In the following, keep in mind our convention
                                                                mation heuristic would be needed.
that language 1 is the source (English), while lan-
guage 2 is the target (Chinese).                                   Objective 3 (improving grammaticality of the
                                                                output) can be directly tackled by using a tight tar-
    3Wu (Wu, 1996) experimented with Chinese-English trans-
lation, while this paper experiments with English-Chinese          4Except for unary productions, which yield only one bilin-
translation.                                                    gual production.

get grammar. To see this, consider using a mir-            markers. This is the rationale for the singletons
rored Chinese CFG to parse English sentences with          mentioned in Section 3.
the language 1 side of the ITG. Any resulting parse           If we create an explicit singleton hypothesis for
tree must be consistent with the original Chinese          every possible input word, the resulting search
grammar. This follows from the fact that both the          space will be too large. To recognize singletons, we
straight and inverted versions of a production have        instead borrow the word-skipping technique from
language 2 (Chinese) sides identical to the original       speech recognition and robust parsing. As formal-
monolingual production: inverting production ori-          ized in the next section, we can do this by modifying
entation cancels out the mirroring of the right-hand-      the item extension step in our chart-parser-like algo-
side symbols. Thus, the output grammaticality de-          rithm. When the dot of an item is on the rightmost
pends directly on the tightness of the original Chi-       position, we can use such constituent, a subtree, to
nese grammar.                                              extend other items. In chart parsing, the valid sub-
   In principle, with this approach a single tar-          trees that can be used to extend an item are those
get grammar could be used for translation from             that are located on the adjacent right of the dot po-
any number of other (fixed word-order) source lan-         sition of the item and the anticipated category of the
guages, so long as a translation lexicon is available      item should also be equal to that of the subtrees.
for each source language.                                  If word-skipping is to be used, the valid subtrees
   Probabilities on the mirrored ITG cannot be re-         can be located a few positions right (or, left for the
liably estimated from bilingual data without a very        item corresponding to inverted production) to the
large parallel corpus. A straightforward approxima-        dot position of the item. In other words, words be-
tion is to employ EM or Viterbi training on just a         tween the dot position and the start of the subtee are
monolingual target language (Chinese) corpus.              skipped, and considered to be singletons.
                                                              Consider Sentence 5 again. Word-skipping han-
4.2   Part-of-Speech Mapping
                                                           dled the the which has no Chinese counterpart. At a
The second problem is that the part-of-speech (PoS)        certain point during translation, we have the follow-
categories used by the target (Chinese) grammar do         ing item: VP--+[is/x~]veNP. With word-skipping,
not correspond to the source (English) words when          it can be extended to VP --+[is/x~]vNPe by the sub-
the source sentence is parsed. It is unlikely that any     tree ( s o n / ~ of/~ Stephen/~)Np,           even the
English lexicon will list Chinese parts-of-speech.         subtree is not adjacent (but within a certain distance,
   We employ a simple part-of-speech mapping               see Section 5) to the dot position of the item. The
technique that allows the PoS tag of any corre-            the located on the adjacent to the dot position of the
sponding word in the target language (as found in          item is skipped.
the translation lexicon) to serve as a proxy for the          Word-skipping provides us the flexibility to parse
source word's PoS. The word view, for example,             the source input by skipping possible singleton(s),
may be tagged with the Chinese tags nc and vn,             if when we doing so, the source input can be parsed
since the translation lexicon holds both v i e w y y / ~   with the highest likelihood, and grammatical output
~nc and v i e w v B / ~ v n .                              can be produced.
   Unknown English words must be handled differ-
ently since they cannot be looked up in the transla-       5   Translation Algorithm
tion lexicon. The English PoS tag is first found by
tagging the English sentence. A set of possible cor-       The translation search algorithm differs from that of
responding Chinese PoS tags is then found by table         Wu's SBTG model in that it handles arbitrary gram-
lookup (using a small hand-constructed mapping ta-         mars rather than binary bracketing grammars. As
ble). For example, NN may map to nc, loc and pref,         such it is more similar to active chart parsing (Ear-
while VB may map to vi, vn, vp, vv, vs, etc. This          ley, 1970) rather than CYK parsing (Kasami, 1965;
method generates many hypotheses and should only           Younger, 1967). We take the standard notion of
be used as a last resort.                                  items (Aho and Ullman, 1972), and use the term an-
                                                           ticipation to mean an item which still has symbols
4.3 Word Skipping                                          right of its dot. Items that don't have any symbols
Regardless of how constituent-order transposition is       right of the dot are called subtree.
handled, some function words simply do not oc-                As with Wu's SBTG model, the algorithm max-
cur in both languages, for example Chinese aspect          imizes a probabilistic objective function, Equa-
tion (2), using dynamic programming similar to that                                     ~0
                                                                                         r.~tuv ~          max            ai(r)    fl    ~r,s,t,u,v, 9v,+lu,
for H M M recognition (Viterbi, 1967). The presence                                                    r-+(ro...rn)                i=O
                                                                                                       s , < t , ~.%+X
o f the bigram model in the objective function ne-                                                    O<s,+I-G<_K
cessitates indexes in the recurrence not only on sub-                                      0
trees over the source English string, but also on the                                    7"rstu v                                        n
delimiting words o f the target Chinese substrings.                                                    =         argmax a i ( r ) H ~. . . . t,u,v,ffv,+,u,
   The dynamic programming exploits a recursive                                                                r-+(~o ..... )     i=0
formulation o f the objective function as follows.                                                         O<s,+x-t,<_K
Some notation remarks: es..t denotes the subse-                                        3. Reconstruction
quence o f English tokens e , + l , e~+2, • • . , et. We
use C ( s . . t ) to denote the set of Chinese words that                              Let qo = (S, 0, T, u, v) be the optimal root. where
are translations of the English word created by tak-                                   (u, v) = maxu, vEC(O.T) ~S st U v For any child o f
ing all tokens in es..t together. C ( s , t) denotes the                               q = (r, s, t, u, v) is given by:
set of Chinese words that are translations o f any of
the English words anywhere within es..t. K is the                                                                  { r~ ]         "[]
                                                                                                                                  A.risitiuivi ,   ifTq=[]
maximium number of consecutive English words                                           CHILD(q, r)         :         7-~)          0               ifTq "- 0
that can be skipped. 5 Finally, the argmax operator is                                                                            ~risitiuivi;
                                                                                                                         NIL                       otherwise
generalized to vector notation to accommodate mul-
tiple indices.                                                                            Assuming the number o f translation per word is
                                                                                       bounded by some constant, then the maximum size
1. Initialization                                                                      of C ( s , t) is proportional to t - s. The asymptotic
                                                                                       time complexity for our algorithm is thus bounded
                                                    O<s<t<T                            by O ( T r ) . However, note that in theory the com-
           60rstYy = b i ( e s . . ¢ / Y ) ,        Y e c(s..t)                        plexity upper bound rises exponentially rather than
                                                    r is Y ' s P o S                   polynomially with the size o f the grammar, just
                                                                                       as for context-free parsing (Barton et al., 1987),
2. Recursion
                                                                                       whereas this is not a problem for Wu's S B T G algo-
                                                                                       rithm. In practice, natural language grammars are
For          all        r, s, t, u, v        such          that
                                                                                       usually sufficiently constrained so that speed is ac-
   r is the category of a constituent spanning s to t
   0_<s<t<T                                                                            tually improved over the S B T G algorithm, as dis-
   u, v are the l--eftmost/rightmostwords of the constituent                           cussed later.
                                                                                          The dynamic programming is efficiently im-
               =     maxr6[]                                                           plemented by an active-chart-parser-style agenda-
(~,'stuv                            0      t'rstuvJ1
                       • t rstuv ,6 rstuv, x•
                                                                                       based algorithm, sketched as follows:
                                                              -0       ~o
                                       rstuv                                             1. Initialization For each word in the input sentence, put a
"[rstuv        --          0      if6~{t~,o > ma, " tr6[]
                                               ,        rst~,~,                             subtree with category equal to the PoS of its translation
                           0      otherwise                                                 into the agenda.
                                                                                         2. Recursion Loop while agenda is not empty:
where 6                                                                                       (a) If the current item is a subtree of category X, ex-
                                                                                                   tend existing anticipations by calling ANTIEIPA-
                                                                                                   TIONEXTENSION, For each rule in the grammar
              :          nl ax         ai(r) f l        dr,s,t,u,v, gv,u,+,                        of Z ~ X W . . . Y, add an initial anticipation of
                     8, < t t ~S,ael
                                                  i=0                                              the form Z ~ X • W . . . Y and put it into the
                    O<s)+l--tt<K                                                                   agenda. Add subtree X to the chart.
   r[]                                                                                        (b) If the current item is an anticipation of the form
     r$tu~'                                              rl
                                                                                                   Z ~ W . . . * X . . . Y from s to to, find all subtrees
                    =       argmax             a i ( r ) H ~rls|tlttlvlffvlttt'kl                  in the chart with category X that start at position t~
                                                        i=0                                        and use each subtree to extend this anticipation by
                           S, < t , <-%+1                                                          calling ANTICIPATIONEXTENSION.
                                                                                               ANTICIPATIONEXTENS1ON : Assuming the subtree we
                                                                                               found is of category X from position sl to t, for any
    Sln our experiments, It" was set to 4                                                      anticipation of the form Z --+ W . . . • X ... Y from so
    %0 = s, sn = t, u• = u, vn ~ v, gv,u,+a = gv,+lun :                                        to [ s l - I f , sl], extend it to Z --+ IV... X • ... Y with
1, qi = (riaitiuivi)                                                                           span from so to t and add it to the agenda.

    3. Reconstruction The output string is recursively r e c o n -              Time              SBTG         Grammatical
       structed from the highest likelihood subtree, with cate-                  (x)             Channel         Channel
       gory S, that span the whole input sentence.
                                                                            x < 30 secs.          15.6%           83.3%
6     Results                                                            30 secs. < x < 1 min.    34.9%           7.6%
                                                                             x > 1 min.           49.5%           9.1%
The grammatical channel was tested in the SILC
translation system. The translation lexicon was                                     Table 1: Translation speed.
partly constructed by training on government tran-
scripts from the HKUST English-Chinese Paral-
lel Bilingual Corpus, and partly entered by hand.                        Sentence meaning         SBTG         Grammatical
The corpus was sentence-aligned statistically (Wu,                         preservation          Channel         Channel
 1994); Chinese words and collocations were ex-                               Correct             25.9%           32.3%
tracted (Fung and Wu, 1994; Wu and Fung, 1994);                              Incorrect            74.1%           67.7 %
then translation pairs were learned via an EM pro-
cedure (Wu and Xia, 1995). Together with hand-                                    Table 2: Translation accuracy.
constructed entries, the resulting English vocabu-
lary is approximately 9,500 words and the Chinese
vocabulary is approximately 14,500 words, with a                     constraints on the search space given by the SITG.
many-to-many translation mapping averaging 2.56                         The natural trade-off is that constraining the
Chinese translations per English word. Since the                     structure of the input decreases robustness some-
lexicon's content is mixed, we approximate transla-                  what. Approximately 13% of the test corpus could
tion probabilities by using the unigram distribution                 not be parsed in the grammatical channel model.
of the target vocabulary from a small monolingual                    As mentioned earlier, this figure is likely to vary
corpus. Noise still exists in the lexicon.                           widely depending on the characteristics of the tar-
   The Chinese grammar we used is not tight--                        get grammar. Of course, one can simply back off
it was written for robust parsing purposes, and as                   to the SBTG model when the grammatical channel
such it over-generates. Because of this we have not                  rejects an input sentence.
yet been able to conduct a fair quantitative assess-                    With respect to objective 2 (improving meaning-
ment of objective 3. Our productions were con-                       preservation accuracy), the new model is also
structed with reference to a standard grammar (Bei-                  promising. Table 2 shows that the percentage of
jing Language and Culture Univ., 1996) and totalled                  meaningfully translated sentences rises from 26% to
316 productions. Not all the original productions                    32% (ignoring the rejected cases). 7 We have judged
are mirrored, since some (128) are unary produc-                     only whether the correct meaning is conveyed by the
tions, and others are Chinese-specific lexical con-                  translation, paying particular attention to word order
structions like S ~ ~ - ~ S NP ~ S, which are                        and grammaticality, but otherwise ignoring morpho-
obviously unnecessary to handle English. About                       logical and function word choices.
27.7% of the non-unary Chinese productions were
mirrored and the total number of productions in the                  7     Conclusion
final ITG is 368.
                                                                     Currently we are designing a tight generation-
   For the experiment, 222 English sentences with
                                                                     oriented Chinese grammar to replace our robust
a maximum length of 20 words from the parallel
                                                                     parsing-oriented grammar. We will use the new
corpus were randomly selected. Some examples of
                                                                     grammar to quantitatively evaluate objective 3. We
the output are shown in Figure 2. No morphological
                                                                     are also studying complementary approaches to
processing has been used to correct the output, and
                                                                     the English word deletion performed by word-
up to now we have only been testing with a bigram
                                                                     skipping--i.e., extensions that insert Chinese words
model trained on extremely small corpus.
                                                                     suggested by the target grammar into the output.
   With respect to objective 1 (increasing translation
                                                                        The framework seeds a natural transition toward
speed), the new model is very encouraging. Ta-
                                                                     pattern-based translation models (objective 4). One
ble 1 shows that over 90% of the samples can be
processed within one minute by the grammatical                          7These accuracy rates are relatively low because these ex-
channel model, whereas that for the SBTG channel                     periments are being conducted with new lexicons and grammar
model is about 50%. This demonstrates the stronger                   on a new translation direction (English-Chinese).

can post-edit the productions of a mirrored SITG                           Input : I entirely agree with this point of view.
more carefully and extensively than we have done                           Output:     ~J~'~"~,, ~ ,1~~1~~- ll~ ~i o
in our cursory pruning, gradually transforming the                         Corpus: ~ , , ~ ~ _ ~ ' ~ o
                                                                           Input : T h i s would create a tremendous financial
original monolingual productions into a set of true
transduction rule patterns. This provides a smooth                               burden to taxpayers in Hong Kong.
evolution from a purely statistical model toward a                         Output:                                                 ~[~
                                                                                       :i~::~:~ ~ ~J ~)i~)~ ~lJ~ .~ }k. [~J":'-'-'-~ fl"-J~. ~ o
hybrid model, as more linguistic resources become                          Corpus: ~ l ~ i ~ J ~        ),.~i~gD]~ ~,~        ~I~ o
available.                                                                 Input : The Government wants, and will work for, the
                                                                                 best education for all the children of Hong Kong.
   We have described a new stochastic grammati-
                                                                           Output: : ~ ~ ~]I~~J( ~ P--J~ ,:~,~, ~ I ]f~ ,,~ ~J~~ ~j~i~J )~.
cal channel model for statistical machine translation
that exhibits several nice properties in comparison
                                                                                ~ ~ ~1~: o
                                                                           Corpus:~ , ~ ~ ~ " ~ 2 ~                              ~lgl/9
with Wu's SBTG model and IBM's word alignment
                                                                                ~g, ~ l~l~'~c~]~_~o
model. The SITG-based channel increases trans-
                                                                           Input   :   Let me repeat one simple point yet again.
lation speed, improves meaning-preservation accu-
                                                                           Output: ~ ~[] . ~ ~ ~'~ ~ ~'[~~:~ o
racy, permits tight target CFGs to be incorporated
for improving output grammaticality, and suggests
                                                                           Input : W e are very        disappointed.
a natural evolution toward transduction rule mod-
                                                                           Output: ~ J ~ ] J ~ +~:~ ~ [ItJ o
els. The input CFG is adapted for use via produc-
                                                                           Corpus: ~ ' ~ , ~ : ~ o
tion mirroring, part-of-speech mapping, and word-
skipping. We gave a polynomial-time translation                            Figure 2: Example translation outputs from the
algorithm that requires only a translation lexicon,                        grammatical channel model.
plus a CFG and bigram language model for the tar-
get language. More linguistic knowledge about the
target language is employed than in pure statisti-                         T. Kasami. 1965. An efficient recognition and syntax analysis al-
                                                                               gorithm for context-free languages. Technical Report AFCRL-65-
cal translation models, but Wu's SBTG polynomial-                              758, Air Force Cambridge Research Lab., Bedford, MA.
time bound on search cost is retained and in fact the                      Andrew J. Viterbi. 1967. Error bounds for convolutional codes and an
search space can be significantly reduced by using                             asymptotically optimal decoding algorithm. IEEE Transactions on
                                                                               h!formation Theory, 13:260-269.
a good grammar. Output always conforms to the                              Dekai Wu and Pascale Fang. 1994. Improving Chinese tokenization
given target grammar.                                                          with linguistic filters on statistical lexical acquisition. In Proc. of
                                                                               4th Conf. on ANLP, pg 180-181, Stuttgart, Oct.
Acknowledgments                                                            Dekai Wu and Xuanyin Xia. 1995. Large-scale automatic extraction
Thanks to the SILC group members: Xuanyin Xia, Daniel                          of an English-Chinese lexicon. Machh~e Translation, 9(3--4):285-
Chan, Aboy Wong, Vincent Chow & James Pang.
                                                                           Dekai Wu. 1994. Aligning a parallel English-Chinese corpus statisti-
                                                                               cally with lexical criteria. In Proc. of 32nd Annual Conf. of Assoc.
References                                                                    fi~r ComputationalLinguistics, pg 80-87, Las Cruces, Jun.
                                                                           Dekai Wu. 1995a. An algorithm for simultaneously bracketing parallel
Alfred V. Aho and Jeffrey D. Ullman. 1972. The Theorb, of Parsing.             texts by aligning words. In Proc. of 33rd Annual Conf. of Assoc. for
   Translation. and Compiling. Prentice Hall, Englewood Cliffs, NJ.            Computational Linguistics, pg 244-251, Cambridge, MA, Jun.
G. Edward Barton, Robert C. Berwick, and Eric. S Ristad. 1987. Com-        Dekai Wu. 1995b. Grammarless extraction of phrasal translation ex-
   putational Complexity and Natural Language. MIT Press, Cam-                 amples from parallel texts. In TMI-95, Proc. of the 6th hmi Conf.
   bridge, MA.                                                                 on Theoretical and Methodological Issues in Machine Translation,
Beijing Language and Culture Univ.. 1996. Sucheng Hanyu Chuji                  volume 2, pg 354-372, Leuven, Belgium, Jul.
   Jiaocheng (A Short h~tensive Elementary Chb~ese Course), volume         Dekai Wu. 1995c. Stochastic inversion transduction grammars, with
    1-4. Beijing Language And Culture Univ. Press.                             application to segmentation, bracketing, and alignment of parallel
Peter E Brown, John Cocke, Stephen A. DellaPietm, Vincent J. Del-              corpora. In Proc. of IJCAI-95, 14th InM Joint Conf. on Artificial
   laPietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and        Intelligence, pg 1328-1334, Montreal, Aug.
   Paul S. Roossin. 1990. A statistical approach to machine transla-       Dekai Wu. 1995d. Trainable coarse bilingual grammars for parallel
   tion. ComputationalLinguistics, 16(2):29-85.                                text bracketing. In Proc. of the 3rdAnnual Workshop on Verb'Large
Peter E Brown, Stephen A. DellaPietra, Vincent J. DellaPietra, and             Corpora, pg 69-81, Cambridge, MA, Jun.
   Robert L. Mercer. 1993. The mathematics of statistical ma-              Dekai Wu. 1996. A polynomial-time algorithm for statistical machine
   chine translation: Parameter estimation. Computational Lfl~guis-            translation. In Proc. of the 34th Annual Conf. of the Assoc. for Com,
   tics, 19(2):263-311.                                                       putational Linguistics, pg 152-158, Santa Cruz, CA, Jun.
Jay Earley. 1970. An efficient context-free parsing algorithm. Com-        Dekai Wu. 1997. Stochastic inversion transduction grammars and
   munications of the Assoc. for Computing Machinerb', 13(2):94-102.           bilingual parsing of parallel corpora. Computational Linguistics,
Pascale Fung and Dekai Wu. 1994. Statistical augmentation of a Chi-            23(3):377--404, Sept.
   nese machine-readabledictionary. In Proc. of the 2nd Annual Work-       David H. Younger. 1967. Recognition and parsing of context-free lan-
   shop on Verb'Large Corpora, pg 69-85, Kyoto, Aug.                           guages in time n 3. hzformation and Control, 10(2): 189-208.

           M a c h i n e Translation with a Stochastic G r a m m a t i c a l C h a n n e l
                     (~Y~~~                          I~I ~ ~ I ~ ~ ~ )
                      Dekai WU ( ~ , ~ )              and Hongsing WONG ( ~ - ~ )

                                      ( d e k a i , wong) + c s . u s L . h k

' ~ , ~_.~:i~:~-~¢_~ o 1"~ Wu(1996)       ~][~1~l~,~,,j~L~f/l)&~J~-~:~_(~'~--~121~9~::~:~
~ ' I   ='~ - ) ,    ~'fl"+~ : ~ _ ~ ' t ~ + ' J :


To top