Kit - PDF

Document Sample
Kit - PDF Powered By Docstoc
					                Integrating Ngram Model and Case-based Learning
                          For Chinese Word Segmentation

                       Chunyu Kit Zhiming Xu Jonathan J. Webster
                        Department of Chinese, Translation and Linguistics
                                  City University of Hong Kong
                              Tat Chee Ave., Kowloon, Hong Kong
                       {ctckit, ctxuzm, ctjjw}

                    Abstract                          tate Chinese-English word alignment for our ongo-
                                                      ing EBMT project, where only unsegmented texts
    This paper presents our recent work               are available for training. It is expected to be ro-
    for participation in the First Interna-           bust enough to handle novel texts, independent of
    tional Chinese Word Segmentation Bake-            any segmented texts for training. To simplify the
    off (ICWSB-1). It is based on a general-          EM training, we used the uni-gram model for the
    purpose ngram model for word segmen-              bakeoff and relied on the Viterbi algorithm (Viterbi,
    tation and a case-based learning approach         1967) for the most probable segmentation, instead of
    to disambiguation. This system excels             attempting to exhaust all possible segmentations of
    in identifying in-vocabulary (IV) words,          each sentence for a complicated full version of EM
    achieving a recall of around 96-98%.              training.
    Here we present our strategies for lan-              The case-based learning works in a straightfor-
    guage model training and disambiguation           ward way. It first extracts case-based knowledge,
    rule learning, analyze the system’s perfor-       as a set of context-dependent transformation rules,
    mance, and discuss areas for further im-          from the segmented training corpus, and then ap-
    provement, e.g., out-of-vocabulary (OOV)          plies them to ambiguous strings in a test corpus in
    word discovery.                                   terms of the similarity of their contexts. The simi-
                                                      larity is empirically computed in terms of the length
                                                      of relevant common affixes of context strings.
1 Introduction
                                                         The effectiveness of this integrated approach is
After about two decades of studies of Chinese word    verified by its outstanding performance on IV word
segmentation, ICWSB-1 (henceforth, the bakeoff)       identification. Its IV recall rate, ranging from 96%
is the first effort to put different approaches and    to 98%, stands at the top or the next to the top in all
systems to the test and comparison on common          closed tests in which we have participated. Unfortu-
datasets. We participated in the bakeoff with a       nately, its overall performance is not sustainable at
segmentation system that is designed to integrate a   the same level, due to the lack of a module for OOV
general-purpose ngram model for probabilistic seg-    word detection.
mentation and a case- or example-based learning          This paper is intended to present the implementa-
approach (Kit et al., 2002) for disambiguation.       tion of the system and analyze its performance and
   The ngram model, with words extracted from         problems, aiming at exploration of directions for fur-
training corpora, is trained with the EM algorithm    ther improvement. The remaining sections are or-
(Dempster et al., 1977) using unsegmented train-      ganized as follows. Section 2 presents the ngram
ing corpora. Originally it was developed to en-       model and its training with the EM algorithm, and
hance word segmentation accuracy so as to facili-     Section 3 presents the case-based learning for dis-
ambiguation. The overall architecture of our system               Following the conventional idea to speed up the
is given in Section 4, and its performance and prob-            EM training, we turned to the Viterbi algorithm. The
lems are analyzed in Section 5. Section 6 concludes             underlying philosophy is to distribute more prob-
the paper and previews future work.                             ability to more probable events. The Viterbi seg-
                                                                mentation, by utilizing dynamic programming tech-
2 Ngram model and training                                      niques to go through the word trellis of a sentence
An ngram model can be utilized to find the most                  efficiently, finds the most probable segmentation un-
probable segmentation of a sentence. Given a Chi-               der the current parameter estimation of the language
nese sentence s = c1 c2 · · · cm (also denoted as cn ),         model, fulfilling (1)). Accordingly, (6) becomes
its probabilistic segmentation into a word sequence
w1 w2 · · · wk (also denoted as w1 ) with the aid of an           f k+1 (w) =         pk (seg(s)) f k (w ∈ seg(s))   (6)
ngram model can be formulated as                                                s∈C

                                               i−1              and (5) becomes
  seg(s) =        arg max               p(wi |wi−n+1 ) (1)
             s = w1 ◦w2 ◦···◦wk
                                                                          f k+1 (w) =          f k (w ∈ seg(s))      (7)
where ◦ denotes string concatenation,           w i−n+1
                                                   the                                   s∈C
context (or history) of wi , and n is the order of the
ngram model in use. We have opted for uni-gram for              where the normalization factor is skipped, for
                                               i−1              only the Viterbi segmentation is used for EM re-
the sake of simplicity. Accordingly, p(w i |wi−n+1 )
in (1) becomes p(wi ), which is commonly estimated              estimation. Equation (7) makes the EM training
as follows, given a corpus C for training.                      with the Viterbi algorithm very simple for the uni-
                     .                                          gram model: iterate word segmentation, as (1), and
             p(wi ) = f (wi )/     f (w)           (2)          word count updating, via (7), sentence by sentence
                                    w∈C                         through the training corpus until there is a conver-
   In order to estimate a reliable p(wi ), the ngram            gence.
model needs to be trained with the EM algorithm                    Since the EM algorithm converges to a local max-
using the available training corpus. Each EM itera-             ima only, it is critical to start the training with an
tion aims at approaching to a more reliable f (w) for           initial f 0 (w) for each word not too far away from its
estimating p(w), as follows:                                    “true” value. Our strategy for initializing f 0 (w) is
                                                                to assume all possible words in the training corpus
    f k+1 (w) =                   pk (s ) f k (w ∈ s )    (3)   as equiprobable and count each of them as 1; and
                  s∈C s ∈S(s)
                                                                then p0 (w) is derived using (2). This strategy is sup-
where k denotes the current iteration, S(s) the set of          posed to have a weaker bias to favor longer words
all possible segmentations for s, and f k (w ∈ s ) the          than maximal matching segmentation.
occurrences of w in a particular segmentation s .                  For the bakeoff, the ngram model is trained with
   However, assuming that every sentence always                 the unsegmented training corpora together with the
has a segmentation, the following equation holds:               test sets. It is a kind of unsupervised training.
                                                                Adding the test set to the training data is reasonable,
                             pk (s ) = 1                  (4)   to allow the model to have necessary adaptation to-
                   s ∈S(s)
                                                                wards the test sets. Experiments show that the train-
Accordingly, we can adjust (3) as (5) with a normal-            ing converges very fast, and the segmentation per-
ization factor α = s ∈S(s) pk (s ), to avoid favor-             formance improves significantly from iteration to it-
ing words in shorter sentences too much. In general,            eration. For the bakeoff experiments, we carried out
shorter sentences have higher probabilities.                    the training in 6 iterations, because more iterations
                                                                than this have not been observed to bring any signif-
                                  pk (s ) k
   f k+1 (w) =                           f (w ∈ s )       (5)   icant improvement on segmentation accuracy to the
                                    α                           training sets.
                  s∈C s ∈S(s)
3 Case-based learning for disambiguation                   transformation rules. An advantage of this approach
                                                           is that the rules so derived carry out not only disam-
No matter how well the language model is trained,
                                                           biguation but also error correction. This links our
probabilistic segmentation cannot avoid mistakes on
                                                           disambiguation strategy to the application of Brill’s
ambiguous strings, although it resolves most ambi-
                                                           (1993) transformation-based error-driven learning to
guities by virtue of probability. For the remaining
                                                           Chinese word segmentation (Palmer, 1997; Hocken-
unresolved ambiguities, however, we have to resort
                                                           maier and Brew, 1998).
to other strategies and/or resources. Our recent study
(Kit et al., 2002) shows that case-based learning is       4 System architecture
an effective approach to disambiguation.
   The basic idea behind the case-based learning is        The overall architecture of our word segmentation
to utilize existing resolutions for known ambiguous        system is presented in Figure 1.
strings to do disambiguation if similar ambiguities
occur again. This learning strategy can be imple-
mented in two straightforward steps:

  1. Collection of correct answers from the train-
     ing corpus for ambiguous strings together with
     their contexts, resulting in a set of context-
     dependent transformation rules;

  2. Application of appropriate rules to ambiguous

A transformation rule of this type is actually an ex-
ample of segmentation, indicating how an ambigu-               Figure 1: Overall architecture of the system
ous string is segmented within a particular context.
It has the following general form:
                                                           5 Performance and analysis
             l     r
           C α C : α → w 1 w2 · · · w k                    The performance of our system in the bakeoff is pre-
                                                           sented in Table 1 in terms of precision (P), recall
where α is the ambiguous string, C l and C r its left
                                                           (R) and F score in percentages, where “c” denotes
and right contexts, respectively, and w 1 w2 · · · wk
                                                           closed tests. Its IV word identification performance
the correct segmentation of α given the contexts.
                                                           is remarkable.
In our implementation, we set the context length on
                                                              However, its overall performance is not in bal-
each side to two words.
                                                           ance with this, due to the lack of a module for OOV
   For a particular ambiguity, the example with the
                                                           word discovery. It only gets a small number of OOV
most similar context in the example (or, rule) base
                                                           words correct by chance. The higher OOV propor-
is applied. The similarity is measured by the sum
                                                           tion in the test set, the worse is its F score. The rel-
of the length of the common suffix and prefix of,
                                                           atively high Roov for PKc track is, mostly, the result
respectively, the left and right contexts. The details
                                                           of number recognition with regular expressions.
of computing this similarity can be found in (Kit et
al., 2002) . If no rule is applicable, its probabilistic     Test      P      R       F      OOV      Roov    Riv
segmentation is retained.
                                                            SAc       95.2   93.1    94.2    02.2     04.3    97.2
   For the bakeoff, we have based our approach to
                                                            CTBc      80.0   67.4    73.2    18.1     07.6    95.9
ambiguity detection and disambiguation rule extrac-
                                                            PKc       92.3   86.7    89.4    06.9     15.9    98.0
tion on the assumption that only ambiguous strings
cause mistakes: we detect the discrepancies of our
                                                            Table 1: System performance, in percentages (%)
probabilistic segmentation and the standard segmen-
tation of the training corpus, and turn them into
5.1   Error analysis                                        proach demonstrates an impressive effectiveness by
Most errors on IV words are due to the side-effect          its outstanding performance on IV word identifica-
of the context-dependent transformation rules. The          tion. With elimination of the bug and false errors, its
rules resolve most remaining ambiguities and cor-           performance could be significantly better.
rect many errors, but at the same time they also cor-       6.1   Future work
rupt some proper segmentations. This side-effect is
most likely to occur when there is inadequate con-          The above problem analysis points to two main di-
text information to decide which rules to apply.            rections for improvement in our future work: (1)
   There are two strategies to remedy, or at least al-      OOV word detection; (2) a better strategy for learn-
leviate, this side-effect: (1) retrain probabilistic seg-   ing and applying transformation rules to reduce the
mentation – a conservative strategy; or, (2) incorpo-       side-effect. In addition, we are also interested in
rate Brill’s error-driven learning with several rounds      studying the effectiveness of higher-order ngram
of transformation rule extraction and application, al-      models and variants of EM training for Chinese
lowing mistakes caused by some rules in previous            word segmentation.
rounds to be corrected by other rules in later rounds.      Acknowledgements
   However, even worse than the above side-effect is
a bug in our disambiguation module: it always ap-           The work is part of the CERG project “EBMT for
plies the first available rule, leading to many unex-        HK Legal Texts” funded by HK UGC under the
pected errors, each of which may result in more than        grant #9040482, with Jonathan J. Webster as the
one erroneous word. For instance, among 430 er-             principal investigator and Chunyu Kit, Caesar S.
rors made by the system in the SA closed test, some         Lun, Haihua Pan, King Kuai Sin and Vincent Wong
70 are due to this bug. A number of representative          as investigators. The authors wish to thank all team
examples of these errors are presented in Table 2,          members for their contribution to this paper.
together with some false errors resulting from the
inconsistency in the standard segmentation.
  Errors      Standard    False errors      Standard        E. Brill. 1993. A Corpus-Based Approach to Language
 2í (8)         2 í        ùŸ×D            ù Ÿ ×D              Learning. Ph.D. thesis, University of Pennsylvania,
 É u (7)        Éu        tjs—Ä           tj s—Ä               Philadelphia, PA.
 . ? (7)        .?        ÙÍ’Ä            ÙÍ ’Ä             A. P. Dempster, N. M. Laird, and D. B.Rubin. 1977.
 _ A (5)        _A        P_,ñ            P_ ,ñ               Maximum likelihood from incomplete data via the em
 É    (4)       É         1w±Ñ            1w± Ñ               algorithm. Journal of the Royal Statistical Society, Se-
 n? (4)         n ?       .™¤Þ            . ™ ¤ Þ
                                                              ries B, 34:1–38.
                                                            J. Hockenmaier and C. Brew. 1998. Error-driven learn-
            Table 2: Errors and false errors                   ing of Chinese word segmentation. In PACLIC-12,
                                                               pages 218–229, Singapore. Chinese and Oriental Lan-
                                                               guages Processing Society.
6 Conclusion and future work
                                                            C. Kit, H. Pan, and H. Chen. 2002. Learning case-based
We have presented our recent work for partici-                 knowledge for disambiguating Chinese word segmen-
                                                               tation: A preliminary study. In COLING2002 work-
pation in ICWSB-1 based on a general-purpose                   shop: SIGHAN-1, pages 33–39, Taipei.
ngram model for probabilistic word segmentation
and a case-based learning strategy for disambigua-          D. Palmer. 1997. A trainable rule-based algorithm
tion. The ngram model is trained using available              for word segmentation. In ACL-97, pages 321–328,
unsegmented texts with the EM algorithm with the
aid of Viterbi segmentation. The learning strategy          A. J. Viterbi. 1967. Error bounds for convolutional codes
acquires a set of context-dependent transformation             and an asymptotically optimum decoding algorithm.
                                                               IEEE Transactions on Information Theory, IT-13:260–
rules to correct mistakes in the probabilistic segmen-         267.
tation of ambiguous substrings. This integrated ap-

Shared By: