343 by xiangpeng


									     Unsupervised Chinese Word Segmentation and Unknown Word

                       FU Guohong                                             WANG Xiaolong
           Dept. of Computer Science and Engineering                Dept. of Computer Science and Engineering
                  Harbin Institute of Technology             Harbin Institute of Technology ,Harbin,150001, P.R. China
                   Harbin, 150001, P.R. China                                    Dept. Of Computing
                      fgh@insun.hit.edu.cn                       the Hong Kong Polytechnic University, Hong Kong

                                                             large number of segmented texts to make an accurate
                        Abstract                             estimate of the corresponding parameters, and a large
                                                             scale of segmented corpus is not available for Chinese
    In this paper, we present an unsupervised model for      at present. To resolve this problem, an unsupervised
    Chinese word segmentation based on the word forma-       approach for Chinese word segmentation and unknown
    tion power of character string (the word form model,
                                                             word identification is tried in this paper.
    WFM) and the affinity of character junctures (the
    character juncture model, CJM). We also proposed a       For an input sequence of Chinese characters, there is
    formula to measure the size of segmentation space and    usually more than one probable sequence of words. In
    adopt a two-way segmentation algorithm in our system     this paper, we firstly incorporate the word formation
    simultaneously. Finally, we devise a modified version    power of Chinese character string (the word form model,
    of Chinese word-formation patterns to identify           WFM) with the affinity of character junctures (the character
    unknown words. Since all the parameters can be           juncture model, CJM) and present an unsupervised stochastic
    estimated directly from unsegmented texts, the           word segmentation model to score the candidate word
    approaches proposed have strong adaptability and have    sequence of a given sentence. Then, we discuss the
    proved efficient through our primary experiments.        candidate segmentation space and adopt a two-way
                                                             optimizing algorithm to retrieve the most probable word
1     Introduction                                           sequence. Finally, we modify the definitions of the four
                                                             word-formation patterns given by [Yao Yuan, 1997] and
Word is the smallest meaning unit of any natural             devise a self-organizing method to detect unknown
language system, and plays a primary role in almost all      words.
NLP systems. However there are no explicit delimiters
to indicate word boundaries in Chinese writing system        2   Word Segmentation Models
except for some punctuation marks. Therefore word
segmentation is the first step of Chinese NLP.               As discussed above, The key problem of statistical
Chinese word segmentation is by no means a trivial           methods like word frequency based segmentation is
process due to the ambiguities and unknown words,            how to estimate the parameters inexpensively. Recently
which are mainly caused by the vagueness in Chinese          several unsupervised methods have been proposed to
word definition and the incompleteness of Chinese            address this problem. [Sproat, et al., 1996] proposed a
dictionary respectively. In the past, two main               string frequency method (SF) to estimate word
approaches have been proposed for Chinese word               frequencies. It derives the initial estimates from the
segmentation, namely the rule-based or heuristics            frequencies of the strings of character in the corpus
approach (Wang, X.-L., et al,1989; Liang, N.-Y., Y.-B.       making up each word in the dictionary whether or not
Zheng,1991;Yeh, C.-L., H.-J. Lee,1991) and the               each string is actually an instance of the word in
statistical approach (Chiang, T.-H., et al,1992; Shiho       question, which inevitably brings about double counts.
Nobesawa, et al, 1994; Law, H.H.C., C. Chan,1996;            Consequently, Sproat’s method tends to inflate a lot of
Gan, K.-W., et al, 1996; Fu, G.-H., et al, 1998). Each       short words to the segmentation result. [Nagata, 1997]
approach has its inconveniences while offering its           devised a longest match string frequency method (LSF).
advantages (Nie, J.-Y, et al,1995; Gan, K.-W., et al,        It counts the instances of string w1 in the text, unless
                                                             the instance is also a sub-string of another string w2 in
Comparing with rule-based approaches, statistical
approaches have stronger adaptability and are more           the dictionary. Though LSF is able to avoid the
practical in a large application like machine translation.   appearance of large number of short words, It also has
However, most of statistical methods often need a            its limits. If a word w1 is a sub-string of other word
w2 and if w1 always appears as a sub-string of w2                S is more likely to be segmented as the word sequence
in the training text, the frequency estimate of                  W.
 w1 becomes 0. To resolve the above problems, we                 2.2 Character Juncture Model (CJM)
present an improved method to estimate word
frequencies based on the word formation power (WFP)              Definition 2: Between each pair of characters in S is a
of Chinese character string in this section. At the same         character juncture. There are two types of character
time, the affinity relation between characters (character        juncture: word boundary and non-word boundary. Let
juncture model) is also introduced into word                      tb "/" and t f " " denote the two types of juncture
segmentation to enhance the power of disambiguation              respectively. Let t (cici 1) denote the type of juncture of
of the model.                                                    a character pair ci ci 1 .
2.1 Word Form Model (WFM)                                        According to definition 2, for a certain character string
                                                                 Ci j  S (1  i  j  n) , if C i j is segmented as a word,
The word formation power (WFP) of a character is
studied in (Nie, J.-Y, et al, 1995). However, the study          then the type of each juncture inside C i j must be
on the WFP of a Chinese character string has not been            assigned as t f and the type of each juncture on the
reported so far.                                                 boundary of the string must be assigned as tb ; i.e.,
Definition 1: Let D denote segmentation dictionary,
                                                                 t (ck ck 1 )  t f where i  k  j , t (c k c k 1 )  t b where
For a Chinese character string Ci j  ci ci 1 c j , if C i j
                                                                 k  j or k  i 1 . From the juncture types assigning
can form a word that is included in D , then C i j is a          point of view, the probability that a certain character
word form (represented as f hereafter). If f  S ,               string C i j forms a word w , PJ (w) , can be given by
 f is also a candidate word of S to be segmented.
For a given word form f , its WFP is not only                        PJ ( w)  Pr (t (c j c j 1 )  t b | c j c j 1 )
                                                                                   j 1
determined by frequency of itself, but also the sum
                                                                                  Pr (t (cl cl 1 )  t f | cl cl 1 )    (3)
frequency of these word forms which include f . Let                                l i
 PF ( f ) denote WFP of f , f  denotes the word
                                                                 Where Pr () refers to a probabilistic function. Assume
form including f , i.e., f  ( f   D) , Thus the WFP
                                                                 that the context words is only dependent on the word
of f (denoted by PF ( f ) ) is
                                                                 boundary, thus the total probability of juncture type
                                    N( f )                       assigning of a candidate word sequence W , viz. the
                    PF ( f ) 
                                     N ( f )             (1)   character juncture model (CJM), can be represented as
                                 f  ( f D )

Where N () denotes the frequency of () , which can                                      PJ (W )   PJ (wk )
                                                                                                    k 1
be counted directly from a raw text corpus. The word
form in S with higher PF ( f ) is likely to be                   With the character juncture model, Chinese word
                                                                 segmentation can be described as a process of assigning
segmented as word.
                                                                 the most appropriate types to each juncture within the
Let W  W1m  w1w2 ...wm denote a candidate word                 input sentence, with which the input sentence will be
sequence of the input sentence                  S    and         segmented into a legitimate and meaningful sequence of
 F  F1  f1 f 2  f m denote its corresponding word
       m                                                         words. Namely, the goal of Chinese word segmentation
                                                                 is to find a word sequence W with the largest
form sequence. Assume that all words are independent
of the other contextual information, then the total WFP          probability PJ (W ) in  .
of W , i.e., the word form model (WFM), can be                   In a way, equation (3) and (4) give the affinity of a
represented as                                                   candidate word in the input sentence and a candidate
                                                                 word sequence respectively. However, the parameters in
               PF (W )  PF ( F )   PF ( f k )                 this two formula need to be trained by a segmented text
                                         k 1
                                                          (2)    corpus, and a large-scale segmented corpus is
                                                                 commonly expensive to be acquired. Mutual
Equation (2) gives the total score of the candidate
                                                                 information of characters is a quite good measurement
word sequence from the point of view of the WFP of
                                                                 for describing the relationships of pairs of characters in
word forms, The higher PF (W ) is, The input sentence
                                                                 a string, and is able to be acquired inexpensively. To
avoid the expensive training, we would like to use                        3        Segmentation Space and Algorithm
mutual information of characters to simplify the
character juncture model.                                                 3.1 Candidate Segmentation Space
Definition 3: For a pair of characters ci ci1  S , its
mutual information is                                                     Definition 4: For an input sentence S  C1n  c1c 2  c n
                                       Pr (ci ci 1 )
                                                                          with n Chinese characters, it may have more than one
             MI (ci ci 1 )  log                                         way of segmentation. A candidate word sequence space,
                                    Pr (ci ) Pr (ci 1 )            (5)
                                                                          denoted by  , is a set of all probable sequences of
Generally, for a pair of characters ci and ci 1 , the                    words.
                                                                          From the above definition of candidate segmentation
larger MI (ci ci1 ) is, the larger Pr (t (ci ci1 )  t f ) is.
                                                                          space, to give the size of  is to determine how many
On the contrary, the smaller MI (ci ci1 ) is, the smaller                different possible word sequences there are in the input
Pr (t (ci ci 1 )  t f ) is. This implicates that the character          sentence S . Let l max denote the maximum word length
pairs within a word have large mutual information,                        over the system dictionary D. Let ci (1  i  n) denote a
while the character pairs between words have small                        certain character in the input sentence S , there are
mutual information. Thus equation (3) can be                              possibly a set of word forms beginning with ci . Let
simplified as                                                             { f k } denote the word form set, ni denote the size of
                         j 1                                             the set, which can be given by a forward dictionary-
            PJ ( w)      MI (c c
                         l i
                                    l l 1 )   MI (c j c j 1 )
                                                                    (6)   based matching process, l k denote the character count
                                                                          of the corresponding word form, then { f k }  {Ci j |
With equation (6), all parameters of character juncture
model can be derived from equation (5) with a raw                         Ci j  ci  c j  D, 1  i  j  n, 1  j  i  1  l max }   .   Let
unsegmented corpus.                                                       (i)      denote the size of  of character string
2.3 Combining the Methods                                                 C  c i  c n in sentence S  C1n  c1c 2 ...c n , then

Word form model represents the total WFP of a                                                               1, i  n
sequence of word forms with the hypothesis of context                                     (i )   i
independent, while character juncture model represents                                                    (i  lk ),1  i  n              (8)
                                                                                                   k 1
combining relations of character pairs within a word
and context influences on words boundary. However,                        For the whole input sentence S  C1n , the size of 
for a word form f in the sentence being segmented,
                                                                          can be given by  n  (1) . Although equation (8) is a
whether f can be identified as a word is not only
dependent on the WFP feature of f itself, but also the                    recursive function, we can backward traverse the whole
character juncture features. Consequently, to enhance                     input sentence S and easily compute the relevant size
the segmentation performance, the two group of                             n        incrementally.     For      example,       a
information should be introduced into word                                sentence中国人民生活水平进入小康 (In English: The
segmentation simultaneously. With the linear                              living standards of Chinese people come into a phrase
interpolation method, word form model and character                       of moderately well-off society) has 168 potential
juncture model can be merged into a hybrid model                          different word sequences, the calculating process is
(HM), namely                                                              shown in table 1.
                                                                                     Table 1 Size of segmentation space  n
      PH (W )      l o g ( f
                  k 1
                           P    F     k   )  (1   ) PJ ( wk )
                                                                           i 1 2 3 4 5 6 7 8 9 10 11 12
Where PH (W ) represents the combination score of a                        S 中 国 人 民 生 活 水 平 进 入 小 康
                                                                           ni 3 2 2 2 2 2 2 1 2 1 2 1
candidate word sequence W; log() is a logarithmic
                                                                          (i) 168 84 52 32 20 12 8 4 4 2 2 1
function;  (1    1) is the interpolation coefficient,
which can be determined with experiments. According                       3.2 Word Segmentation Algorithm
to our experiments, the best segmentation performance
                                                                          From the point of view of AI, to segment a sentence S
can be achieved when   0.9 .
                                                                          is to search an optimum or correct word sequence W in
the corresponding candidate word sequences space n .        4.1 Function Characters Elimination
A effective approach to resolve this kind of problems
                                                             Though unknown words in the segmented sentences are
is to keep all candidate segmentations in a lattice form,
                                                             usually segmented as single-character-words, we cannot
score the lattice with above segmentation models, and
                                                             merge freely these single-character-words into multi-
finally retrieve the best candidate word sequence by a
                                                             character-words. Because this kind of single-character-
certain optimum searching algorithm. Based on the
                                                             word string frequently contains function characters,
introducing of word form lattice and exhaustive
                                                             such as “的”(of), “了”(already), etc. Therefore, we
segmentation principle, A two ways searching
algorithm       incorporating    backward        dynamic     should firstly identify these function characters in the
programming with forward A* stack decoding                   candidate character string of new word. According to
algorithm is adopted here. The algorithm consists of         the hypothesis of (Nie, J.-Y, et al, 1995), these function
three main stages. In the first stage, All possible or       characters have relatively low word formation power.
candidate words of the input sentence are exhaustively       The word formation power (WFP) of a given character
selected from the segmentation dictionary, and the            c may be defined as follows:
corresponding word form lattice is constructed at the                                                       d       )
                                                                               count (c_is_in_multi - w o r - f o r m
same time. In the second stage, the dynamic                       WFP(c) 
                                                                                            count (c)                            (9)
programming module backward traverse the whole
lattice to compute heuristic function for the forward        Thus characters with WFP lower than one threshold are
stack decoding. Finally, the forward A* stack decoding       recognized as function characters.
algorithm scores the whole lattice with equation (2) or
(4) or (7), and extracts the most promising and              4.2 Word-formation Patterns of Chinese
competing word sequence from the word form lattice.              Characters
The detail description for the algorithm is presented in
                                                             For a Chinese character c , it may employ any one of
(Fu, G. H., et al, 1998).
                                                             following word formation patterns in forming multi-
                                                             character words: (1) pttn (H ) : head (used as a prefix of
4   Unknown Word Identification
                                                             other words); (2) pttn (T ) : tail (used as a suffix of other
Most unknown Chinese words are proper nouns, such
                                                             words); (3) pttn (M ) : middle (used as a middle
as Chinese names (e.g., 朱 镕基 总理, premier ZHU
                                                             character of other words). From statistical point of view,
Rongji), name of place (e.g., 綦江 县, Qijiang
                                                             We can easily give a stochastic edition of word
county), transliterated names (e.g., 克林顿 总统,                 formation patters. Unlike the definition given in [Yao
president Clinton), which are built on rather random         Yuan, 1997], Here we use word form frequencies to
word-formation rules, and lack of explicit marks for         approximate related word frequencies, which enable us
identification, like capital letter in English. This makes   to build word-formation patterns of each Chinese
the detection of unknown words quit difficult. Another       character from unsegmented texts. Then, we may
group of new words are derived words(e.g., 一 次 性,            redefine Chinese word formation patterns as follows.
one-off) or iterative expressions such as 高高兴兴,              Definition 5: Let Pr ( pttn (i) | c) denote the conditional
研究研究. Fortunately, almost all unknown words in               probability of a certain word-formation pattern for
the segmented sentences are segmented as single-             character c, where pttn is the state that c is in, and i is a
character-words. Therefore, the detection of unknown         word-formation pattern. Then P( pttn(i )| c) is
words becomes a process of combining these
                                                             defined as follows:
successive single-character-words within a focused
scope. [Yao Yuan, 1997] proposed a statistical                                               count (c_is_prefix )
                                                             Pr ( pttn( H ) | c) 
unknown-word detection method based on four word-                                    count (c_is_in_multi_word_fo rm )          (10)
formation patterns and head-middle-tail structures. The
problem of Yao’s method is that the parameters of
                                                                                            count (c_is_suffix )
these word-formations must be derived from                   Pr ( pttn(T ) | c) 
                                                                                    count (c_is_in_multi_word_fo rm )           (11)
segmented text. To reduce the effort of human
supervision, we introduce the frequency of word form
defined in section 3 and modify the definitions of                                           count (c_is_middle)
                                                             Pr ( pttn( M ) | c) 
word-formation patterns given by [Yao Yuan, 1997],                                   count (c_is_in._multi_word_f orm )        (12)
and devise an unsupervised method for unknown word
identification in this section.                              Where Pr ( pttn(H ) | c)  Pr ( pttn(T ) | c)  Pr ( pttn(M ) | c)  1 .
Based on these patterns, a multi-character-word may                           segmented correctly (Ncw) over the number of words in
be formed by characters in one of the two obvious                             test text (Ntw); (2) disambiguation rate or unknown
head-middle-tail structures as follows: head tail                            words identification rate , defined as a percentage of
and head  middle  middle    tail . Whether a                             the number of fragments disambiguated or identified
single-character-word string s  c1c 2  c n contains                         correctly (Nca) over the number of ambiguous or
                                                                              unknown word fragments in test set (Nta).
new words or not is determined by the probability of
its head-middle-tail structures. Let Pr ( s k ) denotes                       5.2 Experimental results
the             probability           of       a  structure
 s k  ci ci 1 c j (1  i  j  n) in s , then:                                   Table 3 Accuracy  of the proposed models
                                                                                  Models         CJM           WFM           HM
        Pr ( sk )  Pr ( pttr ( H ) | ci ) Pr ( pttr (T ) | c j )
                        j 1                                                     Test set 1      98.81         99.06        99.25
                       Pr ( pttr(M ) | cl )                           (13)
                                                                                Table 4 Disambiguation rate of the proposed models
                      l  i 1

Among all possible structure in the single-character-                               Test              WFM          CJM         HM
word string s , the string s with the largest                                     Sample            Nca        Nca        Nca     
probability Pr (s) and larger than some threshold can                            Test set 2   1004 701 69.82 828 82.47 852 84.86
be identified as a new word. The detail description of                        Table 3 and table 4 show the accuracy of word and the
the algorithm can be seen in [Yao Yuan, 1997].                                disambiguation rate with the proposed segmentation
                                                                              models on test set 1 and 2 respectively. From these, it is
5   Experiment                                                                found that the overall performance of the hybrid model
                                                                              (HM) is the best among the models proposed. Without
5.1 Experimental Data                                                         considering the effects of unknown words, about
                                                                              99.25% of words can be segmented correctly and about
Based on the above principles, a Chinese word                                 84.86% of ambiguous fragments can be disambiguated
segmentation system is developed. The system uses a                           respectively by the hybrid model. This result shows that
segmentation dictionary containing about 53890                                the combination of word form model and character
entries and a GB2312_80 standard Chinese character                            juncture model is beneficial to enhance the
library with 6367 characters.                                                 segmentation performance. In table 4, the disambigua-
As shown in table 2, A Chinese text corpus including                          tion rate of CJM is better than that of WFM by about
about 18,740,000 characters is established from                               15%; This result is possibly due to CJM introducing the
People’s Daily (人民日报) for training; Test set 1 is                             context effects on word boundary into segmentation.
constructed for segmentation accuracy evaluation and
All words in it are included in the system dictionary.                              Table 5 Unknown words identification rate
Test set 2 and 3 are used to evaluate disambiguation                              Models           Nta           Nca         
power and unknown word identification performance                                Test set 3        259          211        81.47
respectively. The effect of inconsistency of word                             Table 5 is the results for unknown-words identification
definition between the system dictionary and the test                         on test set 3. As the table shows, Over 80% of unknown
corpus is not considered in our tests.                                        words in the test set can be correctly detected by the
             Table 2 the training and test data                               proposed unknown-words recognizer.

              Training set Test set 1 Test set 2 Test set 3                   Table 6 Comparison of different methods on accuracy 
 Number of 18,740,000               212,374           7,794         3,695        Methods       FMM         SF       LSF      WFM
 characters                                                                      Test set 1    98.51     99.02     99.15     99.06
 Number of            -             134,454           5,079         2,337     In addition, Other segmentation approaches, such as the
  words                                                                       forward maximum match (FMM), string frequency
 Number of            -              4,923              998         249       method (SF) and longest match string frequency method
 sentences                                                                    (LSF), are also involved into our segmentation accuracy
Two types of performance parameters are adopted here                          evaluation for comparison. As shown in table 6, The
to evaluate our system: (1)segmentation accuracy ,                           accuracy of the proposed WFM is higher than that of
expressed as a percentage of the number of words                              FMM and SF, and slightly lower than that of LSF which
implies that word formation power based word                     Words. In Proceedings of COLING’94, Tokyo, Japan,
segmentation is effective.                                       1994: 227-233
                                                               Wong, P.-K. and C.-K. Chan. 1996. Chinese Word
6   Conclusion                                                   Segmentation Based on Maximum Matching and
                                                                 Word Binding Force. In Proceedings of COLING’96,
In this paper, we have present a self-organized model
                                                                 Copenhagen, Denmark, pp.200-203
for Chinese word segmentation based on the word
formation power of Chinese character string and the affinity   Nie, J.-Y, M.-L. Hannan, W.-Y. Jin. 1995. Unknown
of character junctures. We have also proposed a formula          Word Detection and Segmentation of Chinese Using
to measure the sizes of segmentation space and                   Statistical and Heuristic Knowledge. Communication
adopted a two-way optimizing algorithm into                      of COLIPS, 5, pp. 47-57
segmentation at the same time. In addition, an                 Law, H.H.C., C. Chan. 1996. N-th Order Ergodic
unsupervised version of unknown-words identification             Multigram HMM for Modeling of Language without
method has been also tried in our research based on              Marked Word Boundaries. In Proceedings of
Chinese word formation patterns. Without considering             COLING’96, Copenhagen, Denmark, pp. 204-209.
the effects of word definition inconsistency between           Gan, K.-W., M. Palmer and K.-T. Lua.1996. A
the system dictionary and test corpus, we have                   Statistically Emergent Approach for Language
achieved 99.25% word segmentation accuracy, about                Processing: Application to Modeling Context Effects
85% disambiguation rate and 81.47 respectively in the            in Ambiguous Chinese Word Boundary Perception.
primary tests. This shows that our approach is reliable          Computational Linguistics, 22, pp. 531-553
and efficient. Furthermore, Our approaches have a
strong adaptability due to the inexpensive training of         Sproat R., C. Shih, W. Gale, and N. Chang. 1996. A
parameters. It is believed that our approaches will be           Stochastic     Finite-State   Word     Segmentation
extensively applied in Chinese information processing.           Algorithm for Chinese. Computational Linguistics,
                                                                 22(3), pp.377-404
                Acknowledgements                               Nasaaki Nagata. 1997. A Self-Organizing Japanese
                                                                 Word Segmenter Using Heuristic Word Identification
This work was supported in part by the National 863              and Re-estimation. In Proceedings of the 5th Work-
Plan (863-ZT-03-02-3) and 1999 Outstanding Youth                 shop on Very Large Corpora, Beijing, China, pp. 203-
Fund of Heilongjiang Province respectively.                      215
                                                               Fu, G. H., X.-L. Wang, Y.-H. Gong. 1998. Word Form
                      References                                 Based Chinese Word Segmentation (In Chinese). In
Wang, X.-L., et al. 1989. The Problem of Separating              Fifth National Conference on Man-Machine Speech
  Characters into Fewest Words and Its Algorithms.               and Communication (NCMMSC-98), Harbin, P. R.
  Chinese Science Bulletin (English edition), 34,                China, pp. 328-332
  pp.1924-1928                                                 Yao Yuan. 1997. Statistics Based Approaches Towards
Liang, N.-Y., Y.-B. Zheng. 1991. A Chinese Word                  Chinese Language Processing. Ph.D thesis, National
  Segmentation Model and A Chinese Word                          University of Singapore, pp.37-41
  Segmentation System PC-CWSS (In Chinese).
  Communications of COLIPS, 1, pp.51-55
Yeh, C.-L. , H.-J. Lee. 1991. Rule-Based Word
  Identification for Mandarin Chinese Sentences – A
  Unification Approach. Computer Processing of
  Chinese & Oriental Languages, 5, pp. 97-117
Chiang, T.-H., J.-S. Chang, M.-Y. Lin, and K.-Y. Su.
  1992. Statistical Models for Word Segmentation
  and Unknown Word Resolution. In Proceedings of
  ROCLING-Ⅴ, ROC Computational Linguistics
  Conferences, Taiwan, pp.123-146
Shiho Nobesawa, et al. Segmenting A Sentence into
  Morphemes Using Statistic Information Between

To top