Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Fuzzy K-means Clustering with Missing Values by kellena88


									                             Fuzzy K-means Clustering with Missing Values
                                             Manish Sarkar and Tze-Yun Leong
                                    Department of Computer Science, School of Computing
                                              National University of Singapore
                                         Lower Kent Ridge Road, Singapore: 119260
                                            {manish, leongty}

Fuzzy K-means clustering algorithm is a popular approach       3.   Clustering can be applied to decide whether the
for exploring the structure of a set of patterns, especially        representation of a problem on computers is
when the clusters are overlapping or fuzzy. However, the            appropriate for processing. If the representation is not
fuzzy K-means clustering algorithm cannot be applied                appropriate, then the data set behaves like a set of
when the data contain missing values. In many cases, the            random numbers without any underlying regularity. In
number of patterns with missing values is so large that if          that case, the bad clustering results indicate that the
these patterns are removed, then the number of patterns to          user needs to modify the representation of the
characterize the data set is insufficient. This paper               problem.
proposes a technique to exploit the information provided
by the patterns with the missing values so that the            Basics of clustering: Three types of clustering approaches
clustering results are enhanced. There are various             are commonly used [1]. They are (1) hierarchical
preprocessing methods to substitute the missing values         approach, (2) graph theoretic approach, and (3) objective
before clustering the data. However, instead of repairing      function-based approach. The objective function-based
the data set at the beginning, the repairing can be carried    approach is very popular. One extensively used objective
out incrementally in each iteration based on the context. It   function-type clustering algorithm is hard K-means
is thus more likely that less uncertainty is added while       clustering algorithm [1]. It assigns each pattern exactly to
incorporating the repair work. Fine-tuning the missing         one of the clusters assuming well-defined boundaries
values using the information from other attributes further     between the clusters. However, there may be some
consolidates this scheme. Applications of the proposed         patterns that belong to more than one cluster. In order to
method in medical domain have produced good                    overcome this problem, the idea of fuzzy K-means (FKM)
performance.                                                   algorithm has been introduced. Unlike the hard K-means,
Keywords: Fuzzy K-means clustering and missing values.         in the FKM each input pattern belongs to all the clusters
                                                               with different degrees or membership values.
                                                               Incorporation of the fuzzy theory in the FKM algorithm
1. Introduction                                                makes it a generalized version of the hard K-means
Motivation: In medicine and biology, we often need             algorithm. From the psycho-physiological point of view,
                                                               the problem of pattern clustering is unsuitable for
exploratory analysis like grouping the patterns such that
the patterns within the same cluster have a high degree of     approaches with precise mathematical formulations.
similarity, and the patterns from different clusters have a
                                                               However, the FKM algorithm cannot be applied to the
high degree of dissimilarity. Clustering can be formally
defined as follows [1]: Given a set of data                    real-life clustering problems when the data contain
                                                               missing values. The missing values in a pattern imply that
 X = { x1 , x2 ,..., xn } ⊆ R N , find an integer K            the values of some of the attributes of the pattern are
( 2 ≤ K ≤ n ) and K number of partitions of X that             unknown. Missing values can occur due to various reasons
exhibit categorically homogeneous subsets.                     like (a) patient entries for some attributes are irrelevant or
                                                               unknown, (b) in the questioning session, the patient did
Importance of clustering: Some tasks for which the             not want to provide the values, (c) errors have led to
clustering algorithms can be employed are as follows:          incomplete attributes, (d) random noises have led to some
1. Clustering can abstract or compress certain properties      impossible values, and they have been removed
     of the data set.                                          intentionally, (e) patients have died before an experiment
2. A classifier can be constructed through clustering. To      was finished.
     build a classifier, we group a data set, and
     subsequently assign a class label (crisp or fuzzy) to     Problem definition: This paper addresses how to apply
     each cluster. The class label of a new pattern is         the FKM algorithm efficiently in the presence of missing
     determined based on the cluster in which the pattern      values. It is assumed that the values are missing at random,
     falls.                                                    i.e., the probability of missing a value does not depend on
                                                               the quantity of the value [6].
                                                                called membership function because larger value of the
Related work: The approaches to deal with missing               function denotes more membership of the element to the
values can be categorized into the following groups [3]         set under consideration.
Deductive imputation: Missing values are deduced with           2.2. Fuzzy K-Means Clustering
certainty, or with high probability from the other              Clustering a data set X ⊆ R implies that the data set is
information of the pattern.
                                                                partitioned into K clusters such that each cluster is
Hot-deck imputation: Missing values are replaced with
                                                                compact and far from other clusters. One way to achieve
values from the closest matching patterns.
                                                                this goal is through the minimization of the distances
Mean-value imputation: The mean of the observed values
                                                                between the cluster center and the patterns that belong to
is used to replace the missing values.
                                                                the cluster. Using this principle, the hard K-means
Regression-based imputation: Missing values are replaced
                                                                algorithm minimizes the following objective function [8]:
by the predicted values from a regression analysis.                                                   K
Imputation using Expectation-Maximization: Missing                                        J =∑             ∑ d (m , x ) k   i    (2)
values are repaired in two steps. In the E-step, the                                                  k =1 xi ∈Fk
expected value of the loglikelihood is calculated, and in
the M-step, the missing values are substituted by the           where d ( mk , xi ) is a distance measure between the
expected values. Then the likelihood function is                center mk of the cluster Fk and the pattern xi ∈ X .
maximized as if no data were missing.
                                                                Eqn. (2) can be rewritten as
                                                                                             K        n
Overview of the proposed method: Most of the current
methods repair or impute the missing values before the                             J = ∑∑ µk ( xi )d ( mk , xi )                 (3)
                                                                                            k =1 i =1
clustering starts. This paper attempts to repair the missing
data while performing clustering. Exploiting this trick is      where   µk ( xi ) ∈ {0,1} is the characteristic function, i.e.,
difficult because while updating a cluster center, the
distance between the pattern with missing values and the
                                                                µk ( xi ) = 0 if xi ∉ Fk , else µk ( xi ) = 1 . When the
cluster center cannot be measured. Using the law of large       clusters are overlapping, each pattern may belong to more
numbers, if we assume that the distances between the            that one cluster, i.e., µ k ( xi ) ∈ [0,1] . Hence, µ k ( xi )
cluster center and the patterns form a Gaussian                 should be interpreted as a membership function rather than
distribution, then the distance between a pattern with          the characteristic function. Therefore, the objective
missing values and the cluster center can be replaced by        function (3) can be modified to the following:
the weighted mean of the distances between the cluster                                       K        n
center and the complete patterns. The missing values are                           J = ∑∑ µkq ( xi )d ( mk , xi )                (4)
further fine-tuned by exploiting the information from the                                   k =1 i =1
other attributes.
                                                                where     µk ( xi ) ∈ [0,1]                is now a fuzzy membership
2. Background                                                   function, and q is a constant known as the index of
                                                                fuzziness that controls the amount of fuzziness. Since the
2.1 Fuzzy Sets                                                  minimization of the objective function (4) may lead to a
In traditional two-state classifiers, where a class C is        trivial solution, the following two constraints are satisfied
defined as a subset of the universal set X, any input pattern   while minimizing the objective function:
 x ∈ X can either be a member or not be a member of the
given class C . This property of whether or not a pattern                      i =1
                                                                                      µk ( xi ) > 0 ∀k ∈ {1, 2,..., K }           (5)
 x of the universal set belongs to the class C can be
                                                                                       µk ( xi ) = 1 ∀i ∈ {1, 2,..., n}          (6)
defined by a characteristic function µC : X → {0,1} as                          k =1

follows:                                                        The first constraint guarantees that there is no empty
                                                                cluster, and the second constraint imposes the condition
                       1 iff x ∈ C                             that each pattern needs to share its membership with all the
           µC ( x ) =                               (1)
                      0 otherwise                              clusters such that the sum of memberships is equal to one.
In real-life situations, boundaries between the classes may     Differentiating the objective function (4) with the
be overlapping. Hence, it is uncertain whether an input         constraints      (5)         and         (6),          we           obtain
pattern belongs totally to the class C . To consider such                          1                                                (7)
                                                                 µk ( xi ) =            2 /( q −1) ∀i ∈ {1,..., n}, k ∈ {1,..., K }
                                                                           ∑ h=1 ( d ( m , x ) )
                                                                               K      d ( mk , xi )
situations, in fuzzy sets [1] the concept of the
                                                                                            h    i
characteristic function has been modified to the fuzzy
membership function µC : X → [0,1] . This function is
                    ∑ µ q ( xi ) xi                                                                  ∑
                                                                                              and                  µk ( y j ) = 1
                mk = i =n1 k q                      k = 1, 2,..., K         (8)                            k∈I k
                     ∑i =1 µk ( xi )                                                    ENDIF
Eqn. (7) and (8) are used in an iterative fashion to update                       ENDDO UNTIL         U t − U t +1 > ε         OR   t <T
the memberships and the cluster centers. The updating
continues until the changes in the membership values of                           OUTPUT:
all the patterns become negligible or the required number
                                                                                  (1) µ k ( xi )    ∀i, k , i.e., the belongingness of the patterns in
of iterations is over (Fig. 1).
                                                                                  the clusters.
The worst-case time complexity of the algorithm is as                             (2)   u = argmax µk ( xi ) . u denotes               the cluster in which
follows: To find the distance between the cluster center                                        k

and all the patterns, we need O ( nN ) computations. For                          xi    belongs to when the membership is considered crisp.
all the clusters, the number of computations needed is
O(nNK ) . If the clustering needs T iterations, then the                          Fig. 1: Fuzzy K-means algorithm.
worst-case complexity is O ( nNKT ) .                                             3. Proposed Method
                                                                                  Algorithm: Let all the missing values in the data set X
INPUT:                                                                            occur in the dth attribute. We shall relax this constraint
(1) A set of input data X .                                                       later. Let us call the set of all the patterns with missing
(2) The value of the fuzziness index                q ∈ (1, ∞) .                  values Z, and the set of all complete patterns Y (i.e.,
                                                                                   X = Y ∪ Z ).                                  Each                       pattern
(3) Number of clusters K .
                                                                       −1          z j = [ z j1 , z j 2 ,..., z j ( d −1) ,?, z j ( d +1) ..., z jN ]′ ∈ Z can be
(4) A distance measure d ( mk , xi ) = ( mk − xi ) A ( mk − xi )

between    mk     and     xi , where A is a positive definite matrix.             made complete by substituting                      z jd by    1
                                                                                                                                                Y   ∑   y∈Y
                                                                                                                                                              yd ,
(5) A small, positive constant               ε , and an appropriate matrix        where Y indicates the cardinality of the set Y and [u ]′
norm   .   .
                                                                                  indicates the transpose of [u]. Subsequently, the standard
(6) Maximum number of iterations T.                                               FKM can be applied to the data set since there is no
(7) An n × K matrix U , where the element of the ith row and                      missing value in the data set.
the kth column indicates          µ k ( xi ) .
                                                                                  However, we can modify the clustering algorithm so that
                                                                                  the substitution operation is more context dependent. In
                                                                                  the clustering, we need the substitution operation while
Assign t = 0 .
                                                                                  finding the distance between a cluster center (say kth) and
Randomly initiate the fuzzy K-partition of U .                                    an incomplete pattern. We can fill the pattern at that point
DO                                                                                of time, and thus, we fill the pattern differently and
    Set t = t + 1 .                                                               incrementally for each cluster center. Therefore, instead of
           k = 1, 2,..., K
    FOR                                                                           filling z jd by      1
                                                                                                       Y    ∑      y∈Y
                                                                                                                         yd , we fill ( z jd − mkd ) 2 by the
           Calculate the cluster center                 mk using
                                                                                  mean        of      {( yid − mkd )2 | i = 1, 2,...,| Y |} ,                 i.e.,
               mk = ∑
                                    µ q ( xi ) xi                                                                  2
                              i =1 k

                                                                                   Y ∑ i=1 id
                                                                                   1 ( Y y − m )  . Here the assumption is that the
                                     µ q ( xi )                                                kd
                                 i =1 k

                                                                                  members of {( yid − mkd ) | i = 1, 2,...,| Z |} are i.i.d.
    ENDFOR                                                                                                                 2
                   t +1                             t
    Update U     by calculating U as follows:                                     (independent and identically distributed), and hence, from
    Determine the content of the following set:                                   the law of large numbers, they form a Gaussian
     I k = {k | 1 ≤ k ≤ K ; d ( mk , xi ) = 0}                                    distribution. In the above procedure, we treat each
    IF I k = ∅ ,                                                                  complete     pattern    yi equally. However, the
                               1                                                  complete patterns that are close to mkd should influence
           µk ( xi ) = K
                            ∑ (                         )
                                     2 /( q −1)
                                       d ( mk , xi )                              the update of the cluster center more. In other words, we
                                  h =1 d ( mh , xi )                              can use the concept of weighted mean instead of a simple
    ELSE       µk ( x j ) = 0 ∀k ∈ {1, 2,..., K } − I k                           mean. Hence, we choose the weights as the membership
                                    ( z jd − mkd ) 2                                                 ∑
                                                                                                         |Y |
values.       Thus,                                             is        substituted     by   1
                                                                                                                y jd , can be derived from the proposed method
                                                                                               Y         i =1

 ∑i =1 ( µk ( yi )( yid −mkd )2 ) 
                                                                                               when (a) the cluster centers are assumed to be at the
         ∑i =1 µk ( yi )
                                                                                               origin, (b) all the patterns receive equal importance, and
                                                                                               (c) wh = 0, ∀h ≠ d , wd = 1, and (d) the repairing is
                                                                                               done only in the first iteration. Moreover, if wd = 0 and
                                            ∑i =1 ( µk ( yi )( yid −mkd )2 ) 

The substituted value                      
                                                    ∑i =1 µk ( yi )
                                                                                  becomes      all   σ h2 ∀h ∈ {1, 2,..., N }       are equal, then the proposed
same for all patterns with missing values although some of                                     algorithm reduces to that of [8].
the patterns with missing values are very close to the
                                                                                               Convergence: When the missing value occurs only in the
cluster center mkd and some are far away from mkd . If
                                                                                               dth attribute, we partition the data set into the two sets Y
we assume that the weighted distance ( z jd − mkd ) / σ d
                                                                                    2     2    and Z. If we use the proposed algorithm for this type of
                                                                                               data set, we actually minimize the following objective
linearly depends on the weighted distance between                                              function:
 zid , ∀i ≠ j , and mkd , then we can estimate                                                                          Y    K
                                                                                                                  J = ∑∑ µ kq ( yi )[d ( mi , yi )]2
( z jd − mkd ) / σ      2      2
                               d        using the following linear regression                                          i =1 k =1
or weighted mean:                                                                                                       Z    K

( z jd − mkd ) 2 / σ kd = w1 ( z j1 − m j1 ) 2 / σ k21 + ...
                     2                                                                                               + ∑∑ µkq ( zi )[d ( mi , zi )]2
                                                                                                                       i =1 k =1
               + w( d −1) ( z j ( d −1) − m j ( d −1) ) 2 / σ k2( d −1)                        It is straightforward to show that the objective function
                                                                                               (10) under the constraints (5) and (6) is monotonically
                   ∑  µk ( yi )( yid −mkd )2  2                                           decreasing, and hence, the iterative minimization
               + wd i =1 Y                       / σ kd                            (9)        guarantees the convergence. The same result holds if the
                         ∑i =1 µk ( yi )                                                       values are missing in more than one attribute.
               + w( d +1) ( z j ( d +1) − m j ( d +1) ) 2 / σ k2( d +1)                        Time complexity: Let us first look at the time complexity
               + ... + wN ( z jN − m jN ) / σ               2        2
                                                                                               when the values are missing only in the dth attribute. For
                                                                                               finding the mean and variance of all complete patterns, we
where wh indicates the importance of the hth attribute,                                        need O ( Y ) computations in each iteration. For each

              |Y |
                      µ k ( yi ) yi                                                            iteration and cluster center, we require O ( Z N )
mk =          i =1
                 |Y |

                 ι =1
                        µ k ( yi )                                                             computations            do
                                                                                                                        to   the   regression.   Since
                                                                                               n = | X | = | Y ∪ Z | , the time-complexity for all the
               |Y |
                        µ k ( yi )( yih − mkh ) 2
σ   2
         =     i =1
                                                        .       The importance wh              cluster centers and iterations is bounded by O ( nNTK ) .
    kh                       |Y |
                                 µ ( yi )
                             ι =1 k
                                                                                               When the missing values occur in more than one attribute,
can be determined by using some a priori knowledge or by                                       then the worst-case time complexity becomes
using some feature extraction algorithms (when the data                                        O ( nN 2TK ) . Since in practical cases N << n , the
are labeled). In this paper, we are not assuming that we                                       repair work does not significantly change the order of the
know the importance of the attributes, and hence we are                                        time complexity of the original FKM algorithm.
distributing the importance equally among all the
attributes by making wh = 1/ N ∀h ∈ {1, 2,..., N } .                                           Quality of clustering: The quality of the clustering can be
                                                                                               measured in two ways: directly or indirectly. In the direct
Till now we have shown all the derivations when the
                                                                                               method, we can apply some cluster validity measures to
values are missing only in the dth attribute. Similar
                                                                                               check whether the quality of the clustering is improving.
procedure can be adopted when we have patterns with
                                                                                               In the indirect method, we cluster the data using the
missing values in more than one attribute. Thus, the
                                                                                               proposed method, and then the clusters are utilized to build
modified FKM needs some extra steps to consider the
                                                                                               the classifiers. The classifier performance is used as an
incomplete patterns.
                                                                                               indirect way to quantify the quality of the clustering. Note
                                                                                               that this is possible only when the data are labeled.
Particular case: The mean-value imputation, in which the
missing value z jd of the pattern z j is replaced by                                           4. Results and Discussion
We have conducted the experiments on the Wisconsin-                 FKM with mean substitution                93.18%
Madison breast Cancer data from UCI machine learning                FKM with regression                       95.67%
repository [2]. We have compared the result of the                  FKM with EM algorithm                     96.34%
proposed algorithm with that of mean substitution, hot              FKM with the proposed method              98.43%
deck, regression, EM and C4.5 algorithms. The presence              C4.5 after pruning                        94.31%
of a breast mass may indicate (but not always) malignant
cancer. The University of Wisconsin Hospital has                  Acknowledgments: A strategic research grant RP960351 from
collected 699 samples using the fine needle aspiration test.      NSTB and the Ministry of Education, Singapore, has supported
Each sample consists of the following ten attributes: (1)         this work.
Patient's i.d., (2) clump thickness, (3) uniformity of cell
size, (4) uniformity of cell shape, (5) marginal adhesion,        References
(6) single epithelial cell size, (7) bare nuclei, (8) bland
chromatin, (9) normal nucleoli and (10) mitosis. Except           [1] Bezdek, J. C. Pattern Recognition with Fuzzy
the patient's i.d., all other measurements are assigned to an     Objective Function Algorithms. Plenum Press, New York,
integer value between 1 and 10, with 1 being closest to the       1981.
benign and 10 the most anaplastic. Each sample is either          [2] Blake C. L. and C. J. Merz. UCI repository of machine
benign or malignant.                                              learning databases,, 1998.
                                                                  [3] Ghahramani, Z. and M. I. Jordan. Learning from
The data set contains 16 samples each with one missing            incomplete data. Technical report, AI memo no. 1509,
attribute. Since the number of missing values is small, we        MIT, 1994.
introduced more missing values with probability 0.25 to           [4] Heitjan D. F. Annotation: What can be done about
all attributes of each pattern. Using the t-test, we first        missing data? Approaches to imputation. American
ensured that the data are missing at random. We find the          Journal of Public Health, vol. 87, no. 4, pp. 548-550, 1997.
quality of the clustering through indirect way, i.e., through     [5] Heitjan, D. F. and R. Thomas. Missing data, types of.
classification performance. We partition the data set into        In: Encyclopedia of Statistical Sciences Update, vol. 2, pp.
training and test sets. The training set consists of some         408-411, Wiley, New York, 1998.
patterns with missing values, but the test set contains only      [6] Little R. J. A. and Rubin D. B. Statistical Analysis with
complete patterns. Using the proposed technique, the              Missing Data, Wiley, New York, 1987.
training set is grouped into K clusters, and each cluster is      [7] Schneider T. Analysis of incomplete climate data:
fuzzily labeled. Next, each pattern of the test set is            Estimation of mean values and covariance matrices and
classified based on which fuzzy clusters it falls in. Similar     imputation of missing values. Journal of Climate, vol. 14,
scheme is also used with four other imputation techniques,        no. 5, pp. 853-871, 2001.
and the classification performances of these techniques are       [8] Timm H. and R. Kruse. Fuzzy cluster analysis with
shown in Table 1. The proposed method performs better             missing values. In: Proceedings of 17th International
than the other methods.                                           Conference of the North American Fuzzy Information
                                                                  Processing Society (NAFIPS98), pp. 242-246, Pensacola,
The advantages of the proposed method are: (a) the                FL, 1998.
substitution of a particular missing value is carried out
differently for different cluster centers, (b) the substitution
is carried out incrementally so that better clusters are
formed. The limitations of the proposed method are
appearing from the assumptions that it requires: (a) the
members of {( z jd − mkd ) | j = 1, 2,...,| Z |} to be i.i.d.,

and (b) the attribute with missing values linearly depends
on the other attributes. In future, we would attempt to relax
these assumptions. In addition to medical problems, we
intend to apply the proposed technique to cluster the
microarray genomic data, where missing values are
encountered quite often due to the limitations of the

Table 1: Comparative results of the proposed method with
respect to other methods.
  Techniques                         Classification rates
  FKM with hot deck                        92.67%

To top