VIEWS: 21 PAGES: 5 CATEGORY: Technology POSTED ON: 12/28/2009
Fuzzy K-means Clustering with Missing Values Manish Sarkar and Tze-Yun Leong Department of Computer Science, School of Computing National University of Singapore Lower Kent Ridge Road, Singapore: 119260 {manish, leongty}@comp.nus.edu.sg Fuzzy K-means clustering algorithm is a popular approach 3. Clustering can be applied to decide whether the for exploring the structure of a set of patterns, especially representation of a problem on computers is when the clusters are overlapping or fuzzy. However, the appropriate for processing. If the representation is not fuzzy K-means clustering algorithm cannot be applied appropriate, then the data set behaves like a set of when the data contain missing values. In many cases, the random numbers without any underlying regularity. In number of patterns with missing values is so large that if that case, the bad clustering results indicate that the these patterns are removed, then the number of patterns to user needs to modify the representation of the characterize the data set is insufficient. This paper problem. proposes a technique to exploit the information provided by the patterns with the missing values so that the Basics of clustering: Three types of clustering approaches clustering results are enhanced. There are various are commonly used [1]. They are (1) hierarchical preprocessing methods to substitute the missing values approach, (2) graph theoretic approach, and (3) objective before clustering the data. However, instead of repairing function-based approach. The objective function-based the data set at the beginning, the repairing can be carried approach is very popular. One extensively used objective out incrementally in each iteration based on the context. It function-type clustering algorithm is hard K-means is thus more likely that less uncertainty is added while clustering algorithm [1]. It assigns each pattern exactly to incorporating the repair work. Fine-tuning the missing one of the clusters assuming well-defined boundaries values using the information from other attributes further between the clusters. However, there may be some consolidates this scheme. Applications of the proposed patterns that belong to more than one cluster. In order to method in medical domain have produced good overcome this problem, the idea of fuzzy K-means (FKM) performance. algorithm has been introduced. Unlike the hard K-means, Keywords: Fuzzy K-means clustering and missing values. in the FKM each input pattern belongs to all the clusters with different degrees or membership values. Incorporation of the fuzzy theory in the FKM algorithm 1. Introduction makes it a generalized version of the hard K-means Motivation: In medicine and biology, we often need algorithm. From the psycho-physiological point of view, the problem of pattern clustering is unsuitable for exploratory analysis like grouping the patterns such that the patterns within the same cluster have a high degree of approaches with precise mathematical formulations. similarity, and the patterns from different clusters have a However, the FKM algorithm cannot be applied to the high degree of dissimilarity. Clustering can be formally defined as follows [1]: Given a set of data real-life clustering problems when the data contain missing values. The missing values in a pattern imply that X = { x1 , x2 ,..., xn } ⊆ R N , find an integer K the values of some of the attributes of the pattern are ( 2 ≤ K ≤ n ) and K number of partitions of X that unknown. Missing values can occur due to various reasons exhibit categorically homogeneous subsets. like (a) patient entries for some attributes are irrelevant or unknown, (b) in the questioning session, the patient did Importance of clustering: Some tasks for which the not want to provide the values, (c) errors have led to clustering algorithms can be employed are as follows: incomplete attributes, (d) random noises have led to some 1. Clustering can abstract or compress certain properties impossible values, and they have been removed of the data set. intentionally, (e) patients have died before an experiment 2. A classifier can be constructed through clustering. To was finished. build a classifier, we group a data set, and subsequently assign a class label (crisp or fuzzy) to Problem definition: This paper addresses how to apply each cluster. The class label of a new pattern is the FKM algorithm efficiently in the presence of missing determined based on the cluster in which the pattern values. It is assumed that the values are missing at random, falls. i.e., the probability of missing a value does not depend on the quantity of the value [6]. called membership function because larger value of the Related work: The approaches to deal with missing function denotes more membership of the element to the values can be categorized into the following groups [3] set under consideration. [4][5][6][7]: Deductive imputation: Missing values are deduced with 2.2. Fuzzy K-Means Clustering certainty, or with high probability from the other Clustering a data set X ⊆ R implies that the data set is N information of the pattern. partitioned into K clusters such that each cluster is Hot-deck imputation: Missing values are replaced with compact and far from other clusters. One way to achieve values from the closest matching patterns. this goal is through the minimization of the distances Mean-value imputation: The mean of the observed values between the cluster center and the patterns that belong to is used to replace the missing values. the cluster. Using this principle, the hard K-means Regression-based imputation: Missing values are replaced algorithm minimizes the following objective function [8]: by the predicted values from a regression analysis. K Imputation using Expectation-Maximization: Missing J =∑ ∑ d (m , x ) k i (2) values are repaired in two steps. In the E-step, the k =1 xi ∈Fk expected value of the loglikelihood is calculated, and in the M-step, the missing values are substituted by the where d ( mk , xi ) is a distance measure between the expected values. Then the likelihood function is center mk of the cluster Fk and the pattern xi ∈ X . maximized as if no data were missing. Eqn. (2) can be rewritten as K n Overview of the proposed method: Most of the current methods repair or impute the missing values before the J = ∑∑ µk ( xi )d ( mk , xi ) (3) k =1 i =1 clustering starts. This paper attempts to repair the missing data while performing clustering. Exploiting this trick is where µk ( xi ) ∈ {0,1} is the characteristic function, i.e., difficult because while updating a cluster center, the distance between the pattern with missing values and the µk ( xi ) = 0 if xi ∉ Fk , else µk ( xi ) = 1 . When the cluster center cannot be measured. Using the law of large clusters are overlapping, each pattern may belong to more numbers, if we assume that the distances between the that one cluster, i.e., µ k ( xi ) ∈ [0,1] . Hence, µ k ( xi ) cluster center and the patterns form a Gaussian should be interpreted as a membership function rather than distribution, then the distance between a pattern with the characteristic function. Therefore, the objective missing values and the cluster center can be replaced by function (3) can be modified to the following: the weighted mean of the distances between the cluster K n center and the complete patterns. The missing values are J = ∑∑ µkq ( xi )d ( mk , xi ) (4) further fine-tuned by exploiting the information from the k =1 i =1 other attributes. where µk ( xi ) ∈ [0,1] is now a fuzzy membership 2. Background function, and q is a constant known as the index of fuzziness that controls the amount of fuzziness. Since the 2.1 Fuzzy Sets minimization of the objective function (4) may lead to a In traditional two-state classifiers, where a class C is trivial solution, the following two constraints are satisfied defined as a subset of the universal set X, any input pattern while minimizing the objective function: x ∈ X can either be a member or not be a member of the ∑ n given class C . This property of whether or not a pattern i =1 µk ( xi ) > 0 ∀k ∈ {1, 2,..., K } (5) x of the universal set belongs to the class C can be ∑ K µk ( xi ) = 1 ∀i ∈ {1, 2,..., n} (6) defined by a characteristic function µC : X → {0,1} as k =1 follows: The first constraint guarantees that there is no empty cluster, and the second constraint imposes the condition 1 iff x ∈ C that each pattern needs to share its membership with all the µC ( x ) = (1) 0 otherwise clusters such that the sum of memberships is equal to one. In real-life situations, boundaries between the classes may Differentiating the objective function (4) with the be overlapping. Hence, it is uncertain whether an input constraints (5) and (6), we obtain pattern belongs totally to the class C . To consider such 1 (7) µk ( xi ) = 2 /( q −1) ∀i ∈ {1,..., n}, k ∈ {1,..., K } ∑ h=1 ( d ( m , x ) ) K d ( mk , xi ) situations, in fuzzy sets [1] the concept of the h i characteristic function has been modified to the fuzzy membership function µC : X → [0,1] . This function is ∑ µ q ( xi ) xi ∑ n and µk ( y j ) = 1 mk = i =n1 k q k = 1, 2,..., K (8) k∈I k ∑i =1 µk ( xi ) ENDIF Eqn. (7) and (8) are used in an iterative fashion to update ENDDO UNTIL U t − U t +1 > ε OR t <T the memberships and the cluster centers. The updating continues until the changes in the membership values of OUTPUT: all the patterns become negligible or the required number (1) µ k ( xi ) ∀i, k , i.e., the belongingness of the patterns in of iterations is over (Fig. 1). the clusters. The worst-case time complexity of the algorithm is as (2) u = argmax µk ( xi ) . u denotes the cluster in which follows: To find the distance between the cluster center k and all the patterns, we need O ( nN ) computations. For xi belongs to when the membership is considered crisp. all the clusters, the number of computations needed is O(nNK ) . If the clustering needs T iterations, then the Fig. 1: Fuzzy K-means algorithm. worst-case complexity is O ( nNKT ) . 3. Proposed Method Algorithm: Let all the missing values in the data set X INPUT: occur in the dth attribute. We shall relax this constraint (1) A set of input data X . later. Let us call the set of all the patterns with missing (2) The value of the fuzziness index q ∈ (1, ∞) . values Z, and the set of all complete patterns Y (i.e., X = Y ∪ Z ). Each pattern (3) Number of clusters K . −1 z j = [ z j1 , z j 2 ,..., z j ( d −1) ,?, z j ( d +1) ..., z jN ]′ ∈ Z can be (4) A distance measure d ( mk , xi ) = ( mk − xi ) A ( mk − xi ) ' between mk and xi , where A is a positive definite matrix. made complete by substituting z jd by 1 Y ∑ y∈Y yd , (5) A small, positive constant ε , and an appropriate matrix where Y indicates the cardinality of the set Y and [u ]′ norm . . indicates the transpose of [u]. Subsequently, the standard (6) Maximum number of iterations T. FKM can be applied to the data set since there is no (7) An n × K matrix U , where the element of the ith row and missing value in the data set. the kth column indicates µ k ( xi ) . However, we can modify the clustering algorithm so that ALGORITHM: the substitution operation is more context dependent. In the clustering, we need the substitution operation while Assign t = 0 . finding the distance between a cluster center (say kth) and t Randomly initiate the fuzzy K-partition of U . an incomplete pattern. We can fill the pattern at that point DO of time, and thus, we fill the pattern differently and Set t = t + 1 . incrementally for each cluster center. Therefore, instead of k = 1, 2,..., K FOR filling z jd by 1 Y ∑ y∈Y yd , we fill ( z jd − mkd ) 2 by the Calculate the cluster center mk using mean of {( yid − mkd )2 | i = 1, 2,...,| Y |} , i.e., mk = ∑ n µ q ( xi ) xi 2 i =1 k Y ∑ i=1 id 1 ( Y y − m ) . Here the assumption is that the ∑ n µ q ( xi ) kd i =1 k members of {( yid − mkd ) | i = 1, 2,...,| Z |} are i.i.d. ENDFOR 2 t +1 t Update U by calculating U as follows: (independent and identically distributed), and hence, from Determine the content of the following set: the law of large numbers, they form a Gaussian I k = {k | 1 ≤ k ≤ K ; d ( mk , xi ) = 0} distribution. In the above procedure, we treat each IF I k = ∅ , complete pattern yi equally. However, the 1 complete patterns that are close to mkd should influence µk ( xi ) = K ∑ ( ) 2 /( q −1) d ( mk , xi ) the update of the cluster center more. In other words, we h =1 d ( mh , xi ) can use the concept of weighted mean instead of a simple ELSE µk ( x j ) = 0 ∀k ∈ {1, 2,..., K } − I k mean. Hence, we choose the weights as the membership ( z jd − mkd ) 2 ∑ |Y | values. Thus, is substituted by 1 y jd , can be derived from the proposed method Y i =1 ∑i =1 ( µk ( yi )( yid −mkd )2 ) Y when (a) the cluster centers are assumed to be at the ∑i =1 µk ( yi ) Y . origin, (b) all the patterns receive equal importance, and (c) wh = 0, ∀h ≠ d , wd = 1, and (d) the repairing is done only in the first iteration. Moreover, if wd = 0 and ∑i =1 ( µk ( yi )( yid −mkd )2 ) Y The substituted value ∑i =1 µk ( yi ) Y becomes all σ h2 ∀h ∈ {1, 2,..., N } are equal, then the proposed same for all patterns with missing values although some of algorithm reduces to that of [8]. the patterns with missing values are very close to the Convergence: When the missing value occurs only in the cluster center mkd and some are far away from mkd . If dth attribute, we partition the data set into the two sets Y we assume that the weighted distance ( z jd − mkd ) / σ d 2 2 and Z. If we use the proposed algorithm for this type of data set, we actually minimize the following objective linearly depends on the weighted distance between function: zid , ∀i ≠ j , and mkd , then we can estimate Y K J = ∑∑ µ kq ( yi )[d ( mi , yi )]2 ( z jd − mkd ) / σ 2 2 d using the following linear regression i =1 k =1 (10) or weighted mean: Z K ( z jd − mkd ) 2 / σ kd = w1 ( z j1 − m j1 ) 2 / σ k21 + ... 2 + ∑∑ µkq ( zi )[d ( mi , zi )]2 i =1 k =1 + w( d −1) ( z j ( d −1) − m j ( d −1) ) 2 / σ k2( d −1) It is straightforward to show that the objective function (10) under the constraints (5) and (6) is monotonically Y ∑ µk ( yi )( yid −mkd )2 2 decreasing, and hence, the iterative minimization + wd i =1 Y / σ kd (9) guarantees the convergence. The same result holds if the ∑i =1 µk ( yi ) values are missing in more than one attribute. + w( d +1) ( z j ( d +1) − m j ( d +1) ) 2 / σ k2( d +1) Time complexity: Let us first look at the time complexity + ... + wN ( z jN − m jN ) / σ 2 2 kN when the values are missing only in the dth attribute. For finding the mean and variance of all complete patterns, we where wh indicates the importance of the hth attribute, need O ( Y ) computations in each iteration. For each ∑ |Y | µ k ( yi ) yi iteration and cluster center, we require O ( Z N ) mk = i =1 and ∑ |Y | ι =1 µ k ( yi ) computations do to the regression. Since n = | X | = | Y ∪ Z | , the time-complexity for all the ∑ |Y | µ k ( yi )( yih − mkh ) 2 σ 2 = i =1 . The importance wh cluster centers and iterations is bounded by O ( nNTK ) . ∑ kh |Y | µ ( yi ) ι =1 k When the missing values occur in more than one attribute, can be determined by using some a priori knowledge or by then the worst-case time complexity becomes using some feature extraction algorithms (when the data O ( nN 2TK ) . Since in practical cases N << n , the are labeled). In this paper, we are not assuming that we repair work does not significantly change the order of the know the importance of the attributes, and hence we are time complexity of the original FKM algorithm. distributing the importance equally among all the attributes by making wh = 1/ N ∀h ∈ {1, 2,..., N } . Quality of clustering: The quality of the clustering can be measured in two ways: directly or indirectly. In the direct Till now we have shown all the derivations when the method, we can apply some cluster validity measures to values are missing only in the dth attribute. Similar check whether the quality of the clustering is improving. procedure can be adopted when we have patterns with In the indirect method, we cluster the data using the missing values in more than one attribute. Thus, the proposed method, and then the clusters are utilized to build modified FKM needs some extra steps to consider the the classifiers. The classifier performance is used as an incomplete patterns. indirect way to quantify the quality of the clustering. Note that this is possible only when the data are labeled. Particular case: The mean-value imputation, in which the missing value z jd of the pattern z j is replaced by 4. Results and Discussion We have conducted the experiments on the Wisconsin- FKM with mean substitution 93.18% Madison breast Cancer data from UCI machine learning FKM with regression 95.67% repository [2]. We have compared the result of the FKM with EM algorithm 96.34% proposed algorithm with that of mean substitution, hot FKM with the proposed method 98.43% deck, regression, EM and C4.5 algorithms. The presence C4.5 after pruning 94.31% of a breast mass may indicate (but not always) malignant cancer. The University of Wisconsin Hospital has Acknowledgments: A strategic research grant RP960351 from collected 699 samples using the fine needle aspiration test. NSTB and the Ministry of Education, Singapore, has supported Each sample consists of the following ten attributes: (1) this work. Patient's i.d., (2) clump thickness, (3) uniformity of cell size, (4) uniformity of cell shape, (5) marginal adhesion, References (6) single epithelial cell size, (7) bare nuclei, (8) bland chromatin, (9) normal nucleoli and (10) mitosis. Except [1] Bezdek, J. C. Pattern Recognition with Fuzzy the patient's i.d., all other measurements are assigned to an Objective Function Algorithms. Plenum Press, New York, integer value between 1 and 10, with 1 being closest to the 1981. benign and 10 the most anaplastic. Each sample is either [2] Blake C. L. and C. J. Merz. UCI repository of machine benign or malignant. learning databases, http://www.ics.uci.edu/~mlearn, 1998. [3] Ghahramani, Z. and M. I. Jordan. Learning from The data set contains 16 samples each with one missing incomplete data. Technical report, AI memo no. 1509, attribute. Since the number of missing values is small, we MIT, 1994. introduced more missing values with probability 0.25 to [4] Heitjan D. F. Annotation: What can be done about all attributes of each pattern. Using the t-test, we first missing data? Approaches to imputation. American ensured that the data are missing at random. We find the Journal of Public Health, vol. 87, no. 4, pp. 548-550, 1997. quality of the clustering through indirect way, i.e., through [5] Heitjan, D. F. and R. Thomas. Missing data, types of. classification performance. We partition the data set into In: Encyclopedia of Statistical Sciences Update, vol. 2, pp. training and test sets. The training set consists of some 408-411, Wiley, New York, 1998. patterns with missing values, but the test set contains only [6] Little R. J. A. and Rubin D. B. Statistical Analysis with complete patterns. Using the proposed technique, the Missing Data, Wiley, New York, 1987. training set is grouped into K clusters, and each cluster is [7] Schneider T. Analysis of incomplete climate data: fuzzily labeled. Next, each pattern of the test set is Estimation of mean values and covariance matrices and classified based on which fuzzy clusters it falls in. Similar imputation of missing values. Journal of Climate, vol. 14, scheme is also used with four other imputation techniques, no. 5, pp. 853-871, 2001. and the classification performances of these techniques are [8] Timm H. and R. Kruse. Fuzzy cluster analysis with shown in Table 1. The proposed method performs better missing values. In: Proceedings of 17th International than the other methods. Conference of the North American Fuzzy Information Processing Society (NAFIPS98), pp. 242-246, Pensacola, The advantages of the proposed method are: (a) the FL, 1998. substitution of a particular missing value is carried out differently for different cluster centers, (b) the substitution is carried out incrementally so that better clusters are formed. The limitations of the proposed method are appearing from the assumptions that it requires: (a) the members of {( z jd − mkd ) | j = 1, 2,...,| Z |} to be i.i.d., 2 and (b) the attribute with missing values linearly depends on the other attributes. In future, we would attempt to relax these assumptions. In addition to medical problems, we intend to apply the proposed technique to cluster the microarray genomic data, where missing values are encountered quite often due to the limitations of the experiments. Table 1: Comparative results of the proposed method with respect to other methods. Techniques Classification rates FKM with hot deck 92.67%