VIEWS: 202 PAGES: 6 CATEGORY: Emerging Technologies POSTED ON: 5/11/2011 Public Domain
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4, April 2011 An Efficient Constrained K-Means Clustering using Self Organizing Map M.Sakthi1 and Dr. Antony Selvadoss Thanamani2 1 Research Scholar 2Associate Professor and Head, Department of Computer Science, NGM College, Pollachi, Tamilnadu. Abstract--- The rapid worldwide increase in the data available analysis. It is very complex to list the different scientific fields leads to the difficulty for analyzing those data. Organizing and applications that have utilized clustering method as well data into interesting collection is one of the most basic forms as the thousands of existing techniques. of understanding and learning. Thus, a proper data mining approach is required to organize those data for better The main aim of data clustering is to identify the natural understanding. Clustering is one of the standard approaches in classification of a set of patterns, points, or objects. Webster the field of data mining. The main of this approach is to defines cluster analysis as “a statistical classification method organize a dataset into a set of clusters, which consists of for discovering whether the individuals of a population fall similar data items, as calculated by some distance function. K- into various groups by making quantitative comparisons of Means algorithm is the widely used clustering algorithm multiple characteristics”. The another definition of clustering because of its ability and simple nature. When the dataset is is: Provided a representation of n objects, determine K groups larger, K-Means will misclassify the data points. For according to the measure of similarity like similarities among overcoming this problem, some constraints must be included objects in the same group are high whereas the similarities in the algorithm. The resulting algorithm is called as between objects in different groups are low. Constrained K-Means Clustering. The constraints used in this The main advantages of using the clustering algorithms are: paper are Must-link constraint, Cannot-link constraint, δ- constraint and ε-constraint. For generating the must-link and • Compactness of representation. cannot-link constraints, Self Organizing Map (SOM) is used in • Fast, incremental processing of new data points. this paper. The experimental result shows that the proposed • Clear and fast identification of outliers. algorithm results in better classification than the standard K- Means clustering technique. The widely used clustering technique is K-Means clustering. This is because K-Means is very simple to implement and also Keywords--- K-Means, Self Organizing Map (SOM), it is effective in clustering. But K-Means clustering will lack Constrained K-Means performance when large dataset is involved for clustering. I. INTRODUCTION This can be solved by including some constraints [8, 9] in the clustering algorithm; hence the resulting clustering is called as T HE growth and development in sensing and storage technology and drastic development in the applications such as internet search, digital imaging, and video surveillance Constrained K-Means Clustering [7, 10]. The constraints used in this paper are Must-link constraint, Cannot-link constraint [14, 16], δ-constraint and ε-constraint. Self Organizing Map have generated many high-volume, high-dimensional data (SOM) is used in this paper for generating the must-link and sets. As the majority of the data are stored digitally in cannot-link constraints. electronic media, they offer high prospective for the development of automatic data analysis, classification, and II. RELATED WORKS retrieval approaches. Zhang Zhe et al., [1] proposed an improved K-Means Clustering is one of the most popular approaches used for data clustering algorithm. K-means algorithm [8] is extensively analysis and classification. Cluster analysis is widely used in utilized in spatial clustering. The mean value of each cluster disciplines that involve analysis of multivariate data. A search centroid in this approach is taken as the Heuristic information, through Google Scholar found 1,660 entries with the words so it has some limitations such as sensitive to the initial data clustering that comes into sight in 2007 alone. This huge centroid and instability. The enhanced clustering algorithm amount of data provides the significance of clustering in data referred to the best clustering centroid which is searched during the optimization of clustering centroid. This increases 94 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4, April 2011 the searching probability around the best centroid and the better initial centers to enhance the quality of k-means and enhanced the strength of the approach. The experiment is to minimize the computational complexity of k-means performed on two groups of representative dataset and from approach. The proposed GK-means integrates grid structure the experimental observation, it is clearly noted that the and spatial index with k-means clustering approach. improved K-means algorithm performs better in global Theoretical analysis and experimental observation show that searching and is less sensitive to the initial centroid. the proposed approach performs significantly with higher efficiency. Hai-xiang Guo et al., [2] put forth an Improved Genetic k- means Algorithm for Optimal Clustering. The value of k must Trujillo et al., [5] proposed a combining K-means and be known in advance in the traditional k-means approach. It is semivariogram-based grid clustering approach. Clustering is very tough to confirm the value of k accurately in advance. widely used in various applications which include data The author proposed an enhanced genetic k-means clustering mining, information retrieval, image segmentation, and data (IGKM) and builds a fitness function defined as a product of classification. A clustering technique for grouping data sets three factors, maximization of which guarantees the formation that are indexed in the space is proposed in this paper. This of a small number of compact clusters with large separation approach mainly depends on the k-means clustering technique between at least two clusters. Finally, the experiments are and grid clustering. K-means clustering is the simplest and conducted on two artificial and three real-life data sets that most widely used approach. The main disadvantage of this compare IGKM with other traditional methods like k-means approach is that it is sensitive to the selection of the initial algorithm, GA-based technique and genetic k-means algorithm partition. Grid clustering is extensively used for grouping data (GKM) by inter-cluster distance (ITD), inner-cluster distance that are indexed in the space. The main aim of the proposed (IND) and rate of separation exactness. From the experimental clustering approach is to eliminate the high sensitivity of the observation, it is clear that IGKM reach the optimal value of k k-means clustering approach to the starting conditions by with high accuracy. using the available spatial information. A semivariogram- based grid clustering technique is used in this approach. It Yanfeng Zhang et al., [3] proposed an Agglomerative Fuzzy utilizes the spatial correlation for obtaining the bin size. The K-means clustering method with automatic selection of cluster author combines this approach with a conventional k-means number (NSS-AKmeans) approach for learning optimal clustering technique as the bins are constrained to regular number of clusters and for providing significant clustering blocks while the spatial distribution of objects is irregular. An results. High density areas can be detected by the NSS- effective initialization of the k-means is provided by AKmeans and from these centers the initial cluster centers semivariogram. From the experimental results, it is clearly with a neighbor sharing selection approach can also be observed that the final partition protects the spatial distribution determined. Agglomeration Energy (AE) factor is proposed in of the objects. order to choose a initial cluster for representing global density relationship of objects. Moreover, in order to calculate local Huang et al., [6] put forth the automated variable weighting in neighbor sharing relationship of objects, Neighbors Sharing k-means type clustering that can automatically estimate Factor (NSF) is used. Agglomerative Fuzzy k-means variable weights. A novel approach is introduced to the k- clustering algorithm is then utilized to further merge these means algorithm to iteratively update variable weights initial centers to get the preferred number of clusters and depending on the present partition of data and a formula for create better clustering results. Experimental observations on weight calculation is also proposed in this paper. The several data sets have proved that the proposed clustering convergency theorem of the new clustering algorithm is given approach was very significant in automatically identifying the in this paper. The variable weights created by the approach true cluster number and also providing correct clustering estimates the significance of variables in clustering and can be results. deployed in variable selection in various data mining applications where large and complex real data are often used. Xiaoyun Chen et al., [4] described a GK-means: an efficient Experiments are conducted on both synthetic and real data and K-means clustering algorithm based on grid. Clustering it is found from the experimental observation that the analysis is extensively used in several applications such as proposed approach provides higher performance when pattern recognition, data mining, statistics etc. K-means compared the traditional k-means type algorithms in approach, based on reducing a formal objective function, is recovering clusters in data. most broadly used in research. But, user specification is needed for the k number of clusters and it is difficult to choose III. METHODOLOGY the effective initial centers. It is also very susceptible to noise data points. In this paper, the author mainly focuses on option 95 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4, April 2011 The methodology proposed for clustering the data is presented Step 5: Repeat steps 3 and 4 until no point changes its cluster in this section. Initially, K-Means clustering is described. Then assignment, or until a maximum number of passes through the the constraint based K-Means clustering is provided. Next, the data set is performed. constraints used in Constrained K-Means algorithm are presented. For the generation of constraints like must-link and Function violate-constraints ( ) cannot-link, Self Organizing Map is used in this paper. if must_link constraint not satisfied K-Means Clustering return true Provided a data set of data samples, a preferred number of elseif cannot_link constraint not satisfied clusters, k, and a set of k initial starting points, the k-means clustering technique determines the desired number of distinct return true clusters and their centroids. A centroid is defined as the point whose coordinates are determined by calculating the average elseif δ-constraint not satisfied of each of the coordinates (i.e., feature values) of the points of the jobs allocated to the cluster. Properly, the k-means return true clustering algorithm follows the following steps. elseif ε-constraint not satisfied Step 1: Choose a number of desired clusters, k. return true Step 2: Choose k starting points to be used as initial estimates else of the cluster centroids. These are the initial starting values. return false Step 3: Examine each point in the data set and assign it to the cluster whose centroid is nearest to it. Constraints used for Constrained K-Means Clustering Step 4: When each point is assigned to a cluster, recalculate The Constraints [11, 12, 13] used for Constrained K-Means the new k centroids. Clustering are Step 5: Repeat steps 3 and 4 until no point changes its cluster • Must-link constraint assignment, or until a maximum number of passes through the • Cannot-link constraint data set is performed. • δ-constraint Constrained K-Means Clustering • ε-constraint Constrained K-Means Clustering [15] is similar to the Consider S = {s1, s2,…,sn} as a set of n data points that are to standard K-Means Clustering algorithm with the exception is be separated into clusters. For any pair of points si and sj in S, that the constraints must be satisfied while assigning the data the distance between them is represented by d(si, sj) with a points into the cluster. The algorithm for Constrained K- symmetric property in order that d(si, sj) = d(sj,si). The Means Clustering is described below. constraints are: Step 1: Choose a number of desired clusters, k. • Must-link constraints indicates that two points si and sj (i ≠ j) in S have to be in the same cluster. Step 2: Choose k starting points to be used as initial estimates • Cannot-link constraints indicates that two point si and of the cluster centroids. These are the initial starting values. sj (i ≠ j) in S must not be placed in the same cluster. • δ-Constraint: This constraint represents a value δ > 0. Step 3: Examine each point in the data set and assign it to the Properly, for any pair of clusters Si and Sj (i ≠ j), and cluster whose centroid is nearest to it only when the violate- any pair of points sp and sq such that sp Si and sq constraints ( ) returns false Sj, d(sp, sq) ≥ δ. Step 4: When each point is assigned to a cluster, recalculate • ε-Constraint: This constraint represents a value ε > 0 the new k centroids. and the feasibility need is the following: for any cluster Si containing two or more points and for any point sp Si, there must be another point sq Si such that d(sp, sq) ≤ ε. 96 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4, April 2011 Must-link constraint and Cannot-link constraint are constraints are derived. The data points in a cluster are determined with the help of appropriate neural network. For considered as must link constraint and data points outside the this purpose, this paper uses Self Organizing Map. clusters are considered as cannot link constraints. These constraints are used in the constraints checking module of Self Organizing Map constrained K-Means algorithm. Self-Organizing Maps (SOM) is a general type of neural network technique that is nonlinear regression method that can be utilized to determine relationships among inputs and outputs or categorize data so as to reveal so far unidentified patterns or structures. It is an outstanding technique in exploratory phase of data mining. The results of the examination represents that self-organizing maps can be a feasible technique for categorization of large quantity of data. The SOM has set up its place as an expensively used technique in data-analysis and visualization of high- dimensional data. Among other statistical technique the SOM has no close counterpart, and thus it offers a balancing sight to the data. On the other hand, SOM is the most extensively used technique in this group as it offers some notable merits among Figure 1: Architecture of self-organizing map the substitutes. These comprise, ease of use, particularly for inexperienced users, and highly intuitive display of the data IV. EXPERIMENTAL RESULTS anticipated on to a regular two-dimensional slab, as on a sheet The proposed technique is experimented using the two of a paper. The most important prospective of the SOM is in benchmark datasets which are Iris and Wine Dataset from the exploratory data analysis that varies from regular statistical UCI machine learning Repository [17]. All algorithms are data analysis in that there are no assumed set of hypotheses implemented under the same initial values and stopping that are validated in the analysis. As an alternative, the conditions. The experiments are all performed on a GENX hypotheses are created from the data in the data-driven computer with 2.6 GHz Core (TM) 2 Duo processors using exploratory stage and validated in the confirmatory stage. MATLAB version 7.5. There are few demerits where the exploratory stage may be adequate alone, such as visualization of data with no Experiment with Iris Dataset additional quantitative statistical inference upon it. In practical data analysis problems the majority of mission is to identify The Iris flower data set (Fisher's Iris data set) is a multivariate dependencies among variables. In such a difficulty, SOM can data set. The dataset comprises of 50 samples from each of be utilized for getting insight to the data and for the original three species of Iris flowers (Iris setosa, Iris virginica and Iris search of potential dependencies. In general the findings versicolor). Four features were measured from every sample; require to be validated with more conventional techniques, for they are the length and the width of sepal and petal, in the purpose of assessing the assurance of the conclusions and centimeters. Based on the combination of the four features, to discard those that are not statistically important. Fisher has developed a linear discriminant model to distinguish the species from each other. It is used as a typical Initially the chosen parameters are normalized and then test for many classification techniques. The proposed method initialize the SOM network. Then SOM is trained to offer the is tested first using this Iris dataset. This database has four maximum likelihood estimation, so that an exacting stock can continuous features consisting of 150 instances: 50 for each be linked with a particular node in the categorization layer. class. The self-organizing networks suppose a topological structure between the cluster units. There are m cluster units, To evaluate the efficiency of the proposed approach, this prearranged in a one or two dimensional array: the input technique is compared with the existing K-Means algorithm. signals are n dimensional. Figure 1 represents architecture of The Mean Square Error (MSE) of the centers self-organizing network (SOM) that consists of input layer, || || where vc is the computed center and vt is the and Kohonen or clustering layer. true center. The cluster centers found by the proposed K- Means are closer to the true centers, than the centers found by Finally the categorized data is obtained from the SOM. From K-Means algorithm. The mean square error for the four cluster this obtained categorized data, must link and cannot link centers for the two approaches are presented in table I. The 97 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4, April 2011 resulted execution for the proposed and standard K-Means Proposed K- K-Means algorithms is provided in figure 2. Means Cluster 1 0.4364 0.3094 Cluster 2 0.5562 0.3572 Cluster 3 0.2142 0.1843 TABLE I 1.2 MEAN SQUARE ERROR VALUE OBTAINED FOR THE THREE K-Means CLUSTERS IN THE IRIS DATASET 1 Proposed Proposed K- K-Means Time (seconds) K-Means 0.8 Means Cluster 1 0.3765 0.2007 0.6 Cluster 2 0.4342 0.2564 0.4 Cluster 3 0.3095 0.1943 0.2 1.2 0 K-Means 1 Figure 3: Execution Time for Wine Dataset Proposed K-Means Time (seconds) 0.8 From the experimental observations it can be found that the proposed approach produces better clusters than the existing 0.6 approach. The MSE value is highly reduced for both the dataset. This represents the better accuracy for the proposed 0.4 approach. Also, the execution time is reduced when compared to the existing approach. This is true in both the dataset. 0.2 V. CONCLUSION 0 The increase in the number of data world wide leads to the Figure 2: Execution Time for Iris Dataset requirement for the better analyzing technique for better understanding of data. One of the most essential modes of Experiment with Wine Dataset understanding and learning is categorizing data into reasonable groups. This can be achieved by a famous data The wine dataset is the results of a chemical analysis of wines mining technique called Clustering. Clustering is nothing but grown in the same region in Italy but derived from three separating the given data into particular groups according to different cultivars. The analysis established the quantities of the separation among the data points. This will helps in better 13 constituents found in each of the three types of wines. The understanding and analyzing of the vast data. One of the classes 1, 2 and 3 have 59, 71 and 48 instances respectively. widely used clustering is K-Means clustering because it is There are totally 13 Number of Attributes. simple and efficient. But it lacks accuracy of classification when large data are used in clustering. So the K-Means The MSE value for the three clusters is presented in Table II. clustering needs to be improved to suit for all kinds of data. The resulted execution for the proposed and standard K- Hence the new clustering technique called Constrained K- Means algorithms is provided in figure 2. Means Clustering is introduced. The constraints used in this TABLE II paper are Must-link constraint, Cannot-link constraint, δ- constraint and ε-constraint. SOM is used in this paper for MEAN SQUARE ERROR VALUE OBTAINED FOR THE THREE generating Must-link and Cannot-link constraints. The CLUSTERS IN THE WINE DATASET experimental result shows that the proposed technique results in better classification and also takes lesser time for 98 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 9, No. 4, April 2011 classification. In future, this work can be extended by using [17] Merz C and Murphy P, UCI Repository of Machine Learning Databases, Available: ftp://ftp.ics.uci.edu/pub/machine-Learning-databases. more suitable constraints in the Constrained K-Means Clustering technique. REFERENCES [1] Zhang Zhe, Zhang Junxi and Xue Huifeng, "Improved K-Means Clustering Algorithm", Congress on Image and Signal Processing, Vol. 5, Pp. 169-172, 2008. [2] Hai-xiang Guo, Ke-jun Zhu, Si-wei Gao and Ting Liu, "An Improved Genetic k-means Algorithm for Optimal Clustering", Sixth IEEE International Conference on Data Mining Workshops, Pp. 793-797, 2006. [3] Yanfeng Zhang, Xiaofei Xu and Yunming Ye, "NSS-AKmeans: An Agglomerative Fuzzy K-means clustering method with automatic selection of cluster number", 2nd International Conference on Advanced Computer Control, Vol. 2, Pp. 32-38, 2010. [4] Xiaoyun Chen, Youli Su, Yi Chen and Guohua Liu, "GK-means: an Efficient K-means Clustering Algorithm Based on Grid", International Symposium on Computer Network and Multimedia Technology, Pp. 1- 4, 2009. [5] Trujillo, M., Izquierdo, E., "Combining K-means and semivariogram- based grid clustering", 47th International Symposium, Pp. 9-12, 2005. [6] Huang, J.Z., Ng, M.K., Hongqiang Rong and Zichen Li, "Automated variable weighting in k-means type clustering", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 5, Pp. 657-668, 2005. [7] Yi Hong and Sam Kwong “Learning Assignment Order of Instances for the constrained k-means clustering algorithm” IEEE Transactions on Systems, Man, and Cybernetics, Vol 39, No 2. April, 2009. [8] I. Davidson,M. Ester and S.S. Ravi, “Agglomerative hierarchical clustering with constraints: Theoretical and empirical results”, in Proc. of Principles of Knowledge Discovery from Databases, PKDD 2005. [9] Wagstaff, Kiri L., Basu, Sugato, Davidson, Ian “When is constrained clustering beneficial, and why?” National Conference on Aritficial Intelligence, Boston, Massachusetts 2006. [10] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrodl “Constrained K-means Clustering with Background Knowledge” ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning, 2001. [11] I. Davidson, M. Ester and S.S. Ravi, “Efficient incremental constrained clustering”. In Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007, August 12-15, San Jose, California, USA. [12] I. Davidson, M. Ester and S.S. Ravi, “Clustering with constraints: Feasibility issues and the K-means algorithm”, in proc. SIAM SDM 2005, Newport Beach, USA. [13] D. Klein, S.D. Kamvar and C.D. Manning, “From Instance-Level constraintes to space-level constraints: Making the most of Prior Knowledge in Data Clustering”, in proc. 19th Intl. on Machine Learning (ICML 2002), Sydney, Australia, July 2002, p. 307-314. [14] N. Nguyen and R. Caruana, “Improving classification with pairwise constraints: A margin-based approach”, in proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD’08). [15] K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, “Constrained Kmeans clustering with background knowledge”, in: Proc. Of 18th Int. Conf. on Machine Learning ICML’01, p. 577 - 584. [16] Y. Hu, J. Wang, N. Yu and X.-S. Hua, “Maximum Margin Clustering with Pairwise Constraints”, in proc. of the Eighth IEEE International Conference on Data Mining (ICDM) , 253-262, 2008. 99 http://sites.google.com/site/ijcsis/ ISSN 1947-5500