An Efficient Constrained K-Means Clustering using Self Organizing Map by ijcsiseditor


More Info
									                                                       (IJCSIS) International Journal of Computer Science and Information Security,
                                                       Vol. 9, No. 4, April 2011

  An Efficient Constrained K-Means Clustering using
                 Self Organizing Map
                                     M.Sakthi1 and Dr. Antony Selvadoss Thanamani2 

    Research Scholar 2Associate Professor and Head, Department of Computer Science, NGM College, Pollachi, Tamilnadu.

Abstract--- The rapid worldwide increase in the data available          analysis. It is very complex to list the different scientific fields
leads to the difficulty for analyzing those data. Organizing            and applications that have utilized clustering method as well
data into interesting collection is one of the most basic forms         as the thousands of existing techniques.
of understanding and learning. Thus, a proper data mining
approach is required to organize those data for better                  The main aim of data clustering is to identify the natural
understanding. Clustering is one of the standard approaches in          classification of a set of patterns, points, or objects. Webster
the field of data mining. The main of this approach is to               defines cluster analysis as “a statistical classification method
organize a dataset into a set of clusters, which consists of            for discovering whether the individuals of a population fall
similar data items, as calculated by some distance function. K-         into various groups by making quantitative comparisons of
Means algorithm is the widely used clustering algorithm                 multiple characteristics”. The another definition of clustering
because of its ability and simple nature. When the dataset is           is: Provided a representation of n objects, determine K groups
larger, K-Means will misclassify the data points. For                   according to the measure of similarity like similarities among
overcoming this problem, some constraints must be included              objects in the same group are high whereas the similarities
in the algorithm. The resulting algorithm is called as                  between objects in different groups are low.
Constrained K-Means Clustering. The constraints used in this
                                                                        The main advantages of using the clustering algorithms are:
paper are Must-link constraint, Cannot-link constraint, δ-
constraint and ε-constraint. For generating the must-link and               •    Compactness of representation.
cannot-link constraints, Self Organizing Map (SOM) is used in               •    Fast, incremental processing of new data points.
this paper. The experimental result shows that the proposed
                                                                            •    Clear and fast identification of outliers.
algorithm results in better classification than the standard K-
Means clustering technique.                                             The widely used clustering technique is K-Means clustering.
                                                                        This is because K-Means is very simple to implement and also
Keywords--- K-Means,        Self   Organizing   Map     (SOM),
                                                                        it is effective in clustering. But K-Means clustering will lack
Constrained K-Means
                                                                        performance when large dataset is involved for clustering.
                      I.    INTRODUCTION                                This can be solved by including some constraints [8, 9] in the
                                                                        clustering algorithm; hence the resulting clustering is called as

T     HE growth and development in sensing and storage
      technology and drastic development in the applications
such as internet search, digital imaging, and video surveillance
                                                                        Constrained K-Means Clustering [7, 10]. The constraints used
                                                                        in this paper are Must-link constraint, Cannot-link constraint
                                                                        [14, 16], δ-constraint and ε-constraint. Self Organizing Map
have generated many high-volume, high-dimensional data                  (SOM) is used in this paper for generating the must-link and
sets. As the majority of the data are stored digitally in               cannot-link constraints.
electronic media, they offer high prospective for the
development of automatic data analysis, classification, and                                  II.     RELATED WORKS
retrieval approaches.
                                                                        Zhang Zhe et al., [1] proposed an improved K-Means
Clustering is one of the most popular approaches used for data          clustering algorithm. K-means algorithm [8] is extensively
analysis and classification. Cluster analysis is widely used in         utilized in spatial clustering. The mean value of each cluster
disciplines that involve analysis of multivariate data. A search        centroid in this approach is taken as the Heuristic information,
through Google Scholar found 1,660 entries with the words               so it has some limitations such as sensitive to the initial
data clustering that comes into sight in 2007 alone. This huge          centroid and instability. The enhanced clustering algorithm
amount of data provides the significance of clustering in data          referred to the best clustering centroid which is searched
                                                                        during the optimization of clustering centroid. This increases

                                                                                                     ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 4, April 2011

the searching probability around the best centroid and                     the better initial centers to enhance the quality of k-means and
enhanced the strength of the approach. The experiment is                   to minimize the computational complexity of k-means
performed on two groups of representative dataset and from                 approach. The proposed GK-means integrates grid structure
the experimental observation, it is clearly noted that the                 and spatial index with k-means clustering approach.
improved K-means algorithm performs better in global                       Theoretical analysis and experimental observation show that
searching and is less sensitive to the initial centroid.                   the proposed approach performs significantly with higher
Hai-xiang Guo et al., [2] put forth an Improved Genetic k-
means Algorithm for Optimal Clustering. The value of k must                Trujillo et al., [5] proposed a combining K-means and
be known in advance in the traditional k-means approach. It is             semivariogram-based grid clustering approach. Clustering is
very tough to confirm the value of k accurately in advance.                widely used in various applications which include data
The author proposed an enhanced genetic k-means clustering                 mining, information retrieval, image segmentation, and data
(IGKM) and builds a fitness function defined as a product of               classification. A clustering technique for grouping data sets
three factors, maximization of which guarantees the formation              that are indexed in the space is proposed in this paper. This
of a small number of compact clusters with large separation                approach mainly depends on the k-means clustering technique
between at least two clusters. Finally, the experiments are                and grid clustering. K-means clustering is the simplest and
conducted on two artificial and three real-life data sets that             most widely used approach. The main disadvantage of this
compare IGKM with other traditional methods like k-means                   approach is that it is sensitive to the selection of the initial
algorithm, GA-based technique and genetic k-means algorithm                partition. Grid clustering is extensively used for grouping data
(GKM) by inter-cluster distance (ITD), inner-cluster distance              that are indexed in the space. The main aim of the proposed
(IND) and rate of separation exactness. From the experimental              clustering approach is to eliminate the high sensitivity of the
observation, it is clear that IGKM reach the optimal value of k            k-means clustering approach to the starting conditions by
with high accuracy.                                                        using the available spatial information. A semivariogram-
                                                                           based grid clustering technique is used in this approach. It
Yanfeng Zhang et al., [3] proposed an Agglomerative Fuzzy                  utilizes the spatial correlation for obtaining the bin size. The
K-means clustering method with automatic selection of cluster              author combines this approach with a conventional k-means
number (NSS-AKmeans) approach for learning optimal                         clustering technique as the bins are constrained to regular
number of clusters and for providing significant clustering                blocks while the spatial distribution of objects is irregular. An
results. High density areas can be detected by the NSS-                    effective initialization of the k-means is provided by
AKmeans and from these centers the initial cluster centers                 semivariogram. From the experimental results, it is clearly
with a neighbor sharing selection approach can also be                     observed that the final partition protects the spatial distribution
determined. Agglomeration Energy (AE) factor is proposed in                of the objects.
order to choose a initial cluster for representing global density
relationship of objects. Moreover, in order to calculate local             Huang et al., [6]  put forth the automated variable weighting in
neighbor sharing relationship of objects, Neighbors Sharing                k-means type clustering that can automatically estimate
Factor (NSF) is used. Agglomerative Fuzzy k-means                          variable weights. A novel approach is introduced to the k-
clustering algorithm is then utilized to further merge these               means algorithm to iteratively update variable weights
initial centers to get the preferred number of clusters and                depending on the present partition of data and a formula for
create better clustering results. Experimental observations on             weight calculation is also proposed in this paper. The
several data sets have proved that the proposed clustering                 convergency theorem of the new clustering algorithm is given
approach was very significant in automatically identifying the             in this paper. The variable weights created by the approach
true cluster number and also providing correct clustering                  estimates the significance of variables in clustering and can be
results.                                                                   deployed in variable selection in various data mining
                                                                           applications where large and complex real data are often used.
Xiaoyun Chen et al., [4]  described a GK-means: an efficient               Experiments are conducted on both synthetic and real data and
K-means clustering algorithm based on grid. Clustering                     it is found from the experimental observation that the
analysis is extensively used in several applications such as               proposed approach provides higher performance when
pattern recognition, data mining, statistics etc. K-means                  compared the traditional k-means type algorithms in
approach, based on reducing a formal objective function, is                recovering clusters in data.
most broadly used in research. But, user specification is
needed for the k number of clusters and it is difficult to choose                               III.    METHODOLOGY
the effective initial centers. It is also very susceptible to noise
data points. In this paper, the author mainly focuses on option

                                                                                                        ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 9, No. 4, April 2011

The methodology proposed for clustering the data is presented             Step 5: Repeat steps 3 and 4 until no point changes its cluster
in this section. Initially, K-Means clustering is described. Then         assignment, or until a maximum number of passes through the
the constraint based K-Means clustering is provided. Next, the            data set is performed.
constraints used in Constrained K-Means algorithm are
presented. For the generation of constraints like must-link and           Function violate-constraints ( )
cannot-link, Self Organizing Map is used in this paper.
                                                                                   if must_link constraint not satisfied
K-Means Clustering
                                                                                            return true
Provided a data set of data samples, a preferred number of
                                                                                   elseif cannot_link constraint not satisfied
clusters, k, and a set of k initial starting points, the k-means
clustering technique determines the desired number of distinct                              return true
clusters and their centroids. A centroid is defined as the point
whose coordinates are determined by calculating the average                        elseif δ-constraint not satisfied
of each of the coordinates (i.e., feature values) of the points of
the jobs allocated to the cluster. Properly, the k-means                                    return true
clustering algorithm follows the following steps.
                                                                                   elseif ε-constraint not satisfied
Step 1: Choose a number of desired clusters, k.
                                                                                            return true
Step 2: Choose k starting points to be used as initial estimates
of the cluster centroids. These are the initial starting values.
                                                                                            return false
Step 3: Examine each point in the data set and assign it to the
cluster whose centroid is nearest to it.                                  Constraints used for Constrained K-Means Clustering
Step 4: When each point is assigned to a cluster, recalculate             The Constraints [11, 12, 13] used for Constrained K-Means
the new k centroids.                                                      Clustering are
Step 5: Repeat steps 3 and 4 until no point changes its cluster               •    Must-link constraint
assignment, or until a maximum number of passes through the
                                                                              •    Cannot-link constraint
data set is performed.
                                                                              •    δ-constraint
Constrained K-Means Clustering                                                •    ε-constraint

Constrained K-Means Clustering [15] is similar to the                     Consider S = {s1, s2,…,sn} as a set of n data points that are to
standard K-Means Clustering algorithm with the exception is               be separated into clusters. For any pair of points si and sj in S,
that the constraints must be satisfied while assigning the data           the distance between them is represented by d(si, sj) with a
points into the cluster. The algorithm for Constrained K-                 symmetric property in order that d(si, sj) = d(sj,si). The
Means Clustering is described below.                                      constraints are:

Step 1: Choose a number of desired clusters, k.                               •    Must-link constraints indicates that two points si and
                                                                                   sj (i ≠ j) in S have to be in the same cluster.
Step 2: Choose k starting points to be used as initial estimates              •    Cannot-link constraints indicates that two point si and
of the cluster centroids. These are the initial starting values.                   sj (i ≠ j) in S must not be placed in the same cluster.
                                                                              •    δ-Constraint: This constraint represents a value δ > 0.
Step 3: Examine each point in the data set and assign it to the
                                                                                   Properly, for any pair of clusters Si and Sj (i ≠ j), and
cluster whose centroid is nearest to it only when the violate-
                                                                                   any pair of points sp and sq such that sp Si and sq
constraints ( ) returns false
                                                                                   Sj, d(sp, sq) ≥ δ.
Step 4: When each point is assigned to a cluster, recalculate                 •    ε-Constraint: This constraint represents a value ε > 0
the new k centroids.                                                               and the feasibility need is the following: for any
                                                                                   cluster Si containing two or more points and for any
                                                                                   point sp Si, there must be another point sq Si such
                                                                                   that d(sp, sq) ≤ ε.

                                                                                                       ISSN 1947-5500
                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                          Vol. 9, No. 4, April 2011

Must-link constraint and Cannot-link constraint are                        constraints are derived. The data points in a cluster are
determined with the help of appropriate neural network. For                considered as must link constraint and data points outside the
this purpose, this paper uses Self Organizing Map.                         clusters are considered as cannot link constraints. These
                                                                           constraints are used in the constraints checking module of
Self Organizing Map                                                        constrained K-Means algorithm.

Self-Organizing Maps (SOM) is a general type of neural
network technique that is nonlinear regression method that can
be utilized to determine relationships among inputs and
outputs or categorize data so as to reveal so far unidentified
patterns or structures. It is an outstanding technique in
exploratory phase of data mining. The results of the
examination represents that self-organizing maps can be a
feasible technique for categorization of large quantity of data.
The SOM has set up its place as an expensively used
technique in data-analysis and visualization of high-
dimensional data. Among other statistical technique the SOM
has no close counterpart, and thus it offers a balancing sight to
the data. On the other hand, SOM is the most extensively used
technique in this group as it offers some notable merits among                      Figure 1: Architecture of self-organizing map
the substitutes. These comprise, ease of use, particularly for
inexperienced users, and highly intuitive display of the data                              IV.     EXPERIMENTAL RESULTS
anticipated on to a regular two-dimensional slab, as on a sheet            The proposed technique is experimented using the two
of a paper. The most important prospective of the SOM is in                benchmark datasets which are Iris and Wine Dataset from the
exploratory data analysis that varies from regular statistical             UCI machine learning Repository [17]. All algorithms are
data analysis in that there are no assumed set of hypotheses               implemented under the same initial values and stopping
that are validated in the analysis. As an alternative, the                 conditions. The experiments are all performed on a GENX
hypotheses are created from the data in the data-driven                    computer with 2.6 GHz Core (TM) 2 Duo processors using
exploratory stage and validated in the confirmatory stage.                 MATLAB version 7.5.
There are few demerits where the exploratory stage may be
adequate alone, such as visualization of data with no                      Experiment with Iris Dataset
additional quantitative statistical inference upon it. In practical
data analysis problems the majority of mission is to identify              The Iris flower data set (Fisher's Iris data set) is a multivariate
dependencies among variables. In such a difficulty, SOM can                data set. The dataset comprises of 50 samples from each of
be utilized for getting insight to the data and for the original           three species of Iris flowers (Iris setosa, Iris virginica and Iris
search of potential dependencies. In general the findings                  versicolor). Four features were measured from every sample;
require to be validated with more conventional techniques, for             they are the length and the width of sepal and petal, in
the purpose of assessing the assurance of the conclusions and              centimeters. Based on the combination of the four features,
to discard those that are not statistically important.                     Fisher has developed a linear discriminant model to
                                                                           distinguish the species from each other. It is used as a typical
Initially the chosen parameters are normalized and then                    test for many classification techniques. The proposed method
initialize the SOM network. Then SOM is trained to offer the               is tested first using this Iris dataset. This database has four
maximum likelihood estimation, so that an exacting stock can               continuous features consisting of 150 instances: 50 for each
be linked with a particular node in the categorization layer.              class.
The self-organizing networks suppose a topological structure
between the cluster units. There are m cluster units,                      To evaluate the efficiency of the proposed approach, this
prearranged in a one or two dimensional array: the input                   technique is compared with the existing K-Means algorithm.
signals are n dimensional. Figure 1 represents architecture of             The Mean Square Error (MSE) of the centers
self-organizing network (SOM) that consists of input layer,                  ||       || where vc is the computed center and vt is the
and Kohonen or clustering layer.                                           true center. The cluster centers found by the proposed K-
                                                                           Means are closer to the true centers, than the centers found by
Finally the categorized data is obtained from the SOM. From                K-Means algorithm. The mean square error for the four cluster
this obtained categorized data, must link and cannot link                  centers for the two approaches are presented in table I. The

                                                                                                        ISSN 1947-5500
                                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                                         Vol. 9, No. 4, April 2011

resulted execution for the proposed and standard K-Means                                                                                           Proposed K-
algorithms is provided in figure 2.                                                                                                                  Means
                                                                                                              Cluster 1         0.4364                 0.3094

                                                                                                              Cluster 2         0.5562                 0.3572
                                                                                                              Cluster 3         0.2142                 0.1843

                                              TABLE I
        MEAN SQUARE ERROR VALUE OBTAINED FOR THE THREE                                                                                               K-Means
                CLUSTERS IN THE IRIS DATASET                                                                   1
                                                             Proposed K-                                                                             K-Means

                                                                                             Time (seconds)
                                            K-Means                                                           0.8
                       Cluster 1              0.3765            0.2007                                        0.6
                       Cluster 2              0.4342            0.2564
                       Cluster 3              0.3095            0.1943

                       1.2                                                                                     0
                        1                                                                                      Figure 3: Execution Time for Wine Dataset
      Time (seconds)

                       0.8                                                               From the experimental observations it can be found that the
                                                                                         proposed approach produces better clusters than the existing
                       0.6                                                               approach. The MSE value is highly reduced for both the
                                                                                         dataset. This represents the better accuracy for the proposed
                       0.4                                                               approach. Also, the execution time is reduced when compared
                                                                                         to the existing approach. This is true in both the dataset.
                                                                                                                          V.     CONCLUSION
                                                                                         The increase in the number of data world wide leads to the
                             Figure 2: Execution Time for Iris Dataset                   requirement for the better analyzing technique for better
                                                                                         understanding of data. One of the most essential modes of
Experiment with Wine Dataset                                                             understanding and learning is categorizing data into
                                                                                         reasonable groups. This can be achieved by a famous data
The wine dataset is the results of a chemical analysis of wines                          mining technique called Clustering. Clustering is nothing but
grown in the same region in Italy but derived from three                                 separating the given data into particular groups according to
different cultivars. The analysis established the quantities of                          the separation among the data points. This will helps in better
13 constituents found in each of the three types of wines. The                           understanding and analyzing of the vast data. One of the
classes 1, 2 and 3 have 59, 71 and 48 instances respectively.                            widely used clustering is K-Means clustering because it is
There are totally 13 Number of Attributes.                                               simple and efficient. But it lacks accuracy of classification
                                                                                         when large data are used in clustering. So the K-Means
The MSE value for the three clusters is presented in Table II.
                                                                                         clustering needs to be improved to suit for all kinds of data.
The resulted execution for the proposed and standard K-
                                                                                         Hence the new clustering technique called Constrained K-
Means algorithms is provided in figure 2.
                                                                                         Means Clustering is introduced. The constraints used in this
                                              TABLE II                                   paper are Must-link constraint, Cannot-link constraint, δ-
                                                                                         constraint and ε-constraint. SOM is used in this paper for
        MEAN SQUARE ERROR VALUE OBTAINED FOR THE THREE                                   generating Must-link and Cannot-link constraints. The
                                 CLUSTERS IN THE WINE DATASET                            experimental result shows that the proposed technique results
                                                                                         in better classification and also takes lesser time for

                                                                                                                               ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                      Vol. 9, No. 4, April 2011

classification. In future, this work can be extended by using                           [17] Merz C and Murphy P, UCI Repository of Machine Learning Databases,
more suitable constraints in the Constrained K-Means
Clustering technique.

[1]    Zhang Zhe, Zhang Junxi and Xue Huifeng, "Improved K-Means
       Clustering Algorithm", Congress on Image and Signal Processing, Vol.
       5, Pp. 169-172, 2008.
[2]    Hai-xiang Guo, Ke-jun Zhu, Si-wei Gao and Ting Liu, "An Improved
       Genetic k-means Algorithm for Optimal Clustering", Sixth IEEE
       International Conference on Data Mining Workshops, Pp. 793-797,
[3]    Yanfeng Zhang, Xiaofei Xu and Yunming Ye, "NSS-AKmeans: An
       Agglomerative Fuzzy K-means clustering method with automatic
       selection of cluster number", 2nd International Conference on Advanced
       Computer Control, Vol. 2, Pp. 32-38, 2010.
[4]    Xiaoyun Chen, Youli Su, Yi Chen and Guohua Liu, "GK-means: an
       Efficient K-means Clustering Algorithm Based on Grid", International
       Symposium on Computer Network and Multimedia Technology, Pp. 1-
       4, 2009.
[5]    Trujillo, M., Izquierdo, E., "Combining K-means and semivariogram-
       based grid clustering", 47th International Symposium, Pp. 9-12, 2005.
[6]    Huang, J.Z., Ng, M.K., Hongqiang Rong and Zichen Li, "Automated
       variable weighting in k-means type clustering", IEEE Transactions on
       Pattern Analysis and Machine Intelligence, Vol. 27, No. 5, Pp. 657-668,
[7]    Yi Hong and Sam Kwong “Learning Assignment Order of Instances for
       the constrained k-means clustering algorithm” IEEE Transactions on
       Systems, Man, and Cybernetics, Vol 39, No 2. April, 2009.
[8]    I. Davidson,M. Ester and S.S. Ravi, “Agglomerative hierarchical
       clustering with constraints: Theoretical and empirical results”, in Proc.
       of Principles of Knowledge Discovery from Databases, PKDD 2005.
[9]    Wagstaff, Kiri L., Basu, Sugato, Davidson, Ian “When is constrained
       clustering beneficial, and why?” National Conference on Aritficial
       Intelligence, Boston, Massachusetts 2006.
[10]   Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrodl “Constrained
       K-means Clustering with Background Knowledge” ICML '01
       Proceedings of the Eighteenth International Conference on Machine
       Learning, 2001.
[11]   I. Davidson, M. Ester and S.S. Ravi, “Efficient incremental constrained
       clustering”. In Thirteenth ACM SIGKDD International Conference on
       Knowledge Discovery and Data Mining, 2007, August 12-15, San Jose,
       California, USA.
[12]   I. Davidson, M. Ester and S.S. Ravi, “Clustering with constraints:
       Feasibility issues and the K-means algorithm”, in proc. SIAM SDM
       2005, Newport Beach, USA.
[13]   D. Klein, S.D. Kamvar and C.D. Manning, “From Instance-Level
       constraintes to space-level constraints: Making the most of Prior
       Knowledge in Data Clustering”, in proc. 19th Intl. on Machine Learning
       (ICML 2002), Sydney, Australia, July 2002, p. 307-314.
[14]   N. Nguyen and R. Caruana, “Improving classification with pairwise
       constraints: A margin-based approach”, in proc. of the European
       Conference on Machine Learning and Principles and Practice of
       Knowledge Discovery in Databases (ECML PKDD’08).
[15]   K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, “Constrained
       Kmeans clustering with background knowledge”, in: Proc. Of 18th Int.
       Conf. on Machine Learning ICML’01, p. 577 - 584.
[16]   Y. Hu, J. Wang, N. Yu and X.-S. Hua, “Maximum Margin Clustering
       with Pairwise Constraints”, in proc. of the Eighth IEEE International
       Conference on Data Mining (ICDM) , 253-262, 2008.

                                                                                                                       ISSN 1947-5500

To top