VIEWS: 271 PAGES: 6 CATEGORY: Emerging Technologies POSTED ON: 8/13/2010
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 An Optimized Clustering Algorithm Using Genetic Algorithm and Rough set Theory based on Kohonen self organizing map 3 1 Asgarali Bouyer, 2Abdolreza Hatamlou Abdul Hanan Abdullah 1,2 Department of Computer Science Department of Computer and Information Systems, 1 Islamic Azad University – Miyandoab Branch Faculty of Computer Science and Information Systems, 3 1 Universiti Teknologi Malaysia Miyandoab, Iran 2 81310 Skudai, Johor Bahru, Malaysia University Kebangsaan Malaysia 3 hanan@utm.my 2 Selangor, Malaysia 1 basgarali2@live.utm.my, 2hatamlou@iaukhoy.ac.ir Abstract—The Kohonen self organizing map is an efficient Clustering algorithms attempt to organize unlabeled tool in exploratory phase of data mining and pattern input vectors into clusters such that points within the recognition. The SOM is a popular tool that maps high cluster are more similar to each other than vectors dimensional space into a small number of dimensions by belonging to different clusters [7]. The clustering placing similar elements close together, forming clusters. methods are of five types: hierarchical clustering, Recently, most of the researchers found that to take the partitioning clustering, density-based clustering, grid- uncertainty concerned in cluster analysis, using the crisp based clustering and model-based clustering [8]. The boundaries in some clustering operations is not necessary. rough set theory employs two upper and lower thresholds In this paper, an optimized two-level clustering algorithm in the clustering process which result in a rough clusters based on SOM which employs the rough set theory and genetic algorithm is proposed to defeat the uncertainty appearance. This technique also could be defined in problem. The evaluation of proposed algorithm on our incremental order i.e. the number of clusters is not gathered poultry diseases data and Iris data expresses predefined by users. more accurate compared with the crisp clustering methods Our goal is to optimized clustering algorithm that will and reduces the errors. use in poultry disease predictions. The clustering will Index Terms- SOM, Clustering, Rough set theory, assist in improving further analysis of the poultry Genetic Algorithm. symptoms data in detecting outliers. Analyzing outlier can reveal surprising facts hidden inside data like I. INTRODUCTION ambiguous patterns that are still assumed to belong to one of the predefined or undefined classes. Clustering is The self organizing map (SOM) proposed by important in detecting outlier to avoid the high cost of Kohonen [1], has been widely used in industrial misclassification. In order to cater for the complex nature applications such as pattern recognition, biological of data of our problem domain, clustering technique modeling, data compression, signal processing and data based on machine learning approaches such as self mining [2]-[5]. It is an unsupervised and nonparametric organizing map (SOM), kernel machines, fuzzy methods, neural network approach. The success of the SOM etc for clustering poultry symptoms (based on algorithm lies in its simplicity that makes it easy to observation data – body, feathers, skin, head, muscle, understand, simulate and be used in many applications. lung, heart, intestines, ovary, etc) will prove to be a The basic SOM consists of neurons usually arranged in a promising tool. two-dimensional structure such that there are neighborhood relations among the neurons. After In this paper, a new two-level clustering algorithm is completion of training, each neuron is attached to a proposed. The idea is that the first level is to train the feature vector of the same dimension as input space. By data by the SOM neural network and the clustering at the assigning each input vector to the neuron with nearest second level is a rough set based incremental clustering feature vectors, the SOM is able to divide the input space approach [9], which will be applied on the output of into regions (clusters) with common nearest feature SOM and requires only a single neurons scan. The vectors. This process can be considered as performing optimal number of clusters can be found by rough set vector quantization (VQ) [6]. Also, because of the theory which groups the given neurons into a set of neighborhood relation contributed by the inter- overlapping clusters (clusters the mapped data connections among neurons, the SOM exhibits another respectively). Then the overlapped neurons will be important property of topology preservation. assigned to the true clusters they belong to, by apply 39 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 genetic algorithm. A genetic algorithm has been adopted kernel function around the winning neuron c at time t . to minimize the uncertainty that comes from some The neighborhood kernel function is a non-increasing clustering operations. In our previous work [3] the hybrid function of time and of the distance of neuron i from SOM and rough set has been applied to catch the the winning neuron c . The kernel can be taken as a involved ambiguity of clusters but the experiment results Gaussian function: show that the proposed algorithm (Genetic Rough SOM) outperforms the previous one. the next important process Pos i − Pos c 2 − is to collect poultry data from the common and important 2σ ( t ) 2 diseases, which can affect the respiratory and non- hic (t ) = e (2) respiratory system of poultry. The first phase is to where Posi is the coordinates of neuron i on the output identify the format and values for input parameters from available information. The second phase is to investigate grid and σ (t ) is kernel width. The weight update rule in and develop data conversion and reduction algorithms for the sequential SOM algorithm can be written as: ⎧w (t ) + ε (t ) hic (t )(x(t ) − wi (t ) )∀i ∈ N c input parameters. This paper is organized as following; in section II the wi (t + 1) = ⎨ i (3) basics of SOM algorithm are outlined. The basic of ⎩ wi (t ) ow rough set incremental clustering approach are described Both learning rate ε (t ) and neighborhood σ (t ) in section III. In section IV the essence of genetic decrease monotonically with time. During training, the algorithm is described. The proposed algorithm is presented in section V. Section VI is dedicated to SOM behaves like a flexible net that fold onto a cloud experiment results and section VII provides brief formed by training data. Because of the neighborhood conclusion and future works. relations, neighboring neurons are pulled to the same direction, and thus feature vectors of neighboring neurons resemble each other. There are many variants of II. SELF ORGANIZING MAP the SOM [10, 11]. However, these variants are not considered in this paper because the proposed algorithm Competitive learning is an adaptive process in which is based on SOM, but not a new variant of SOM. the neurons in a neural network gradually become The 2D map can be easily visualized and thus give sensitive to different input categories, sets of samples in people useful information about the input data. The a specific domain of the input space. A division of usual way to display the cluster structure of the data is to neural nodes emerges in the network to represent use a distance matrix, such as U-matrix [12]. U-matrix different patterns of the inputs after training. method displays the SOM grid according to neighboring The division is enforced by competition among the neurons. Clusters can be identified in low inter-neuron neurons: when an input x arrives, the neuron that is best distances and borders are identified in high inter-neuron able to represent it wins the competition and is allowed distances. Another method of visualizing cluster to learn it even better. If there exist an ordering between structure is to assign the input data to their nearest the neurons, i.e. the neurons are located on a discrete neurons. Some neurons then have no input data assigned lattice, the competitive learning algorithm can be to them. These neurons can be used as the border of generalized. Not only the winning neuron but also its clusters [13]. neighboring neurons on the lattice are allowed to learn, the whole effect is that the final map becomes an ordered map in the input space. This is the essence of III. ROUGH SET INCREMENTAL CLUSTERING the SOM algorithm. The SOM consist of m neurons This algorithm is a soft clustering method employing located on a regular low-dimensional grid, usually one rough set theory [14]. It groups the given data set into a or two dimensional. The lattice of the grid is either set of overlapping clusters. Each cluster is represented hexagonal or rectangular. by a lower approximation and an upper approximation The basic SOM algorithm is iterative. Each neuron i ( A(C ), A (C )) for every cluster C ⊆ U . Here U is a set has a d -dimensional feature vector wi = [ wi1 ,..., wid ] . of all objects under exploration. However, the lower and At each training step t , a sample data vector x(t ) is upper approximations of Ci ∈ U are required to follow randomly chosen for the training set. Distance between some of the basic rough set properties such as: x(t ) and all feature vectors are computed. The winning / (1) 0 ⊆ A(Ci ) ⊆ A (Ci ) ⊆ U neuron, denoted by c , is the neuron with the feature / (2) A(Ci ) ∩ A(C j ) = 0, i ≠ j vector closest to x(t ) : / (3) A(Ci ) ∩ A (C j ) = 0, i ≠ j c = arg min x(t ) − wi , i ∈ { ,..., m} 1 (1) (4) If an object u k ∈ U is not part of any lower i approximation, then it must belong to two or A set of neighboring nodes of the winning node is more upper approximations. denoted as N c . We define hic (t ) as the neighborhood 40 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 Note that (1)-(4) are not independent. However the fitness to optimization and machine learning. GA enumerating them will be helpful in understanding the provides very efficient search method working on basic of rough set theory. population, and has been applied to many problems of The lower approximation A(C ) contains all the optimization and classification [16]-[17]. General GA patterns that definitely belong to the cluster C and the process is as follows: upper approximation A (C ) permits overlap. Since the (1) Initial the population of genes. upper approximation permits overlaps, each set of data (2) Calculates the fitness of each individual in the points that are shared by a group of clusters define population. indiscernible set. Thus, the ambiguity in assigning a (3) Reproduce the individual selected to form a new pattern to a cluster is captured using the upper population according to each individual’s approximation. Employing rough set theory, the fitness. proposed clustering scheme generates soft clusters (4) Perform crossover and mutation on the (clusters with permitted overlap in upper population. approximation). (5) Repeat step (2) through (4) until some condition For a rough set clustering scheme and given two is satisfied. objects u h , u k ∈ U we have three distinct possibilities: Crossover operation swaps some part of genetic bit • Both u k and u h are in the same lower string within parents. It emulates just as crossover of approximation A(C ) . genes in real world that descendants are inherited • Object u k is in lower approximation A(C ) and characteristics from both parents. Mutation operation u h is in the corresponding upper approximation inverts some bits from whole bit string at very low rate. A (C ) , and case 1 is not applicable. In real world we can see that some mutants come out • Both u k and u h are in the same upper rarely. Fig.1 shows the way of applying crossover and approximation A (C ) , and case 1 and 2 are not mutation operations to genetic algorithm. Each applicable. individual in the population evolves to getting higher fitness generation by generation. The quality of a conventional clustering scheme is determined using within-group-error [15] Δ given by: Crossover Mutation m ∑ ∑ distance (u 010000001001 010000011101 011100011001 Δ= h , uk ) (4) i =1 u h , u k ∈C i × where u h , u k are objects in the same cluster C i . 111001011101 111001001001 011101011001 For the above rough set possibilities, three types of equation (4) could be defined as following: Figure 1. Crossover and Mutation m Δ1 = ∑ ∑ distance (u i =1 u h , u k ∈ A ( X i ) h , uk ) V. GENETIC ROUGH SET CLUSTERING OF THE SELF m ∑ ∑ distance (u ORGANIZING MAP Δ2 = h , uk ) (5) i =1 u h ∈ A ( X i ) and u k ∈ A ( X i ) In this paper rectangular grid is used for the SOM. m ∑ ∑ distance (u Before training process begins, the input data will be Δ3 = h , uk ) normalized. This will prevent one attribute from i =1 u h , u k ∈ A ( X i ) overpowering in clustering criterion. The normalization The total error of rough set clustering will then be a of the new pattern X i = {xi1 ,..., xid } for i = 1,2,..., N is weighted sum of these errors: as following: Δtotal = w1 × Δ1 + w2 × Δ2 + w3 × Δ3 where w1 > w2 > w3. (6) Xi Xi = . (7) Xi Since Δ1 corresponds to situations where both objects definitely belong to the same cluster, the weight Once the training phase of the SOM neural network w1 should have the highest value. completed, the output grid of neurons which is now stable to network iteration, will be clustered by applying the rough set algorithm as described in the previous IV. GENETIC ALGORITHM section. The similarity measure used for rough set Genetic algorithm was proposed by John Holland in clustering of neurons is Euclidean distance (the same early 1970s, it applies some of natural evolution used for training the SOM). In this proposed method mechanism such as crossover, mutation, and survival of 41 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 (see Fig.2) some neurons, those never mapped any data as possible. Therefore, a precision measure needs to be are excluded from being processed by rough set used for evaluating the quality of the proposed approach. algorithm. A possible precision measure can be defined as the following equation [14]: Number of objects in lower approx certainty = (9) Lower approx Total number of objects Upper approx VI. EXPERIMENT RESULTS To demonstrate the effectiveness of the proposed clustering algorithm GR-SOM (Genetic Rough set Incremental clustering of the SOM), two phases of experiments has been done on the well known Iris data set [18] and our gathered data. The Iris data set, which has been widely used in pattern classification, consists of 150 data points of four dimensions and our collected Figure 2. Clustering of the Self Organizing Map. The overlapped neurons are highlited for two clusters. data has 48 data points. The Iris data are divided into three classes with 50 points each. The first class of Iris From the rough set algorithm it can be observed that plant is linearly separable from the other two. The other if two neurons are defined as indiscernible (those two classes are overlapped to some extent. neurons in the upper approximation of two or more The first phase of experiments, presents the clusters), there is a certain level of similarity they have uncertainty that comes from the data set and in the with respect to the clusters they belong to and that second phase the errors has been generated. The results similarity relation has to be symmetric. Thus, the of GR-SOM and RI-SOM [3] (Rough set Incremental similarity measure must be symmetric. SOM) are compared to I-SOM [4] (Incremental According to the rough set clustering of the SOM, clustering of SOM). The input data are normalized such overlapped neurons and respectively overlapped data that the value of each datum in each dimension lies in (those data in the upper approximation) are detected. In [0,1] . the experiments, to calculate errors and uncertainty, the For training, SOM 10 × 10 with 100 epochs on the previous equations will be applied to the results of SOM input data is used. The general parameters for the (clustered and overlapped data). Then for each genetic algorithm have been configured as Table I. Fig.4 overlapped neuron a gene is generated that represents shows the certainty generated from epoch 100 to 500 by the alternative distances from each cluster leader. Fig.3 (9) on the mentioned data set. From the gained certainty shows an example of the generated genes for it’s obvious that the GR-SOM could efficiently detect m overlapped neurons on n existing cluster leaders. the overlapped data that have been mapped by gene 1 d1 d2 d3 d4 …. dn-1 dn overlapped neurons (table II). gene 2 In the second phase, the same initialization for the d1 d2 d3 d4 …. dn-1 dn gene 3 SOM has been used. The errors that come from the data d1 d2 d3 d4 …. dn-1 dn sets, according to the (5) and (6) have been generated by . . . . . …. . . our proposed algorithms (table III). The weighted sum gene m d1 d2 d3 d4 …. dn-1 dn (6) has been configured as (10). Figure 3. Generated genes. m number of overlapped neurons and n is TABLE I. GENERAL PARAMETERS OF THE GENETIC ALGORITHM number of existing clusters. The highlighted di is the optimize one that minize the fitness function Population Size 50 Number of Evaluation 10 After the genes have been generated the genetic Crossover Rate 0.25 Mutation Rate 0.001 algorithm is employed to minimize the following fitness Number of Generation 100 function which represents the total sum of each d j of the related gene: TABLE II. THE CERTAINTY-LEVEL OF GR-SOM, RI-SOM AND I- SOM ON THE IRIS DATA SET FROM EPOCH 100 TO 500. m n F= ∑∑ g (d i =1 j =1 i j) (8) Epoch 100 33.33 200 65.23 300 76.01 400 89.47 500 92.01 I-SOM RI-SOM 67.07 73.02 81.98 91.23 97.33 The aim of the proposed approach is making the GR-SOM 69.45 74.34 83.67 94.49 98.01 genetic rough set clustering of the SOM to be as precise 42 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 The artificial data set is a 569 30-dimensional data set I‐SOM RI‐SOM GR‐SOM 100 which is trained twice, once with I-SOM and once with 90 RI-SOM. The errors of generated results are calculated 80 from the difference between the results of equation (9) 70 and 1, see “Fig. 5”. From the “Fig. 5” it could be observed that the Certainty 60 50 proposed RI-SOM algorithm generates less error in 40 cluster prediction compare to I-SOM. 30 20 10 VII. CONCLUSION AND FUTURE WORK 0 In this paper a two-level based clustering approach 100 200 300 400 500 (GR-SOM), has been proposed to predict clusters of Epoch high dimensional data and to detect the uncertainty that Figure 4. Comparison of the certainty-level of GR-SOM, RI-SOM comes from the overlapping data. The approach is based and I-SOM on the Iris data set. on the rough set theory that employs a soft clustering which can detects overlapped data from the data set and 3 ∑w makes clustering as precise as possible, then GA is i =1 applied to find the true cluster for each overlapped data. i =1 The results of the both phases indicate that GR-SOM is and for each wi we have : (10) more accurate and generates fewer errors as compare to 1 crisp clustering (I-SOM). wi = × ( 4 − i ). 6 The proposed algorithm detects accurate overlapping clusters in clustering operations. As the future work, the TABLE III. COMPARATIVE GENERATED ERRORS OF GR-SOM AND overlapped data also could be assigned correctly to true I-SOM ON THE IRIS DATA SET ACCORDING TO EQUATIONS (5) AND (6). clusters they belong to, by assigning fuzzy membership Method Δ1 Δ2 Δ3 Δ total value to the indiscernible set of data. Also a weight can be assigned to the data’s dimension to improve the Iris Data GR-SOM 1.05 0.85 0.04 1.4 overall accuracy. set I-SOM 2.8 REFERENCES Furthermore, to demonstrate the effectiveness of the [1] T. Kohonen, “Self-organized formation of topologically proposed clustering algorithm (RI-SOM), two data sets, correct feature maps”, Biol. Cybern. 43. 1982, pp. 59–69. one artificial and one real word data set were used in our [2] T. Kohonen, Self-Organizing Maps, Springer, Berlin, experiments. The results are compared to I-SOM Germany, 1997. (Incremental clustering of SOM). The input data are [3] M.N.M Sap and Ehsan Mohebi, “Hybrid Self Organizing normalized such that the value of each datum in each Map for Overlapping Custers”. The Springer-Verlag Proceedings of the CCIS 2008. Hainan Island, China. dimension lies in [0,1] . For training SOM 10 × 10 with Accepted. 100 epochs on the input data is used. [4] M.N.M Sap and Ehsan Mohebi, “Rough set Based Clustering of the Self Organizing Map”. The IEEE Computer Scociety Proceeding of the 1st Aseian Conference on Intelligent Information and Database Systems 2009. Dong Hoi, Vietnam. Accepted [5] M.N.M Sap and Ehsan Mohebi, “A Novel Clustering of the SOM using Rough set”. The IEEE Proceeding of the 6th Student Conference on Research and Development 2008”. Johor, Malaysia 2008. Accepted [6] R.M. Gray., “Vector quantization”. IEEE Acoust. Speech, Signal Process. Mag. 1 (2) 1984. pp. 4–29. [7] N.R. Pal, J.C. Bezdek, and E.C.K. Tsao, “Generalized clustering networks and Kohonen’s self-organizing scheme”. IEEE Trans. Neural Networks (4) 1993. pp. 549–557. [8] J. Han, M. Kamber, “Data mining: concepts and techniques”, Morgan-Kaufman, San Francisco, 2000. [9] S. Asharaf, M. Narasimha Murty, and S.K. Shevade, Figure 5. Comparison the error between I-SOM and RI-SOM “Rough set based incremental clustering of interval data”, proposed algorithms on artificial data set. Pattern Recognition Letters, Vol. 27, 2006, pp. 515-519. 43 http://sites.google.com/site/ijcsis/ ISSN 1947-5500 (IJCSIS) International Journal of Computer Science and Information Security, Vol. 8, No. 4, July 2010 [10] Yan and Yaoguang., “Research and application of SOM [14] Pawlak, Z., “Rough sets”. Internat. J. Computer Inf. Sci. neural network which based on kernel function”. vol.11, 1982. pp. 341–356. Proceeding of ICNN&B’05. Vol.1, 2005. pp. 509- 511. [15] S.C. Sharma and A. Werner., “Improved method of [11] M.N.M. Sap and Ehsan Mohebi. “Outlier Detection grouping provincewide permanent traffic counters”. Methodologies: A Review”. Journal of Information Transaction Research Report 815, Washington D.C. 1981 Technology, UTM, Vol. 20, Issue 1, 2008. pp. 87-105. pp. 13-18 . [12] A. Ultsch, H.P. Siemon., “Kohonen’s self organizing [16] Goldberg D.E, “Genetic Algorithm in Search feature maps for exploratory data analysis”. Proceedings Optimization and Machine Learning”. Addison-Wesley of the International Neural Network Conference, Pubishing Co.inc, 1989. Dordrecht, Netherlands 1990. pp. 305–308. [17] Ebrehart, R, Simpson P. Dobbins R., “Comptational [13] X. Zhang, Y. Li. “Self-organizing map as a new method Intelligent PC Tools”, Waite Group Press, 1996. for clustering and data analysis”. Proceedings of the [18] UCIMachineLearning,www.ics.uci.edu/mlearn/MLRepos International Joint Conference on Neural Networks, itory.html. Nagoya, Japan 1993. pp. 2448–2451. 44 http://sites.google.com/site/ijcsis/ ISSN 1947-5500