An Optimized Clustering Algorithm Using Genetic Algorithm and Rough set Theory based on Kohonen self organizing map

Document Sample
An Optimized Clustering Algorithm Using Genetic Algorithm and Rough set Theory based on Kohonen self organizing map Powered By Docstoc
					                                                  (IJCSIS) International Journal of Computer Science and Information Security,
                                                  Vol. 8, No. 4, July 2010

        An Optimized Clustering Algorithm Using
      Genetic Algorithm and Rough set Theory based
             on Kohonen self organizing map

      Asgarali Bouyer, 2Abdolreza Hatamlou                                            Abdul Hanan Abdullah
           Department of Computer Science                               Department of Computer and Information Systems,
   Islamic Azad University – Miyandoab Branch                         Faculty of Computer Science and Information Systems,
                 1                                                                  Universiti Teknologi Malaysia
                  Miyandoab, Iran
         2                                                                     81310 Skudai, Johor Bahru, Malaysia
          University Kebangsaan Malaysia                                                  3
                Selangor, Malaysia

 Abstract—The Kohonen self organizing map is an efficient              Clustering algorithms attempt to organize unlabeled
 tool in exploratory phase of data mining and pattern              input vectors into clusters such that points within the
 recognition. The SOM is a popular tool that maps high             cluster are more similar to each other than vectors
 dimensional space into a small number of dimensions by            belonging to different clusters [7]. The clustering
 placing similar elements close together, forming clusters.        methods are of five types: hierarchical clustering,
 Recently, most of the researchers found that to take the          partitioning clustering, density-based clustering, grid-
 uncertainty concerned in cluster analysis, using the crisp        based clustering and model-based clustering [8]. The
 boundaries in some clustering operations is not necessary.        rough set theory employs two upper and lower thresholds
 In this paper, an optimized two-level clustering algorithm
                                                                   in the clustering process which result in a rough clusters
 based on SOM which employs the rough set theory and
 genetic algorithm is proposed to defeat the uncertainty
                                                                   appearance. This technique also could be defined in
 problem. The evaluation of proposed algorithm on our              incremental order i.e. the number of clusters is not
 gathered poultry diseases data and Iris data expresses            predefined by users.
 more accurate compared with the crisp clustering methods              Our goal is to optimized clustering algorithm that will
 and reduces the errors.                                           use in poultry disease predictions. The clustering will
     Index Terms- SOM, Clustering, Rough set theory,               assist in improving further analysis of the poultry
 Genetic Algorithm.
                                                                   symptoms data in detecting outliers. Analyzing outlier
                                                                   can reveal surprising facts hidden inside data like
                    I.    INTRODUCTION                             ambiguous patterns that are still assumed to belong to
                                                                   one of the predefined or undefined classes. Clustering is
     The self organizing map (SOM) proposed by                     important in detecting outlier to avoid the high cost of
 Kohonen [1], has been widely used in industrial                   misclassification. In order to cater for the complex nature
 applications such as pattern recognition, biological              of data of our problem domain, clustering technique
 modeling, data compression, signal processing and data            based on machine learning approaches such as self
 mining [2]-[5]. It is an unsupervised and nonparametric           organizing map (SOM), kernel machines, fuzzy methods,
 neural network approach. The success of the SOM                   etc for clustering poultry symptoms (based on
 algorithm lies in its simplicity that makes it easy to            observation data – body, feathers, skin, head, muscle,
 understand, simulate and be used in many applications.            lung, heart, intestines, ovary, etc) will prove to be a
 The basic SOM consists of neurons usually arranged in a           promising tool.
 two-dimensional structure such that there are
 neighborhood relations among the neurons. After                       In this paper, a new two-level clustering algorithm is
 completion of training, each neuron is attached to a              proposed. The idea is that the first level is to train the
 feature vector of the same dimension as input space. By           data by the SOM neural network and the clustering at the
 assigning each input vector to the neuron with nearest            second level is a rough set based incremental clustering
 feature vectors, the SOM is able to divide the input space        approach [9], which will be applied on the output of
 into regions (clusters) with common nearest feature               SOM and requires only a single neurons scan. The
 vectors. This process can be considered as performing             optimal number of clusters can be found by rough set
 vector quantization (VQ) [6]. Also, because of the                theory which groups the given neurons into a set of
 neighborhood relation contributed by the inter-                   overlapping clusters (clusters the mapped data
 connections among neurons, the SOM exhibits another               respectively). Then the overlapped neurons will be
 important property of topology preservation.                      assigned to the true clusters they belong to, by apply

                                                                                               ISSN 1947-5500
                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                   Vol. 8, No. 4, July 2010

genetic algorithm. A genetic algorithm has been adopted             kernel function around the winning neuron c at time t .
to minimize the uncertainty that comes from some                    The neighborhood kernel function is a non-increasing
clustering operations. In our previous work [3] the hybrid          function of time and of the distance of neuron i from
SOM and rough set has been applied to catch the                     the winning neuron c . The kernel can be taken as a
involved ambiguity of clusters but the experiment results
                                                                    Gaussian function:
show that the proposed algorithm (Genetic Rough SOM)
outperforms the previous one. the next important process                                                     Pos i − Pos c
is to collect poultry data from the common and important                                                        2σ ( t ) 2
diseases, which can affect the respiratory and non-                                       hic (t ) = e                                 (2)
respiratory system of poultry. The first phase is to
                                                                    where Posi is the coordinates of neuron i on the output
identify the format and values for input parameters from
available information. The second phase is to investigate           grid and σ (t ) is kernel width. The weight update rule in
and develop data conversion and reduction algorithms for            the sequential SOM algorithm can be written as:

                                                                                 ⎧w (t ) + ε (t ) hic (t )(x(t ) − wi (t ) )∀i ∈ N c
input parameters.
   This paper is organized as following; in section II the          wi (t + 1) = ⎨ i                                                   (3)
basics of SOM algorithm are outlined. The basic of                               ⎩                 wi (t )                    ow
rough set incremental clustering approach are described                Both learning rate ε (t ) and neighborhood σ (t )
in section III. In section IV the essence of genetic
                                                                    decrease monotonically with time. During training, the
algorithm is described. The proposed algorithm is
presented in section V. Section VI is dedicated to                  SOM behaves like a flexible net that fold onto a cloud
experiment results and section VII provides brief                   formed by training data. Because of the neighborhood
conclusion and future works.                                        relations, neighboring neurons are pulled to the same
                                                                    direction, and thus feature vectors of neighboring
                                                                    neurons resemble each other. There are many variants of
                 II.   SELF ORGANIZING MAP                          the SOM [10, 11]. However, these variants are not
                                                                    considered in this paper because the proposed algorithm
   Competitive learning is an adaptive process in which             is based on SOM, but not a new variant of SOM.
the neurons in a neural network gradually become                       The 2D map can be easily visualized and thus give
sensitive to different input categories, sets of samples in         people useful information about the input data. The
a specific domain of the input space. A division of                 usual way to display the cluster structure of the data is to
neural nodes emerges in the network to represent                    use a distance matrix, such as U-matrix [12]. U-matrix
different patterns of the inputs after training.                    method displays the SOM grid according to neighboring
   The division is enforced by competition among the                neurons. Clusters can be identified in low inter-neuron
neurons: when an input x arrives, the neuron that is best           distances and borders are identified in high inter-neuron
able to represent it wins the competition and is allowed            distances. Another method of visualizing cluster
to learn it even better. If there exist an ordering between         structure is to assign the input data to their nearest
the neurons, i.e. the neurons are located on a discrete             neurons. Some neurons then have no input data assigned
lattice, the competitive learning algorithm can be                  to them. These neurons can be used as the border of
generalized. Not only the winning neuron but also its               clusters [13].
neighboring neurons on the lattice are allowed to learn,
the whole effect is that the final map becomes an
ordered map in the input space. This is the essence of                     III.    ROUGH SET INCREMENTAL CLUSTERING
the SOM algorithm. The SOM consist of m neurons                         This algorithm is a soft clustering method employing
located on a regular low-dimensional grid, usually one              rough set theory [14]. It groups the given data set into a
or two dimensional. The lattice of the grid is either               set of overlapping clusters. Each cluster is represented
hexagonal or rectangular.                                           by a lower approximation and an upper approximation
   The basic SOM algorithm is iterative. Each neuron i               ( A(C ), A (C )) for every cluster C ⊆ U . Here U is a set
has a d -dimensional feature vector wi = [ wi1 ,..., wid ] .        of all objects under exploration. However, the lower and
At each training step t , a sample data vector x(t ) is             upper approximations of Ci ∈ U are required to follow
randomly chosen for the training set. Distance between              some of the basic rough set properties such as:
 x(t ) and all feature vectors are computed. The winning                       /
                                                                         (1) 0 ⊆ A(Ci ) ⊆ A (Ci ) ⊆ U
neuron, denoted by c , is the neuron with the feature                                            /
                                                                         (2) A(Ci ) ∩ A(C j ) = 0, i ≠ j
vector closest to x(t ) :                                                                         /
                                                                         (3) A(Ci ) ∩ A (C j ) = 0, i ≠ j
    c = arg min x(t ) − wi ,    i ∈ { ,..., m}
                                     1                  (1)              (4) If an object u k ∈ U is not part of any lower
             i                                                                approximation, then it must belong to two or
  A set of neighboring nodes of the winning node is                           more upper approximations.
denoted as N c . We define hic (t ) as the neighborhood

                                                                                                         ISSN 1947-5500
                                                                                     (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                     Vol. 8, No. 4, July 2010

   Note that (1)-(4) are not independent. However                                                    the fitness to optimization and machine learning. GA
enumerating them will be helpful in understanding the                                                provides very efficient search method working on
basic of rough set theory.                                                                           population, and has been applied to many problems of
   The lower approximation A(C ) contains all the                                                    optimization and classification [16]-[17]. General GA
patterns that definitely belong to the cluster C and the                                             process is as follows:
upper approximation A (C ) permits overlap. Since the                                                    (1) Initial the population of genes.
upper approximation permits overlaps, each set of data                                                   (2) Calculates the fitness of each individual in the
points that are shared by a group of clusters define                                                          population.
indiscernible set. Thus, the ambiguity in assigning a                                                    (3) Reproduce the individual selected to form a new
pattern to a cluster is captured using the upper                                                              population according to each individual’s
approximation. Employing rough set theory, the                                                                fitness.
proposed clustering scheme generates soft clusters                                                       (4) Perform crossover and mutation on the
(clusters      with   permitted    overlap     in     upper                                                   population.
approximation).                                                                                          (5) Repeat step (2) through (4) until some condition
   For a rough set clustering scheme and given two                                                            is satisfied.
objects u h , u k ∈ U we have three distinct possibilities:                                             Crossover operation swaps some part of genetic bit
   •       Both u k and u h are in the same lower                                                    string within parents. It emulates just as crossover of
           approximation A(C ) .                                                                     genes in real world that descendants are inherited
   •       Object u k is in lower approximation A(C ) and                                            characteristics from both parents. Mutation operation
           u h is in the corresponding upper approximation                                           inverts some bits from whole bit string at very low rate.
            A (C ) , and case 1 is not applicable.                                                   In real world we can see that some mutants come out
   •       Both u k and u h are in the same upper                                                    rarely. Fig.1 shows the way of applying crossover and
           approximation A (C ) , and case 1 and 2 are not                                           mutation operations to genetic algorithm. Each
           applicable.                                                                               individual in the population evolves to getting higher
                                                                                                     fitness generation by generation.
   The quality of a conventional clustering scheme is
determined using within-group-error [15] Δ given by:
                                                                                                                   Crossover                       Mutation

       ∑ ∑ distance (u
                                                                                                      010000001001 010000011101                 011100011001
Δ=                                           h , uk   )
       i =1 u h , u k ∈C i                                                                               ×
where u h , u k are objects in the same cluster C i .                                                 111001011101 111001001001                 011101011001
For the above rough set possibilities, three types of
equation (4) could be defined as following:                                                                         Figure 1. Crossover and Mutation
               Δ1 =     ∑ ∑ distance (u
                         i =1 u h , u k ∈ A ( X i )
                                                              h , uk )
                                                                                                        V.    GENETIC ROUGH SET CLUSTERING OF THE SELF

                         ∑                      ∑ distance (u
                                                                                                                       ORGANIZING MAP
               Δ2 =                                                       h , uk )        (5)
                             i =1 u h ∈ A ( X i ) and u k ∈ A ( X i )                                   In this paper rectangular grid is used for the SOM.

                        ∑ ∑ distance (u
                                                                                                     Before training process begins, the input data will be
               Δ3 =                                            h , uk )                              normalized. This will prevent one attribute from
                         i =1 u h , u k ∈ A ( X i )
                                                                                                     overpowering in clustering criterion. The normalization
  The total error of rough set clustering will then be a                                             of the new pattern X i = {xi1 ,..., xid } for i = 1,2,..., N is
weighted sum of these errors:                                                                        as following:

Δtotal = w1 × Δ1 + w2 × Δ2 + w3 × Δ3 where w1 > w2 > w3. (6)                                                                            Xi
                                                                                                                                Xi =       .                     (7)
  Since Δ1 corresponds to situations where both
objects definitely belong to the same cluster, the weight                                               Once the training phase of the SOM neural network
w1 should have the highest value.                                                                    completed, the output grid of neurons which is now
                                                                                                     stable to network iteration, will be clustered by applying
                                                                                                     the rough set algorithm as described in the previous
                        IV.          GENETIC ALGORITHM
                                                                                                     section. The similarity measure used for rough set
   Genetic algorithm was proposed by John Holland in                                                 clustering of neurons is Euclidean distance (the same
early 1970s, it applies some of natural evolution                                                    used for training the SOM). In this proposed method
mechanism such as crossover, mutation, and survival of

                                                                                                                                   ISSN 1947-5500
                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                              Vol. 8, No. 4, July 2010

(see Fig.2) some neurons, those never mapped any data                           as possible. Therefore, a precision measure needs to be
are excluded from being processed by rough set                                  used for evaluating the quality of the proposed approach.
algorithm.                                                                      A possible precision measure can be defined as the
                                                                                following equation [14]:
                                                                                              Number of objects in lower approx
                                                                                certainty =                                                        (9)
  Lower approx                                                                                   Total number of objects

  Upper approx
                                                                                                VI.     EXPERIMENT RESULTS
                                                                                    To demonstrate the effectiveness of the proposed
                                                                                clustering algorithm GR-SOM (Genetic Rough set
                                                                                Incremental clustering of the SOM), two phases of
                                                                                experiments has been done on the well known Iris data
                                                                                set [18] and our gathered data. The Iris data set, which
                                                                                has been widely used in pattern classification, consists
                                                                                of 150 data points of four dimensions and our collected
  Figure 2. Clustering of the Self Organizing Map. The overlapped
               neurons are highlited for two clusters.
                                                                                data has 48 data points. The Iris data are divided into
                                                                                three classes with 50 points each. The first class of Iris
   From the rough set algorithm it can be observed that                         plant is linearly separable from the other two. The other
if two neurons are defined as indiscernible (those                              two classes are overlapped to some extent.
neurons in the upper approximation of two or more                                   The first phase of experiments, presents the
clusters), there is a certain level of similarity they have                     uncertainty that comes from the data set and in the
with respect to the clusters they belong to and that                            second phase the errors has been generated. The results
similarity relation has to be symmetric. Thus, the                              of GR-SOM and RI-SOM [3] (Rough set Incremental
similarity measure must be symmetric.                                           SOM) are compared to I-SOM [4] (Incremental
    According to the rough set clustering of the SOM,                           clustering of SOM). The input data are normalized such
overlapped neurons and respectively overlapped data                             that the value of each datum in each dimension lies in
(those data in the upper approximation) are detected. In                        [0,1] .
the experiments, to calculate errors and uncertainty, the                           For training, SOM 10 × 10 with 100 epochs on the
previous equations will be applied to the results of SOM                        input data is used. The general parameters for the
(clustered and overlapped data). Then for each                                  genetic algorithm have been configured as Table I. Fig.4
overlapped neuron a gene is generated that represents                           shows the certainty generated from epoch 100 to 500 by
the alternative distances from each cluster leader. Fig.3                       (9) on the mentioned data set. From the gained certainty
shows an example of the generated genes for                                     it’s obvious that the GR-SOM could efficiently detect
 m overlapped neurons on n existing cluster leaders.                            the overlapped data that have been mapped by
   gene 1     d1 d2 d3 d4               …. dn-1 dn                              overlapped neurons (table II).
   gene 2                                                                           In the second phase, the same initialization for the
              d1 d2 d3 d4               …. dn-1 dn
   gene 3
                                                                                SOM has been used. The errors that come from the data
              d1 d2 d3 d4               …. dn-1 dn
                                                                                sets, according to the (5) and (6) have been generated by
      .        .     .      .    .      ….       .     .                        our proposed algorithms (table III). The weighted sum
  gene m        d1     d2       d3      d4        ….   dn-1     dn              (6) has been configured as (10).

Figure 3. Generated genes. m number of overlapped neurons and n is              TABLE I.       GENERAL PARAMETERS OF THE GENETIC ALGORITHM
number of existing clusters. The highlighted di is the optimize one that
                      minize the fitness function                                             Population Size              50
                                                                                              Number of Evaluation         10
   After the genes have been generated the genetic                                            Crossover Rate               0.25
                                                                                              Mutation Rate                0.001
algorithm is employed to minimize the following fitness                                       Number of Generation         100
function which represents the total sum of each d j of
the related gene:                                                               TABLE II.   THE CERTAINTY-LEVEL OF GR-SOM, RI-SOM AND I-
                                                                                      SOM ON THE IRIS DATA SET FROM EPOCH 100 TO 500.
                            m     n
                     F=     ∑∑ g (d
                            i =1 j =1
                                        i    j)                      (8)           Epoch        100
                                                                                   RI-SOM       67.07     73.02      81.98         91.23   97.33
  The aim of the proposed approach is making the
                                                                                   GR-SOM       69.45     74.34      83.67         94.49   98.01
genetic rough set clustering of the SOM to be as precise

                                                                                                                ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                      Vol. 8, No. 4, July 2010

                                                                                            The artificial data set is a 569 30-dimensional data set
                                    I‐SOM    RI‐SOM    GR‐SOM
                                                                                         which is trained twice, once with I-SOM and once with
                   90                                                                    RI-SOM. The errors of generated results are calculated
                   80                                                                    from the difference between the results of equation (9)
                   70                                                                    and 1, see “Fig. 5”.
                                                                                            From the “Fig. 5” it could be observed that the

                   50                                                                    proposed RI-SOM algorithm generates less error in
                   40                                                                    cluster prediction compare to I-SOM.
                   10                                                                             VII. CONCLUSION AND FUTURE WORK
                    0                                                                       In this paper a two-level based clustering approach
                            100      200      300      400          500                  (GR-SOM), has been proposed to predict clusters of
                                                                                         high dimensional data and to detect the uncertainty that
 Figure 4. Comparison of the certainty-level of GR-SOM, RI-SOM                           comes from the overlapping data. The approach is based
                and I-SOM on the Iris data set.                                          on the rough set theory that employs a soft clustering
                                                                                         which can detects overlapped data from the data set and

                                                                                         makes clustering as precise as possible, then GA is
                   i   =1                                                                applied to find the true cluster for each overlapped data.
    i =1
                                                                                         The results of the both phases indicate that GR-SOM is
    and for each wi we have :                                               (10)         more accurate and generates fewer errors as compare to
                   1                                                                     crisp clustering (I-SOM).
    wi =             × ( 4 − i ).
                   6                                                                        The proposed algorithm detects accurate overlapping
                                                                                         clusters in clustering operations. As the future work, the
TABLE III.    COMPARATIVE GENERATED ERRORS OF GR-SOM AND                                 overlapped data also could be assigned correctly to true
                                                                                         clusters they belong to, by assigning fuzzy membership
                        Method         Δ1        Δ2          Δ3           Δ total        value to the indiscernible set of data. Also a weight can
                                                                                         be assigned to the data’s dimension to improve the
 Iris Data              GR-SOM        1.05      0.85         0.04          1.4
                                                                                         overall accuracy.
     set                I-SOM                                              2.8

   Furthermore, to demonstrate the effectiveness of the
                                                                                         [1] T. Kohonen, “Self-organized formation of topologically
proposed clustering algorithm (RI-SOM), two data sets,                                       correct feature maps”, Biol. Cybern. 43. 1982, pp. 59–69.
one artificial and one real word data set were used in our                               [2] T. Kohonen, Self-Organizing Maps, Springer, Berlin,
experiments. The results are compared to I-SOM                                               Germany, 1997.
(Incremental clustering of SOM). The input data are                                      [3] M.N.M Sap and Ehsan Mohebi, “Hybrid Self Organizing
normalized such that the value of each datum in each                                         Map for Overlapping Custers”. The Springer-Verlag
                                                                                             Proceedings of the CCIS 2008. Hainan Island, China.
dimension lies in [0,1] . For training SOM 10 × 10 with                                      Accepted.
100 epochs on the input data is used.                                                    [4] M.N.M Sap and Ehsan Mohebi, “Rough set Based
                                                                                             Clustering of the Self Organizing Map”. The IEEE
                                                                                             Computer Scociety Proceeding of the 1st Aseian
                                                                                             Conference on Intelligent Information and Database
                                                                                             Systems 2009. Dong Hoi, Vietnam. Accepted
                                                                                         [5] M.N.M Sap and Ehsan Mohebi, “A Novel Clustering of
                                                                                             the SOM using Rough set”. The IEEE Proceeding of the
                                                                                             6th Student Conference on Research and Development
                                                                                             2008”. Johor, Malaysia 2008. Accepted
                                                                                         [6] R.M. Gray., “Vector quantization”. IEEE Acoust. Speech,
                                                                                             Signal Process. Mag. 1 (2) 1984. pp. 4–29.
                                                                                         [7] N.R. Pal, J.C. Bezdek, and E.C.K. Tsao, “Generalized
                                                                                             clustering networks and Kohonen’s self-organizing
                                                                                             scheme”. IEEE Trans. Neural Networks (4) 1993. pp.
                                                                                         [8] J. Han, M. Kamber, “Data mining: concepts and
                                                                                             techniques”, Morgan-Kaufman, San Francisco, 2000.
                                                                                         [9] S. Asharaf, M. Narasimha Murty, and S.K. Shevade,
   Figure 5. Comparison the error between I-SOM and RI-SOM                                   “Rough set based incremental clustering of interval data”,
             proposed algorithms on artificial data set.                                     Pattern Recognition Letters, Vol. 27, 2006, pp. 515-519.

                                                                                                                       ISSN 1947-5500
                                                    (IJCSIS) International Journal of Computer Science and Information Security,
                                                    Vol. 8, No. 4, July 2010

[10] Yan and Yaoguang., “Research and application of SOM             [14] Pawlak, Z., “Rough sets”. Internat. J. Computer Inf. Sci.
     neural network which based on kernel function”.                      vol.11, 1982. pp. 341–356.
     Proceeding of ICNN&B’05. Vol.1, 2005. pp. 509- 511.             [15] S.C. Sharma and A. Werner., “Improved method of
[11] M.N.M. Sap and Ehsan Mohebi. “Outlier Detection                      grouping provincewide permanent traffic counters”.
     Methodologies: A Review”. Journal of Information                     Transaction Research Report 815, Washington D.C. 1981
     Technology, UTM, Vol. 20, Issue 1, 2008. pp. 87-105.                 pp. 13-18 .
[12] A. Ultsch, H.P. Siemon., “Kohonen’s self organizing             [16] Goldberg D.E, “Genetic Algorithm in Search
     feature maps for exploratory data analysis”. Proceedings             Optimization and Machine Learning”. Addison-Wesley
     of the International Neural Network Conference,                      Pubishing, 1989.
     Dordrecht, Netherlands 1990. pp. 305–308.                       [17] Ebrehart, R, Simpson P. Dobbins R., “Comptational
[13] X. Zhang, Y. Li. “Self-organizing map as a new method                Intelligent PC Tools”, Waite Group Press, 1996.
     for clustering and data analysis”. Proceedings of the           [18] UCIMachineLearning,
     International Joint Conference on Neural Networks,                   itory.html.
     Nagoya, Japan 1993. pp. 2448–2451.

                                                                                                   ISSN 1947-5500