An Optimized Clustering Algorithm Using Genetic Algorithm and Rough set Theory based on Kohonen self organizing map
Document Sample


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
An Optimized Clustering Algorithm Using
Genetic Algorithm and Rough set Theory based
on Kohonen self organizing map
3
1
Asgarali Bouyer, 2Abdolreza Hatamlou Abdul Hanan Abdullah
1,2
Department of Computer Science Department of Computer and Information Systems,
1
Islamic Azad University – Miyandoab Branch Faculty of Computer Science and Information Systems,
3
1 Universiti Teknologi Malaysia
Miyandoab, Iran
2 81310 Skudai, Johor Bahru, Malaysia
University Kebangsaan Malaysia 3
hanan@utm.my
2
Selangor, Malaysia
1
basgarali2@live.utm.my, 2hatamlou@iaukhoy.ac.ir
Abstract—The Kohonen self organizing map is an efficient Clustering algorithms attempt to organize unlabeled
tool in exploratory phase of data mining and pattern input vectors into clusters such that points within the
recognition. The SOM is a popular tool that maps high cluster are more similar to each other than vectors
dimensional space into a small number of dimensions by belonging to different clusters [7]. The clustering
placing similar elements close together, forming clusters. methods are of five types: hierarchical clustering,
Recently, most of the researchers found that to take the partitioning clustering, density-based clustering, grid-
uncertainty concerned in cluster analysis, using the crisp based clustering and model-based clustering [8]. The
boundaries in some clustering operations is not necessary. rough set theory employs two upper and lower thresholds
In this paper, an optimized two-level clustering algorithm
in the clustering process which result in a rough clusters
based on SOM which employs the rough set theory and
genetic algorithm is proposed to defeat the uncertainty
appearance. This technique also could be defined in
problem. The evaluation of proposed algorithm on our incremental order i.e. the number of clusters is not
gathered poultry diseases data and Iris data expresses predefined by users.
more accurate compared with the crisp clustering methods Our goal is to optimized clustering algorithm that will
and reduces the errors. use in poultry disease predictions. The clustering will
Index Terms- SOM, Clustering, Rough set theory, assist in improving further analysis of the poultry
Genetic Algorithm.
symptoms data in detecting outliers. Analyzing outlier
can reveal surprising facts hidden inside data like
I. INTRODUCTION ambiguous patterns that are still assumed to belong to
one of the predefined or undefined classes. Clustering is
The self organizing map (SOM) proposed by important in detecting outlier to avoid the high cost of
Kohonen [1], has been widely used in industrial misclassification. In order to cater for the complex nature
applications such as pattern recognition, biological of data of our problem domain, clustering technique
modeling, data compression, signal processing and data based on machine learning approaches such as self
mining [2]-[5]. It is an unsupervised and nonparametric organizing map (SOM), kernel machines, fuzzy methods,
neural network approach. The success of the SOM etc for clustering poultry symptoms (based on
algorithm lies in its simplicity that makes it easy to observation data – body, feathers, skin, head, muscle,
understand, simulate and be used in many applications. lung, heart, intestines, ovary, etc) will prove to be a
The basic SOM consists of neurons usually arranged in a promising tool.
two-dimensional structure such that there are
neighborhood relations among the neurons. After In this paper, a new two-level clustering algorithm is
completion of training, each neuron is attached to a proposed. The idea is that the first level is to train the
feature vector of the same dimension as input space. By data by the SOM neural network and the clustering at the
assigning each input vector to the neuron with nearest second level is a rough set based incremental clustering
feature vectors, the SOM is able to divide the input space approach [9], which will be applied on the output of
into regions (clusters) with common nearest feature SOM and requires only a single neurons scan. The
vectors. This process can be considered as performing optimal number of clusters can be found by rough set
vector quantization (VQ) [6]. Also, because of the theory which groups the given neurons into a set of
neighborhood relation contributed by the inter- overlapping clusters (clusters the mapped data
connections among neurons, the SOM exhibits another respectively). Then the overlapped neurons will be
important property of topology preservation. assigned to the true clusters they belong to, by apply
39 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
genetic algorithm. A genetic algorithm has been adopted kernel function around the winning neuron c at time t .
to minimize the uncertainty that comes from some The neighborhood kernel function is a non-increasing
clustering operations. In our previous work [3] the hybrid function of time and of the distance of neuron i from
SOM and rough set has been applied to catch the the winning neuron c . The kernel can be taken as a
involved ambiguity of clusters but the experiment results
Gaussian function:
show that the proposed algorithm (Genetic Rough SOM)
outperforms the previous one. the next important process Pos i − Pos c
2
−
is to collect poultry data from the common and important 2σ ( t ) 2
diseases, which can affect the respiratory and non- hic (t ) = e (2)
respiratory system of poultry. The first phase is to
where Posi is the coordinates of neuron i on the output
identify the format and values for input parameters from
available information. The second phase is to investigate grid and σ (t ) is kernel width. The weight update rule in
and develop data conversion and reduction algorithms for the sequential SOM algorithm can be written as:
⎧w (t ) + ε (t ) hic (t )(x(t ) − wi (t ) )∀i ∈ N c
input parameters.
This paper is organized as following; in section II the wi (t + 1) = ⎨ i (3)
basics of SOM algorithm are outlined. The basic of ⎩ wi (t ) ow
rough set incremental clustering approach are described Both learning rate ε (t ) and neighborhood σ (t )
in section III. In section IV the essence of genetic
decrease monotonically with time. During training, the
algorithm is described. The proposed algorithm is
presented in section V. Section VI is dedicated to SOM behaves like a flexible net that fold onto a cloud
experiment results and section VII provides brief formed by training data. Because of the neighborhood
conclusion and future works. relations, neighboring neurons are pulled to the same
direction, and thus feature vectors of neighboring
neurons resemble each other. There are many variants of
II. SELF ORGANIZING MAP the SOM [10, 11]. However, these variants are not
considered in this paper because the proposed algorithm
Competitive learning is an adaptive process in which is based on SOM, but not a new variant of SOM.
the neurons in a neural network gradually become The 2D map can be easily visualized and thus give
sensitive to different input categories, sets of samples in people useful information about the input data. The
a specific domain of the input space. A division of usual way to display the cluster structure of the data is to
neural nodes emerges in the network to represent use a distance matrix, such as U-matrix [12]. U-matrix
different patterns of the inputs after training. method displays the SOM grid according to neighboring
The division is enforced by competition among the neurons. Clusters can be identified in low inter-neuron
neurons: when an input x arrives, the neuron that is best distances and borders are identified in high inter-neuron
able to represent it wins the competition and is allowed distances. Another method of visualizing cluster
to learn it even better. If there exist an ordering between structure is to assign the input data to their nearest
the neurons, i.e. the neurons are located on a discrete neurons. Some neurons then have no input data assigned
lattice, the competitive learning algorithm can be to them. These neurons can be used as the border of
generalized. Not only the winning neuron but also its clusters [13].
neighboring neurons on the lattice are allowed to learn,
the whole effect is that the final map becomes an
ordered map in the input space. This is the essence of III. ROUGH SET INCREMENTAL CLUSTERING
the SOM algorithm. The SOM consist of m neurons This algorithm is a soft clustering method employing
located on a regular low-dimensional grid, usually one rough set theory [14]. It groups the given data set into a
or two dimensional. The lattice of the grid is either set of overlapping clusters. Each cluster is represented
hexagonal or rectangular. by a lower approximation and an upper approximation
The basic SOM algorithm is iterative. Each neuron i ( A(C ), A (C )) for every cluster C ⊆ U . Here U is a set
has a d -dimensional feature vector wi = [ wi1 ,..., wid ] . of all objects under exploration. However, the lower and
At each training step t , a sample data vector x(t ) is upper approximations of Ci ∈ U are required to follow
randomly chosen for the training set. Distance between some of the basic rough set properties such as:
x(t ) and all feature vectors are computed. The winning /
(1) 0 ⊆ A(Ci ) ⊆ A (Ci ) ⊆ U
neuron, denoted by c , is the neuron with the feature /
(2) A(Ci ) ∩ A(C j ) = 0, i ≠ j
vector closest to x(t ) : /
(3) A(Ci ) ∩ A (C j ) = 0, i ≠ j
c = arg min x(t ) − wi , i ∈ { ,..., m}
1 (1) (4) If an object u k ∈ U is not part of any lower
i approximation, then it must belong to two or
A set of neighboring nodes of the winning node is more upper approximations.
denoted as N c . We define hic (t ) as the neighborhood
40 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
Note that (1)-(4) are not independent. However the fitness to optimization and machine learning. GA
enumerating them will be helpful in understanding the provides very efficient search method working on
basic of rough set theory. population, and has been applied to many problems of
The lower approximation A(C ) contains all the optimization and classification [16]-[17]. General GA
patterns that definitely belong to the cluster C and the process is as follows:
upper approximation A (C ) permits overlap. Since the (1) Initial the population of genes.
upper approximation permits overlaps, each set of data (2) Calculates the fitness of each individual in the
points that are shared by a group of clusters define population.
indiscernible set. Thus, the ambiguity in assigning a (3) Reproduce the individual selected to form a new
pattern to a cluster is captured using the upper population according to each individual’s
approximation. Employing rough set theory, the fitness.
proposed clustering scheme generates soft clusters (4) Perform crossover and mutation on the
(clusters with permitted overlap in upper population.
approximation). (5) Repeat step (2) through (4) until some condition
For a rough set clustering scheme and given two is satisfied.
objects u h , u k ∈ U we have three distinct possibilities: Crossover operation swaps some part of genetic bit
• Both u k and u h are in the same lower string within parents. It emulates just as crossover of
approximation A(C ) . genes in real world that descendants are inherited
• Object u k is in lower approximation A(C ) and characteristics from both parents. Mutation operation
u h is in the corresponding upper approximation inverts some bits from whole bit string at very low rate.
A (C ) , and case 1 is not applicable. In real world we can see that some mutants come out
• Both u k and u h are in the same upper rarely. Fig.1 shows the way of applying crossover and
approximation A (C ) , and case 1 and 2 are not mutation operations to genetic algorithm. Each
applicable. individual in the population evolves to getting higher
fitness generation by generation.
The quality of a conventional clustering scheme is
determined using within-group-error [15] Δ given by:
Crossover Mutation
m
∑ ∑ distance (u
010000001001 010000011101 011100011001
Δ= h , uk )
(4)
i =1 u h , u k ∈C i ×
where u h , u k are objects in the same cluster C i . 111001011101 111001001001 011101011001
For the above rough set possibilities, three types of
equation (4) could be defined as following: Figure 1. Crossover and Mutation
m
Δ1 = ∑ ∑ distance (u
i =1 u h , u k ∈ A ( X i )
h , uk )
V. GENETIC ROUGH SET CLUSTERING OF THE SELF
m
∑ ∑ distance (u
ORGANIZING MAP
Δ2 = h , uk ) (5)
i =1 u h ∈ A ( X i ) and u k ∈ A ( X i ) In this paper rectangular grid is used for the SOM.
m
∑ ∑ distance (u
Before training process begins, the input data will be
Δ3 = h , uk ) normalized. This will prevent one attribute from
i =1 u h , u k ∈ A ( X i )
overpowering in clustering criterion. The normalization
The total error of rough set clustering will then be a of the new pattern X i = {xi1 ,..., xid } for i = 1,2,..., N is
weighted sum of these errors: as following:
Δtotal = w1 × Δ1 + w2 × Δ2 + w3 × Δ3 where w1 > w2 > w3. (6) Xi
Xi = . (7)
Xi
Since Δ1 corresponds to situations where both
objects definitely belong to the same cluster, the weight Once the training phase of the SOM neural network
w1 should have the highest value. completed, the output grid of neurons which is now
stable to network iteration, will be clustered by applying
the rough set algorithm as described in the previous
IV. GENETIC ALGORITHM
section. The similarity measure used for rough set
Genetic algorithm was proposed by John Holland in clustering of neurons is Euclidean distance (the same
early 1970s, it applies some of natural evolution used for training the SOM). In this proposed method
mechanism such as crossover, mutation, and survival of
41 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
(see Fig.2) some neurons, those never mapped any data as possible. Therefore, a precision measure needs to be
are excluded from being processed by rough set used for evaluating the quality of the proposed approach.
algorithm. A possible precision measure can be defined as the
following equation [14]:
Number of objects in lower approx
certainty = (9)
Lower approx Total number of objects
Upper approx
VI. EXPERIMENT RESULTS
To demonstrate the effectiveness of the proposed
clustering algorithm GR-SOM (Genetic Rough set
Incremental clustering of the SOM), two phases of
experiments has been done on the well known Iris data
set [18] and our gathered data. The Iris data set, which
has been widely used in pattern classification, consists
of 150 data points of four dimensions and our collected
Figure 2. Clustering of the Self Organizing Map. The overlapped
neurons are highlited for two clusters.
data has 48 data points. The Iris data are divided into
three classes with 50 points each. The first class of Iris
From the rough set algorithm it can be observed that plant is linearly separable from the other two. The other
if two neurons are defined as indiscernible (those two classes are overlapped to some extent.
neurons in the upper approximation of two or more The first phase of experiments, presents the
clusters), there is a certain level of similarity they have uncertainty that comes from the data set and in the
with respect to the clusters they belong to and that second phase the errors has been generated. The results
similarity relation has to be symmetric. Thus, the of GR-SOM and RI-SOM [3] (Rough set Incremental
similarity measure must be symmetric. SOM) are compared to I-SOM [4] (Incremental
According to the rough set clustering of the SOM, clustering of SOM). The input data are normalized such
overlapped neurons and respectively overlapped data that the value of each datum in each dimension lies in
(those data in the upper approximation) are detected. In [0,1] .
the experiments, to calculate errors and uncertainty, the For training, SOM 10 × 10 with 100 epochs on the
previous equations will be applied to the results of SOM input data is used. The general parameters for the
(clustered and overlapped data). Then for each genetic algorithm have been configured as Table I. Fig.4
overlapped neuron a gene is generated that represents shows the certainty generated from epoch 100 to 500 by
the alternative distances from each cluster leader. Fig.3 (9) on the mentioned data set. From the gained certainty
shows an example of the generated genes for it’s obvious that the GR-SOM could efficiently detect
m overlapped neurons on n existing cluster leaders. the overlapped data that have been mapped by
gene 1 d1 d2 d3 d4 …. dn-1 dn overlapped neurons (table II).
gene 2 In the second phase, the same initialization for the
d1 d2 d3 d4 …. dn-1 dn
gene 3
SOM has been used. The errors that come from the data
d1 d2 d3 d4 …. dn-1 dn
sets, according to the (5) and (6) have been generated by
. . . . . …. . . our proposed algorithms (table III). The weighted sum
gene m d1 d2 d3 d4 …. dn-1 dn (6) has been configured as (10).
Figure 3. Generated genes. m number of overlapped neurons and n is TABLE I. GENERAL PARAMETERS OF THE GENETIC ALGORITHM
number of existing clusters. The highlighted di is the optimize one that
minize the fitness function Population Size 50
Number of Evaluation 10
After the genes have been generated the genetic Crossover Rate 0.25
Mutation Rate 0.001
algorithm is employed to minimize the following fitness Number of Generation 100
function which represents the total sum of each d j of
the related gene: TABLE II. THE CERTAINTY-LEVEL OF GR-SOM, RI-SOM AND I-
SOM ON THE IRIS DATA SET FROM EPOCH 100 TO 500.
m n
F= ∑∑ g (d
i =1 j =1
i j) (8) Epoch 100
33.33
200
65.23
300
76.01
400
89.47
500
92.01
I-SOM
RI-SOM 67.07 73.02 81.98 91.23 97.33
The aim of the proposed approach is making the
GR-SOM 69.45 74.34 83.67 94.49 98.01
genetic rough set clustering of the SOM to be as precise
42 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
The artificial data set is a 569 30-dimensional data set
I‐SOM RI‐SOM GR‐SOM
100
which is trained twice, once with I-SOM and once with
90 RI-SOM. The errors of generated results are calculated
80 from the difference between the results of equation (9)
70 and 1, see “Fig. 5”.
From the “Fig. 5” it could be observed that the
Certainty
60
50 proposed RI-SOM algorithm generates less error in
40 cluster prediction compare to I-SOM.
30
20
10 VII. CONCLUSION AND FUTURE WORK
0 In this paper a two-level based clustering approach
100 200 300 400 500 (GR-SOM), has been proposed to predict clusters of
Epoch
high dimensional data and to detect the uncertainty that
Figure 4. Comparison of the certainty-level of GR-SOM, RI-SOM comes from the overlapping data. The approach is based
and I-SOM on the Iris data set. on the rough set theory that employs a soft clustering
which can detects overlapped data from the data set and
3
∑w
makes clustering as precise as possible, then GA is
i =1 applied to find the true cluster for each overlapped data.
i =1
The results of the both phases indicate that GR-SOM is
and for each wi we have : (10) more accurate and generates fewer errors as compare to
1 crisp clustering (I-SOM).
wi = × ( 4 − i ).
6 The proposed algorithm detects accurate overlapping
clusters in clustering operations. As the future work, the
TABLE III. COMPARATIVE GENERATED ERRORS OF GR-SOM AND overlapped data also could be assigned correctly to true
I-SOM ON THE IRIS DATA SET ACCORDING TO EQUATIONS (5) AND (6).
clusters they belong to, by assigning fuzzy membership
Method Δ1 Δ2 Δ3 Δ total value to the indiscernible set of data. Also a weight can
be assigned to the data’s dimension to improve the
Iris Data GR-SOM 1.05 0.85 0.04 1.4
overall accuracy.
set I-SOM 2.8
REFERENCES
Furthermore, to demonstrate the effectiveness of the
[1] T. Kohonen, “Self-organized formation of topologically
proposed clustering algorithm (RI-SOM), two data sets, correct feature maps”, Biol. Cybern. 43. 1982, pp. 59–69.
one artificial and one real word data set were used in our [2] T. Kohonen, Self-Organizing Maps, Springer, Berlin,
experiments. The results are compared to I-SOM Germany, 1997.
(Incremental clustering of SOM). The input data are [3] M.N.M Sap and Ehsan Mohebi, “Hybrid Self Organizing
normalized such that the value of each datum in each Map for Overlapping Custers”. The Springer-Verlag
Proceedings of the CCIS 2008. Hainan Island, China.
dimension lies in [0,1] . For training SOM 10 × 10 with Accepted.
100 epochs on the input data is used. [4] M.N.M Sap and Ehsan Mohebi, “Rough set Based
Clustering of the Self Organizing Map”. The IEEE
Computer Scociety Proceeding of the 1st Aseian
Conference on Intelligent Information and Database
Systems 2009. Dong Hoi, Vietnam. Accepted
[5] M.N.M Sap and Ehsan Mohebi, “A Novel Clustering of
the SOM using Rough set”. The IEEE Proceeding of the
6th Student Conference on Research and Development
2008”. Johor, Malaysia 2008. Accepted
[6] R.M. Gray., “Vector quantization”. IEEE Acoust. Speech,
Signal Process. Mag. 1 (2) 1984. pp. 4–29.
[7] N.R. Pal, J.C. Bezdek, and E.C.K. Tsao, “Generalized
clustering networks and Kohonen’s self-organizing
scheme”. IEEE Trans. Neural Networks (4) 1993. pp.
549–557.
[8] J. Han, M. Kamber, “Data mining: concepts and
techniques”, Morgan-Kaufman, San Francisco, 2000.
[9] S. Asharaf, M. Narasimha Murty, and S.K. Shevade,
Figure 5. Comparison the error between I-SOM and RI-SOM “Rough set based incremental clustering of interval data”,
proposed algorithms on artificial data set. Pattern Recognition Letters, Vol. 27, 2006, pp. 515-519.
43 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 4, July 2010
[10] Yan and Yaoguang., “Research and application of SOM [14] Pawlak, Z., “Rough sets”. Internat. J. Computer Inf. Sci.
neural network which based on kernel function”. vol.11, 1982. pp. 341–356.
Proceeding of ICNN&B’05. Vol.1, 2005. pp. 509- 511. [15] S.C. Sharma and A. Werner., “Improved method of
[11] M.N.M. Sap and Ehsan Mohebi. “Outlier Detection grouping provincewide permanent traffic counters”.
Methodologies: A Review”. Journal of Information Transaction Research Report 815, Washington D.C. 1981
Technology, UTM, Vol. 20, Issue 1, 2008. pp. 87-105. pp. 13-18 .
[12] A. Ultsch, H.P. Siemon., “Kohonen’s self organizing [16] Goldberg D.E, “Genetic Algorithm in Search
feature maps for exploratory data analysis”. Proceedings Optimization and Machine Learning”. Addison-Wesley
of the International Neural Network Conference, Pubishing Co.inc, 1989.
Dordrecht, Netherlands 1990. pp. 305–308. [17] Ebrehart, R, Simpson P. Dobbins R., “Comptational
[13] X. Zhang, Y. Li. “Self-organizing map as a new method Intelligent PC Tools”, Waite Group Press, 1996.
for clustering and data analysis”. Proceedings of the [18] UCIMachineLearning,www.ics.uci.edu/mlearn/MLRepos
International Joint Conference on Neural Networks, itory.html.
Nagoya, Japan 1993. pp. 2448–2451.
44 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Related docs
Other docs by ijcsiseditor
Digital Images Encryption in Spatial Domain Based on Singular Value Decomposition and Cellular Automata
Views: 0 | Downloads: 0
Agent Behavior in Multiagent Systems: Issues and Challenges in Design, Development and Implementation
Views: 1 | Downloads: 0
Optimizing Cost, Delay, Packet Loss and Network Load in AODV Routing Protocols
Views: 2 | Downloads: 0
Get documents about "