Using Intelligent Techniques for Breast Cancer Classification

Document Sample
Using Intelligent Techniques for Breast Cancer Classification Powered By Docstoc
					    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

       Using Intelligent Techniques for Breast Cancer
                              Hesham Arafat 1, Sherif Barakat 2 and Amal F. Goweda3
                      Mansoura University, Faculty of Engineering, Department of Computer Engineering and Systems,
                      Mansoura University, Faculty of Computers and Information, Department of Information Systems,
                      Mansoura University, Faculty of Computers and Information, Department of Information Systems,
                                                                 physicians in optimizing the decision task effectively.
Abstract:     Attribute reduction is an important issue in       Rough set theory offers a novel approach to manage
rough set theory. It is necessary to investigate fast and        uncertainty that has been used for the discovery of data
effective approximate algorithms to generate a set of            dependencies, importance of features, patterns in sample
discriminatory features. The main objective of this paper is     data, feature space dimensionality reduction and the
investigating a strategy based on Rough Set Theory (RST)         classification of objects. While rough set on their own
with Particle Swarm Optimization (PSO) to be used. Rough         provide a powerful technique, it is often combined with
Set Theory has been recognized to be one of the powerful
                                                                 other computational intelligence techniques such as
tools in the medical feature selection .The supplementary
                                                                 neural networks, fuzzy sets, genetic algorithms, Bayesian
part which will be used is Particle Swarm Optimization
(PSO) that is defined as a subfield of swarm intelligence that   approaches, swarm optimization and support vector
studies the emergent collective intelligence of groups of        machines. Particle Swarm Optimization as a new
simple agents and based on social behavior that can be           evolutionary computation technique, in which each
observed in nature, such as flocks of birds and fish schools     potential solution is seen as a particle with a certain
where a number of individuals with limited capabilities are      velocity flying through the problem space. The Particle
able to achieve intelligent solutions for complex problems.      Swarms find optimal regions of the complex search space
Particle Swarm Optimization is widely used and rapidly           through the interaction of individuals in the population.
developed for its easy implementation and few particles          PSO is attractive for feature selection in that particle
required to be tuned. This hybrid approach embodies an           swarms will discover best feature combinations as they
adaptive feature selection procedure which dynamically
                                                                 fly within the subset space. Compared with other
accounts for the relevance and dependence of the features
                                                                 evolutionary techniques, PSO requires only primitive and
.The relevance selected feature subsets are used to generate
decision rules for the breast cancer classification task to
                                                                 simple mathematical operators. In this research, rough
differentiate the benign cases from the malignant cases by       set is applied to improve feature selection and data
assigning classes to objects. The proposed hybrid approach       reduction. Particle Swarm Optimization (PSO) is used to
can help in improving classification accuracy and also in        optimize the rough set feature reduction to effectively
finding more robust features to improve classifier               classify breast cancer tumors, either malignant or benign.
                                                                 This paper is organized as follows: In section 2 we
Keywords: Breast cancer, Rough Sets, Feature Selection,
                                                                 reviewed briefly some of the recent related work
Particle Swarm Optimization (PSO), Classification.
                                                                 published in the area of cancer classification using
                                                                 intelligent techniques, the vision of the proposed hybrid
1. INTRODUCTION                                                  techniques given in section 3, section 4 discusses basic
Breast cancer occurs when cells become abnormal and              concepts of rough set theory, the basic idea of Particle
divide without control or order that can be considered as        Swarm Optimization and also shown experiments results
cancerous growth that begins in the tissues of the breast.       and Section 5 concludes the research
Breast cancer has become the most common cancer
disease among women [1]. The most effective way to               2. RELATED WORK
reduce breast cancer deaths is detect it earlier. Early          Besides the introduction given above, a literary review is
detection is the best form of cure and accurate diagnosis        presented on Particle Swarm Optimization and rough sets
of the tumor which is extremely vital. Early detection           through their journey from inception to be implemented
allows physicians to differentiate between benign breast         in various problems.
tumors from malignant ones without going for surgical            The use of machine learning and data mining techniques
biopsy. It also offers accurate, timely analysis of patient's    [2] has revolutionized the whole process of breast cancer
particular type of cancer and the available treatment            Diagnosis and Prognosis. Data mining methods overview
options. Extensive research has been carried out on              including Decision Trees, Support Vector Machine
automating the critical diagnosis procedure as various           (SVM), Genetic Algorithms (GAs) / Evolutionary
machine learning algorithms have been developed to aid           Programming (EP), Fuzzy Sets, Neural Networks and

Volume I, Issue 3, September-October 2012                                                                             Page 26
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

Rough Sets had been carried out to enhance the breast          threshold and relative importance of upper and lower
cancer diagnosis and prognosis.                                approximations of the rough sets.
Data Mining Classification Techniques for Breast Cancer        Another approach that uses rough set with PSO has been
Diagnosis as discussed in [3] .A. Soltani Sarvestani, A.       proposed by Wang et al. in [7]. The authors applied
A. Safavi, N.M. Parandeh and M.Salehi provided a               rough set to predict the degree of malignancy in brain
comparison among the capabilities of various neural            glioma. The selected feature subsets based on rough set
networks such as Multilayer Perceptron (MLP), Self-            with PSO are used to generate decision rules for the
Organizing Map (SOM), Radial Basis Function (RBF)              classification task. A rough set attribute reduction
and Probabilistic Neural Network(PNN) which are used           algorithm that employs a search method based on PSO is
to classify WBC and NHBCD data. The performance of             proposed and compared with other rough set reduction
these neural network structures was investigated for           algorithms. Experimental results show that reducts found
breast cancer diagnosis problem. RBF and PNN were              by the proposed algorithm are more efficient and can
proved as the best classifiers in the training set. But the    generate decision rules with better classification
PNN gave the best classification accuracy when the test        performance. Moreover, the decision rules induced by
set is considered. This work showed that statistical neural    rough set rule induction algorithm can reveal regular and
networks can be effectively used for breast cancer             interpretable patterns of the relations between glioma
diagnosis as by applying several neural network                MRI features and the degree of malignancy, which are
structures a diagnostic system was constructed that            helpful for medical experts.
performed quite well.                                          In this introduction, different data mining techniques
In [4] Wei-pin Chang, Der-Ming and Liou explored that          including rough sets and evolutionary algorithms used
the genetic algorithm model yielded better results than        for working out the feature selection problem,
other data mining models for the analysis of the data of       classification, data analysis and clustering had been
breast cancer patients in terms of the overall accuracy of     showed. No attempt has been made to cover all
the patient classification, the expression and complexity      approaches currently existing in literature but our
of the classification rule. The artificial neural network,     intention has been to highlight the role played by data
decision tree, logistic regression and genetic algorithm       mining techniques and especially rough set theory across
were used for the comparative studies and the accuracy         evolutionary algorithms.
while positive predictive value of each algorithm was          Table 1 depicts a fair comparison between supervised
used as the evaluation indicators. WBC database was            learning approaches which are used for the classification
incorporated for the data analysis followed by the 10-fold     task that concentrates on predicting the value of the
cross-validation. The results showed that the genetic          decision class for an object among a predefined set of
algorithm described in the study was able to produce           values classes given the values of some given attributes
accurate results in the classification of breast cancer data   for the object comparison between learning algorithms
and the classification rule identified was more acceptable     and more details can be found in [8]
and comprehensible.                                             Table 1: a fair comparison between supervised learning
In [5] K. Rajiv Gandhi, Marcus Karnan and S. Kannan                                   approaches
in their paper constructed classification rules using the
Particle Swarm Optimization algorithm for breast cancer          Classificatio     Strength Points       Weakness
datasets. In this study to cope with heavy computational         n Technique                              Points
efforts, the problem of feature subset selection as a pre-       Decision Tree
                                                                                   Decision   trees     Decision
processing step was used which learns fuzzy rules bases          Induction
                                                                                     generate rules        trees are
using GA implementing the Pittsburgh approach. It was
                                                                                     which       are       less
used to produce a smaller fuzzy rule bases system with
                                                                                     effective and         appropri
higher accuracy. The resulted datasets after feature
                                                                                     simple                ate     to
selection were used for classification using particle
swarm optimization algorithm. The rules developed were
                                                                                   Decision   trees       the
with rate of accuracy defining the underlying attributes
                                                                                     provide fields        continuo
                                                                                     which are the         us
Das et al. in [6] hybridized rough set theory with Particle
                                                                                     important             attribute
Swarm Optimization (PSO) algorithm. The hybrid
                                                                                     ones.                 values.
rough-PSO technique has been used for grouping the
pixels of an image in its intensity space. Authors treated
                                                                                                         It can be
image segmentation as a clustering problem. Each cluster
is modeled with a rough set. PSO is employed to tune the
                                                                                                            e      to

Volume I, Issue 3, September-October 2012                                                                      Page 27
   International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

  Bayesian                                                                                              consider
                    Easy        to        Loss    of
  Classification                                                                                        ed     so
                      implement.             accuracy
                    Good      results                                                                  higher
                                             g      to
                      obtained     in                                                                   fitness is
                      most of the                                                                       attainabl
                      cases.                                                                            e.
                                           Dependen
                                                         To evaluate which model is the best for the classification
                                                         task some dimensions for Comparison are taken into
                                                         accounts which are as follow:
                                             .              Error in numeric predictions
  Neural                                                    Cross Validation
                    High tolerance        Long            Speed of model application
                      to noisy data          training       Speed of model application
                                             time.          Classification Accuracy
                    Well suited for                        Total cost/benefit
                      continuous           Require a
                                                            Noise Tolerance
                      valued inputs          number
                      and outputs            of
                    Algorithms    are       ers.        3. ROUGH SET AND                       PSO       BASED
                      parallel.                          FEATURE SELECTION
                                           Poor           3.1Problem Definition
                                             interpret   Feature selection aims to determine a minimal feature
                                             ability     subset from a problem domain while retaining a suitably
  Support                                                high accuracy in representing the original features. The
                    Training        is    The need     significance of feature selection can be viewed in two
                      relatively easy.       for     a   facets. The frontier facet is to filter out noise and remove
                                             good        redundant and irrelevant features. According to Jensen
                    Non-traditional         kernel      and Shen in [9] feature selection is compulsory due to the
                      data can be            function    abundance of noisy, irrelevant or misleading features in a
                      used as input
                                                         dataset. Second facet, feature selection can be
                      to         SVM
                                                         implemented as an optimization procedure of search for
                      instead       of
                                                         an optimal subset of features that better satisfy a desired
                      feature vectors
                                                         measure in [10].Thus, a proposed rough set feature
  Genetic                                  The          selection algorithm based on a search method called
                    The parallelism
  Algorithms                                 language    Particle Swarm Optimization (PSO) is used to select
                      that     allows
                                             used to     feature subsets that are more efficient to describe the
                                             specify     decisions as well as the original whole feature set and
                      many schema
                                             candidat    discarded the redundant features leading to better
                      at once
                                             e           prediction accuracy. After selecting those features that
                    Genetic                 solutions   influence the decision concepts, they are employed within
                      algorithms             must be     a decision rule generation process and creating
                      perform well           robust.     descriptive rules for the classification task.
                      in problems for                      3.2 Objective
                                           The
                      which       the                    Problem target is focusing on finding optimal minimal
                      fitness                            feature subset from a problem domain in order to remove
                                             of how
                      landscape     is                   those features considered as irrelevant that increase the
                                             to write
                      complex ones.                      complexity of learning process and decrease the accuracy
                                                         of induced knowledge. It results not only in improving
                                                         the speed of data manipulation but even in improving the
                                                         classification rate by reducing the influence of noise and
                                             must be
                                                         achieving classification accuracy. The aim is to build a
                                                         concise model of the distribution of class labels in terms

Volume I, Issue 3, September-October 2012                                                                  Page 28
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

of predictor features. Then the resulting classifier is used        (3) i : Pi      Xi
to assign class labels to the testing instances where the           (4) while (stopping criterion not met)
values of the predictor features are known, but the value           (5) for i = 1,…, S // for each particle
of the class label is unknown.                                      (6) if (fitness(i) > fit) // local best
Figure 1 shows the feature selection procedure that was             (7)     fit     fitness(i)
adopted in this study. This structure composes of two               (8)     pbest       Xi
important phases. Feature Selection tier constitutes the            (9) if (fitness(i) > globalbest) // global best
deployment of Particle Swarm Optimization for further               (10) globalbest            fitness(i)
refinement and recommend only the significant features.             (11) gbest       Xi;
Classification tier constitutes the process of exploiting              R        getReduct(Xi) // convert to reduct
optimized features to extract classification rules to be            (12) updateVelocity(); updatePosition()
used to differentiate between benign cases from
malignant cases.                                                         Figure 2 An algorithm to compute reducts
In processing medical data, the advantage of choosing           Initially, a population of particles is constructed with
the optimal subset of features is as in [11]:                   random positions and velocities on S dimensions in the
  • Reducing the dimensionality of the attributes reduces       problem space. For each particle, the fitness function is
     the complexity of the problem and allows researchers       evaluated. If the current particle’s fitness evaluation is
     to focus more clearly on the relevant attributes           better than pbest, then this particle becomes the current
  • Simplifying data description may facilitate                 best, and its position and fitness are stored. Next, the
     physicians to make a prompt diagnosis.                     current particle’s fitness is compared with the
  • Having fewer features means that less data need to          population’s overall previous best fitness. If the current
     be collected, as collecting data is never an easy job in   value is better than gbest, then this is set to the current
     medical applications because it is time-consuming          particle’s position, with the global best fitness updated.
     and costly.                                                This position represents the best feature subset
                                                                encountered so far, and is thus converted and stored in R
                                                                .The velocity and position of the particle is then updated
                                                                according to Equation 12 and Equation 13. This process
                                                                loops until a stopping criterion is met, usually a
                                                                sufficiently good fitness or a maximum number of
                                                                iterations (generations).The chosen subsets are then
                                                                employed within a decision rule generation process,
                                                                creating descriptive rules for the classification task.
                                                                The motivation behind this study is trying to provide a
                                                                practical tool for optimizing feature selection problem,
                                                                the number of reducts found and the classification
                                                                accuracy when applied to the classification of complex,
                                                                real-world datasets. PSO has a strong search capability in
                                                                the problem space and can efficiently find minimal
                                                                reducts. Therefore, the combination of both with domain
                                                                intelligence leads to better knowledge.

                                                                4. TECHNIQUES USED IN THE STUDY
                                                                Structure described in Fig.1, involves two techniques
                                                                Rough Set and PSO.
                                                                  4.1.Rough Set Theory
                                                                Rough set theory [13] is a mathematical approach for
     Figure 1 Rough-PSO Feature Selection Process               handling vagueness and uncertainty in data analysis.
                                                                Objects may be indiscernible due to the limited available
The idea of PSO reducts for the optimal feature selection       information. A rough set is characterized by a pair of
problem can be shown [12] in Figure 2:                          precise concepts, called lower and upper approximations
    (1) i : Xi randomPosition();                               which are generated using object indiscernibility. The
      Vi     randomVelocity()                                   most important issues are the reduction of attributes and
    (2) fit    bestFit(X);                                      the generation of decision rules. The rough set approach
      globalbest      fit;                                      seems to be of fundamental importance to AI and
      pbest      bestPos(X);                                    cognitive sciences, especially in the areas of machine

Volume I, Issue 3, September-October 2012                                                                        Page 29
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

learning, knowledge acquisition, and decision analysis,          4.1.1.Basic Rough Set Concepts
knowledge discovery from databases, expert systems,            Let I  (U , A  {d }) be an information system [13]
inductive reasoning and pattern recognition.                   where U is the universe with a non-empty set of finite
 Irrelevant features, uncertainties and missing values         objects. A is a non-empty finite set of condition
often exist in medical data such as breast cancer data. So,    attributes, and d is the decision attribute (such a table is
the analysis of medical data often requires dealing with
                                                               also called decision table). a  A There is a
incompleteness and inconsistent of data that make it
differ from other intelligent techniques such as neural        corresponding function         f a : U  Va , where Va is the
networks, decision trees and fuzzy theory that are mainly      set of values of a.         If P  A , there is an associated
based hypothesis (e.g. knowledge about dependencies,           equivalence relation:
probability distributions and large number of
                                                               IND(P)  {(x, y) U U | a  P, f a ( x)  f a ( y)}(1)
Rough set theory [13] can deal with uncertainty and
incompleteness in data analysis. The attribute reduction       The partition of U, generated by IND (P) is denoted U/P.
algorithm removes redundant information or features and        If ( x, y )  IND ( P ) then x and y are indiscernible to P.
selects a feature subset that has the same discernibility as
the original set of features. The medical goal is to           The equivalence classes of the P-indiscernibility relation
identify subsets of the most important attributes              are     denoted [ x]P .    Let X  U ,      the    P-lower
influencing the treatment of patients. Rough set rule
                                                               approximation P X and P-upper approximation P X of
induction algorithms generate decision rules that can be
more useful for medical expert to analyze and gain             set X can be defined as:
understanding dimensions of the problem                         P X  {x  U | [ x]  X }        (2)
 The main advantage of rough set theory in data analysis
                                                                P X  {x  U | [ x ] P  X   }
is that it does not need any preliminary or additional                                            (3)
information about data like probability in statistics or       Let P, Q  A be equivalence relations over U, then the
grade of membership or the value of possibility in fuzzy       positive, negative and boundary regions can be defined
set theory. Rough set has many advantages to be used by        as:
many researchers:
                                                               POS P (Q)   P X
    Providing efficient algorithms for finding hidden                          X U / Q
     patterns in data and most of them are suited for
     parallel processing.                                      NEG P (Q)  U   P X
                                                                                       X U / Q
   Evaluating significance of data                            BND P (Q )   P X   P X
                                                                                X U / Q          X U / Q
   Generating sets of decision rules from data which are                                                  (6)
    concise and valuable                                         The positive region of the partition U/Q with respect to
                                                               P abbreviated as POS P (Q ) is the set of all objects of U
   Finding minimal sets of data (data reduction)
                                                               that can be certainly classified to blocks of the partition
   Rough set methods do not need membership                   U/Q by means of P. Functional dependence: For given
    functions and prior parameter settings due to its          A= (U, A), P, Q A, by P→Q is denoted the functional
    simplicity                                                 dependence of (Q) on (P) in A that holds if and only if
Rough sets have been a useful tool for medical                 IND (P)     IND (Q). Also dependencies to a degree are
applications. Hassanien [14] reported application of           considered in [13]:
rough sets to breast cancer data that generated rules with     Q depends on P in a degree k ( 0  k  1 ) denoted
98% accurate results. Tsumoto [15] proposed a rough set        P k Q
algorithm to generate diagnostic rules based on the
hierarchical structure of differential medical diagnosis                         POS P (Q)
and it was evaluated experimentally results show that
                                                               k   P (Q ) 
rules represent experts’ decision processes. Komorowski                                            (7)
and Ohrn [16] use a rough set approach for identifying a       If k=1, Q depends totally on P, if 0<k<1, Q depends
patient group in need of a scintigraphic scan for              partially on P, and if k=0 then Q does not depend on P.
subsequent modeling. In [15], a rough set classification       When P is a set of condition attributes and Q is the
algorithm exhibits higher classification accuracy than         decision,  P (Q) is the quality of classification [13],
decision tree algorithms, such as ID3 and C4.5. The            [17].
generated rules are more understandable than those
produced by decision tree methods.
Volume I, Issue 3, September-October 2012                                                                            Page 30
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

 The goal of attribute reduction is to remove redundant                        card (         d  v                )
                                                                                          A                      A
attributes so that the reduced set provides the same            (D ) 
                                                                                    card (          )
quality of classification as the original. The set of all                                        A                       (11)
reducts is defined as [18]:
A dataset may have many attribute reducts. The set of all     Red{R C |  R (D) C (D),B  R,  B (D)  C (D)}(8)
optimal reducts is:
                                                                               card (         d  v                )
Redmin  {R  Red| R  Red, R  R}                               (D )               A                      A
                                           (9)                                     card ( d  v              )
                                                                                                         A               (12)
The intersection of all reducts is called the core, the
elements of which are those attributes that cannot              4.1.3.Rough set Feature Selection
be eliminated. The core is defined as [13], [18]:
                                                              Rough sets for feature selection [20] is valuable, as the
Core(C) =     Red              (10)                           selected feature subset can generate more general
   4.1.2.Decision Rules                                       decision rules and better classification quality of new
An expression c: (a=v) where a  A and v  Va is an           samples. So some heuristic or approximation algorithms
                                                              have to be considered. K.Y. Hu [21] computes the
elementary condition (atomic formula) of the decision         significance of an attribute using heuristic ideas from
rule which can be checked for any x  X . An elementary       discernibility matrices and proposes a heuristic reduction
condition    c     can      be    interpreted    as    a      algorithm (DISMAR). X. Hu [22] gives a rough set
mapping c : U  {true, false} . A conjunction C of q          reduction algorithm using a positive region-based
elementary          conditions         is        denoted      attribute significance measure as a heuristic (POSAR).
by C  c1  c2  ...  cq . The cover of a conjunction C      G.Y. Wang [23] develops a conditional information
                                                              entropy reduction algorithm (CEAR).
denoted by [C] or C A is the subset of examples that
                                                                4.2.Particle Swarm Optimization (PSO)
satisfy the conditions represented by C as showed in          Particle swarm optimization (PSO) is an evolutionary
[17].The          cover          of       conjunction         computation technique developed by Kennedy and
[C ]  {x U : C ( x )  true} called the support             Eberhart [24], [25]. The original idea was to graphically
descriptor .If K is concept, the positive cover               simulate the choreography of a bird flock. Shi.Y.
[C ]  [C ]  K denotes the set of positive examples         introduced the concept of inertia weight into the particle
                                                              swarm optimizer to produce the standard PSO algorithm
covered by C.                                                 [26].The concept of particle swarms has become very
A decision rule r for A is any expression of the form         popular these days as an efficient search and
  (d  v ) where   c1  c 2  ...  c q is a              optimization      technique.     The     Particle   Swarm
                                                              Optimization (PSO) [27], [30] does not require any
conjunction, satisfying [ ] K   and v  Vd , Vd is the     gradient information of the function to be optimized, uses
set of values of d. The set of attribute-value pairs          only primitive mathematical operators, and is
occurring in the left hand side of the rule r is the          conceptually very simple. Since its advent in 1995, PSO
condition part, Pred(r), and the right hand is the decision   has attracted the attention of many researchers all over
part, Succ(r). An object u  U is matched by a decision       the world resulting in a huge number of variants of the
                                                              basic algorithm and many parameter automation
rule   (d  v ) if and only if u supports both the
condition part and the decision part of the rule. If u is     An Analysis of the Advantages of the Basic Particle
matched by   (d  v ) then we say that the rule             Swarm Optimization Algorithm discussed in [28]:
classifies u to decision class v. The number of objects           PSO is based on the intelligence. It can be applied
matched by a decision rule,   (d  v ) , denoted by              into both scientific research and engineering use.
                                                                  PSO have no overlapping and mutation calculation.
Match(r), is equal to card (  A ) . The support of the            The search can be carried out by the speed of the
rule card (         d  v A ) is the number of objects           particle. During the development of several
                A                                                  generations, only the most optimist particle can
supporting the decision rule.                                      transmit information onto the other particles, and the
   As in [19], the accuracy and coverage of a decision             speed of the researching is very fast.
rule   (d  v ) are defined as:                                 The calculation in PSO is very simple. In compared
                                                                   with the other developing calculations, it occupies
                                                                   the biggest optimization ability and it can be
                                                                   completed easily.

Volume I, Issue 3, September-October 2012                                                                                       Page 31
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

   PSO adopts the real number code, and it is decided                    (8)      end if
     directly by the solution. The number of the                          (9)              g = i.// Arbitrary
     dimension is equal to the constant of the solution.                  (10)         for j = indexes of neighbors do
PSO is initialized with a population of particles. Each                   (11)          if G (pj) > G (pg) then
                                                                          (12)          g = j.
particle is treated as a point in an S-dimensional space.
                                                                            //g is the index of the best performer in the
The ith particle is represented as X i  ( xi1 , xi 2 ,..., xiS ) .       neighborhood
The best previous position (pbest, the position giving the                (13)         end if
best       fitness          value)      of     any     particle           (14)         end for
                                                                          (15)         for d = 1 to number of dimensions do
is Pi  ( pi1 , pi 2 ,..., piS ) . The index of the global best           (16)         vid (t) = f(xid(t − 1), vid(t − 1), pid, pgd)
particle is represented by ‘gbest’. The velocity for particle               //Update velocity
is Vi  (vi1 , vi 2 ,..., viS ) . The particles are manipulated           (17)        vid 2 in (−Vmax,+Vmax)
                                                                          (18)        xid(t) = f(vid(t), xid(t − 1)) .
according to the following equation:                                       //Update position
vid  w*vid c1 *rand *(pid  xid) c2 *Rand *(pgd  xid)
                    ()                     ()                             (19)       end for
                                                                          (20) end for
xid  xid  vid                                                           (21) until stopping criteria
                                    (14)                                  (22) end procedure
Where w is the inertia weight, suitable selection of the
inertia weight provides a balance between global and                    Figure 3: Standard Particle Swarm Optimization
local exploration and thus require less iterations on                                              (PSO)
average to find the optimum. If a time varying inertia                Definitions and Variables used in Figure 3:
weight is employed, better performance can be expected                   t means the current time step, t − 1 means the
[29]. The acceleration constants c1 and c2 in equation                    previous time step.
(13) represent the weighting of the stochastic                           xid(t) is the current state (position) at site d of
acceleration terms that pull each particle toward pbest                   individual i.
and gbest positions. Low values allow particles to roam                  vid (t) is the current velocity at site d of individual i.
far from target regions before being tugged back, while                  ±Vmax is the upper/lower bound placed on vid.
high values result in abrupt movement toward, or past,                   pid is the individual’s i best state (position) found so
target regions. rand () and Rand() are two random                         far at site d.
functions in the range [0,1]. Particle’s velocities on each
                                                                         pgd is the neighborhood best state found so far at site
dimension are limited to a maximum velocity Vmax. If
Vmax is too small, particles may not explore sufficiently
                                                                        4.3.Problems'          Description          and       Basic
beyond locally good regions. If Vmax is too high
                                                                          Experimentation Setup
particles might fly past good solutions.
                                                                      Breast cancer UCI dataset [31] was obtained from
  The first part of equation (13) enables the “flying
                                                                      University of Wisconsin Hospitals, Madison from Dr.
particles” with memory capability and the ability to
                                                                      William H. Wolberg. We perform experimentation on the
explore new search space areas. The second part is the
                                                                      dataset summarized in Table 2.
“cognition” part, which represents the private thinking of
                                                                                Table2: Data used in the experiments
the particle itself. The third part is the “social” part,
which represents the collaboration among the particles.                Name           Instances      Class             Validati
Equation (13) is used to update the particle’s velocity.                              699            Distributio       on
Then, the particle flies toward a new position according                                             n
to equation (14). The performance of each particle is                  Wisconsin      Attribute      Benign            Trainin
measured according to a pre-defined fitness function.                  Breast         s              cases: 458        g 80%
The process for implementing the PSO algorithm is as                   Cancer         11             (65.5%)           Testing
follows [7]:                                                           Diagnosti                     Malignant         :140
                                                                       c                             cases: 241        case
     (1) procedure PSO                                                                               (34.5%).
     (2) repeat
     (3) for i = 1 to number of individuals do                        The data will be nine conditional features and one
     (4)    if G (xi) > G (pi) then
                                                                      decision feature as the first attribute that describes
      //G () evaluates goodness
                                                                      sample code number will be removed as shown later.
     (5)    for d = 1 to dimensions do
                                                                      Data was implemented in WEKA software more
     (6)        pid = xid .
      // pid is the best state found so far
                                                                      information about it can be found in [32]
     (7)      end for                                                 Steps to be implemented:

Volume I, Issue 3, September-October 2012                                                                                       Page 32
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

Step 1: Remove sample code number from data (no effect       population size and 20 iterations obtains the best result
on data) with removal filter.                                among the other methods.
Step 2: Dataset was discretized from numeric to nominal       Table 4: Comparison of classification results by using
data using NumericToNominal filter which is defined as                   various classification techniques
an instance filter that discretizes a range of numeric
attributes in the dataset into nominal attributes.
Step 3: Replace missing values for nominal and numeric                           Correct    Incorr      TP    F    Pre    R     Popul Feat
                                                                                 ly         ectly       Rat   P    cisi   ec    ation ure
attributes with modes and means from the training data                           Classifi   Classif     e     R    on     all   Size  Selec
that will be done by using ReplaceMissingValues filter.                          ed         ied         (A    at   (A                 tion

Step 4: To find the reducts we applied the supervised                            Instanc    Instan      V     e    VG

                                                                                 es         ces         G)    (A   )
attribute selection filter RSARSubsetEval(Rough Set                                                           V
Attribute Reduction) that is the implementation of the                                                        G)
QuickReduct algorithm of rough set attribute reduction
and we use the search method as PSOsearch that explores                          135       5          0.9   0.   0.9    0.    20    10
                                                                                 96.428     3.571       64    03   65     96          Attri
the attribute space using the Particle Swarm                                     6%         4%                8           4           butes
Optimization (PSO) algorithm described in [33] and
parameters showed in figure4 and table 3                                         Confusion Matrix
                   Table 3: PSO Parameters                                       a b <------classified as
                                                                                 87 3 | a = 2
                                                                                 2 48 | b = 4
                                                                                 136       42.        0.9   0.   0.9    0.    10    9
PSO           Individua   Inertia   Social    Iteration                          97.142     8571        71    02   72     97          Attri
Parameter     l           Weigh     Weigh     s                                  9%         %                 5           1           butes
                                                               Naïve Bayes

s             Weight      t         t
                                                                                 Confusion Matrix
                0.34        0.33      0.33       20                             a b <-- classified as
                                                                                87 3 | a = 2
                                                                                 1 49 | b = 4
                                                                                 136       42.        0.9   0.   0.9    0.97 5      8
                                                                                 97.142     8571        71    02   72     1           Attri
                                                                                 9%         %                 5                       butes

                                                                                 Confusion Matrix
                                                                                a b <-- classified as
                                                                                87 3 | a = 2
                                                                                 1 49 | b = 4
                                                                                 129       117        0.9   0.   0.9    0.    20    10
                                                                                 92.142     .8571       21    10   21     92          Attri
            Figure 4 Preprocess Implementation                                   9%         %                 6           1           butes

This stage was important because Rough Set filtering was                         129       117        0.9   0.   0.9    0.    10    9
used to eliminate the unimportant and redundant features                         92.142     .8571       21    10   21     92          Attri
(First phase in Figure1) and to reduce the number of                             9%         %                 6           1           butes
iterations that PSO has to perform in finding an optimum                                                                              Rule
feature subset.                                                                                                                       s
                                                               Decision Table

Step 5: We used some of classification techniques as
showed in table 4 to classify the data .The number of                                                                                 nds
decision rules and the classification accuracy are also                          129       117        0.9   0.   0.9    0.    5     8Attr
shown.                                                                           92.142     .8571       21    10   21     92          ibute
                                                                                 9%         %                 6           1           s
From the results, we could conclude that an increase of                                                                               10
particle/individual above 20 does not bring any relevant                                                                              Rule
improvement in the algorithm’s performance. The                                                                                       s
increment or the decrement in number of iterations has                                                                                Seco
also no influence on algorithm's performance as its ideal                                                                             nds
result by experiments is on 20 iterations. The best result                       Confusion Matrix for the above three cases
in all of the classification algorithms obtains with                              a b <-- classified as
                                                                                 86 4 a = 2
minimum feature subset .This achieves our view to obtain                          7 43 b = 4
best results with minimum features subset. Finally, the
evaluation results show that using Naïve Bayes with 5

Volume I, Issue 3, September-October 2012                                                                                            Page 33
     International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

          122      96.     0.93    0.      0.9   0.    20       10                                      Confusion Matrix
          87.142    4286     1       10      32    93             Attri                                   a b <-- classified as
          9%        %                4             1              butes                                   86 4 | a = 2
                                                                  Uncl                                    3 47 | b = 4
                                                                  96                                     130       107         0.9   0.   0.9   0.   5     8
                                                                  .428                                    92.857     .1429        29    09   28    92         Attri
                                                                  6%                                      1%         %                  3          9          butes
          Confusion Matrix                                                                                                                                    es:37
          a b <-- classified as                                                                                                                               Tree
          82 2 | a = 2                                                                                                                                        Size
          7 40 | b = 4                                                                                                                                        :41
                                                                                                          Confusion Matrix
                                                                                                          a b <-- classified as
          119      117                   0.927 0.922   10 9 9                                           86 4 | a = 2
          85 %      .1429     0.9    0.                        Attri                                       6 44 | b = 4
                    %         22     12                        butes                                       130      107       0.9     0.   0.9   0.   20    10At

                                     6                         Uncl                                        92.857    .1429      29      08   29    92         tribu
                                                               assifi                                      1%        %                  4          9          tes
                                                               ed                                                                                             14
                                                              11                                                                                              Rule
                                                               (7.8                                                                                           s
                                                               571                                                                                            0.11
                                                               %)                                                                                             seco
          Confusion Matrix                                                                                                                                    nds
          a b <-- classified as
          81 1 | a = 2
          9 38 | b = 4
          120      9         0.9   0.      0.9   0.    5        8
                                                                                                          Confusion Matrix
          85.714    6.428      3     12      37    93             Attri
                                                                                                          a b <-- classified as
          3%        6%               6                            butes
                                                                                                          85 5 | a = 2
                                                                                                          5 45 | b = 4
                                                                                                          135       53.         0.9   0.   0.9   0.   10    9Attr
                                                                           JRip with k=2 optimizations,

                                                                  7.85                                    96.428     5714         64    05   65    96         ibute
                                                                  71%                                     6%         %                  5          4          s 13
          Confusion Matrix
                                                                                     Folds 3

          a b <-- classified as
          83 0 | a = 2
          9 37 | b = 4
          1339 7 5          0.9    0.5    0.9    0.9   20    10
          5%        %         5      4      5      5           Attrib
                                                               ber of
                                                               es :29
                                                               Tree                                       Confusion Matrix
                                                               Size                                       a b <-- classified as
                                                               :32                                        89 1 | a = 2
                                                                                                          4 46 | b = 4

          1339     7 5      0.9    0.5    0.9    0.9   10    9                                          133       75          0.9        0.9   0.   5     8

          5%        %         5      4      5      5           Attrib                                     95%        %            5     0.    5    95         Attri
                                                               utes                                                                     07                    butes
                                                               Num                                                                      2                     15
                                                               ber of                                                                                         Rule
                                                               Leav                                                                                           s
                                                               es :29
                                                               :32                                        Confusion Matrix
                                                                                                          a b <-- classified as
                                                                                                          88 2 | a = 2
                                                                                                          5 45 | b = 4

Volume I, Issue 3, September-October 2012                                                                                                                    Page 34
    International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

Naïve Bayes Classifier shows that classification process    set based algorithms can be derived from data tables with
with minimum features only 8 attributes achieve higher      no need for priori estimates or preliminary assumptions.
result with 136 correctly classified instances. The same    The combination of rough sets with other intelligence
result the naïve base classifier achieved when it used 9    techniques is able to provide a more effective approach.
attributes and 10 population size .This shows that the      We have illustrated that rough sets have been
best result is on minimum features selection .Decision      successfully combined with particle swarm optimization
Table Classifier gives 129 correctly classified instances   algorithms that is described as new heuristic optimization
with minimum features subset .Only 8 attributes give the    method based on swarm intelligence. It is very simple,
same result as 10 or 9 attributes used.                     easily implemented and it needs fewer parameters, which
Prism classifier gives the worst results .It achieves 122   made it fully developed and applied for feature extraction
correctly classified instances with ten attributes used.    task.
Decrement number of correctly classified instances to       References
120 when 8 attributes and 5 population size are used .So      [1]         R. Roselin, K. Thangavel, and C.
we can say that decrement number of attributes with             Velayutham,“Fuzzy-Rough Feature Selection for
Prism classifier gives counterproductive and decreases          Mammogram Classification”, Journal of Electronic
the classification accuracy.J48 classifier achieves best        Science and Technology, Vol. 9, No. 2, JUNE 2011.
result of 133 correctly classified with 9 attributes and      [2]    J.R. Quinlan, “Induction of Decision Trees”,
population size of 10 .JRip achieves best result of 135         Machine Learning, pp.81-106, Vol.1, 1986.
correctly classified instances with 9 attributes and 10       [3] Sarvestan Soltani A., Safavi A. A., Parandeh M.
population size .Best results were extracted by Naïve           N. and Salehi M., “Predicting Breast Cancer
Bayes then JRip classifier in terms of classification           Survivability using data mining techniques”,
accuracy and feature reduction                                  Software Technology and Engineering (ICSTE), 2nd
                                                                International Conference ,pp.227-231, Vol.2, 2010.
5. Future Plans                                               [4] Chang Pin Wei and Liou Ming Der, “Comparison
The blending with the other intelligent optimization            of three Data Mining techniques with Genetic
algorithm [28]                                                  Algorithm in analysisof Breast Cancer data”,
The Blending Process is to combine the advantages of the        Available:
PSO with the advantages of the other intelligent                par_threedata.pdf.
optimization algorithms to create the compound                [5] Gandhi Rajiv K., Karnan Marcus and Kannan S.,
algorithm that has practical value. For example, the            “Classification Rule Construction Using Particle
particle swarm optimization algorithm can be improved           Swarm Optimization Algorithm for Breast Cancer
by the simulated annealing (SA) approach .It can be             Datasets”, Signal Acquisition and Processing,
connected with the hereditary agents, the algorithm of a        ICSAP, International Conference, pp. 233 – 237,
colony of ants, vague method and etc.                           2010.
The application area of the Algorithm                         [6]     S.Das, A. Abraham, S.K. Sarkar,“A Hybrid
At present, the most research on PSO in the coordinate          Rough Set–Particle Swarm Algorithm for Image
system. There is less research on the PSO algorithm             Pixel Classification”,Proc.of the SixthInt.Conf. on
application in non-coordinate system, scattered system          Hybrid Intelligent Systems, pp. 26-32, 2006.
and compound optimization system.                             [7]    Matthew Settles, “An Introduction to Particle
                                                                Swarm Optimization”, November 2007.
6. CONCLUSION                                                 [8] S. B. Kotsiantis, “Supervised Machine Learning:
                                                                A     Review     of    Classification   Techniques”,
Medical diagnosis is considered as an intricate task that       Informatica, Vol.31, 249-268, 2007.
needs to be carried out precisely and efficiently. The        [9]    Jensen, R. and Shen, Q., “Fuzzy-rough Data
automation of the same would be highly beneficial.              Reduction with Ant Colony Optimization”, Journal
Clinical decisions are often made based on doctor's             of Fussy Sets and Systems, pp.5-20, Vol. 149, 2005.
intuition and experience. Data mining techniques have         [10] Monteiro, S., Uto, TK., Kosugi, Y.,
the potential to generate a knowledge-rich environment          Kobayashi,N.,Watanabe, E. and Kameyama,
which can help to significantly improve the quality of          K,“Feature Extraction of Hyperspectral Data for
clinical decisions. Rough set theory supplies essential         Under Spilled Blood Visualization Using Particle
tools for knowledge analysis. It provides algorithms for        Swarm Optimization”,International Journal of
knowledge reduction, concept approximation, decision            Bioelectromagnetism, pp.232-235, Vol. 7,No.1 ,
rule induction and object classification. The methods of        2005.
rough set theory rest on indiscernibility and related         [11] Yan WANG, Lizhuang MA, “Feature Selection
notions, in particular on notions related to rough              for Medical Dataset Using Rough Set Theory”,
inclusions. All constructs needed in implementing rough
Volume I, Issue 3, September-October 2012                                                                    Page 35
   International Journal of Emerging Trends & Technology in Computer Science(IJETTCS)
       Web Site: Email:,
Volume I, Issue 3, September-October 2012                                      ISSN 2278-6856

   Proceedings of the 3rd WSEAS International            [25]     J.Kennedy, R.C.Eberhart,“A new optimizer
   Conference on COMPUTER ENGINEERING and                  using particle swarm theory”, In: Sixth International
   APPLICATIONS.                                           Symposium on Micro Machine and Human Science,
 [12] Jensen, R., Shen, Q., & Tuson, A.,“Finding           Nagoya, pp. 39-43, 1995.
   Rough Set Reducts with SAT,” In Proceedings of the    [26] Y. Shi, R. Eberhart,“A Modified Particle Swarm
   10th International conference on Rough Sets, Fuzzy      Optimizer", In: Proc. IEEE Int. Conf. On
   Sets, Data Mining and Granular Computing, LNAI          Evolutionary Computation, Anchorage, AK, USA,
   3641, pp. 194-203, 2005.                                pp. 69-73, 1998.
 [13] Z. Pawlak, “Rough Sets: Theoretical aspects of     [27]     Kennedy.J,“Small Worlds and Mega-Minds:
   reasoning about data,” Kluwer Academic Publishers,      Effects of Neighborhood Topology on Particle
   Dordrecht, 1991.                                        Swarm Performance", Proceedings of the 1999
 [14]     A.E. Hassanien,“Rough Set Approach for           Congress of Evolutionary Computation, IEEE Press,
   Attribute Reduction and Rule Generation: A Case         Vol. 3, pp. 1931-1938, 1999.
   of Patients with Suspected Breast Cancer”, Journal    [28] Qinghai Bai, “Analysis of Particle Swarm
   of the American society for Information science and     Optimization Algorithm",Computer and Information
   Technology ,pp. 954-962, Vol.55,No.11, 2004.            Science,Vol.3,No.1, 2010.
 [15] S. Tsumoto,“Mining Diagnostic Rules from           [29] Y. Shi, R. C. Eberhart,“Parameter Selection in
   Clinical Databases Using Rough Sets and Medical         Particle Swarm Optimization in Evolutionary
   Diagnostic Model”, Information Sciences, pp.65-80,      Programming”, VII: Proc. EP98, New York:
   Vol.162, 2004.                                          Springer-Verlag, pp. 591-600, 1998.
 [16] J. Komorowski, A. Ohrn,“ Modeling Prognostic       [30]     R.C. Eberhart, Y. Shi,“Particle Swarm
   Power of Cardiac Tests Using Rough Sets, Artificial     Optimization: Developments, Applications and
   Intelligence in Medicine ”, pp. 167-191, Vol.15,        Resources”, In: Proc. IEEE Int. Conf. On
   1999.                                                   Evolutionary Computation, Seoul, pp. 81-86, 2001.
 [17] ZPawlak, “Rough Set Approach to Knowledge-         [31]
   Based Decision Support", European Journal of            databases/breast-cancer-wisconsin/breast-cancer-
   Operational Research, pp. 48-57, Vol.99, 1997.
 [18] Xiangyang Wang , Jie Yang , Xiaolong Teng ,        [32]
   Weijun Xia , Richard Jensen , “Feature Selection      [33] Moraglio, A., Di Chio, C., and Poli, R.,
   based on Rough Sets and Particle Swarm                  “Geometric Particle Swarm Optimization,”EuroGP,
   Optimization” .                                         LNCS 445, pp. 125-135, 2007.
 [19]     A.Skowron, C.Rauszer, “The Discernibility
   Matrices and Functions in Information Systems”, In:
   R.W. Swiniarski (Eds.): Intelligent Decision
   Support—Handbook of Applications and Advances
   of the Rough Sets Theory, Kluwer Academic
   Publishers, Dordrecht, pp. 311-362, 1992.
 [20] R.W. Swiniarski, A. Skowron, “Rough set
   methods in feature selection and recognition”,
   Pattern Recognition Letters, pp. 833-849, Vol. 24,
   2003.[21] K.Y. Hu, Y.C. Lu, C.Y. Shi, “Feature
   ranking in rough sets,” AI Communications, pp. 41-
   50, Vol.16,No.1, 2003.
 [22] X. Hu, “Knowledge Discovery in Databases: An
   Attribute-Oriented Rough Set Approach”,Ph.D
   thesis, Regina University,1995.
 [23] G.Y. Wang, J. Zhao, J.J. An, Y. Wu, “Theoretical
   Study on Attribute Reduction of Rough Set Theory:
   Comparison of Algebra and Information Views", In:
   Proceedings of the Third IEEE International
   Conference on Cognitive Informatics, (ICCI’04),
 [24] J .Kennedy, R.Eberhart, “Particle Swarm
   Optimization",In :Proc IEEE Int. Conf. On Neural
   Networks, Perth, pp. 1942-1948, 1995.

Volume I, Issue 3, September-October 2012                                                              Page 36

Description: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: Email:, Volume 1, Issue 3, September – October 2012 ISSN 2278-6856