Comparative Study of Genetic Algorithms and Resampling Methods for

W
Document Sample
scope of work template
							         Comparative Study of Genetic Algorithms and Resampling
                   Methods for Ensemble Constructing
                                                R.I. Diaz, R.M. Valdovinos, J.H. Pacheco


   Abstract— Diversity and accuracy in the members of the                      in small subsamples, which will be improved using Bagging
classifier ensemble appear as two of the main issues to take                    and Boosting. With these method so-called BagGP and
into account for its construction and operation. The resampling                BoostGP, the trees of BagGP and BostGP can reduce the
method has been the strategy to construct the most used
ensembles; however, the subsamples here obtained consider                      nodes generated after several generations. Other tendencies
both diversity and high accuracy. In this work two different                   propose mixing GA with other techniques for the ensemble
strategies to construct ensembles with those characteristics are               design. Zho et al. [15] propose the GASEN method, which
analyzed: resampling methods as Bagging and Boosting, and                      trains several neuronal networks and use a GA to select an
an evolutive strategy as Genetic Algorithms. Using a dynamic                   optimal subgroup of neuronal networks for constructing the
weighting scheme, the Genetic Algorithm strategy demonstrated
its effectiveness in searching the best solution to the problem.               ensemble. Sirlantzis et. al [17] used a GA to select both
In addition, we also introduce other modifications in order to                  the individual classifiers for the ensemble and the rules of
reduce the processing time of the Genetic Algorithm. All of                    combination. In this way, part of our methodology consists
them are studied specifically in the framework of the Nearest                   in the building of subsamples whose chromosome considers
Neighbour classification algorithm.                                             either (1) for inclusion or (0) for exclusion of a training
                                                                               sample. In this way, the subsample formed by the GA has
                         I. I NTRODUCTION
                                                                               three aspects: size reduction, diversity and good fitness.
   Ensemble is a learning paradigm where several classifiers                    Thus, the ensemble formed with these subsamples uses two
are combined in order to generate a single classification                       strategies of fusion decisions: weighted and non-weighted.
result. Let D = {D1 , . . . , DL } be a set of L classifiers, and                  This study is mainly focused on realizing not only the
Ω = {ω1 , . . . , ωc } be a set of c classes. Each classifier Di                advantages of the ensembles but also those of the GA. Thus,
(i = 1, . . . , L) gets as input a feature vector x ∈ d , and                  with GA methods we find the best individual classifier of
assigns it to one of the c problem classes.                                    the ensemble. Furthermore, we are interested in the empirical
   It is widely accepted that the major factor for a better ac-                knowledge of the behavior of this approach with the process
curacy is the diversity among the classifiers to be combined,                   of constructing ensembles with resampling methods.
that is, they must differ in their decisions to complement                        From now on, the rest of the paper is organized as
each other [10], [2], [11]. To obtain diversity, there are many                follows. Section II provides a description of the resampling
distinct techniques for constructing classifier ensembles [6].                  methods evaluated in our study. Section III review the main
One consists of using different classifiers over a unique                       concepts about genetic algorithm used. Section IV is about
training set; in this case, the classifiers themselves must                     the nearest neighbor classifier. Section V introduces the
be different enough to produce diverse decisions. Another                      majority schemes here used. Next, the experimental results
consists of manipulating (or resampling) the data set on                       are in Section VI discussed. Finally, Section VII gives the
which the classifiers are trained. Under this scenario, all                     main conclusions and points out possible directions for future
classifiers should be based upon the same technique, e.g.,                      research.
a k-Nearest Neighbor (k-NN) classifier.
   About the resampling methods, Bagging [3] and Boost-                                        II. R ESAMPLING M ETHODS
ing [5] algorithms are widely used. These methods use a                           Selection with replacement of patterns is the main charac-
random selection with replacement, thus, the subsamples                        teristic of the resampling methods used for classifier ensem-
contain many redundant patterns.                                               bles. In this section, we briefly describe two of the methods
   On the other hand, recent investigations suggest improv-                    more widely used.
ing the resampling techniques using genetic programming
or Genetic Algorithms (GA) [16]. Iba [18] optimizes the                        A. Bagging
Bagging and Boosting algorithms dividing the training set                         Bagging (Bootstrap Aggregating) [3] is the simplest and
                                                                               earliest resampling method. This algorithm employs boot-
   Ricardo I. Diaz and Juan H. Pacheco Authors are with the Pattern            strap sampling to generate several subsamples by random
Recognition Group, Instituto Tecnolgico of Toluca, Metepec, Mxico, (email:
ricardo.diazg@hotmail.com, hpacheco@ittoluca.edu.mx). Rosa M. Valdovi-         sampling with replacement, m examples from the original
nos Author, is with the Applied Computing group, Centro Universitario Valle    training set (also of size m). The individual predictions are
de Chalco, Universidad Autonoma del Estado de Mxico, Valle de Chalco,          often by majority voting combined. Note that many of the
Mxico, (e-mail: li rmvr@hotmail.com, tel: (+52)55-59714940).
   This work has been partially supported by grants: 51626 and 67407 from      original instances may be repeated in the resulting subsample
the Mexican CONACYT.                                                           while others may be left out.




                                                                              4180

978-1-4244-1823-7/08/$25.00 c 2008 IEEE
   Briefly, the bagging method generates L bootstrap subsam-                             III. G ENETIC A LGORITHM
ples {S1 , S2 , . . . , SL } of size m from the original training        The most basic structure of the GA proposed by Hol-
set T and creates the corresponding L base classifiers. The            land [16], begins with a set of possible solutions (population)
output produced by the ensemble is the class label with a             codified as a chain of bits (called chromosome), later with
majority of votes.                                                    the use of a method to evaluate the behavior (fitness) of
B. Boosting                                                           each chromosome, the parents of the next population are
   Boosting and its main variant AdaBoost (Adaptive Boost-            determined.
ing) [5] sequentially generates a series of individual clas-             In our GA, an m-dimensional chromosome represents all
sifiers, where the training instances wrongly predicted by             the training set of the m samples. This was accomplished by
previous base classifiers are picked more often than examples          the binary codification , where a specific training sample was
correctly classified. In general, every variant of boosting            either (1) or (0) considered. This codification is randomly
attempts to produce new classifiers capable to better predict          accomplished. Thus, a (chromosome) solution for all train-
examples for which the current ensemble fails. This is done           ing samples marked with 1 s is formed. Then the training
for minimizing the expected error.                                    samples marked with 0 s are not part of the subsample.
   AdaBoost generates L subsamples S1 , S2 , . . . , SL using            On the other hand, to reduce the processing time of the
a weight for each one of the m instances and thus L                   GA, in addition to the 0 s, some chromosomes are reduced
individual classifiers D1 , D2 , . . . , DL are built. At each stage   in 20%, that is to say, during the evolutive process, several
l(l = 1, . . . , L), the weight Wl (i) defines the probability         genes marked with a different value of 0 or 1 were ignored.
of adding the instance xi into subsample Sl and represents            The initial and the subsequent populations (until 30 epochs)
the ”difficulty” in predicting such instance by the previously         were of 15 chromosomes constituted.
                                                                         Respect to the fitness method here, the leaving-one-out
created base classifier Dl − 1. Initially, the probability of
                                                                      method was employed. To this purpose, for each solution,
picking each instance is set to W1 (i) = 1, and then the
                                                                      the following function ej is defined:
weights are modified at each step 1 < l = L.
   There are two ways in determining the subsamples em-                                           1
ployed in AdaBoost. The first one is picking a set of                                       ej =             e(y, x)               (4)
                                                                                                  m
                                                                                                      x T
examples based on the probabilities of the instances (this
probability depends on how often that example was by                    where m denotes the number of patterns in a training
the previous classifiers misclassified). With this strategy,            sample T , x represents a training instance, y is the nearest
”difficult” instances are likely to appear more than once in           neighbor of x in T - {x}, and e(y, x) is defined as follows:
the next subsample.                                                                               0, if L(y) = L(x) ;
   In the second one, the implementation simply consists of                        e(y, x) =                                      (5)
                                                                                                  1, otherwise.
using all the instances and the weights corresponding to its
probability. At each step l, the corresponding base classifier            where L(x) is the class label of a pattern x, and L(y)
Dl is built and then we compute its error El using the weights        indicates the class label of a pattern y. Each individual
Wl as                                                                 solution (chromosome) will be weighted according to the
                                                                      function in Eq.4 by using the error function just introduced.
                           N                                             On other hand, an elitist method select the best solutions in
                   El =         Wl (i)(l − yi,l )              (1)    each step and uses these chromosomes to apply de genetic
                          i=1                                         operators: crossover and mutation. The former, consists of
  Were yi,l = 1 if Di produces the correct label, yi,l = 0            the uniform crossover and, the latter, randomly change 10%
otherwise.                                                            of the genes in each chromosome. An important aspect is
  The criterion to stop in this algorithm is El ≥ 0.5.                that the best solutions are not included in the next epoch.
Otherwise it computes a coefficient βl = El /(1 − El ) to                When the evolutionary process was finished, the best five
be in the weighted voting of the ensemble used and also to            solutions of the all epochs are to build the ensemble chosen.
update the weights of the individual instances as follows:
                                                                                   IV. N EAREST NEIGHBOR ENSEMBLE
                                         l−y                             The Nearest Neighbor (NN) rule [9] are one of the most
                    Wl+l (i) =    Wl (i)βl i,l                 (2)
                                                                      celebrated algorithms in machine learning. In recent years,
   A final classification produced for a new sample x is given          interest in these methods has flourished again in several
by weighted voting among the L base classifiers, where the             fields science, due to their conceptual simplicity and to an
weight of each base classifier depends on its performance on           asymptotic error rate conveniently bounded in terms of the
the subsample used for building classifiers. The final decision         optimal Bayes error, they are revealed as powerful non-
of the ensemble D for the sample x corresponds to the class           parametric classification systems in real-world problems.
label c with the maximal support according to:                           In its classical manifestation, given a set of n previously
                                                          1           labeled prototypes or training sample (TS), this classifier
             D(x) = argmaxc        Ω                log        (3)    assigns a given sample to the class indicated by the label
                                                          β1
                                       l:Dl (x)=c                     of the closest prototype in the TS.




                          2008 IEEE Congress on Evolutionary Computation (CEC 2008)                                             4181
                     V. F USION METHOD                              5-fold cross-validation process: each data set, 5-fold cross-
   The most popular method for combining the decisions cor-         validation was used for estimating the average predictive
responds to the majority voting [8]. Let wj be the weight of        accuracy and processing time: 80% of the patterns for
the j − th classifier Dj , then the final output of the ensemble      training and 20% for the test set.
where the majority voting takes a linear combination of the            Each classifier ensemble consists of five individual clas-
classifiers is computed as:                                          sifiers (L = 5). The unique classifier used for training all
                                                                    subsets corresponds to a 1-NN decision rule. The ensem-
                                L
                                                                    bles have been constructed through a class-dependent (or
                       r=            wj Dj (x)               (6)    stratified) resampling method [12] by using Bagging (E1)
                            j=1
                                                                    and Boosting (E2) and, two GA configurations: (E3) without
                            L
where ∀j, wj ≥ 0 and j=1 wj = 1.                                    reduction and (E4) with 20% of reduction.
   If each classifier just provides the class of the input pattern      The importance of analyzing different methods to obtain
x, then one can only have the simple majority voting where          diversity in ensembles comes from the fact that with these,
all classifiers have equal weight wj = 1/L.                          it is feasible to establish an appropriate policy to select the
   An important issue having strongly called the attention of       most suitable method for constructing classifier ensembles.
many researchers is the error rate associated to the simple         Table I reports the average accuracy (and standard deviations
voting method and to the individual components of an en-            in second row) obtained with the different ensembles together
semble. Assuming that each one of the classifiers combined           with the simple majority voting and the dynamic weighting
has an error rate less than 50%. Hansen and Salomon [17]            method described in Section V.
show that the accuracy of the ensemble improves when                   From results in Table I, we can sketch some comments.
more components are added to the system; however, this              First, all ensembles provide similar performances, showing
assumption is not always fulfilled. In this context, Matan [18]      a slight improvement over the average accuracy of the
asserts that in some cases, the simple voting might perform         single classifier. Second, when focusing on the diversity
even worse than any of the members of the ensemble. Thus            methods, GA’s strategies provide slightly better results than
some weighting method can be employed in order to partially         resampling methods. Third, showing GA’s schemes, the GA
overcome these difficulties.                                         without reduction seems to be the method with the highest
   A weighted voting method has the potential to make               performance, in general, their differences are not statistically
the ensemble more robust to the choice of the number of             significant. Comparing those different voting strategies, the
individual classifiers. Two general approaches can be for            best results correspond to the inverse distance.
weighting remarked: dynamic weighting and static weighting             Taking into account these preliminary results, it is possible
of classifiers. In the dynamic strategy, the weights assigned        to conclude that it has been a good decision to use a
to the individual classifiers can change for each test pattern.      strategy either resampling or GA methods. Overall when
On the contrary, in the static weighting, the weights are           these methods have little statistic dependence. In this way,
for each classifier computed in the training phase, and they         for analysis of differences between the strategies, we apply
are maintained constant during the classification of the test        the resampled paired t test [7]. Under the null hypothesis,
patterns.                                                           this statistic has a t distribution with n − 1 degrees of
   If the classifiers can also supply additional information,        freedom. For five trials, the null hypothesis can be rejected
then their votes can be weighted [14], [13], for example, by        if |t| > t4,975 = 2.776.
a function of their distance to the input pattern (dj ).               One of the purposes of the t analysis is to identify the
   In this work, we use a weighting function for classifier          statistic degree among the two methods (Type I error). A
ensembles. Dudani [4] proposes diverse methods in order to          Type I error occurs when the null hypothesis is true (i.e.,
weight the k-NN rule. The votes of the k nearest neighbors          there is no difference between the two methodologies) and
are weighted by a function of their distances to the test           the resampling method rejects the null hypothesis.
pattern. In his original proposal, a neighbor with smaller             Figure 1 shows the comparison between different resam-
distance is weighted more heavily than one with a greater           pling methods tested in this work. This comparison was
distance. Based on the latter work, we use the inverse              accomplished using different combinations among pairs of
distance in order to weight the individual components of the        the methods. In the left-top of this figure the different
ensemble. The inverse distance can be expressed as follows:         combinations studied are shown. The observed t-value axis
                                1                                   gives information about the t statistic values, on which the
                  w(Dj ) =              if dj = 0            (7)    null hypothesis can be rejected.
                                dj
                                                                       We can see that with all of these combinations the null
               VI. E XPERIMENTAL R ESULTS                           hypothesis is rejected because, all classifiers have values
   In this section, we present the results corresponding            outside the marked interval which are significantly different
to the experiments carried out over eight data sets taken           from the reference value. About the processing time, the
from the UCI Machine Learning Database Repository                   Figure 2 shows the time (in minutes) results obtained with
(http://www.ics.uci.edu/˜mlearn). We adopted a                      both types of GA ensembles.




4182                                 2008 IEEE Congress on Evolutionary Computation (CEC 2008)
                                                                                             TABLE I
AVERAGE ACCURACIES ( AND STANDARD DEVIATIONS )                                 WITH DIFFERENT RESAMPLING AND FUSION METHODS .        VALUES IN BOLD TYPE DENOTE THE
                                                                            HIGHEST ACCURACY FOR EACH DATABASE

                                                                  Single           Simple    Voting                  Weighting Voting
                                                                             E1      E2        E3       E4     E1       E2     E3     E4
                                                  Cancer          95.6      95.3    91.7      96.5     96.8   95.3     92.1 96.6 96.8
                                                                   2.5      3.2      3.6      2.1      2.0     3.2      2.6    1.9    2.0
                                                  German          65.2      67.3    68.8      66.3     69.3   67.3     68.6 66.6 68.9
                                                                   2.6      3.8      4.1      1.8      1.6     3.7      3.4    1.8    1.9
                                                  Heart           58.2      64.4    57.4      66.3     64.4   64.8     58.5 65.6 63.3
                                                                   6.2      6.6      6.8      6.9      4.2     5.9      7.2    6.8    4.8
                                                  Liver           65.2      65.8    62.6      65.5     66.1   64.4     60.6 65.2 67.0
                                                                   4.8      6.6      4.5      4.0      6.1     7.0      5.2    3.5    6.8
                                                  Sonar           82.0      73.7    69.8      82.4     75.1   77.6     73.2 83.4 76.1
                                                                   9.4      11.0    10.0      11.2     8.5     7.8     11.3 10.0 8.2
                                                  Vehicle         64.2      62.2    61.9      64.4     61.4   63.4     62.9 64.7 62.1
                                                                   1.8      3.0      4.4      2.6      3.8     2.5      3.4    2.5    4.2
                                                  Wine            72.4      65.9    70.0      72.9     72.9   65.9     65.9 72.9 71.8
                                                                   3.4      11.7     4.4       4.4     4.4    11.5      4.9    5.7    3.4



                                                                                                    The benefit of the use of a GA with reduction is clearly
                                                                                                  observed when a bigger databases is used. In the Figure 2
                   2.5
                                  E3
                                  E1
                                       Vs
                                       Vs
                                            E4
                                            E2
                                                                                                  important differences in processing time with Vehicle and
                                  E1
                                  E1
                                       Vs
                                       Vs
                                            E3
                                            E4                                                    German databases are observed.
                                  E2   Vs   E3
                        2         E2   Vs   E4
                                                                                                                         VII. C ONCLUSIONS
Observed t value




                   1.5
                                                                                                     This paper analyzes the behavior of four diversity methods
                                                                                                  to build ensembles and they overall accuracy obtained with
                        1
                                                                                                  an ensemble of five individual classifiers using the majority
                   0.5
                                                                                                  voting with two schemes: weighted and no-weighted voting.
                                                                                                  Also, t test for validating the statistic dependence of the
                        0                                                                         resampling methods was used.
                            Cancer German        Wine     Heart    Liver   Sonar   Vehicle
                                                           Database                                  From the experiments carried out, it seems that in general,
                                                                                                  the GA provide better levels of accuracy than the resampling
                            Fig. 1.     Test comparisons of four resampling methods               methods. In this way, we also demonstrated that the method
                                                                                                  for reducing the computational cost of the GA when some
                                                                                                  genes are ignored, do not affect substantially the precision
                                                                                                  of the ensemble. With the measure employed respect to the
                                                                                                  voting scheme, there is less benefit obtained.
                   25
                                                                                                     Future works, pointing to employ another reduction meth-
                             E3
                             E4                                                                   ods for GA, and to validate the proposal with others re-
                   20                                                                             sampling methods using others weighting measures are in
                                                                                                  line. More comparisons on various problems form the UCI
                   15                                                                             repository as treatment of unbalance among classes, the di-
Minutes




                                                                                                  mensionality and the noisy patterns contained in the database
                   10                                                                             will be developed as soon as possible.

                   5
                                                                                                                              R EFERENCES
                                                                                                  [1] Hansen, L.K., Salomon, P.: Neural network ensembles, IEEE Trans. on
                   0                                                                                  Pattern Analysis and Machine Intelligence 12 (1990) 993–1001.
                        Cancer          Wine     Heart     Liver   Sonar     Vehicle   German     [2] B.E. Banfield, L.O. Hall, K.W. Bowyer, W.P. Kegelmeyer Jr.: A new
                                                          Database                                    ensemble diversity measure applied to thinning ensembles, In: Proc.
                                                                                                      Proc. 4th Intl. Workshop on Multiple Classifier Systems, Guildford, UK
Fig. 2.                     Processing time of the ensembles based on Genetic Algorithms              (2003) 306–316.
                                                                                                  [3] L. Breiman: Bagging predictors: Machine Learning 26 (1996) 123–140.
                                                                                                  [4] S.A. Dudani: The distance weighted k-nearest neighbor rule. IEEE
                                                                                                      Trans. on Systems, Man and Cybernetics 6 (1976) 325–327.




                                                   2008 IEEE Congress on Evolutionary Computation (CEC 2008)                                                        4183
[5] Y. Freund, R.E. Schapire: Experiments with a new boosting algorithm,
    In: Proc. 13th Intl. Conference on Machine Learning, Morgan Kauf-
    mann (1996) 148–156.
[6] G.T. Dietterich: Machine learning research: four current directions. AI
    Magazine 18 (1997) 97–136.
[7] J. Demsar: Statistical Comparisons of Classifiers over Multiple Data
    Sets. Journal of Machine Learning Research 7 (2006) 1–30.
[8] L.I. Kuncheva, K.R. Kountchev: Generating classifier outputs of fixed
    accuracy and diversity. Pattern Recognition Letters 23 (2002) 593–600.
[9] B.V. Dasarathy: Nearest Neighbor Norms: NN Pattern Classification
    Techniques. IEEE Computer Society Press, Los Alamos, CA, 1991.
[10] L.I. Kuncheva, C.J. Whitaker: Measures of diversity in classifier
    ensembles. Machine Learning 51 (2003) 181–207.
[11] A. Narasimhamurthy: Evaluation of diversity measures for binary
    classifier ensembles, In: Proc. 6th Intl. Workshop on Multiple Classifier
    Systems, Seaside, CA (2005) 13–15.
[12] R.M. Valdovinos, J.S. Snchez: Class-dependant resampling for med-
    ical applications, In: Proc. 4th Intl. Conf. on Machine Learning and
    Applications, Los Angeles, CA (2005) 351–356.
[13] N. Wanas, M. Kamel: Weighted combining of neural network ensem-
    bles. In: Proc. Intl. Joint Conf. on Neural Networks, Vol. 2, Honolulu,
    HI (2002) 1748–1752.
[14] F Xue, R. Subbu, P. Bonissone: Locally weighted fusion of multiple
    predictive models. In: Proc. IEEE World Congress on Computational
    Intelligence, Vancouver, BC, Canada (2006).
[15] Z. Zhou, J. Wu, W. Tang, Z. Chen: Selectively Ensembling Neural
    Classifiers. In: Proc. of the International Joint Conference on Neural
    Networks, Honolulu (2002) 1411–1415.
[16] J. Holland, Adaptation in Natural and Artificial System, The University
    of Michigan Press (1975).
[17] K. Sirlantzis, M.C. Fairhurst, R.M. Guest: An Evolutionary Algorithm
    for Classifier and Combination Rule selection in Multiple Classifier
    Systems. In: Proc. 16th Conference on Pattern Recognition, Quebec,
    Canada 2 (2002) 771–774.
[18] H. Iba:Bagging, Boosting and Bloating in Genetic Programming. In
    Proceedings of the Genetic and Evolutionary Computation Conference
    GECCO-99, (1999).




4184                               2008 IEEE Congress on Evolutionary Computation (CEC 2008)

						
Related docs