Comparative Study of Genetic Algorithms and Resampling Methods for
Document Sample


Comparative Study of Genetic Algorithms and Resampling
Methods for Ensemble Constructing
R.I. Diaz, R.M. Valdovinos, J.H. Pacheco
Abstract— Diversity and accuracy in the members of the in small subsamples, which will be improved using Bagging
classifier ensemble appear as two of the main issues to take and Boosting. With these method so-called BagGP and
into account for its construction and operation. The resampling BoostGP, the trees of BagGP and BostGP can reduce the
method has been the strategy to construct the most used
ensembles; however, the subsamples here obtained consider nodes generated after several generations. Other tendencies
both diversity and high accuracy. In this work two different propose mixing GA with other techniques for the ensemble
strategies to construct ensembles with those characteristics are design. Zho et al. [15] propose the GASEN method, which
analyzed: resampling methods as Bagging and Boosting, and trains several neuronal networks and use a GA to select an
an evolutive strategy as Genetic Algorithms. Using a dynamic optimal subgroup of neuronal networks for constructing the
weighting scheme, the Genetic Algorithm strategy demonstrated
its effectiveness in searching the best solution to the problem. ensemble. Sirlantzis et. al [17] used a GA to select both
In addition, we also introduce other modifications in order to the individual classifiers for the ensemble and the rules of
reduce the processing time of the Genetic Algorithm. All of combination. In this way, part of our methodology consists
them are studied specifically in the framework of the Nearest in the building of subsamples whose chromosome considers
Neighbour classification algorithm. either (1) for inclusion or (0) for exclusion of a training
sample. In this way, the subsample formed by the GA has
I. I NTRODUCTION
three aspects: size reduction, diversity and good fitness.
Ensemble is a learning paradigm where several classifiers Thus, the ensemble formed with these subsamples uses two
are combined in order to generate a single classification strategies of fusion decisions: weighted and non-weighted.
result. Let D = {D1 , . . . , DL } be a set of L classifiers, and This study is mainly focused on realizing not only the
Ω = {ω1 , . . . , ωc } be a set of c classes. Each classifier Di advantages of the ensembles but also those of the GA. Thus,
(i = 1, . . . , L) gets as input a feature vector x ∈ d , and with GA methods we find the best individual classifier of
assigns it to one of the c problem classes. the ensemble. Furthermore, we are interested in the empirical
It is widely accepted that the major factor for a better ac- knowledge of the behavior of this approach with the process
curacy is the diversity among the classifiers to be combined, of constructing ensembles with resampling methods.
that is, they must differ in their decisions to complement From now on, the rest of the paper is organized as
each other [10], [2], [11]. To obtain diversity, there are many follows. Section II provides a description of the resampling
distinct techniques for constructing classifier ensembles [6]. methods evaluated in our study. Section III review the main
One consists of using different classifiers over a unique concepts about genetic algorithm used. Section IV is about
training set; in this case, the classifiers themselves must the nearest neighbor classifier. Section V introduces the
be different enough to produce diverse decisions. Another majority schemes here used. Next, the experimental results
consists of manipulating (or resampling) the data set on are in Section VI discussed. Finally, Section VII gives the
which the classifiers are trained. Under this scenario, all main conclusions and points out possible directions for future
classifiers should be based upon the same technique, e.g., research.
a k-Nearest Neighbor (k-NN) classifier.
About the resampling methods, Bagging [3] and Boost- II. R ESAMPLING M ETHODS
ing [5] algorithms are widely used. These methods use a Selection with replacement of patterns is the main charac-
random selection with replacement, thus, the subsamples teristic of the resampling methods used for classifier ensem-
contain many redundant patterns. bles. In this section, we briefly describe two of the methods
On the other hand, recent investigations suggest improv- more widely used.
ing the resampling techniques using genetic programming
or Genetic Algorithms (GA) [16]. Iba [18] optimizes the A. Bagging
Bagging and Boosting algorithms dividing the training set Bagging (Bootstrap Aggregating) [3] is the simplest and
earliest resampling method. This algorithm employs boot-
Ricardo I. Diaz and Juan H. Pacheco Authors are with the Pattern strap sampling to generate several subsamples by random
Recognition Group, Instituto Tecnolgico of Toluca, Metepec, Mxico, (email:
ricardo.diazg@hotmail.com, hpacheco@ittoluca.edu.mx). Rosa M. Valdovi- sampling with replacement, m examples from the original
nos Author, is with the Applied Computing group, Centro Universitario Valle training set (also of size m). The individual predictions are
de Chalco, Universidad Autonoma del Estado de Mxico, Valle de Chalco, often by majority voting combined. Note that many of the
Mxico, (e-mail: li rmvr@hotmail.com, tel: (+52)55-59714940).
This work has been partially supported by grants: 51626 and 67407 from original instances may be repeated in the resulting subsample
the Mexican CONACYT. while others may be left out.
4180
978-1-4244-1823-7/08/$25.00 c 2008 IEEE
Briefly, the bagging method generates L bootstrap subsam- III. G ENETIC A LGORITHM
ples {S1 , S2 , . . . , SL } of size m from the original training The most basic structure of the GA proposed by Hol-
set T and creates the corresponding L base classifiers. The land [16], begins with a set of possible solutions (population)
output produced by the ensemble is the class label with a codified as a chain of bits (called chromosome), later with
majority of votes. the use of a method to evaluate the behavior (fitness) of
B. Boosting each chromosome, the parents of the next population are
Boosting and its main variant AdaBoost (Adaptive Boost- determined.
ing) [5] sequentially generates a series of individual clas- In our GA, an m-dimensional chromosome represents all
sifiers, where the training instances wrongly predicted by the training set of the m samples. This was accomplished by
previous base classifiers are picked more often than examples the binary codification , where a specific training sample was
correctly classified. In general, every variant of boosting either (1) or (0) considered. This codification is randomly
attempts to produce new classifiers capable to better predict accomplished. Thus, a (chromosome) solution for all train-
examples for which the current ensemble fails. This is done ing samples marked with 1 s is formed. Then the training
for minimizing the expected error. samples marked with 0 s are not part of the subsample.
AdaBoost generates L subsamples S1 , S2 , . . . , SL using On the other hand, to reduce the processing time of the
a weight for each one of the m instances and thus L GA, in addition to the 0 s, some chromosomes are reduced
individual classifiers D1 , D2 , . . . , DL are built. At each stage in 20%, that is to say, during the evolutive process, several
l(l = 1, . . . , L), the weight Wl (i) defines the probability genes marked with a different value of 0 or 1 were ignored.
of adding the instance xi into subsample Sl and represents The initial and the subsequent populations (until 30 epochs)
the ”difficulty” in predicting such instance by the previously were of 15 chromosomes constituted.
Respect to the fitness method here, the leaving-one-out
created base classifier Dl − 1. Initially, the probability of
method was employed. To this purpose, for each solution,
picking each instance is set to W1 (i) = 1, and then the
the following function ej is defined:
weights are modified at each step 1 < l = L.
There are two ways in determining the subsamples em- 1
ployed in AdaBoost. The first one is picking a set of ej = e(y, x) (4)
m
x T
examples based on the probabilities of the instances (this
probability depends on how often that example was by where m denotes the number of patterns in a training
the previous classifiers misclassified). With this strategy, sample T , x represents a training instance, y is the nearest
”difficult” instances are likely to appear more than once in neighbor of x in T - {x}, and e(y, x) is defined as follows:
the next subsample. 0, if L(y) = L(x) ;
In the second one, the implementation simply consists of e(y, x) = (5)
1, otherwise.
using all the instances and the weights corresponding to its
probability. At each step l, the corresponding base classifier where L(x) is the class label of a pattern x, and L(y)
Dl is built and then we compute its error El using the weights indicates the class label of a pattern y. Each individual
Wl as solution (chromosome) will be weighted according to the
function in Eq.4 by using the error function just introduced.
N On other hand, an elitist method select the best solutions in
El = Wl (i)(l − yi,l ) (1) each step and uses these chromosomes to apply de genetic
i=1 operators: crossover and mutation. The former, consists of
Were yi,l = 1 if Di produces the correct label, yi,l = 0 the uniform crossover and, the latter, randomly change 10%
otherwise. of the genes in each chromosome. An important aspect is
The criterion to stop in this algorithm is El ≥ 0.5. that the best solutions are not included in the next epoch.
Otherwise it computes a coefficient βl = El /(1 − El ) to When the evolutionary process was finished, the best five
be in the weighted voting of the ensemble used and also to solutions of the all epochs are to build the ensemble chosen.
update the weights of the individual instances as follows:
IV. N EAREST NEIGHBOR ENSEMBLE
l−y The Nearest Neighbor (NN) rule [9] are one of the most
Wl+l (i) = Wl (i)βl i,l (2)
celebrated algorithms in machine learning. In recent years,
A final classification produced for a new sample x is given interest in these methods has flourished again in several
by weighted voting among the L base classifiers, where the fields science, due to their conceptual simplicity and to an
weight of each base classifier depends on its performance on asymptotic error rate conveniently bounded in terms of the
the subsample used for building classifiers. The final decision optimal Bayes error, they are revealed as powerful non-
of the ensemble D for the sample x corresponds to the class parametric classification systems in real-world problems.
label c with the maximal support according to: In its classical manifestation, given a set of n previously
1 labeled prototypes or training sample (TS), this classifier
D(x) = argmaxc Ω log (3) assigns a given sample to the class indicated by the label
β1
l:Dl (x)=c of the closest prototype in the TS.
2008 IEEE Congress on Evolutionary Computation (CEC 2008) 4181
V. F USION METHOD 5-fold cross-validation process: each data set, 5-fold cross-
The most popular method for combining the decisions cor- validation was used for estimating the average predictive
responds to the majority voting [8]. Let wj be the weight of accuracy and processing time: 80% of the patterns for
the j − th classifier Dj , then the final output of the ensemble training and 20% for the test set.
where the majority voting takes a linear combination of the Each classifier ensemble consists of five individual clas-
classifiers is computed as: sifiers (L = 5). The unique classifier used for training all
subsets corresponds to a 1-NN decision rule. The ensem-
L
bles have been constructed through a class-dependent (or
r= wj Dj (x) (6) stratified) resampling method [12] by using Bagging (E1)
j=1
and Boosting (E2) and, two GA configurations: (E3) without
L
where ∀j, wj ≥ 0 and j=1 wj = 1. reduction and (E4) with 20% of reduction.
If each classifier just provides the class of the input pattern The importance of analyzing different methods to obtain
x, then one can only have the simple majority voting where diversity in ensembles comes from the fact that with these,
all classifiers have equal weight wj = 1/L. it is feasible to establish an appropriate policy to select the
An important issue having strongly called the attention of most suitable method for constructing classifier ensembles.
many researchers is the error rate associated to the simple Table I reports the average accuracy (and standard deviations
voting method and to the individual components of an en- in second row) obtained with the different ensembles together
semble. Assuming that each one of the classifiers combined with the simple majority voting and the dynamic weighting
has an error rate less than 50%. Hansen and Salomon [17] method described in Section V.
show that the accuracy of the ensemble improves when From results in Table I, we can sketch some comments.
more components are added to the system; however, this First, all ensembles provide similar performances, showing
assumption is not always fulfilled. In this context, Matan [18] a slight improvement over the average accuracy of the
asserts that in some cases, the simple voting might perform single classifier. Second, when focusing on the diversity
even worse than any of the members of the ensemble. Thus methods, GA’s strategies provide slightly better results than
some weighting method can be employed in order to partially resampling methods. Third, showing GA’s schemes, the GA
overcome these difficulties. without reduction seems to be the method with the highest
A weighted voting method has the potential to make performance, in general, their differences are not statistically
the ensemble more robust to the choice of the number of significant. Comparing those different voting strategies, the
individual classifiers. Two general approaches can be for best results correspond to the inverse distance.
weighting remarked: dynamic weighting and static weighting Taking into account these preliminary results, it is possible
of classifiers. In the dynamic strategy, the weights assigned to conclude that it has been a good decision to use a
to the individual classifiers can change for each test pattern. strategy either resampling or GA methods. Overall when
On the contrary, in the static weighting, the weights are these methods have little statistic dependence. In this way,
for each classifier computed in the training phase, and they for analysis of differences between the strategies, we apply
are maintained constant during the classification of the test the resampled paired t test [7]. Under the null hypothesis,
patterns. this statistic has a t distribution with n − 1 degrees of
If the classifiers can also supply additional information, freedom. For five trials, the null hypothesis can be rejected
then their votes can be weighted [14], [13], for example, by if |t| > t4,975 = 2.776.
a function of their distance to the input pattern (dj ). One of the purposes of the t analysis is to identify the
In this work, we use a weighting function for classifier statistic degree among the two methods (Type I error). A
ensembles. Dudani [4] proposes diverse methods in order to Type I error occurs when the null hypothesis is true (i.e.,
weight the k-NN rule. The votes of the k nearest neighbors there is no difference between the two methodologies) and
are weighted by a function of their distances to the test the resampling method rejects the null hypothesis.
pattern. In his original proposal, a neighbor with smaller Figure 1 shows the comparison between different resam-
distance is weighted more heavily than one with a greater pling methods tested in this work. This comparison was
distance. Based on the latter work, we use the inverse accomplished using different combinations among pairs of
distance in order to weight the individual components of the the methods. In the left-top of this figure the different
ensemble. The inverse distance can be expressed as follows: combinations studied are shown. The observed t-value axis
1 gives information about the t statistic values, on which the
w(Dj ) = if dj = 0 (7) null hypothesis can be rejected.
dj
We can see that with all of these combinations the null
VI. E XPERIMENTAL R ESULTS hypothesis is rejected because, all classifiers have values
In this section, we present the results corresponding outside the marked interval which are significantly different
to the experiments carried out over eight data sets taken from the reference value. About the processing time, the
from the UCI Machine Learning Database Repository Figure 2 shows the time (in minutes) results obtained with
(http://www.ics.uci.edu/˜mlearn). We adopted a both types of GA ensembles.
4182 2008 IEEE Congress on Evolutionary Computation (CEC 2008)
TABLE I
AVERAGE ACCURACIES ( AND STANDARD DEVIATIONS ) WITH DIFFERENT RESAMPLING AND FUSION METHODS . VALUES IN BOLD TYPE DENOTE THE
HIGHEST ACCURACY FOR EACH DATABASE
Single Simple Voting Weighting Voting
E1 E2 E3 E4 E1 E2 E3 E4
Cancer 95.6 95.3 91.7 96.5 96.8 95.3 92.1 96.6 96.8
2.5 3.2 3.6 2.1 2.0 3.2 2.6 1.9 2.0
German 65.2 67.3 68.8 66.3 69.3 67.3 68.6 66.6 68.9
2.6 3.8 4.1 1.8 1.6 3.7 3.4 1.8 1.9
Heart 58.2 64.4 57.4 66.3 64.4 64.8 58.5 65.6 63.3
6.2 6.6 6.8 6.9 4.2 5.9 7.2 6.8 4.8
Liver 65.2 65.8 62.6 65.5 66.1 64.4 60.6 65.2 67.0
4.8 6.6 4.5 4.0 6.1 7.0 5.2 3.5 6.8
Sonar 82.0 73.7 69.8 82.4 75.1 77.6 73.2 83.4 76.1
9.4 11.0 10.0 11.2 8.5 7.8 11.3 10.0 8.2
Vehicle 64.2 62.2 61.9 64.4 61.4 63.4 62.9 64.7 62.1
1.8 3.0 4.4 2.6 3.8 2.5 3.4 2.5 4.2
Wine 72.4 65.9 70.0 72.9 72.9 65.9 65.9 72.9 71.8
3.4 11.7 4.4 4.4 4.4 11.5 4.9 5.7 3.4
The benefit of the use of a GA with reduction is clearly
observed when a bigger databases is used. In the Figure 2
2.5
E3
E1
Vs
Vs
E4
E2
important differences in processing time with Vehicle and
E1
E1
Vs
Vs
E3
E4 German databases are observed.
E2 Vs E3
2 E2 Vs E4
VII. C ONCLUSIONS
Observed t value
1.5
This paper analyzes the behavior of four diversity methods
to build ensembles and they overall accuracy obtained with
1
an ensemble of five individual classifiers using the majority
0.5
voting with two schemes: weighted and no-weighted voting.
Also, t test for validating the statistic dependence of the
0 resampling methods was used.
Cancer German Wine Heart Liver Sonar Vehicle
Database From the experiments carried out, it seems that in general,
the GA provide better levels of accuracy than the resampling
Fig. 1. Test comparisons of four resampling methods methods. In this way, we also demonstrated that the method
for reducing the computational cost of the GA when some
genes are ignored, do not affect substantially the precision
of the ensemble. With the measure employed respect to the
voting scheme, there is less benefit obtained.
25
Future works, pointing to employ another reduction meth-
E3
E4 ods for GA, and to validate the proposal with others re-
20 sampling methods using others weighting measures are in
line. More comparisons on various problems form the UCI
15 repository as treatment of unbalance among classes, the di-
Minutes
mensionality and the noisy patterns contained in the database
10 will be developed as soon as possible.
5
R EFERENCES
[1] Hansen, L.K., Salomon, P.: Neural network ensembles, IEEE Trans. on
0 Pattern Analysis and Machine Intelligence 12 (1990) 993–1001.
Cancer Wine Heart Liver Sonar Vehicle German [2] B.E. Banfield, L.O. Hall, K.W. Bowyer, W.P. Kegelmeyer Jr.: A new
Database ensemble diversity measure applied to thinning ensembles, In: Proc.
Proc. 4th Intl. Workshop on Multiple Classifier Systems, Guildford, UK
Fig. 2. Processing time of the ensembles based on Genetic Algorithms (2003) 306–316.
[3] L. Breiman: Bagging predictors: Machine Learning 26 (1996) 123–140.
[4] S.A. Dudani: The distance weighted k-nearest neighbor rule. IEEE
Trans. on Systems, Man and Cybernetics 6 (1976) 325–327.
2008 IEEE Congress on Evolutionary Computation (CEC 2008) 4183
[5] Y. Freund, R.E. Schapire: Experiments with a new boosting algorithm,
In: Proc. 13th Intl. Conference on Machine Learning, Morgan Kauf-
mann (1996) 148–156.
[6] G.T. Dietterich: Machine learning research: four current directions. AI
Magazine 18 (1997) 97–136.
[7] J. Demsar: Statistical Comparisons of Classifiers over Multiple Data
Sets. Journal of Machine Learning Research 7 (2006) 1–30.
[8] L.I. Kuncheva, K.R. Kountchev: Generating classifier outputs of fixed
accuracy and diversity. Pattern Recognition Letters 23 (2002) 593–600.
[9] B.V. Dasarathy: Nearest Neighbor Norms: NN Pattern Classification
Techniques. IEEE Computer Society Press, Los Alamos, CA, 1991.
[10] L.I. Kuncheva, C.J. Whitaker: Measures of diversity in classifier
ensembles. Machine Learning 51 (2003) 181–207.
[11] A. Narasimhamurthy: Evaluation of diversity measures for binary
classifier ensembles, In: Proc. 6th Intl. Workshop on Multiple Classifier
Systems, Seaside, CA (2005) 13–15.
[12] R.M. Valdovinos, J.S. Snchez: Class-dependant resampling for med-
ical applications, In: Proc. 4th Intl. Conf. on Machine Learning and
Applications, Los Angeles, CA (2005) 351–356.
[13] N. Wanas, M. Kamel: Weighted combining of neural network ensem-
bles. In: Proc. Intl. Joint Conf. on Neural Networks, Vol. 2, Honolulu,
HI (2002) 1748–1752.
[14] F Xue, R. Subbu, P. Bonissone: Locally weighted fusion of multiple
predictive models. In: Proc. IEEE World Congress on Computational
Intelligence, Vancouver, BC, Canada (2006).
[15] Z. Zhou, J. Wu, W. Tang, Z. Chen: Selectively Ensembling Neural
Classifiers. In: Proc. of the International Joint Conference on Neural
Networks, Honolulu (2002) 1411–1415.
[16] J. Holland, Adaptation in Natural and Artificial System, The University
of Michigan Press (1975).
[17] K. Sirlantzis, M.C. Fairhurst, R.M. Guest: An Evolutionary Algorithm
for Classifier and Combination Rule selection in Multiple Classifier
Systems. In: Proc. 16th Conference on Pattern Recognition, Quebec,
Canada 2 (2002) 771–774.
[18] H. Iba:Bagging, Boosting and Bloating in Genetic Programming. In
Proceedings of the Genetic and Evolutionary Computation Conference
GECCO-99, (1999).
4184 2008 IEEE Congress on Evolutionary Computation (CEC 2008)
Related docs
Get documents about "