# 10_Bagging+boosting

Document Sample

```					Chapter 7: Ensemble Methods
Ensemble Methods

•   Rationale
•   Combining classifiers
•   Bagging
•   Boosting
Rationale
• In any application, we can use several
learning algorithms; hyperparameters
affect the final learner
• The No Free Lunch Theorem: no single
learning algorithm in any domains always
induces the most accurate learner
• Try many and choose the one with the best
cross-validation results
Rationale
• On the other hand …
– Each learning model comes with a set of
assumption and thus bias
– Learning is an ill-posed problem (finite data):
each model converges to a different solution
and fails under different circumstances
– Why do not we combine multiple learners
intelligently, which may lead to improved
results?
Rationale
• How about combining learners that always make
similar decisions?

• Complementary?
• To build ensemble: Your suggestions?
–   Diff L
–   Same L, diff P
–   D
–   F
Rationale
• Why it works?
• Suppose there are 25 base classifiers
– Each classifier has error rate,  = 0.35
– If the base classifiers are identical, then the ensemble
will misclassify the same examples predicted incorrectly
by the base classifiers.
– Assume classifiers are independent, i.e., their errors are
uncorrelated. Then the ensemble makes a wrong prediction
only if more than half of the base classifiers predict
incorrectly.
– Probability that the ensemble classifier makes a wrong
prediction:

25
 25 i
 i  (1   )25i  0.06
 
i 13    
Works if …

• The base classifiers should be independent.
• The base classifiers should do better than
a classifier that performs random guessing.
(error < 0.5)
• In practice, it is hard to have base
classifiers perfectly independent.
Nevertheless, improvements have been
observed in ensemble methods when they
are slightly correlated.
Rationale
• One important note is that:
– When we generate multiple base-learners, we
want them to be reasonably accurate but do
not require them to be very accurate
individually, so they are not, and need not be,
optimized separately for best accuracy. The
base learners are not chosen for their
accuracy, but for their simplicity.
Ensemble Methods

•   Rationale
•   Combining classifiers
•   Bagging
•   Boosting
Combining classifiers
• Examples: classification trees and neural
networks, several neural networks, several
classification trees, etc.
• Average results from different models
• Why?
– Better classification performance than
individual classifiers
– More resilience to noise
• Why not?
– Time consuming
– Overfitting
Why
• Why?
– Better classification performance than individual
classifiers
– More resilience to noise
• Beside avoiding the selection of the worse classifier
under particular hypothesis, fusion of multiple
classifiers can improve the performance of the best
individual classifiers
• This is possible if individual classifiers make
“different” errors
• For linear combiners, Turner and Ghosh (1996)
showed that averaging outputs of individual
classifiers with unbiased and uncorrelated errors can
improve the performance of the best individual
classifier and, for infinite number of classifiers,
provide the optimal Bayes classifier
Different classifier
Architecture
serial

parallel

hybrid
Architecture
Architecture
Classifiers Fusion
• Fusion is useful only if the combined classifiers are
mutually complementary
• Majority vote fuser: the majority should be always
correct
Complementary classifiers

• Several approaches have been proposed to
complementary classifiers. Among the others:
–   Using problem and designer knowledge
–   Injecting randomness
–   Varying the classifier type, architecture, or parameters
–   Manipulating training data
–   Manipulating features
If you are interested …
• L. Xu, A. Kryzak, C. V. Suen, “Methods of Combining Multiple
Classifiers and Their Applications to Handwriting
Recognition”, IEEE Transactions on Systems, Man Cybernet,
22(3), 1992, pp. 418-435.
• J. Kittle, M. Hatef, R. Duin and J. Matas, “On Combining
Classifiers”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(3), March 1998, pp. 226-239.
• D. Tax, M. Breukelen, R. Duin, J. Kittle, “Combining Multiple
Classifiers by Averaging or by Multiplying?”, Patter
Recognition, 33(2000), pp. 1475-1485.
• L. I. Kuncheva, “A Theoretical Study on Six Classifier Fusion
Strategies”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24(2), 2002, pp. 281-286.
Alternatively …

classifiers with the same dataset,
we can manipulate the training set:
multiple training sets are created
by resampling the original data
according to some distribution. E.g.,
bagging and boosting
Ensemble Methods

•   Rationale
•   Combining classifiers
•   Bagging
•   Boosting
Bagging
• Breiman, 1996
• Derived from bootstrap (Efron, 1993)

• Create classifiers using training sets that
are bootstrapped (drawn with
replacement)

• Average results for each case
Bagging Example

Original         1   2   3 4     5   6 7 8

Training set 1   2 7     8 3     7   6 3 1

Training set 2   7 8     5 6     4   2 7 1

Training set 3   3 6     2 7     5   6 2 2

Training set 4   4 5     1   4   6   4 3 8
Bagging
• Sampling (with replacement) according to a uniform
probability distribution
– Each bootstrap sample D has the same size as the original data.
– Some instances could appear several times in the same training set,
while others may be omitted.

• Build classifier on each bootstrap sample D
• D will contain approximately 63% of the original data.
• Each data object has probability 1- (1 – 1/n)n of being
selected in D
Bagging
• Bagging improves generalization performance by reducing
variance of the base classifiers. The performance of
bagging depends on the stability of the base classifier.

– If a base classifier is unstable, bagging helps to reduce the
errors associated with random fluctuations in the training
data.
– If a base classifier is stable, bagging may not be able to
improve, rather it could degrade the performance.

• Bagging is less susceptible to model overfitting when applied
to noisy data.
Boosting

• Sequential production of classifiers
• Each classifier is dependent on the
previous one, and focuses on the previous
one’s errors
• Examples that are incorrectly predicted in
previous classifiers are chosen more often
or weighted more heavily
Boosting
• Records that are wrongly classified will have their weights
increased
• Records that are classified correctly will have their weights
decreased
Original Data        1    2   3      4    5    6     7       8   9   10
Boosting (Round 1)   7    3   2      8    7    9     4      10   6   3
Boosting (Round 2)   5    4   9      4    2    5     1       7   4   2
Boosting (Round 3)   4    4   8     10    4    5     4       6   3   4

• Example 4 is hard to classify
• Its weight is increased, therefore it is more
likely to be chosen again in subsequent rounds

• Boosting algorithms differ in terms of (1) how the weights of
the training examples are updated at the end of each round, and
(2) how the predictions made by each classifier are combined.
• Freund and Schapire, 1997
• Ideas
– Complex hypotheses tend to overfitting
– Simple hypotheses may not explain data
well
– Combine many simple hypotheses into a
complex one
– Ways to design simple ones, and
combination issues
• Two approaches

– Select examples according to error in previous
classifier (more representatives of
misclassified cases are selected) – more
common

– Weigh errors of the misclassified cases higher
(all cases are incorporated, but weights are
different) – does not work for some algorithms
Boosting Example

Original         1   2 3 4     5   6 7 8

Training set 1   2   7 8 3     7   6 3 1

Training set 2   1   4 5 4     1   5 6 4

Training set 3   7   1   5 8   1   8 1   4

Training set 4   1   1   6 1   1   3 1   5
• Input:
– Training samples S = {(xi, yi)}, i = 1, 2, …, N
– Weak learner h
• Initialization
– Each sample has equal weight wi = 1/N
• For k = 1 … T
– Train weak learner hk according to weighted sample sets
– Compute classification errors
– Update sample weights wi
• Output
– Final model which is a linear combination of hk
Some Details
• Weak learner: error rate is only slightly better
than random guessing

• Boosting: sequentially apply the weak learner to
repeated modified version of the data, thereby
producing a sequence of weak classifiers h(x). The
prediction from all of the weak classifiers are
combined through a weighted majority vote

• H(x) = sign[sum(aihi(x))]

Training Samples   h1(x)

Weighted Samples   h2(x)

Sign[sum]
Weighted Samples   h3(x)

Weighted Samples   hT(x)
• For k = 1 to T
– Fit a learner to the training data using weights wi
– Compute


N
wi I ( yi  hk ( x))
errk    i 1


N
i 1
wi
1  errk
 k  log
errk
– Set wi

wi    wi exp  k I ( yi  hk ( xi )


• It penalizes models that have poor accuracy
• If any intermediate rounds produce error rate
higher than 50%, the weights are reverted back to
1/n and the resampling procedure is repeated

• because of its tendency to focus on training
examples that are wrongly classified, the boosting
technique can be quite susceptible to overfitting.

• Classification

• Regression
Who is doing better?
• Popular Ensemble Methods: An
Empirical Study by David Opitz and
Richard Maclin
• Present a comprehensive evaluation
of both bagging and boosting on 23
datasets using decision trees and
NNs
Classifier Ensemble
• Neural networks are the basic classification method
• An effective combining scheme is to simply average the
predictions of the network
• An ideal assemble consists of highly correct classifiers that
disagree as much as possible
Bagging vs. Boosting
Training Data
1, 2, 3, 4, 5, 6, 7, 8

Bagging training set                          Boosting training set
Set 1: 2, 7, 8, 3, 7, 6, 3, 1                 Set 1: 2, 7, 8, 3, 7, 6, 3, 1
Set 2: 7, 8, 5, 6, 4, 2, 7, 1                 Set 2: 1, 4, 5, 4, 1, 5, 6, 4
Set 3: 3, 6, 2, 7, 5, 6, 2, 2                 Set 3: 7, 1, 5, 8, 1, 8, 1, 4
Set 4: 4, 5, 1, 4, 6, 4, 3, 8                 Set 4: 1, 1, 6, 1, 1, 3, 1, 5

breast-cancer-w                 3.4            3.5          3.4          3.8           4             5           3.7          3.5          3.5
credit-a                       14.8        13.7            13.8         15.8         15.7          14.9         13.4          14          13.7
credit-g                       27.9        24.7            24.2         25.2         25.3          29.6         25.2         25.9         26.7
diabetes                       23.9            23          22.8         24.4         23.3          27.8         24.4          26          25.7
glass                          38.6        35.2            33.1          32          31.1          31.3         25.8         25.5         23.3
heart-cleveland                18.6        17.4             17          20.7         21.1          24.3         19.5         21.5         20.8
hepatitis                      20.1        19.5            17.8          19          19.7          21.2         17.3         16.9         17.2
house-votes-84                  4.9            4.8          4.1          5.1          5.3           3.6          3.6           5           4.8
hypo                            6.4            6.2          6.2          6.2          6.2           0.5          0.4          0.4          0.4
ionosphere                      9.7            7.5          9.2          7.6          8.3           8.1          6.4           6           6.1
iris                            4.3            3.9           4           3.7          3.9           5.2          4.9          5.1          5.6
kr-vs-kp                        2.3            0.8          0.8          0.4          0.3           0.6          0.6          0.3          0.4
labor                           6.1            3.2          4.2          3.2          3.2          16.5         13.7          13          11.6
letter                          18         12.8            10.5          5.7          4.6           14            7           4.1          3.9
promoters-936                   5.3            4.8           4           4.5          4.6          12.8         10.6          6.8          6.4
ribosome-bind                   9.3            8.5          8.4          8.1          8.2          11.2         10.2          9.3          9.6
satellite                       13         10.9            10.6          9.9          10           13.8          9.9          8.6          8.4
segmentation                    6.6            5.3          5.4          3.5          3.3           3.7           3           1.7          1.5
sick                            5.9            5.7          5.7          4.7          4.5           1.3          1.2          1.1           1
sonar                          16.6        15.9            16.8         12.9          13           29.7         25.3         21.5         21.7
soybean                         9.2            6.7          6.9          6.7          6.3            8           7.9          7.2          6.7
splice                          4.7             4           3.9           4           4.2           5.9          5.4          5.1          5.3
vehicle                        24.9        21.2            20.7         19.1         19.7          29.4         27.1         22.5         22.9

1. Single NN; 2. ensemble; 3. bagging; 4. arcing; 5. ada;
6. decision tree; 7. bagging of decision trees; 8. arcing; 9. ada - dt
Neural Networks
boosting, arcing, and bagging
of NN as a percentage of the
original error rate as well as
standard deviation

• Arcing
• Bagging

White bar represents 1
standard deviation
Decision Trees
Composite Error Rates
Neural Networks:
Bagging vs Simple
Neural Networks vs.
Decision Trees
•   NN
•   DT

Box represents
reduction in error
Arcing
Bagging
Noise
• Hurts boosting the most
Conclusions
• Performance depends on data and classifier
• In some cases, ensembles can overcome bias of
component learning algorithm
• Bagging is more consistent than boosting
• Boosting can give much better results on some
data

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 0 posted: 8/15/2012 language: English pages: 54