10_Bagging+boosting

Document Sample
10_Bagging+boosting Powered By Docstoc
					Chapter 7: Ensemble Methods
       Ensemble Methods

•   Rationale
•   Combining classifiers
•   Bagging
•   Boosting
    – Ada-Boosting
            Rationale
• In any application, we can use several
  learning algorithms; hyperparameters
  affect the final learner
• The No Free Lunch Theorem: no single
  learning algorithm in any domains always
  induces the most accurate learner
• Try many and choose the one with the best
  cross-validation results
              Rationale
• On the other hand …
  – Each learning model comes with a set of
    assumption and thus bias
  – Learning is an ill-posed problem (finite data):
    each model converges to a different solution
    and fails under different circumstances
  – Why do not we combine multiple learners
    intelligently, which may lead to improved
    results?
                  Rationale
• How about combining learners that always make
  similar decisions?
   – Advantages?
   – Disadvantages?

• Complementary?
• To build ensemble: Your suggestions?
   –   Diff L
   –   Same L, diff P
   –   D
   –   F
                  Rationale
• Why it works?
• Suppose there are 25 base classifiers
   – Each classifier has error rate,  = 0.35
   – If the base classifiers are identical, then the ensemble
     will misclassify the same examples predicted incorrectly
     by the base classifiers.
   – Assume classifiers are independent, i.e., their errors are
     uncorrelated. Then the ensemble makes a wrong prediction
     only if more than half of the base classifiers predict
     incorrectly.
   – Probability that the ensemble classifier makes a wrong
     prediction:

                    25
                          25 i
                    i  (1   )25i  0.06
                          
                   i 13    
           Works if …

• The base classifiers should be independent.
• The base classifiers should do better than
  a classifier that performs random guessing.
  (error < 0.5)
• In practice, it is hard to have base
  classifiers perfectly independent.
  Nevertheless, improvements have been
  observed in ensemble methods when they
  are slightly correlated.
              Rationale
• One important note is that:
  – When we generate multiple base-learners, we
    want them to be reasonably accurate but do
    not require them to be very accurate
    individually, so they are not, and need not be,
    optimized separately for best accuracy. The
    base learners are not chosen for their
    accuracy, but for their simplicity.
       Ensemble Methods

•   Rationale
•   Combining classifiers
•   Bagging
•   Boosting
    – Ada-Boosting
   Combining classifiers
• Examples: classification trees and neural
  networks, several neural networks, several
  classification trees, etc.
• Average results from different models
• Why?
  – Better classification performance than
    individual classifiers
  – More resilience to noise
• Why not?
  – Time consuming
  – Overfitting
                     Why
• Why?
  – Better classification performance than individual
    classifiers
  – More resilience to noise
      • Beside avoiding the selection of the worse classifier
        under particular hypothesis, fusion of multiple
        classifiers can improve the performance of the best
        individual classifiers
      • This is possible if individual classifiers make
        “different” errors
      • For linear combiners, Turner and Ghosh (1996)
        showed that averaging outputs of individual
        classifiers with unbiased and uncorrelated errors can
        improve the performance of the best individual
        classifier and, for infinite number of classifiers,
        provide the optimal Bayes classifier
Different classifier
         Architecture
serial



            parallel




              hybrid
Architecture
Architecture
          Classifiers Fusion
• Fusion is useful only if the combined classifiers are
  mutually complementary
• Majority vote fuser: the majority should be always
  correct
 Complementary classifiers

• Several approaches have been proposed to
  construct ensembles made up of
  complementary classifiers. Among the others:
  –   Using problem and designer knowledge
  –   Injecting randomness
  –   Varying the classifier type, architecture, or parameters
  –   Manipulating training data
  –   Manipulating features
  If you are interested …
• L. Xu, A. Kryzak, C. V. Suen, “Methods of Combining Multiple
  Classifiers and Their Applications to Handwriting
  Recognition”, IEEE Transactions on Systems, Man Cybernet,
  22(3), 1992, pp. 418-435.
• J. Kittle, M. Hatef, R. Duin and J. Matas, “On Combining
  Classifiers”, IEEE Transactions on Pattern Analysis and
  Machine Intelligence, 20(3), March 1998, pp. 226-239.
• D. Tax, M. Breukelen, R. Duin, J. Kittle, “Combining Multiple
  Classifiers by Averaging or by Multiplying?”, Patter
  Recognition, 33(2000), pp. 1475-1485.
• L. I. Kuncheva, “A Theoretical Study on Six Classifier Fusion
  Strategies”, IEEE Transactions on Pattern Analysis and
  Machine Intelligence, 24(2), 2002, pp. 281-286.
        Alternatively …

•   Instead of designing multiple
    classifiers with the same dataset,
    we can manipulate the training set:
    multiple training sets are created
    by resampling the original data
    according to some distribution. E.g.,
    bagging and boosting
       Ensemble Methods

•   Rationale
•   Combining classifiers
•   Bagging
•   Boosting
    – Ada-Boosting
              Bagging
• Breiman, 1996
• Derived from bootstrap (Efron, 1993)

• Create classifiers using training sets that
  are bootstrapped (drawn with
  replacement)

• Average results for each case
       Bagging Example

Original         1   2   3 4     5   6 7 8

Training set 1   2 7     8 3     7   6 3 1

Training set 2   7 8     5 6     4   2 7 1

Training set 3   3 6     2 7     5   6 2 2

Training set 4   4 5     1   4   6   4 3 8
                       Bagging
• Sampling (with replacement) according to a uniform
  probability distribution
   – Each bootstrap sample D has the same size as the original data.
   – Some instances could appear several times in the same training set,
     while others may be omitted.


• Build classifier on each bootstrap sample D
• D will contain approximately 63% of the original data.
• Each data object has probability 1- (1 – 1/n)n of being
  selected in D
                     Bagging
• Bagging improves generalization performance by reducing
  variance of the base classifiers. The performance of
  bagging depends on the stability of the base classifier.

   – If a base classifier is unstable, bagging helps to reduce the
     errors associated with random fluctuations in the training
     data.
   – If a base classifier is stable, bagging may not be able to
     improve, rather it could degrade the performance.

• Bagging is less susceptible to model overfitting when applied
  to noisy data.
            Boosting

• Sequential production of classifiers
• Each classifier is dependent on the
  previous one, and focuses on the previous
  one’s errors
• Examples that are incorrectly predicted in
  previous classifiers are chosen more often
  or weighted more heavily
                           Boosting
• Records that are wrongly classified will have their weights
  increased
• Records that are classified correctly will have their weights
  decreased
  Original Data        1    2   3      4    5    6     7       8   9   10
  Boosting (Round 1)   7    3   2      8    7    9     4      10   6   3
  Boosting (Round 2)   5    4   9      4    2    5     1       7   4   2
  Boosting (Round 3)   4    4   8     10    4    5     4       6   3   4




                            • Example 4 is hard to classify
                            • Its weight is increased, therefore it is more
                            likely to be chosen again in subsequent rounds


• Boosting algorithms differ in terms of (1) how the weights of
  the training examples are updated at the end of each round, and
  (2) how the predictions made by each classifier are combined.
         Ada-Boosting
• Freund and Schapire, 1997
• Ideas
  – Complex hypotheses tend to overfitting
  – Simple hypotheses may not explain data
    well
  – Combine many simple hypotheses into a
    complex one
  – Ways to design simple ones, and
    combination issues
          Ada-Boosting
• Two approaches

  – Select examples according to error in previous
    classifier (more representatives of
    misclassified cases are selected) – more
    common

  – Weigh errors of the misclassified cases higher
    (all cases are incorporated, but weights are
    different) – does not work for some algorithms
      Boosting Example

Original         1   2 3 4     5   6 7 8

Training set 1   2   7 8 3     7   6 3 1

Training set 2   1   4 5 4     1   5 6 4

Training set 3   7   1   5 8   1   8 1   4

Training set 4   1   1   6 1   1   3 1   5
               Ada-Boosting
• Input:
   – Training samples S = {(xi, yi)}, i = 1, 2, …, N
   – Weak learner h
• Initialization
   – Each sample has equal weight wi = 1/N
• For k = 1 … T
   – Train weak learner hk according to weighted sample sets
   – Compute classification errors
   – Update sample weights wi
• Output
   – Final model which is a linear combination of hk
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
Ada-Boosting
           Some Details
• Weak learner: error rate is only slightly better
  than random guessing

• Boosting: sequentially apply the weak learner to
  repeated modified version of the data, thereby
  producing a sequence of weak classifiers h(x). The
  prediction from all of the weak classifiers are
  combined through a weighted majority vote

• H(x) = sign[sum(aihi(x))]
Schematic of AdaBoost

Training Samples   h1(x)



Weighted Samples   h2(x)


                           Sign[sum]
Weighted Samples   h3(x)




Weighted Samples   hT(x)
                     AdaBoost
• For k = 1 to T
   – Fit a learner to the training data using weights wi
   – Compute

                       
                           N
                               wi I ( yi  hk ( x))
              errk    i 1

                                
                                    N
                                    i 1
                                           wi
                        1  errk
               k  log
                          errk
   – Set wi

              wi    wi exp  k I ( yi  hk ( xi )
                               
              AdaBoost

• It penalizes models that have poor accuracy
• If any intermediate rounds produce error rate
  higher than 50%, the weights are reverted back to
  1/n and the resampling procedure is repeated

• because of its tendency to focus on training
  examples that are wrongly classified, the boosting
  technique can be quite susceptible to overfitting.
           AdaBoost

• Classification
  – AdaBoost.M1 (two-class problem)
  – AdaBoost.M2 (multiple-class problem)


• Regression
  – AdaBoostR
  Who is doing better?
• Popular Ensemble Methods: An
  Empirical Study by David Opitz and
  Richard Maclin
• Present a comprehensive evaluation
  of both bagging and boosting on 23
  datasets using decision trees and
  NNs
      Classifier Ensemble
• Neural networks are the basic classification method
• An effective combining scheme is to simply average the
  predictions of the network
• An ideal assemble consists of highly correct classifiers that
  disagree as much as possible
         Bagging vs. Boosting
                            Training Data
                            1, 2, 3, 4, 5, 6, 7, 8

Bagging training set                          Boosting training set
Set 1: 2, 7, 8, 3, 7, 6, 3, 1                 Set 1: 2, 7, 8, 3, 7, 6, 3, 1
Set 2: 7, 8, 5, 6, 4, 2, 7, 1                 Set 2: 1, 4, 5, 4, 1, 5, 6, 4
Set 3: 3, 6, 2, 7, 5, 6, 2, 2                 Set 3: 7, 1, 5, 8, 1, 8, 1, 4
Set 4: 4, 5, 1, 4, 6, 4, 3, 8                 Set 4: 1, 1, 6, 1, 1, 3, 1, 5
                        stan          simple         bag          arc          ada          stan          bag          arc          ada

breast-cancer-w                 3.4            3.5          3.4          3.8           4             5           3.7          3.5          3.5
credit-a                       14.8        13.7            13.8         15.8         15.7          14.9         13.4          14          13.7
credit-g                       27.9        24.7            24.2         25.2         25.3          29.6         25.2         25.9         26.7
diabetes                       23.9            23          22.8         24.4         23.3          27.8         24.4          26          25.7
glass                          38.6        35.2            33.1          32          31.1          31.3         25.8         25.5         23.3
heart-cleveland                18.6        17.4             17          20.7         21.1          24.3         19.5         21.5         20.8
hepatitis                      20.1        19.5            17.8          19          19.7          21.2         17.3         16.9         17.2
house-votes-84                  4.9            4.8          4.1          5.1          5.3           3.6          3.6           5           4.8
hypo                            6.4            6.2          6.2          6.2          6.2           0.5          0.4          0.4          0.4
ionosphere                      9.7            7.5          9.2          7.6          8.3           8.1          6.4           6           6.1
iris                            4.3            3.9           4           3.7          3.9           5.2          4.9          5.1          5.6
kr-vs-kp                        2.3            0.8          0.8          0.4          0.3           0.6          0.6          0.3          0.4
labor                           6.1            3.2          4.2          3.2          3.2          16.5         13.7          13          11.6
letter                          18         12.8            10.5          5.7          4.6           14            7           4.1          3.9
promoters-936                   5.3            4.8           4           4.5          4.6          12.8         10.6          6.8          6.4
ribosome-bind                   9.3            8.5          8.4          8.1          8.2          11.2         10.2          9.3          9.6
satellite                       13         10.9            10.6          9.9          10           13.8          9.9          8.6          8.4
segmentation                    6.6            5.3          5.4          3.5          3.3           3.7           3           1.7          1.5
sick                            5.9            5.7          5.7          4.7          4.5           1.3          1.2          1.1           1
sonar                          16.6        15.9            16.8         12.9          13           29.7         25.3         21.5         21.7
soybean                         9.2            6.7          6.9          6.7          6.3            8           7.9          7.2          6.7
splice                          4.7             4           3.9           4           4.2           5.9          5.4          5.1          5.3
vehicle                        24.9        21.2            20.7         19.1         19.7          29.4         27.1         22.5         22.9

                  1. Single NN; 2. ensemble; 3. bagging; 4. arcing; 5. ada;
                  6. decision tree; 7. bagging of decision trees; 8. arcing; 9. ada - dt
Neural Networks
                  Reduction in error for Ada-
                  boosting, arcing, and bagging
                  of NN as a percentage of the
                  original error rate as well as
                  standard deviation



                              • Ada-Boosting
                              • Arcing
                              • Bagging

                              White bar represents 1
                              standard deviation
Decision Trees
Composite Error Rates
Neural Networks:
Bagging vs Simple
Ada-Boost:
Neural Networks vs.
Decision Trees
                      •   NN
                      •   DT

                      Box represents
                      reduction in error
Arcing
Bagging
                Noise
• Hurts boosting the most
             Conclusions
• Performance depends on data and classifier
• In some cases, ensembles can overcome bias of
  component learning algorithm
• Bagging is more consistent than boosting
• Boosting can give much better results on some
  data

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:8/15/2012
language:English
pages:54