Microsoft PowerPoint - Ensemble

Document Sample
Microsoft PowerPoint - Ensemble Powered By Docstoc
					                  Ensemble Learning                                            A Classifier Ensemble
• what is an ensemble?                                                                  Class Prediction
• why use an ensemble?
• selecting component classifiers                                                           Combiner
• selecting combining mechanism
                                                                                       Class Predictions
• some results
                                                               Classifier 1 Classifier 2                   ...   Classifier N


                                                                                          Input Features

CS 5751 Machine        Ensemble Learning                1   CS 5751 Machine                  Ensemble Learning                  2
Learning                                                    Learning




              Key Ensemble Questions                                   Why Do Ensembles Work?
Which components to combine?                                Hansen and Salamon, 1990
• different learning algorithms                             If we can assume classifiers are random in
• same learning algorithm trained in different ways            predictions and accuracy > 50%, can push
• same learning algorithm trained the same way                 accuracy arbitrarily high by combining more
                                                               classifiers
How to combine classifications?
                                                            Key assumption: classifiers are independent in their
• majority vote                                                predictions
• weighted (confidence of classifier) vote                  • not a very reasonable assumption
• weighted (confidence in classifier) vote                  • more realistic: for data points where classifiers
• learned combiner                                             predict with > 50% accuracy, can push accuracy
What makes a good (accurate) ensemble?                         arbitrarily high (some data points just too hard)
CS 5751 Machine        Ensemble Learning                3   CS 5751 Machine                  Ensemble Learning                  4
Learning                                                    Learning




      What Makes a Good Ensemble?                              Ensemble Mechanisms - Components
Krogh and Vedelsby, 1995                                    • Separate learning methods
Can show that the accuracy of an ensemble is                     – not often used
  mathematically related:                                        – very effective in certain problems (e.g., protein folding,
                                                                   Rost and Sander, Zhang)
  ˆ
  E = E −D                                                  • Same learning method
  ˆ
  E is the error of the entire ensemble                          – generally still need to vary something externally
  E is the average error of the component classifiers               • exception, some good results with neural networks
                                                                 – most often, data set used for training varied:
  D is a term measuring the diversity of the components
                                                                    • Bagging (Bootstrap and Aggregate), Breiman
Effective ensembles have accurate and diverse                       • Boosting, Freund & Schapire
  components                                                                  – Ada, Freund & Schapire
                                                                              – Arcing, Breiman
CS 5751 Machine        Ensemble Learning                5   CS 5751 Machine                  Ensemble Learning                  6
Learning                                                    Learning
 Ensemble Mechanisms - Combiners                                                      Bagging
• Voting                                                  Varies data set
• Averaging (if predictions not 0,1)                      Each training set a bootstrap sample
• Weighted Averaging                                         bootstrap sample - select set of examples (with
   – base weights on confidence in component                   replacement) from original sample
• Learning combiner                                       Algorithm:
   – Stacking, Wolpert                                       for k = 1 to #classifiers
          • general combiner                                      train´ = bootstrap sample of train set
     – RegionBoosting, Maclin                                     create classifier using train´ as training set
                                                               combine classifications using simple voting
          • piecewise combiner


CS 5751 Machine           Ensemble Learning          7    CS 5751 Machine                 Ensemble Learning                         8
Learning                                                  Learning




                   Weak Learning                                                  Boosting - Ada
Schapire showed that a set of weak learners (learners     Varies weights on training data
  with > 50% accuracy, but not much greater) could        Algorithm:
  be combined into a strong learner                            for each data points: weight wi to 1/#datapoints
Idea: weight the data set based on how well we have            for k = 1 to #classifiers
  predicted data points so far                                     generate classifierk with current weighted train set
   – data points predicted accurately - low weight                 εk = sum of wi’s of misclassified points
   – data points mispredicted - high weight                        βk = 1- εk / εk
                                                                   multiply weights of all misclassified points by βk
Result: focuses components on portion of data space
                                                                   normalize weights to sum to 1
  not previously well predicted
                                                             combine: weighted vote, weight for classifierk is log(βk )
                                                          Q: what to do if εk = 0.0 or εk > 0.5?
CS 5751 Machine           Ensemble Learning          9    CS 5751 Machine                 Ensemble Learning                     10
Learning                                                  Learning




                  Boosting - Arcing                          Some Results - BP, C4.5 Components
Sample data set (like Bagging), but probability of           Dataset C4.5 BP BagC4 BagBP AdaC4 AdaBP ArcC4 ArcBP

  data point being chosen weighted (like Boosting)              letter 14.0 18.0   7.0        10.5        4.1   5.7    3.9    4.6

mi = #number of mistakes made on point i by                 segment 3.7     6.6    3.0         5.4        1.7   3.5    1.5    3.3
  previous classifiers                                     promoter 12.8 5.3       10.6        4.0        6.8   4.5    6.4    4.6
probability of selecting point i :
                      4
                                                            kr-vs-kp 0.6    2.3    0.6         0.8        0.3   0.4    0.4    0.3
               1 + mi
    probi = N                                                  splice 5.9   4.7    5.4         3.9        5.1   4.0    5.3    4.2
            ∑ j =01 + m j 4                                  breastc 5.0    3.4    3.7        3.4-        3.5   3.8-   3.5    4.0-

Value 4 chosen empirically                                   housev 3.6     4.9    3.6         4.1       5.0-   5.1-   4.8-   5.3-

Combine using voting
CS 5751 Machine           Ensemble Learning          11   CS 5751 Machine                 Ensemble Learning                     12
Learning                                                  Learning
Some Theories on Bagging/Boosting                                             Combiner - Stacking
Error = Bayes Optimal Error + Bias + Variance               Idea:
Bayes Optimal Error = noise error                              generate component (level 0) classifiers with part
Theories:                                                         of the data (half, three quarters)
   Bagging can reduce variance part of error                   train combiner (level 1) classifier to combine
   Boosting can reduce variance AND bias part of                  predictions of components using remaining data
     error                                                     retrain component classifiers with all of training
   Bagging will hardly ever increase error                        data
   Boosting may increase error                              In practice, often equivalent to voting
   Boosting susceptible to noise
     Boosting’s increases margins
CS 5751 Machine          Ensemble Learning             13   CS 5751 Machine         Ensemble Learning          14
Learning                                                    Learning




              Combiner - RegionBoost
• Train “weight” classifier for each component
  classifier
• “weight” classifier predicts how likely point will
  be predicted correctly
• “weight” classifiers: k-Nearest Neighbor,
  Backprop
• Combiner, generate component classifier
  prediction and weight using corresponding
  “weight” classifier
• Small gains in accuracy

CS 5751 Machine          Ensemble Learning             15
Learning