Study on Ensemble Learning

Document Sample
Study on Ensemble Learning Powered By Docstoc
					Study on Ensemble Learning

        By Feng Zhou
                  Content
• Introduction
• A Statistical View of M3 Network
• Future Works
                        Introduction
• Ensemble learning:
   –   To combine a group of classifiers rather than to design a new one.
   –   The decisions of multiple hypotheses are combined to produce more
       accurate results.
• Problems in traditional learning algorithms
   –   Statistical Problem
   –   Computational Problem
   –   Representation Problem
• Related Works
   – Resampling techniques: Bagging, Boosting
   – Approaches for extending to multi-class problem:
     One-vs-One, One-vs-All.
 Min-Max-Modular                    (M3)      Network
                      (Lu, IEEE TNN 1999)

• Steps
  – Dividing training sets. (Chen, IJCNN 2006; Wen, ICONIP 2005)
  – Training pair-wise classifiers
  – Integrating the outcomes (Zhao, IJCNN 2005)
     • Min process
     • Max process
  0.1 0.5 0.7 0.2   
                     Min
                             0.1

                    
                     Min
                                       
                                     
  0.4 0.3 0.5 0.6            0.3        Max
                                                  0.3
  0.8 0.5 0.4 0.2   
                     Min     0.2
  0.5 0.9 0.7 0.3            0.3
                    
                     Min
                      A Statistical View

• Assumption
  – The pair-wise classifier outputs a probabilistic
    value.
    Sigmoid function (J.C. Platt, ALMC 1999):
                                          1
                           P( | x) 
                                      1  e Ax  B

• Bayesian decision theory
    arg max P( | x)
  ˆ                                                                P( x |  ) P( )
                                  where P( | x) 
           {  ,  }                             P( x |   ) P(  )  P( x |   ) P(  )
A Simple Discrete Example
                          P(w|x)

                          W+       W-

                     X1   1/2

                     X2   1/2      2/5

                     X3            2/5

                     X4            1/5
A Simple Discrete Example (II)
                                                      Pc0(w+|x=x2) = 1/3
                                                      Pc1(w+|x=x2) = 1/2
                                                      Pc2(w+|x=x2) = 1/2


                                                   Pc0 < min(Pc1,Pc2)
                 Classifier 0 (w+:w-)




Classifier 1 (w+:w1-)                   Classifier 2 (w+:w2-)
       A More Complicated Example
                                                     •   When consider a new more
                                                         classifier, the evidence that
                               Information about         x belong to w+ is getting
                               w- is increasing          shrinking.

                                                     •   Pglobal(w+) < min(Ppartial(w+))

                                                     •   The one reporting the
                                                         minimum value contains
                                                         the most information about
                                                         w- (Minimization principle)

                                                     •   If Ppartial(w+)=1, no
                                                         information about w- is
                        ……                               contained.




Classifier 1 (w+:w1-)        Classifier 2 (w+:w2-)
                            Analysis
• For each classifier cij
                        P(i | x, i    )
                                           j

                              P( x, i )
                                                        M ij
                        P( x, i )  p( x,   )
                                              j

• For each sub-positive class wi+
                     P (i | x, i    )
                                     1
                                                   qi
                                1
                            j M      (n   1)
• For positive class w+
                                  ij



                               P (  | x)
                                         1
                         1
                                     1
                                i 1  q  (n  1)
                                        i
                Analysis (II)
• Decomposition of a complex problem




• Restoration to the original resoluation
Composition of Training Sets
                     w+              w-
               w1+   …    wn++ w1-   …    wn--

        w1+
                                                   Have been used
   w+   …

        wn++

        w1-                                         Not used yet

   w-   …

        wn--



                                                 Trivial set, useless
 Another Way of Combination
                  w+              w-
            w1+   …    wn++ w1-   …    wn--

     w1+

w+   …

     wn++
                                                                1
                                                          ik    '
                                                                     (n  2)
     w1-                                      q
                                                               M ki
                                                       1            1
                                                 ik M '   j M  (n  n  2)
w-   …                                                  ki           kj



     wn--                                        Training and testing Time:

                                                ( n  * n  )  ( n   n  )
Experiments - Synthesis Data
    Experiments – Text Categorization
         (20 Newsgroup copus)
Experiments Setup
 • Removing words :
   stemming
   stop
   words < 30

 • Using Naïve Bayes
   as the elementary
   classifier

 • Estimating the
   probability with a
   sigmod function
                   Future Work
• Situation with consideration of noise
  – The virtue of the problem:
     To access the underlying distribution
                                                           
  – Independent parameters for the model:
                                 
                                                     n n
                          n n
  – Constraints we get: (      )
                            2
  – To obtain the best estimation.
     Kullback-Leibler Distance (T. Hastie, Ann Statist 1998)
                          References
[1] T. Hastie & R. Tibshirani, Classification by pairwise coupling, Ann
   Statist 1998.
[2] J. C. Platt, (Probabilistic outputs for support vector machines and
   comparisons to regularized likelihood methods, ALMC 1999
[3] B. Lu & , Task decomposition and module combination based on
   class relations a modular neural network for pattern classification,
   IEEE Tran. Neural Networks, 1999
[4] Y. M. Wen & B. Lu, Equal Clustering Makes Min-Max Modular
   Support Vector Machines More Efficient, ICONIP 2005
[5] H. Zhao & B. Lu, On efficient selection of binary classifiers for min-
   max modular classifier, IJCNN 2005
[6] K. Chen & B. Lu, Efficient classification of multi-label and
   imbalanced data using min-max modular classifiers, IJCNN 2006

				
DOCUMENT INFO