A Bayesian Approach to Model Selection by yurtgc548


									 Some Aspects of Bayesian
Approach to Model Selection

          Vetrov Dmitry
  Dorodnicyn Computing Centre of
           RAS, Moscow
          Our research team
My colleague
Dmitry Kropotov, PhD
  student of MSU
Nikita Ptashko
Pavel Tolpegin
Igor Tolstov

   Problem formulation
   Ways of solution
   Bayesian paradigm
   Bayesian regularization of kernel classifiers
         Quality vs. Reliability
A general problem:
What means to use for solving a task? Either
  sophisticated and complex, but accurate; or
  simple but reliable?

A trade-off between quality and reliability is
Machine learning interpretation
The easiest way to establish a compromise is to
  regularize criterion function using some
  heuristic regularizer

            J ( w)  ( w)   R( w)

  The general problem is HOW to express
  accuracy and reliability in the same terms. In
  other words how to define regularization
  coefficient  ?
 General ways of compromise I
Structural Risk Minimization (SRM) – penalizes
  flexibility of classifiers expressed in VC-
  dimension of given classifier.

       Ptest  Ptrain   (VCdim )
 Drawback: VC-dimension is very difficult to
 compute and its estimates are too rough. The
 upper bound for test error is too high and
 often exceeds 1
 General ways of compromise II
Minimal Description Length (MDL) – penalizes
  algorithmic complexity of classifier. Classifier is
  considered as a coding algorithm. We encode
  both training data and algorithm itself trying to
  minimize the total description length
              lDescription  lEncoded  lcoder
            Important aspect
All the described schemes penalize the flexibility or
   complexity of classifier, but is it what we really

  “Complex classifier does not always mean
  bad classifier.”
                              Ludmila Kuncheva
                         private communication
    Maximal likelihood principle
Well-known maximal likelihood principle states
 that we should select the classifier with the
 largest likelihood (i.e. accuracy on the
 training sample)
            wML  arg max P( Dtrain | w)

           P( Dtest | Dtrain )  P( Dtest | wML )
            Bayesian view

P( Dtest | Dtrain)   P( Dtest | w) P( w | Dtrain)dw

  Likelihood                                    Prior

                         P( Dtrain | w) P( w)
 P( w | Dtrain ) 
                      P( D
                             train   | w) P( w)dw

             Model Selection
Suppose we have different classifier families

               (1 ),..., ( p )
  and want to know what family is better without
  performing computationally expensive cross-
  validation techniques.

This problem is also known as model selection task
          Bayesian framework I
Find the best model, i.e. the optimal value of
  hyperparameter 

               MP  arg max P( | Drain )
                            A
If all models are equally likely then

P( | Dtrain )  P( Dtrain |  )   P( Dtrain | w) P( w |  )dw

Note that it is exactly the evidence which should
  be maximized to find best model
        Bayesian framework II
Now compute posterior parameter distribution…
                       P( Dtrain | w) P( w |  MP )
      P( w | Dtrain) 
                           P( Dtrain |  MP )
… and final likelihood of test data

    P( Dtest | Dtrain)   P( Dtest | w) P( w | Dtrain)dw
       Why do we need model
The answer is simple:
Many classifiers (e.g. neural networks or support
  vector machines) require some additional
  parameters to be set by user before training

IDEA: These parameters can be viewed as model
  hyperparameters and Bayesian framework can
  be applied to select their best values
                 What is evidence
                          Red model has larger
                          likelihood, but green model
P( Dtrain | w)            has better evidence. It is
                          more stable and we may hope
                          for better generalization

      Support vector machines
Separating surface is defined as linear combination
  of kernel functions
           f ( x)   wi K ( x, xi )  b
                   i 1
The weights are determined solving QP
  optimization problem
            w  C  k  max

                          j 1
          Bottlenecks of SVM
SVM proved to be one of the best classifiers due to
  the use of maximal margin principle and kernel
  trick BUT…

How to define the best kernel for a particular task
  and regularization coefficient C ?
Bad kernels may lead to very poor performance
  due to overfitting or undertraining
    Relevance Vector Machines
Probabilistic approach to kernel models. Weights
  are interpreted as random variables with
  gaussian prior distribution

             wi ~ N (0,  )i

Maximal evidence principle is used to select best  i
 values. Most of them tend to infinity. Hence the
 corresponding weights have zero values that
 makes the classifier quite sparse
  Sparseness of RVM

SVM (C=10)      RVM
 Numerical implementation of RVM

We use Laplace approximation to avoid integration.
 Then likelihood can be written as
            P( Dtrain | w) ~ N ( wmax , H )

           H  ww log P( Dtrain | w)
Then evidence can be computed analytically.
  Iterative optimization of  i becomes possible
        Evidence interpretation
Then evidence is given by

   P( Dtrain |  )  (2 ) N / 2 P( Dtrain | wmax ) | H |1/ 2

This is exactly STABILITY with respect to weights
  changes ! The larger is Hessian the less is
             Kernel selection
IDEA: To use the same techniques for kernel
  determination, e.g. for finding the best width
  of gaussian kernel

                              x y     2
            K ( x, y )  exp              
                                2 2
                                           
            Sudden problem
It appeared that narrow gaussians are more
   stable with respect to weight changes
We allow the centres of kernels be located in
 random points (relevant points). The trade-off
 between narrow (high accuracy on the training
 set) and wide (stable answers) gaussian can
 finally be found.

The classifier we got appeared even more sparse
  than RVM!
Sparseness of GRVM

RVM           GRVM
         Some experimental results
             Errors                     Kernels

             RVM   SVM         RVM      RVM           SVM      RVM
               LOO   LOO         ME       LOO           LOO      ME
Australian      14.9   11.54    10.58         37         188      19
Bupa              25   26.92    21.15             6      179          7
Hepatitis     36.17    31.91    31.91         34         102      11
Pima          22.08    21.65    21.21         29         309      13
Credit        16.35    15.38    15.87         57         217      36
                  Future work
   Develop quick optimization procedures
   Optimize  i and  simultaneously during
    evidence maximization
   Use different width for different features to
    get more sophisticated kernels
   Apply this approach to polynomial kernels
   Apply this approach to regression tasks
            Thank you!

         Contact information:
VetrovD@yandex.ru, DKropotov@yandex.ru

To top