Bayesian Support Vector Machine Classification

Document Sample
Bayesian Support Vector Machine Classification Powered By Docstoc
					Bayesian Support Vector
 Machine Classification

       Vasilis A. Sotiris
  AMSC663 Midterm Presentation
           December 2007

        University of Maryland
        College Park, MD 20783
                   Objectives
• Develop an algorithm to detect anomalies in
  electronic systems (multivariate)
• Improve detection sensitivity of classical Support
  Vector Machines (SVM)
• Decrease false alarms
• Predict future system performance
                        Methodology
• Use linear Principal Component Analysis to decompose and
  compress raw data into two models: a) PCA model, and b)
  Residual model.
• Use Support Vector Machines to classify data (in each model)
  into normal and abnormal classes
• Assign probabilities to the classification output of the SVMs
  using a sigmoid function
• Use a Maximum Likelihood Estimation to find the optimal
  sigmoid function parameters (in each model)
• Determine the joint class probability from both models
• Track changes to the joint probability to:
   – improve detection sensitivity
   – decrease false alarms
   – Predict future system performance
                    Flow chart of Probabilistic SVC
                       Detection Methodology
 New
 Observation          PCA                                      Likelihood      Probability
  R 1xm              Model                    D1(y1)          function        matrix
                                                           1
                      R kxm
                                 SVC                     p


Input space R nxm                                          0                                  Health
                                                                  -1 0 +1                    Decision
Training data                             PCA Model                 D(x)
                                       Decision boundary
                                                               Joint
                    PCA                                        Probabilities

                                                               Likelihood      Probability
                                                               function        matrix
                                                D2(y2)
                                                           1
                                                         p
                                 SVC                                                         Trending of
                                                           0                                 joint
                      Residual
                                                                  -1 0 +1                    probability
                      Model                                         D(x)                     distributions
  Baseline                                Residual Model
  Population          R lxm              Decision boundary
  Database                                                                       Probability
                    PCA          SVC                                             Model
Principal Component Analysis
Principal Component Analysis – Statistical
               Properties
• Decompose data into                       x2

  two models:                       y2, PC2                                  y1, PC1

    – PCA model (Maximum
      variance) – y1                                                            x1

    – Residual model – y2
                                            2
                                                                  a                                    
                                   y1   ai xi  a1 x1  a2 x2   1 x1
                                                                  a                            x2   a T x
                                                                   2
• Direction of y1 is the
                                        i 1

                                                                2
                           maxvar( y )  var( a x )
  eigenvector with largest      
                                                 1
                                                                i 1
                                                                            i i



  associated eigenvalue l var y   Ey  Ey   Ea xx a Ea xEx a
                                        1            1
                                                         2
                                                                   1
                                                                        2            T       T       T          T



• Vector a is chosen as    var y   a E xx a  a ExE x a
                                        1
                                                 T           T              T                T


  the eigenvector of the
                          var y1   aT Ca C  E xx  E x E x 
                                                                                         T               T
  covariance matrix C
                                   var( y1 )  1
    Singular Value Decomposition (SVD) -
                Eigenanalysis
• SVD is used in this algorithm to         X  UV               T
  perform PCA                               1       0
• SVD                                          2      
                                                                 nxn
   – performs eigenanalysis without                   
     first computing the covariance                    
                                             0      m 
     matrix
                                             u11    u12      u1m 
                                                           u13
   – Speeds up computations                  u                    
                                              21    u22            nxm
   – Computes basis functions (used in   U  u31               
     projection – next)                                           
                                                                 
• The output of SVD is:                      un1
                                                             unm 
                                                                   
   – U – Basis functions for the PCA
                                                v11 v12   v13  v1m 
     and residual models                       v                    
   – L – Eigenvalues of covariance              21 v22               mxm
                                         V T   v31              
     matrix                                                         
   – V – Eigenvectors of covariance                                
                                               vm1
                                                               vmm 
                                                                     
     matrix
                Subspace Decomposition
                                                                  Residual
• [S] –PCA Model subspace                              [R]
                                                                  Subspace

  – Detect dominant parameter                                    XR
    variation                        Data

• [R] – Residual Subspace
  – Detects hidden anomalies
• Therefore, analysis of the         [S]
                                              Xs

  system behavior can be
  decoupled into what is called
  the signal subspace and residual          Raw Data         PCA Model
                                                              Projection
  subspace
• To get xs and xr we project the
  input data onto [S] and [R]
                                               x  xS  xR
                                                          Residual
                                                           Model
                                                         Projection
                Least Squares Projections
• u – basis vector for PC1 and PC2                                   R

                                          New Observation
• v – vector from centered training                         v                         PC1
                                                                              [S]
  data to new observation                     v  pu

• Objective:                                           Vp
                                                                u
                                                                                     PC2
   – Find optimal p that minimizes v-pu
• This gives Vp                                    v  pu  0
• The projection equation is finally           pu  v
                                                   u up  u v
                                                       T         T
  put in terms of SVD
   – H=UkUkT                                       p opt u  V p
   – k - number of principal components
     (dimensions for PCA model)                              
                                              V p  u u u u v
                                                                T    1   T


• The projection pursuit is
  optimized based on the PCA                                    H u u u u    T   1 T


  model                                                         H  U kU k
                                                                                T
                Data Decomposition

• With the projection matrix           x  xS  xR
  H, we can project any incoming
  signal onto the signal [S] and
                                       H  U kU k
                                                        T
  residual [R] subspaces
• G is an analogous matrix to H
                                       G  I  U kU k
                                                            T
  that is used to create the
  projection onto [R]
• H is the projection onto [S], and    x  H x  I  H x
  G is the projection onto [R]
                                                                Projection
                                      Projection onto
                                                                 onto [R]
                                            [S]
Support Vector Machines
                Support Vector Machines
• The performance of a system can
  be fully explained with the        x2
  distribution of its parameters
• SVMs estimate the decision
  boundary for the given
  distribution
• Areas with less information are         Soft decision boundary
  allowed a larger margin of error        Hard decision boundary
• New observations can be
  classified using the decision                         x1
  boundary and are labeled as:
   – (-1) outside
   – (+1) inside
     Linear Classification – Separable Input Space
                                             x2
                                                                        Abnormal Class
• SVM finds a function D(x) that best              M
  separates the two classes (max M)
• D(x) can be used as a classifier                                                            w

• Through the support vectors we can
   – compress the input space by excluding                                             D(x)
     all other data except for the support        Normal Class                                    x1
     vectors.
                                                               Training Support Vectors ai
   – The SVs tell us everything we need to
     know about the system in order to                            New observation vector
     perform detection
                                                         2
                                                  M 
• By minimizing the norm of w we                         w
  find the line or linear surface that                                                             n
                                                         1                  1 T          w   a i yi xi
                                                                        
                                                                    2
  best separates the two classes                  min!
                                                         2
                                                           w
                                                                            2
                                                                              w w                 i 1



• The decision function is the linear                         n                  n
                                                  D( x)   wi xi  b   yia i xi x  b
                                                                                              T

  combination of the weight vector w                         i 1               i 1
    Linear Classification – Inseparable Input
                      Space
• For inseparable data the SVM finds a          x2                   Abnormal Class
  function D(x) that best separates the two           1
                                                 x1
  classes by:                                              M
   – Maximizing the margin M and
      minimizing the sum of slack errors xi
                                                                                     x2
• Function D(x) can be used as a
                                                                                2
  classifier
                                                                                 D(x)
    – In this illustration, a new observation         Normal Class                        x1
      point that falls to the right of it is
      considered abnormal                                      Training Support Vectors
    – Points below and to the left are
                                                               New observation vector
      considered normal
• By minimizing the norm of w and the                                2
  sum of slack errors xi we find the line                  M 
                                                                     w
  or linear surface that best separates
  the two classes                                              1
                                                           min! w
                                                                         2    1 T        n
                                                                              w w  C  i
                                                               2              2        i 1
                 Nonlinear classification
• For inseparable data the SVM
  finds a nonlinear function D(x)          x2
                                                               Abnormal Class
  that best separates the two classes
  by:                                                                             D(x)
   – Use of a kernel map k(.)
   – K=F(xi)F(x)
   – Feature map F(x)=[x2 √2x 1]T
• The decision function D(x)                                            Normal Observation

  requires the dot product of the
                                                                                         x1
  feature map F using the same                  Normal Class

  mathematical framework as the
  linear classifier                                n                n

• This is called the Kernel Trick       D( x)   wi xi  b   yia i   xi   x   b
                                                  i 1             i 1
   – (example)
SVM Training
           Training SVMs for Classification
                                         Confidence Limit training
• Need effective way to train SVM                        x2                                                                              x2
  without the presence of negative                  58


                                                    56
                                                                                                                                                    D1(x)
  class data                                        54


                                                    52



    – Convert outer distribution of positive                                                                                                  VS1




                                               X2
                                                    50




      class to negative
                                                    48


                                                    46




• Confidence limit training uses a                  44


                                                    42
                                                      42        44        46        48          50        52        54        56


  defined confidence level around                     x1                                  x1


                                                                                                                                                            x1
  which a negative class is generated
                                      One Class training
• One class training takes a          x2                                                                                                 x2
  percentage of the positive class
                                                         58


                                                         56




  data and converts it to negative                       54
                                                                                                                                                    D2(x)
  class
                                                         52

                                                                                                                                              VS2

                                                    X2
                                                         50




    – is an optimization problem
                                                         48


                                                         46




    – minimizes the volume in the
                                                         44


                                                         42
                                                           42        44        46        48          50        52        54        56


      decision surface VS                                                                      x1
                                                                                                                                    x1                      x1
    – does not need negative class
      information                                                                                                              VS1 > VS2
                   One Class Training
                                               Performance region
                                         x2
• The negative class is important
  for SVM accuracy
• The data is portioned using       SVM decision functions
  Kmeans                            around each centroid

• The negative class is
  computed around each cluster                                       Centroids computed using
                                                                     unsupervised clustering
  centroid
• The negative class is selected
  from the positive class data as
  the points that have:                                                         x1
   – the fewest neighboors
   – Denoted by D                                k     n                   2

                                           d i                    cj
                                                             ( j)
• Computationally this is done
                                                 j 1 i 1
                                                             x
                                                             i
  by maximizing the sum of
  Euclidian distances from
  between all points                       D  arg max f d 
                                                       d
Class Prediction Probabilities and
 Maximum Likelihood Estimation
          Fitting a Sigmoid Function
• In this project we are           x2   58
                                                                                              x2
  interested in finding the             56


                                        54
                                                                                                                D(x)
  probability that our class            52




                                  X2
                                        50




  prediction is correct
                                        48


                                        46


                                        44



   – Modeling the miss-                 42
                                          42         44   46   48
                                                                    x1
                                                                         50   52   54   56

                                                                                         x1                            x1
     classification rate
• The class prediction in
  PHM is the prediction of
  normality or abnormality

                                       Probability
                                                                                                     Hard decision
• With an MLE estimate of                                                                            boundary
  the density function of                                                                                   n
                                                                                                   D( x)   yia i k  b
  these class probabilities we                                                                             i 1


  can determine the
  uncertainty of the prediction                                                         distance
                  MLE and SVMs
• Using a semi-parametric                   x2

  approach a Sigmoid function S
  is fitted along the hard                                             D(x)

  decision boundary to model
  class probability
• We are interested in                                                 x1

  determining the density
                                    P(y|D(xi)) – Likelihood function
  function that best prescribes
  this probability
• The likelihood is computed
                                                                   D(x)
  based on the knowledge of the
  decision function values D(xi),
  in the parameter space
       MLE and the Sigmoid Function
• Parameters a* and b* are
                                  P y  1  f ( D, a, b) 
  determined by solving a                                                      1
  maximum likelihood estimation                                     1  exp aD x   b 
  (MLE) of y                       ln Ly  1 x 
• The minimization is a two
  parameter optimization          ln  f D1 y  1. f D2 y  1... f Dm y  1

  problem of F, a function of a     ln( a.b)  ln a  ln b
  and b
                                    ln f Di y  1
                                     m
• Depending on parameters a*
  and b* the shape of the sigmoid i 1m
  will change.                   F   Di ln  f ( D, a, b)   1  Di  ln 1  f ( D, a, b) 
                                       i 1
• It can be proven that the MLE
  optimization problem is convex  min!( F )
• Can use Newton’s method with
  a backtracking line search
Joint Probability Model
                         Joint Probability Model


                                   P ( y | x S , xR )
           Final Class                                        Projection onto
           Probability     Classification   Projection onto   Residual model
                               for x          PCA model


• Class prediction P(y|xS,xR) based on the joint class probabilities from:
     – PCA model: p(y|xS)
     – Residual model: p(y|xR)
•   p(y=c|xS) - the probability that a point xS is classified as c in the PCA model
•   p(y|xR) - the probability that a point is classified as c in the residual model
•   P(y|xS,xR) - the final probability that a point x is classified as c
•   Anticipate better accuracy and sensitivity to onset of anomalies
             Joint Probability Model
                              Bayes Rule

                              Assumption




• The joint probability model depends on the results of the
  SVC from both models (PCA and Residual)
   – Assumption: Data on models is linearly independent
• Changes in the joint classification probability can be
  used as precursor to anomalies and used for prediction
Schedule/Progress
         SVM
Classification Example
                Example Non-Linear
•
                        Classification
    Have 4 1-D data points                            y,D
    represented in vector x and a label
    vector y given by                            +1

     – x=[1,2,5,6]T
     – y=[-1,-1,1,-1]T
                                                            1   2                   x
     – This means that coordinates x(1),
       x(2) and x(4) belong to the same          -1
       class I (circles) and x(3) is its own                        D(x)
       class II (squares)
• The decision function D(x) is given                n
  as the nonlinear combination of the     D( x)   yia i k  x, xi   b
  weight vector which is expressed                 i 1


                                          Ld a    a T Ha  f T a
  in terms of the lagrange multipliers                  1
• The lagrange multipliers are                          2
  computed in the quadratic
  optimization problem                     H NL  yi y j xi  x j   yi y j k ( xi , x j )
                                                                 T

• We are going to use a polynomial
  kernel of degree two because we
  can see that some kind of parabola
                                       x   1        
                                                       2 x1    2 x2
                                                                                    2
                                                                          2 x1 x2 x1 x2
                                                                                           2
                                                                                              
  will separate the classes
   Example Non-Linear Classification - Construct
       Hessian for quadratic optimization
H NL  yi y j xi  x j   yi y j k ( xi , x j )
                       T


                                      1  2xi1 x j1  2xi 2 x j 2  2xi1 xi 2 x j1 xi 2  xi1 x j1  xi 2 x j 2
                                                                                             2    2       2        2

 k ( xi , x j )    xi  x j 
                                                  T
                                                            
                                      1  2 xi x j  xi x j
                                                                 T
                                                                         
                                                                         2




                                       
                    k xi , x j   xi x j  1
                                           T
                                                       
                                                       2



                                                                                    4     9    36 49 
      (1)(1)(1)(1)  1  4 (1)(1)(1)(2)  1  9 
                               2                                     2
                                                                                    9    25  121 169 
                                                                              H                     
 H  (1)(1)(2)(1)  1  9 (1)(1)(2)(2)  1  25 
                           2                       2
                                                                                    36  121 676  961
                                                                                                      
                                                      
                                                                                   49 169  961 1369 

• Notice that in order to calculate the scalar product T in the feature
  space, we do not need to perform the mapping using the equation for .
  Instead we calculate this product directly in the input space using the
  input data by computing the kernel of the map
• This is called the kernel trick
    Example Non-Linear Classification - The Kernel
                       Trick 
                                                             x  2
                                                             
• Let x belong to the real 2-D input                         x  x1 , x2 
                                                                           T



                                               x   x1 , 2 x1 x2 , x2 
  space                                                    2                   2 T


• Choose a mapping function F of
                                     xi x j   xi1 , 2 xi1 x2 , xi 2  x j1 , 2 x j1 x2 , x j 2 
                                                         2                 2 T     2                   2
  degree two
                                        xi1 x j1  2 xi1 xi 2 x j1 x j 2  xi 2 x j 2   xi x j 
• The required dot product of the            2     2                             2     2        T       2

  map function can be expresses as:
     – a dot product in the input space                                        
                                                                 k xi , x j   xi x j
                                                                                   T
                                                                                          
                                                                                          2


     – This is the kernel trick
• The Kernel trick basically says
  that any mapping can be
  expressed in terms of a dot
  product of the input space data to
  some degree
     – here to the second degree
    Example Non-Linear Classification – Decision
                  function D(x)
• Compute Lagrange multipliers a         a  0         2.49        7.33   4.83
  through the quadratic optimization                        4

  problem                                D ( x )   yia i k  x, xi   b
                                                        i 1
• Plug into equation for D(x)                           4
                                         D( x)   yia i  xxi  1  b
• Determine b using the class                       i 1

  constraints:                           D( x)  2.49 (1)( 2 x  1) 2  7.33(1)(5 x  1) 2
   – y=[-1,-1,+1,-1]                            4.833 (1)( 6 x  1) 2
   – b=-9
                                         D( x)  0.667 x 2  5.33 x  9
• The end result is a nonlinear
  (quadratic) decision function               y,D

• For x(1)=1, sign(D(x)=-4.33)<0  C1    +1

• For x(2)=2, sign(D(x)=-1.00)<0  C1
• For x(3)=5, sign(D(x)=0.994)>0  C2               1           2                   x

• For x(4)=6, sign(D(x)=-1.009)<0  C1   -1
                                                                    D(x)
• The nonlinear classifier correctly
  classified the data!
      Quadratic Optimization and Global solutions
• What do all these methods have in
  common?
     – Quadratic optimization of the weight
       vector w
     – Where H is the hessian matrix
                                                        Ld a    a T Ha  f T a
                                                                   1
     – y is the class membership of each
       training point                                              2
• This type of equation is defined as a        Hlinear  yi y j xi x j
                                                                  T

  quadratic optimization problem solution
  to which gives:
   – Lagrange multipliers a, which in turn are H NL  yi y j xi  x j   yi y j k ( xi , x j )
                                                                   T

        used in D(x)
• In Matlab “quadprog” is used to solve
  the quadratic optimization
• Because there can only exist one
  solution to the quadratic problem it
• guarantees a global
  solution.