Bayesian Support Vector Machine Classification

Document Sample

```					Bayesian Support Vector
Machine Classification

Vasilis A. Sotiris
AMSC663 Midterm Presentation
December 2007

University of Maryland
College Park, MD 20783
Objectives
• Develop an algorithm to detect anomalies in
electronic systems (multivariate)
• Improve detection sensitivity of classical Support
Vector Machines (SVM)
• Decrease false alarms
• Predict future system performance
Methodology
• Use linear Principal Component Analysis to decompose and
compress raw data into two models: a) PCA model, and b)
Residual model.
• Use Support Vector Machines to classify data (in each model)
into normal and abnormal classes
• Assign probabilities to the classification output of the SVMs
using a sigmoid function
• Use a Maximum Likelihood Estimation to find the optimal
sigmoid function parameters (in each model)
• Determine the joint class probability from both models
• Track changes to the joint probability to:
– improve detection sensitivity
– decrease false alarms
– Predict future system performance
Flow chart of Probabilistic SVC
Detection Methodology
New
Observation          PCA                                      Likelihood      Probability
 R 1xm              Model                    D1(y1)          function        matrix
1
R kxm
SVC                     p

Input space R nxm                                          0                                  Health
-1 0 +1                    Decision
Training data                             PCA Model                 D(x)
Decision boundary
Joint
PCA                                        Probabilities

Likelihood      Probability
function        matrix
D2(y2)
1
p
SVC                                                         Trending of
0                                 joint
Residual
-1 0 +1                    probability
Model                                         D(x)                     distributions
Baseline                                Residual Model
Population          R lxm              Decision boundary
Database                                                                       Probability
PCA          SVC                                             Model
Principal Component Analysis
Principal Component Analysis – Statistical
Properties
• Decompose data into                       x2

two models:                       y2, PC2                                  y1, PC1

– PCA model (Maximum
variance) – y1                                                            x1

– Residual model – y2
2
a                                    
y1   ai xi  a1 x1  a2 x2   1 x1
a                            x2   a T x
 2
• Direction of y1 is the
i 1

                                2
maxvar( y )  var( a x )
eigenvector with largest      
1
          i 1
i i

associated eigenvalue l var y   Ey  Ey   Ea xx a Ea xEx a
1            1
2
1
2            T       T       T          T

• Vector a is chosen as    var y   a E xx a  a ExE x a
1
T           T              T                T

the eigenvector of the
 var y1   aT Ca C  E xx  E x E x 
T               T
covariance matrix C
var( y1 )  1
Singular Value Decomposition (SVD) -
Eigenanalysis
• SVD is used in this algorithm to         X  UV               T
perform PCA                               1       0
• SVD                                          2      
                       nxn
– performs eigenanalysis without                   
first computing the covariance                    
 0      m 
matrix
u11    u12      u1m 
u13
– Speeds up computations                  u                    
 21    u22            nxm
– Computes basis functions (used in   U  u31               
projection – next)                                           
                    
• The output of SVD is:                      un1
                unm 

– U – Basis functions for the PCA
 v11 v12   v13  v1m 
and residual models                       v                    
– L – Eigenvalues of covariance              21 v22               mxm
V T   v31              
matrix                                                         
– V – Eigenvectors of covariance                                
vm1
                vmm 

matrix
Subspace Decomposition
Residual
• [S] –PCA Model subspace                              [R]
Subspace

– Detect dominant parameter                                    XR
variation                        Data

• [R] – Residual Subspace
– Detects hidden anomalies
• Therefore, analysis of the         [S]
Xs

system behavior can be
decoupled into what is called
the signal subspace and residual          Raw Data         PCA Model
Projection
subspace
• To get xs and xr we project the
input data onto [S] and [R]
x  xS  xR
Residual
Model
Projection
Least Squares Projections
• u – basis vector for PC1 and PC2                                   R

New Observation
• v – vector from centered training                         v                         PC1
[S]
data to new observation                     v  pu

• Objective:                                           Vp
u
PC2
– Find optimal p that minimizes v-pu
• This gives Vp                                    v  pu  0
• The projection equation is finally           pu  v
u up  u v
T         T
put in terms of SVD
– H=UkUkT                                       p opt u  V p
– k - number of principal components
(dimensions for PCA model)                              
V p  u u u u v
T    1   T

• The projection pursuit is
optimized based on the PCA                                    H u u u u    T   1 T

model                                                         H  U kU k
T
Data Decomposition

• With the projection matrix           x  xS  xR
H, we can project any incoming
signal onto the signal [S] and
H  U kU k
T
residual [R] subspaces
• G is an analogous matrix to H
G  I  U kU k
T
that is used to create the
projection onto [R]
• H is the projection onto [S], and    x  H x  I  H x
G is the projection onto [R]
Projection
Projection onto
onto [R]
[S]
Support Vector Machines
Support Vector Machines
• The performance of a system can
be fully explained with the        x2
distribution of its parameters
• SVMs estimate the decision
boundary for the given
distribution
• Areas with less information are         Soft decision boundary
allowed a larger margin of error        Hard decision boundary
• New observations can be
classified using the decision                         x1
boundary and are labeled as:
– (-1) outside
– (+1) inside
Linear Classification – Separable Input Space
x2
Abnormal Class
• SVM finds a function D(x) that best              M
separates the two classes (max M)
• D(x) can be used as a classifier                                                            w

• Through the support vectors we can
– compress the input space by excluding                                             D(x)
all other data except for the support        Normal Class                                    x1
vectors.
Training Support Vectors ai
– The SVs tell us everything we need to
know about the system in order to                            New observation vector
perform detection
2
M 
• By minimizing the norm of w we                         w
find the line or linear surface that                                                             n
1                  1 T          w   a i yi xi

2
best separates the two classes                  min!
2
w
2
w w                 i 1

• The decision function is the linear                         n                  n
D( x)   wi xi  b   yia i xi x  b
T

combination of the weight vector w                         i 1               i 1
Linear Classification – Inseparable Input
Space
• For inseparable data the SVM finds a          x2                   Abnormal Class
function D(x) that best separates the two           1
x1
classes by:                                              M
– Maximizing the margin M and
minimizing the sum of slack errors xi
x2
• Function D(x) can be used as a
2
classifier
D(x)
– In this illustration, a new observation         Normal Class                        x1
point that falls to the right of it is
considered abnormal                                      Training Support Vectors
– Points below and to the left are
New observation vector
considered normal
• By minimizing the norm of w and the                                2
sum of slack errors xi we find the line                  M 
w
or linear surface that best separates
the two classes                                              1
min! w
2    1 T        n
 w w  C  i
2              2        i 1
Nonlinear classification
• For inseparable data the SVM
finds a nonlinear function D(x)          x2
Abnormal Class
that best separates the two classes
by:                                                                             D(x)
– Use of a kernel map k(.)
– K=F(xi)F(x)
– Feature map F(x)=[x2 √2x 1]T
• The decision function D(x)                                            Normal Observation

requires the dot product of the
x1
feature map F using the same                  Normal Class

mathematical framework as the
linear classifier                                n                n

• This is called the Kernel Trick       D( x)   wi xi  b   yia i   xi   x   b
i 1             i 1
– (example)
SVM Training
Training SVMs for Classification
Confidence Limit training
• Need effective way to train SVM                        x2                                                                              x2
without the presence of negative                  58

56
D1(x)
class data                                        54

52

– Convert outer distribution of positive                                                                                                  VS1

X2
50

class to negative
48

46

• Confidence limit training uses a                  44

42
42        44        46        48          50        52        54        56

defined confidence level around                     x1                                  x1

x1
which a negative class is generated
One Class training
• One class training takes a          x2                                                                                                 x2
percentage of the positive class
58

56

data and converts it to negative                       54
D2(x)
class
52

VS2

X2
50

– is an optimization problem
48

46

– minimizes the volume in the
44

42
42        44        46        48          50        52        54        56

decision surface VS                                                                      x1
x1                      x1
– does not need negative class
information                                                                                                              VS1 > VS2
One Class Training
Performance region
x2
• The negative class is important
for SVM accuracy
• The data is portioned using       SVM decision functions
Kmeans                            around each centroid

• The negative class is
computed around each cluster                                       Centroids computed using
unsupervised clustering
centroid
• The negative class is selected
from the positive class data as
the points that have:                                                         x1
– the fewest neighboors
– Denoted by D                                k     n                   2

d i                    cj
( j)
• Computationally this is done
j 1 i 1
x
i
by maximizing the sum of
Euclidian distances from
between all points                       D  arg max f d 
d
Class Prediction Probabilities and
Maximum Likelihood Estimation
Fitting a Sigmoid Function
• In this project we are           x2   58
x2
interested in finding the             56

54
D(x)
probability that our class            52

X2
50

prediction is correct
48

46

44

– Modeling the miss-                 42
42         44   46   48
x1
50   52   54   56

x1                            x1
classification rate
• The class prediction in
PHM is the prediction of
normality or abnormality

Probability
Hard decision
• With an MLE estimate of                                                                            boundary
the density function of                                                                                   n
D( x)   yia i k  b
these class probabilities we                                                                             i 1

can determine the
uncertainty of the prediction                                                         distance
MLE and SVMs
• Using a semi-parametric                   x2

approach a Sigmoid function S
is fitted along the hard                                             D(x)

decision boundary to model
class probability
• We are interested in                                                 x1

determining the density
P(y|D(xi)) – Likelihood function
function that best prescribes
this probability
• The likelihood is computed
D(x)
based on the knowledge of the
decision function values D(xi),
in the parameter space
MLE and the Sigmoid Function
• Parameters a* and b* are
P y  1  f ( D, a, b) 
determined by solving a                                                      1
maximum likelihood estimation                                     1  exp aD x   b 
(MLE) of y                       ln Ly  1 x 
• The minimization is a two
parameter optimization          ln  f D1 y  1. f D2 y  1... f Dm y  1

problem of F, a function of a     ln( a.b)  ln a  ln b
and b
  ln f Di y  1
m
• Depending on parameters a*
and b* the shape of the sigmoid i 1m
will change.                   F   Di ln  f ( D, a, b)   1  Di  ln 1  f ( D, a, b) 
i 1
• It can be proven that the MLE
optimization problem is convex  min!( F )
• Can use Newton’s method with
a backtracking line search
Joint Probability Model
Joint Probability Model

P ( y | x S , xR )
Final Class                                        Projection onto
Probability     Classification   Projection onto   Residual model
for x          PCA model

• Class prediction P(y|xS,xR) based on the joint class probabilities from:
– PCA model: p(y|xS)
– Residual model: p(y|xR)
•   p(y=c|xS) - the probability that a point xS is classified as c in the PCA model
•   p(y|xR) - the probability that a point is classified as c in the residual model
•   P(y|xS,xR) - the final probability that a point x is classified as c
•   Anticipate better accuracy and sensitivity to onset of anomalies
Joint Probability Model
Bayes Rule

Assumption

• The joint probability model depends on the results of the
SVC from both models (PCA and Residual)
– Assumption: Data on models is linearly independent
• Changes in the joint classification probability can be
used as precursor to anomalies and used for prediction
Schedule/Progress
SVM
Classification Example
Example Non-Linear
•
Classification
Have 4 1-D data points                            y,D
represented in vector x and a label
vector y given by                            +1

– x=[1,2,5,6]T
– y=[-1,-1,1,-1]T
1   2                   x
– This means that coordinates x(1),
x(2) and x(4) belong to the same          -1
class I (circles) and x(3) is its own                        D(x)
class II (squares)
• The decision function D(x) is given                n
as the nonlinear combination of the     D( x)   yia i k  x, xi   b
weight vector which is expressed                 i 1

Ld a    a T Ha  f T a
in terms of the lagrange multipliers                  1
• The lagrange multipliers are                          2
optimization problem                     H NL  yi y j xi  x j   yi y j k ( xi , x j )
T

• We are going to use a polynomial
kernel of degree two because we
can see that some kind of parabola
x   1        
2 x1    2 x2
2
2 x1 x2 x1 x2
2

will separate the classes
Example Non-Linear Classification - Construct
H NL  yi y j xi  x j   yi y j k ( xi , x j )
T

 1  2xi1 x j1  2xi 2 x j 2  2xi1 xi 2 x j1 xi 2  xi1 x j1  xi 2 x j 2
2    2       2        2

k ( xi , x j )    xi  x j 
   T
 
 1  2 xi x j  xi x j
T

2


 k xi , x j   xi x j  1
T

2

 4     9    36 49 
 (1)(1)(1)(1)  1  4 (1)(1)(1)(2)  1  9 
2                                     2
 9    25  121 169 
                                                                         H                     
H  (1)(1)(2)(1)  1  9 (1)(1)(2)(2)  1  25 
2                       2
 36  121 676  961
                                                                                                 
                                                 
                         49 169  961 1369 

• Notice that in order to calculate the scalar product T in the feature
space, we do not need to perform the mapping using the equation for .
Instead we calculate this product directly in the input space using the
input data by computing the kernel of the map
• This is called the kernel trick
Example Non-Linear Classification - The Kernel
Trick 
x  2

• Let x belong to the real 2-D input                         x  x1 , x2 
T

x   x1 , 2 x1 x2 , x2 
space                                                    2                   2 T

• Choose a mapping function F of
xi x j   xi1 , 2 xi1 x2 , xi 2  x j1 , 2 x j1 x2 , x j 2 
2                 2 T     2                   2
degree two
 xi1 x j1  2 xi1 xi 2 x j1 x j 2  xi 2 x j 2   xi x j 
• The required dot product of the            2     2                             2     2        T       2

map function can be expresses as:
– a dot product in the input space                                        
k xi , x j   xi x j
T

2

– This is the kernel trick
• The Kernel trick basically says
that any mapping can be
expressed in terms of a dot
product of the input space data to
some degree
– here to the second degree
Example Non-Linear Classification – Decision
function D(x)
• Compute Lagrange multipliers a         a  0         2.49        7.33   4.83

problem                                D ( x )   yia i k  x, xi   b
i 1
• Plug into equation for D(x)                           4
D( x)   yia i  xxi  1  b
• Determine b using the class                       i 1

constraints:                           D( x)  2.49 (1)( 2 x  1) 2  7.33(1)(5 x  1) 2
– y=[-1,-1,+1,-1]                            4.833 (1)( 6 x  1) 2
– b=-9
D( x)  0.667 x 2  5.33 x  9
• The end result is a nonlinear

• For x(1)=1, sign(D(x)=-4.33)<0  C1    +1

• For x(2)=2, sign(D(x)=-1.00)<0  C1
• For x(3)=5, sign(D(x)=0.994)>0  C2               1           2                   x

• For x(4)=6, sign(D(x)=-1.009)<0  C1   -1
D(x)
• The nonlinear classifier correctly
classified the data!
• What do all these methods have in
common?
– Quadratic optimization of the weight
vector w
– Where H is the hessian matrix
Ld a    a T Ha  f T a
1
– y is the class membership of each
training point                                              2
• This type of equation is defined as a        Hlinear  yi y j xi x j
T

to which gives:
– Lagrange multipliers a, which in turn are H NL  yi y j xi  x j   yi y j k ( xi , x j )
T

used in D(x)
• In Matlab “quadprog” is used to solve