Document Sample

Bayesian Support Vector Machine Classification Vasilis A. Sotiris AMSC663 Midterm Presentation December 2007 University of Maryland College Park, MD 20783 Objectives • Develop an algorithm to detect anomalies in electronic systems (multivariate) • Improve detection sensitivity of classical Support Vector Machines (SVM) • Decrease false alarms • Predict future system performance Methodology • Use linear Principal Component Analysis to decompose and compress raw data into two models: a) PCA model, and b) Residual model. • Use Support Vector Machines to classify data (in each model) into normal and abnormal classes • Assign probabilities to the classification output of the SVMs using a sigmoid function • Use a Maximum Likelihood Estimation to find the optimal sigmoid function parameters (in each model) • Determine the joint class probability from both models • Track changes to the joint probability to: – improve detection sensitivity – decrease false alarms – Predict future system performance Flow chart of Probabilistic SVC Detection Methodology New Observation PCA Likelihood Probability R 1xm Model D1(y1) function matrix 1 R kxm SVC p Input space R nxm 0 Health -1 0 +1 Decision Training data PCA Model D(x) Decision boundary Joint PCA Probabilities Likelihood Probability function matrix D2(y2) 1 p SVC Trending of 0 joint Residual -1 0 +1 probability Model D(x) distributions Baseline Residual Model Population R lxm Decision boundary Database Probability PCA SVC Model Principal Component Analysis Principal Component Analysis – Statistical Properties • Decompose data into x2 two models: y2, PC2 y1, PC1 – PCA model (Maximum variance) – y1 x1 – Residual model – y2 2 a y1 ai xi a1 x1 a2 x2 1 x1 a x2 a T x 2 • Direction of y1 is the i 1 2 maxvar( y ) var( a x ) eigenvector with largest 1 i 1 i i associated eigenvalue l var y Ey Ey Ea xx a Ea xEx a 1 1 2 1 2 T T T T • Vector a is chosen as var y a E xx a a ExE x a 1 T T T T the eigenvector of the var y1 aT Ca C E xx E x E x T T covariance matrix C var( y1 ) 1 Singular Value Decomposition (SVD) - Eigenanalysis • SVD is used in this algorithm to X UV T perform PCA 1 0 • SVD 2 nxn – performs eigenanalysis without first computing the covariance 0 m matrix u11 u12 u1m u13 – Speeds up computations u 21 u22 nxm – Computes basis functions (used in U u31 projection – next) • The output of SVD is: un1 unm – U – Basis functions for the PCA v11 v12 v13 v1m and residual models v – L – Eigenvalues of covariance 21 v22 mxm V T v31 matrix – V – Eigenvectors of covariance vm1 vmm matrix Subspace Decomposition Residual • [S] –PCA Model subspace [R] Subspace – Detect dominant parameter XR variation Data • [R] – Residual Subspace – Detects hidden anomalies • Therefore, analysis of the [S] Xs system behavior can be decoupled into what is called the signal subspace and residual Raw Data PCA Model Projection subspace • To get xs and xr we project the input data onto [S] and [R] x xS xR Residual Model Projection Least Squares Projections • u – basis vector for PC1 and PC2 R New Observation • v – vector from centered training v PC1 [S] data to new observation v pu • Objective: Vp u PC2 – Find optimal p that minimizes v-pu • This gives Vp v pu 0 • The projection equation is finally pu v u up u v T T put in terms of SVD – H=UkUkT p opt u V p – k - number of principal components (dimensions for PCA model) V p u u u u v T 1 T • The projection pursuit is optimized based on the PCA H u u u u T 1 T model H U kU k T Data Decomposition • With the projection matrix x xS xR H, we can project any incoming signal onto the signal [S] and H U kU k T residual [R] subspaces • G is an analogous matrix to H G I U kU k T that is used to create the projection onto [R] • H is the projection onto [S], and x H x I H x G is the projection onto [R] Projection Projection onto onto [R] [S] Support Vector Machines Support Vector Machines • The performance of a system can be fully explained with the x2 distribution of its parameters • SVMs estimate the decision boundary for the given distribution • Areas with less information are Soft decision boundary allowed a larger margin of error Hard decision boundary • New observations can be classified using the decision x1 boundary and are labeled as: – (-1) outside – (+1) inside Linear Classification – Separable Input Space x2 Abnormal Class • SVM finds a function D(x) that best M separates the two classes (max M) • D(x) can be used as a classifier w • Through the support vectors we can – compress the input space by excluding D(x) all other data except for the support Normal Class x1 vectors. Training Support Vectors ai – The SVs tell us everything we need to know about the system in order to New observation vector perform detection 2 M • By minimizing the norm of w we w find the line or linear surface that n 1 1 T w a i yi xi 2 best separates the two classes min! 2 w 2 w w i 1 • The decision function is the linear n n D( x) wi xi b yia i xi x b T combination of the weight vector w i 1 i 1 Linear Classification – Inseparable Input Space • For inseparable data the SVM finds a x2 Abnormal Class function D(x) that best separates the two 1 x1 classes by: M – Maximizing the margin M and minimizing the sum of slack errors xi x2 • Function D(x) can be used as a 2 classifier D(x) – In this illustration, a new observation Normal Class x1 point that falls to the right of it is considered abnormal Training Support Vectors – Points below and to the left are New observation vector considered normal • By minimizing the norm of w and the 2 sum of slack errors xi we find the line M w or linear surface that best separates the two classes 1 min! w 2 1 T n w w C i 2 2 i 1 Nonlinear classification • For inseparable data the SVM finds a nonlinear function D(x) x2 Abnormal Class that best separates the two classes by: D(x) – Use of a kernel map k(.) – K=F(xi)F(x) – Feature map F(x)=[x2 √2x 1]T • The decision function D(x) Normal Observation requires the dot product of the x1 feature map F using the same Normal Class mathematical framework as the linear classifier n n • This is called the Kernel Trick D( x) wi xi b yia i xi x b i 1 i 1 – (example) SVM Training Training SVMs for Classification Confidence Limit training • Need effective way to train SVM x2 x2 without the presence of negative 58 56 D1(x) class data 54 52 – Convert outer distribution of positive VS1 X2 50 class to negative 48 46 • Confidence limit training uses a 44 42 42 44 46 48 50 52 54 56 defined confidence level around x1 x1 x1 which a negative class is generated One Class training • One class training takes a x2 x2 percentage of the positive class 58 56 data and converts it to negative 54 D2(x) class 52 VS2 X2 50 – is an optimization problem 48 46 – minimizes the volume in the 44 42 42 44 46 48 50 52 54 56 decision surface VS x1 x1 x1 – does not need negative class information VS1 > VS2 One Class Training Performance region x2 • The negative class is important for SVM accuracy • The data is portioned using SVM decision functions Kmeans around each centroid • The negative class is computed around each cluster Centroids computed using unsupervised clustering centroid • The negative class is selected from the positive class data as the points that have: x1 – the fewest neighboors – Denoted by D k n 2 d i cj ( j) • Computationally this is done j 1 i 1 x i by maximizing the sum of Euclidian distances from between all points D arg max f d d Class Prediction Probabilities and Maximum Likelihood Estimation Fitting a Sigmoid Function • In this project we are x2 58 x2 interested in finding the 56 54 D(x) probability that our class 52 X2 50 prediction is correct 48 46 44 – Modeling the miss- 42 42 44 46 48 x1 50 52 54 56 x1 x1 classification rate • The class prediction in PHM is the prediction of normality or abnormality Probability Hard decision • With an MLE estimate of boundary the density function of n D( x) yia i k b these class probabilities we i 1 can determine the uncertainty of the prediction distance MLE and SVMs • Using a semi-parametric x2 approach a Sigmoid function S is fitted along the hard D(x) decision boundary to model class probability • We are interested in x1 determining the density P(y|D(xi)) – Likelihood function function that best prescribes this probability • The likelihood is computed D(x) based on the knowledge of the decision function values D(xi), in the parameter space MLE and the Sigmoid Function • Parameters a* and b* are P y 1 f ( D, a, b) determined by solving a 1 maximum likelihood estimation 1 exp aD x b (MLE) of y ln Ly 1 x • The minimization is a two parameter optimization ln f D1 y 1. f D2 y 1... f Dm y 1 problem of F, a function of a ln( a.b) ln a ln b and b ln f Di y 1 m • Depending on parameters a* and b* the shape of the sigmoid i 1m will change. F Di ln f ( D, a, b) 1 Di ln 1 f ( D, a, b) i 1 • It can be proven that the MLE optimization problem is convex min!( F ) • Can use Newton’s method with a backtracking line search Joint Probability Model Joint Probability Model P ( y | x S , xR ) Final Class Projection onto Probability Classification Projection onto Residual model for x PCA model • Class prediction P(y|xS,xR) based on the joint class probabilities from: – PCA model: p(y|xS) – Residual model: p(y|xR) • p(y=c|xS) - the probability that a point xS is classified as c in the PCA model • p(y|xR) - the probability that a point is classified as c in the residual model • P(y|xS,xR) - the final probability that a point x is classified as c • Anticipate better accuracy and sensitivity to onset of anomalies Joint Probability Model Bayes Rule Assumption • The joint probability model depends on the results of the SVC from both models (PCA and Residual) – Assumption: Data on models is linearly independent • Changes in the joint classification probability can be used as precursor to anomalies and used for prediction Schedule/Progress SVM Classification Example Example Non-Linear • Classification Have 4 1-D data points y,D represented in vector x and a label vector y given by +1 – x=[1,2,5,6]T – y=[-1,-1,1,-1]T 1 2 x – This means that coordinates x(1), x(2) and x(4) belong to the same -1 class I (circles) and x(3) is its own D(x) class II (squares) • The decision function D(x) is given n as the nonlinear combination of the D( x) yia i k x, xi b weight vector which is expressed i 1 Ld a a T Ha f T a in terms of the lagrange multipliers 1 • The lagrange multipliers are 2 computed in the quadratic optimization problem H NL yi y j xi x j yi y j k ( xi , x j ) T • We are going to use a polynomial kernel of degree two because we can see that some kind of parabola x 1 2 x1 2 x2 2 2 x1 x2 x1 x2 2 will separate the classes Example Non-Linear Classification - Construct Hessian for quadratic optimization H NL yi y j xi x j yi y j k ( xi , x j ) T 1 2xi1 x j1 2xi 2 x j 2 2xi1 xi 2 x j1 xi 2 xi1 x j1 xi 2 x j 2 2 2 2 2 k ( xi , x j ) xi x j T 1 2 xi x j xi x j T 2 k xi , x j xi x j 1 T 2 4 9 36 49 (1)(1)(1)(1) 1 4 (1)(1)(1)(2) 1 9 2 2 9 25 121 169 H H (1)(1)(2)(1) 1 9 (1)(1)(2)(2) 1 25 2 2 36 121 676 961 49 169 961 1369 • Notice that in order to calculate the scalar product T in the feature space, we do not need to perform the mapping using the equation for . Instead we calculate this product directly in the input space using the input data by computing the kernel of the map • This is called the kernel trick Example Non-Linear Classification - The Kernel Trick x 2 • Let x belong to the real 2-D input x x1 , x2 T x x1 , 2 x1 x2 , x2 space 2 2 T • Choose a mapping function F of xi x j xi1 , 2 xi1 x2 , xi 2 x j1 , 2 x j1 x2 , x j 2 2 2 T 2 2 degree two xi1 x j1 2 xi1 xi 2 x j1 x j 2 xi 2 x j 2 xi x j • The required dot product of the 2 2 2 2 T 2 map function can be expresses as: – a dot product in the input space k xi , x j xi x j T 2 – This is the kernel trick • The Kernel trick basically says that any mapping can be expressed in terms of a dot product of the input space data to some degree – here to the second degree Example Non-Linear Classification – Decision function D(x) • Compute Lagrange multipliers a a 0 2.49 7.33 4.83 through the quadratic optimization 4 problem D ( x ) yia i k x, xi b i 1 • Plug into equation for D(x) 4 D( x) yia i xxi 1 b • Determine b using the class i 1 constraints: D( x) 2.49 (1)( 2 x 1) 2 7.33(1)(5 x 1) 2 – y=[-1,-1,+1,-1] 4.833 (1)( 6 x 1) 2 – b=-9 D( x) 0.667 x 2 5.33 x 9 • The end result is a nonlinear (quadratic) decision function y,D • For x(1)=1, sign(D(x)=-4.33)<0 C1 +1 • For x(2)=2, sign(D(x)=-1.00)<0 C1 • For x(3)=5, sign(D(x)=0.994)>0 C2 1 2 x • For x(4)=6, sign(D(x)=-1.009)<0 C1 -1 D(x) • The nonlinear classifier correctly classified the data! Quadratic Optimization and Global solutions • What do all these methods have in common? – Quadratic optimization of the weight vector w – Where H is the hessian matrix Ld a a T Ha f T a 1 – y is the class membership of each training point 2 • This type of equation is defined as a Hlinear yi y j xi x j T quadratic optimization problem solution to which gives: – Lagrange multipliers a, which in turn are H NL yi y j xi x j yi y j k ( xi , x j ) T used in D(x) • In Matlab “quadprog” is used to solve the quadratic optimization • Because there can only exist one solution to the quadratic problem it • guarantees a global solution.

DOCUMENT INFO

Shared By:

Categories:

Tags:
support vector machines, machine learning, support vector machine, training set, data mining, data sets, data set, support vectors, on machine, neural networks, bayesian classification, relevance vector machine, data points, decision tree induction, support vector

Stats:

views: | 12 |

posted: | 5/21/2010 |

language: | English |

pages: | 32 |

OTHER DOCS BY air20214

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.