Document Sample

Part 2: Support Vector Machines Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at Tech Tune Ups, ECE Dept, June 1, 2011 Electrical and Computer Engineering 1 SVM: Brief History 1963 Margin (Vapnik & Lerner) 1964 Margin (Vapnik and Chervonenkis, 1964) 1964 RBF Kernels (Aizerman) 1965 Optimization formulation (Mangasarian) 1971 Kernels (Kimeldorf annd Wahba) 1992-1994 SVMs (Vapnik et al) 1996 – present Rapid growth, numerous apps 1996 – present Extensions to other problems 2 MOTIVATION for SVM • Problems with ‘conventional’ methods: - model complexity ~ dimensionality (# features) - nonlinear methods multiple minima - hard to control complexity • SVM solution approach - adaptive loss function (to control complexity independent of dimensionality) - flexible nonlinear models - tractable optimization formulation 3 SVM APPROACH • Linear approximation in Z-space using special adaptive loss function • Complexity independent of dimensionality x gx z wz ˆ y 4 OUTLINE • Margin-based loss • SVM for classification • SVM examples • Support vector regression • Summary 5 Example: binary classification • Given: Linearly separable data How to construct linear decision boundary? 6 Linear Discriminant Analysis LDA solution Separation margin 7 Perceptron (linear NN) • Perceptron solutions and separation margin 8 Largest-margin solution • All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (falsifiability) M 2 9 Complexity of -margin hyperplanes • If data samples belong to a sphere of radius R, then the set of -margin hyperplanes has VC dimension bounded by h min( R / , d ) 1 2 2 • For large margin hyperplanes, VC-dimension controlled independent of dimensionality d. 10 Motivation: philosophical • Classical view: good model explains the data + low complexity Occam’s razor (complexity ~ # parameters) • VC theory: good model explains the data + low VC-dimension ~ VC-falsifiability: good model: explains the data + has large falsifiability The idea: falsifiability ~ empirical loss function 11 Adaptive loss functions • Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model • The trade-off (between the two goals) is adaptively controlled adaptive loss fct • Examples of such loss functions for different learning problems are shown next 12 Margin-based loss for classification Margin 2 L ( y, f (x, )) max yf (x, ),0 13 Margin-based loss for classification: margin is adapted to training data Class +1 Class -1 Margin y f (x, ) L ( y, f (x, )) max yf (x, ),0 Epsilon loss for regression L ( y, f (x, )) max | y f (x, ) | ,0 15 Parameter epsilon is adapted to training data Example: linear regression y = x + noise where noise = N(0, 0.36), x ~ [0,1], 4 samples Compare: squared, linear and SVM loss (eps = 0.6) 2 1.5 1 0.5 y 0 -0.5 -1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x OUTLINE • Margin-based loss • SVM for classification - Linear SVM classifier - Inner product kernels - Nonlinear SVM classifier • SVM examples • Support Vector regression • Summary 17 SVM Loss for Classification Continuous quantity yf (x, w) measures how close a sample x is to decision boundary 18 Optimal Separating Hyperplane Distance btwn hyperplane and sample f (x' ) / w Margin 1/ w Shaded points are SVs 19 Linear SVM Optimization Formulation (for separable data) • Given training data x i ,yi i 1,...,n • Find parameters w, b of linear hyperplane f x w x b w 0.5 w 2 that minimize under constraints y i w x i b 1 • Quadratic optimization with linear constraints tractable for moderate dimensions d • For large dimensions use dual formulation: - scales with sample size(n) rather than d - uses only dot products x i , x j 20 Classification for non-separable data slack _ variables L ( y, f (x, )) max yf (x, ),0 yf (x, ) 21 SVM for non-separable data x 1 = 1 - f (x1 ) f (x) w x b x1 x 2 = 1 - f (x 2 ) x2 x3 f ( x) = +1 f ( x) = 0 x 3 = 1 + f (x 3 ) f ( x) = - 1 n 1 2 Minimize C i w min i 1 2 under constraints yi w x i b 1 i 22 SVM Dual Formulation • Given training data x i ,yi i 1,...,n • Find parameters i* ,b * of an opt. hyperplane as a solution to maximization problem L i i j yi y j x i x j max n 1 n i 1 2 i , j 1 n under constraints y i 1 i i 0, 0 i C n • Solution f x i* yi x x i b * i 1 where samples with nonzero i* are SVs • Needs only inner products x x' 23 Nonlinear Decision Boundary • Fixed (linear) parameterization is too rigid • Nonlinear curved margin may yield larger margin (falsifiability) and lower error 24 Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM • Nonlinear mapping to feature z space, i.e. x ~ ( x1 , x2 ) z ~ (1, x1 , x2 , x1 x2 , x , x ) 2 1 2 2 • Linear in z-space ~ nonlinear in x-space • BUT z z Hx, x ~ kernel trick Compute dot product via kernel analytically x gx z wz ˆ y 25 SVM Formulation (with kernels) • Replacing z z H x, x leads to: • Find parameters , b of an optimal * * i n hyperplane Dx y H x , x b * i i i * i 1 as a solution to maximization problem L i i j y i y j H x i , x j max n 1 n i 1 2 i , j 1 n under constraints y i i 0, 0 i C x i ,yi i 1 Given: the training data i 1,...,n an inner product kernel H x, x regularization parameter C 26 Examples of Kernels Kernel H x, x is a symmetric function satisfying general math conditions (Mercer’s conditions) Examples of kernels for different mappings xz • Polynomials of degree q H x, x x x' 1q • RBF kernel x x' 2 H x, x exp 2 • Neural Networks H x, x tanhv(x x' ) a for given parameters v, a Automatic selection of the number of hidden units (SV’s) 27 More on Kernels • The kernel matrix has all info (data + kernel) H(1,1) H(1,2)…….H(1,n) H(2,1) H(2,2)…….H(2,n) …………………………. H(n,1) H(n,2)…….H(n,n) • Kernel defines a distance in some feature space (aka kernel-induced feature space) • Kernels can incorporate apriori knowledge • Kernels can be defined over complex structures (trees, sequences, sets etc) 28 Support Vectors • SV’s ~ training samples with non-zero loss • SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters. • SVM Generalization ~ data compression 29 New insights provided by SVM • Why linear classifiers can generalize? h min( R / , d ) 1 2 2 (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both • Requires common-sense parameter tuning 30 OUTLINE • Margin-based loss • SVM for classification • SVM examples • Support Vector regression • Summary 31 Ripley’s data set • 250 training samples, 1,000 test samples • SVM using RBF kernel (u, v) exp u v 2 • Model selection via 10-fold cross-validation 32 Ripley’s data set: SVM model • Decision boundary and margin borders • SV’s are circled 1.2 1 0.8 0.6 x2 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 x1 33 Ripley’s data set: model selection • SVM tuning parameters C, • Select opt parameter values via 10-fold x-validation • Results of cross-validation are summarized below: C= 0.1 C= 1 C= 10 C= 100 C= 1000 C= 10000 =2-3 98.4% 23.6% 18.8% 20.4% 18.4% 14.4% =2-2 51.6% 22% 20% 20% 16% 14% =2-1 33.2% 19.6% 18.8% 15.6% 13.6% 14.8% =20 28% 18% 16.4% 14% 12.8% 15.6% =21 20.8% 16.4% 14% 12.8% 16% 17.2% =22 19.2% 14.4% 13.6% 15.6% 15.6% 16% =23 15.6% 14% 15.6% 16.4% 18.4% 18.4% 34 Noisy Hyperbolas data set • This example shows application of different kernels • Note: decision boundaries are quite different RBF kernel 1 Polynomial 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 35 Many challenging applications • Mimic human recognition capabilities - high-dimensional data - content-based - context-dependent • Example: read the sentence Sceitnitss osbevred: it is nt inptrant how lteters are msspled isnide the word. It is ipmoratnt that the fisrt and lsat letetrs do not chngae, tehn the txet is itneprted corrcetly • SVM is suitable for sparse high-dimensional data 36 Example SVM Applications • Handwritten digit recognition • Genomics • Face detection in unrestricted images • Text/ document classification • Image classification and retrieval • ……. 37 Handwritten Digit Recognition (mid-90’s) • Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples • Data encoding: 16x16 pixel image 256-dim. vector • Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application • Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit) 38 Digit Recognition Results • Summary - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type - 100 – 400 support vectors per class (digit) • More details Type of kernel No. of Support Vectors Error% Polynomial 274 4.0 RBF 291 4.1 Neural Network 254 4.2 • ~ 80-90% of SV’s coincide (for different kernels) 39 Document Classification (Joachims, 1998) • The Problem: Classification of text documents in large data bases, for text indexing and retrieval • Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly • Predictive Learning Approach (SVM): construct a classifier using all possible features (words) • Document/ Text Representation: individual words = input features (possibly weighted) • SVM performance: – Very promising (~ 90% accuracy vs 80% by other classifiers) – Most problems are linearly separable use linear SVM 40 OUTLINE • Margin-based loss • SVM for classification • SVM examples • Support vector regression • Summary 41 Linear SVM regression Assume linear parameterization f (x, ) w x b y 1 2 * x L ( y, f (x, )) max | y f (x, ) | ,0 42 Direct Optimization Formulation y Given training data 1 x i ,yi i 1,...,n 2 * Minimize x n 1 (w w ) C ( i i* ) 2 i 1 yi (w x i ) b i Under constraints (w x i ) b y i i* , * 0, i 1,..., n i i 43 Example: SVM regression using RBF kernel 1 0.8 0.6 y 0.4 0.2 0 -0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x SVM estimate is shown in dashed line SVM model uses only 5 SV’s (out of the 40 points) 44 xc 2 m j RBF regression model f x, w w j exp 2 j 1 0.20 4 3 2 1 0 y -1 -2 -3 -4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Weighted sum of 5 RBF kernels gives the SVM model 45 Summary • Margin-based loss: robust + performs complexity control • Nonlinear feature selection (~ SV’s): performed automatically • Tractable model selection – easier than most nonlinear methods. • SVM is not a magic bullet solution - similar to other methods when n >> h - SVM is better when n << h or n ~ h 46

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 1 |

posted: | 2/17/2012 |

language: | |

pages: | 46 |

OTHER DOCS BY xiagong0815

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.