# Part 2

Document Sample

```					           Part 2:
Support Vector Machines
University of Minnesota
cherk001@umn.edu

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

Electrical and Computer Engineering
1
SVM: Brief History

1963 Margin (Vapnik & Lerner)
1964 Margin (Vapnik and Chervonenkis, 1964)
1964 RBF Kernels (Aizerman)
1965 Optimization formulation (Mangasarian)
1971 Kernels (Kimeldorf annd Wahba)
1992-1994 SVMs (Vapnik et al)
1996 – present Rapid growth, numerous apps
1996 – present Extensions to other problems
2
MOTIVATION for SVM
•   Problems with ‘conventional’ methods:
- model complexity ~ dimensionality (# features)
- nonlinear methods  multiple minima
- hard to control complexity

•   SVM solution approach
- adaptive loss function (to control complexity
independent of dimensionality)
- flexible nonlinear models
- tractable optimization formulation

3
SVM APPROACH
•   Linear approximation in Z-space using
•   Complexity independent of dimensionality

x     gx




   z      wz         ˆ
y

4
OUTLINE

•   Margin-based loss
•   SVM for classification
•   SVM examples
•   Support vector regression
•   Summary

5
Example: binary classification
• Given: Linearly separable data
How to construct linear decision boundary?

6
Linear Discriminant Analysis
LDA solution    Separation margin

7
Perceptron (linear NN)
• Perceptron solutions and separation margin

8
Largest-margin solution
• All solutions explain the data well (zero error)
All solutions ~ the same linear parameterization
Larger margin ~ more confidence (falsifiability)

M  2

9
Complexity of -margin hyperplanes

• If data samples belong to a sphere of radius R,
then the set of -margin hyperplanes has VC
dimension bounded by

h  min( R /  , d )  1
2    2

• For large margin hyperplanes, VC-dimension
controlled independent of dimensionality d.

10
Motivation: philosophical
• Classical view: good model
explains the data + low complexity
 Occam’s razor (complexity ~ # parameters)
• VC theory: good model
explains the data + low VC-dimension
~ VC-falsifiability: good model:
explains the data + has large falsifiability
The idea: falsifiability ~ empirical loss function

11
• Both goals (explanation + falsifiability) can
encoded into empirical loss function where
- (large) portion of the data has zero loss
- the rest of the data has non-zero loss,
i.e. it falsifies the model
• The trade-off (between the two goals) is
• Examples of such loss functions for
different learning problems are shown next
12
Margin-based loss for classification

Margin  2   L ( y, f (x,  ))  max   yf (x,  ),0
13
Margin-based loss for classification:
margin is adapted to training data
Class +1            Class -1

Margin                     y  f (x,  )

L ( y, f (x,  ))  max   yf (x,  ),0
Epsilon loss for regression

L ( y, f (x,  ))  max | y  f (x,  ) |  ,0
15
Parameter epsilon is adapted to training data
Example: linear regression y = x + noise
where noise = N(0, 0.36), x ~ [0,1], 4 samples
Compare: squared, linear and SVM loss (eps = 0.6)
2

1.5

1

0.5
y

0

-0.5

-1
0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
x
OUTLINE
• Margin-based loss
• SVM for classification
- Linear SVM classifier
- Inner product kernels
- Nonlinear SVM classifier
• SVM examples
• Support Vector regression
• Summary
17
SVM Loss for Classification

Continuous quantity yf (x, w) measures how
close a sample x is to decision boundary

18
Optimal Separating Hyperplane

Distance btwn hyperplane and sample f (x' ) / w
 Margin   1/ w     Shaded points are SVs       19
Linear SVM Optimization Formulation
(for separable data)
•    Given training data x i ,yi  i  1,...,n
•    Find parameters w, b of linear hyperplane
f x   w  x   b
 w   0.5  w
2
that minimize
under constraints          y i w  x i   b  1
•    Quadratic optimization with linear constraints
tractable for moderate dimensions d
•    For large dimensions use dual formulation:
- scales with sample size(n) rather than d
- uses only dot products x i , x j 
20
Classification for non-separable data

slack _ variables
L ( y, f (x,  ))  max   yf (x,  ),0
    yf (x,  )
21
SVM for non-separable data
x 1 = 1 - f (x1 )

f (x)  w  x   b
x1

x 2 = 1 - f (x 2 )

x2

x3                f ( x) = +1

f ( x) = 0
x 3 = 1 + f (x 3 )

f ( x) = - 1

n
1 2
Minimize                     C   i  w  min
i 1   2
under constraints yi w  x i   b  1   i                                                            22
SVM Dual Formulation
•   Given training data x i ,yi  i  1,...,n
•   Find parameters   i* ,b * of an opt. hyperplane
as a solution to maximization problem

L     i    i j yi y j x i  x j   max
n
1 n
i 1   2 i , j 1
n

under constraints     y
i 1
i   i    0,         0  i  C
n

•   Solution      f x     i* yi x  x i   b *
i 1
where samples with nonzero  i* are SVs
•   Needs only inner products x  x'
23
Nonlinear Decision Boundary
• Fixed (linear) parameterization is too rigid
• Nonlinear curved margin may yield larger margin
(falsifiability) and lower error

24
Nonlinear Mapping via Kernels
Nonlinear f(x,w) + margin-based loss = SVM
• Nonlinear mapping to feature z space, i.e.
x ~ ( x1 , x2 )  z ~ (1, x1 , x2 , x1 x2 , x , x )
2
1
2
2
• Linear in z-space ~ nonlinear in x-space

• BUT z z  Hx, x ~ kernel trick
 Compute dot product via kernel analytically

x    gx





   z   wz   ˆ
y

25
SVM Formulation (with kernels)
•   Replacing z  z   H x, x         leads to:
•   Find parameters  , b of an optimal
*   *
i
n
hyperplane Dx     y H x , x   b
*
i   i       i
*

i 1
as a solution to maximization problem
L     i    i j y i y j H x i , x j   max
n
1 n
i 1   2 i , j 1
n

under constraints         y           i   i        0,       0  i  C
x i ,yi 
i 1
Given: the training data                                       i  1,...,n
an inner product kernel H x, x
regularization parameter C
26
Examples of Kernels
Kernel H x, x is a symmetric function satisfying general
math conditions (Mercer’s conditions)
Examples of kernels for different mappings xz
• Polynomials of degree q
H x, x   x  x'  1q

• RBF kernel                             x  x'

2


H x, x  exp             


2         

• Neural Networks         H x, x  tanhv(x  x' )  a
for given parameters    v, a
Automatic selection of the number of hidden units (SV’s)     27
More on Kernels
• The kernel matrix has all info (data + kernel)
H(1,1) H(1,2)…….H(1,n)
H(2,1) H(2,2)…….H(2,n)
………………………….
H(n,1) H(n,2)…….H(n,n)
• Kernel defines a distance in some feature
space (aka kernel-induced feature space)
• Kernels can incorporate apriori knowledge
• Kernels can be defined over complex
structures (trees, sequences, sets etc)
28
Support Vectors
• SV’s ~ training samples with non-zero loss
• SV’s are samples that falsify the model
• The model depends only on SVs
 SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:
About 40% of us (Americans) will vote for a Democrat, even if the
candidate is Genghis Khan. About 40% will vote for a Republican,
even if the candidate is Attila the Han. This means that the election
is left in the hands of one-fifth of the voters.

• SVM Generalization ~ data compression
29
New insights provided by SVM
• Why linear classifiers can generalize?

h  min( R /  , d )  1
2    2

(1) Margin is large (relative to R)
(2) % of SV’s is small
(3) ratio d/n is small
• SVM offers an effective way to control
complexity (via margin + kernel selection)
i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
30
OUTLINE
•   Margin-based loss
•   SVM for classification
•   SVM examples
•   Support Vector regression
•   Summary

31
Ripley’s data set
• 250 training samples, 1,000 test samples

• SVM using RBF kernel  (u, v)  exp  u  v
2

• Model selection via 10-fold cross-validation

32
Ripley’s data set: SVM model
• Decision boundary and margin borders
• SV’s are circled
1.2

1

0.8

0.6
x2

0.4

0.2

0

-0.2
-1.5   -1   -0.5        0   0.5   1
x1                 33
Ripley’s data set: model selection
• SVM tuning parameters C,
• Select opt parameter values via 10-fold x-validation
• Results of cross-validation are summarized below:

C= 0.1    C= 1    C= 10   C= 100   C= 1000 C= 10000

=2-3   98.4%    23.6%   18.8%    20.4%    18.4%   14.4%
=2-2   51.6%    22%     20%      20%      16%     14%
=2-1   33.2%    19.6%   18.8%    15.6%    13.6%   14.8%
=20   28%      18%     16.4%    14%      12.8%   15.6%
=21   20.8%    16.4%   14%      12.8%    16%     17.2%
=22   19.2%    14.4%   13.6%    15.6%    15.6%   16%
=23   15.6%    14%     15.6%    16.4%    18.4%   18.4%

34
Noisy Hyperbolas data set
• This example shows application of different kernels
• Note: decision boundaries are quite different

RBF kernel
1
Polynomial
1

0.9
0.9

0.8
0.8

0.7
0.7

0.6
0.6

0.5                                                         0.5

0.4                                                         0.4

0.3                                                         0.3

0.2                                                         0.2
0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1     0.1   0.2    0.3   0.4   0.5   0.6   0.7   0.8   0.9

35
Many challenging applications
•   Mimic human recognition capabilities
- high-dimensional data
- content-based
- context-dependent
Sceitnitss osbevred: it is nt inptrant how
lteters are msspled isnide the word. It is
ipmoratnt that the fisrt and lsat letetrs do not
chngae, tehn the txet is itneprted corrcetly

•   SVM is suitable for sparse high-dimensional
data
36
Example SVM Applications
• Handwritten digit recognition
• Genomics
• Face detection in unrestricted
images
• Text/ document classification
• Image classification and retrieval
• …….                                  37
Handwritten Digit Recognition (mid-90’s)
• Data set:
postal images (zip-code), segmented, cropped;
~ 7K training samples, and 2K test samples
• Data encoding:
16x16 pixel image  256-dim. vector
• Original motivation: Compare SVM with custom
MLP network (LeNet) designed for this application
• Multi-class problem: one-vs-all approach
 10 SVM classifiers (one per each digit)
38
Digit Recognition Results
• Summary
- prediction accuracy better than custom NN’s
- accuracy does not depend on the kernel type
- 100 – 400 support vectors per class (digit)
• More details
Type of kernel No. of Support Vectors   Error%
Polynomial            274               4.0
RBF                   291               4.1
Neural Network        254               4.2

• ~ 80-90% of SV’s coincide (for different kernels)
39
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in
large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via
feature selection) – relies on a good indexing scheme.
This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a
classifier using all possible features (words)
• Document/ Text Representation:
individual words = input features (possibly weighted)
• SVM performance:
– Very promising (~ 90% accuracy vs 80% by other classifiers)
– Most problems are linearly separable  use linear SVM
40
OUTLINE

•   Margin-based loss
•   SVM for classification
•   SVM examples
•   Support vector regression
•   Summary

41
Linear SVM regression
Assume linear parameterization         f (x,  )  w  x  b

y

1


2
*

x

L ( y, f (x,  ))  max | y  f (x,  ) |  ,0
42
Direct Optimization Formulation
y

Given training data                        1

x i ,yi    i  1,...,n
2
*

Minimize
x
n
1
(w  w )  C  ( i   i* )
2              i 1
 yi  (w  x i )  b     i

Under constraints
  (w  x i )  b  y i     i*
  ,  *  0, i  1,..., n
      i     i

43
Example:                  SVM regression using RBF kernel

1

0.8

0.6
y

0.4

0.2

0

-0.2
0   0.1    0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
x

SVM estimate is shown in dashed line
SVM model uses only 5 SV’s (out of the 40 points)
44
 xc 2 
m
       j 
RBF regression model                             f  x, w    w j exp         2 
j 1        0.20  
           
4

3

2

1

0
y

-1

-2

-3

-4
0   0.1   0.2   0.3   0.4   0.5     0.6   0.7   0.8   0.9   1
x

Weighted sum of 5 RBF kernels gives the SVM model
45
Summary
• Margin-based loss: robust +
performs complexity control
• Nonlinear feature selection (~
SV’s): performed automatically
• Tractable model selection – easier
than most nonlinear methods.
• SVM is not a magic bullet solution
- similar to other methods when n >> h
- SVM is better when n << h or n ~ h    46

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 1 posted: 2/17/2012 language: pages: 46