What is a Support Vector Machine?

Document Sample

```					                What is a
Support Vector Machine?
•      An optimally defined surface
•      Typically nonlinear in the input space
•      Linear in a higher dimensional space
•      Implicitly defined by a kernel function

Acknowledgments: These slides combine and modify ones
provided by Andrew Moore (CMU), Glenn Fung (Wisconsin), and
Olvi Mangasarian (Wisconsin)

CS 540, University of Wisconsin-Madison, C. R. Dyer
What are Support Vector Machines
Used For?
• Classification
• Regression and data-fitting
• Supervised and unsupervised learning

CS 540, University of Wisconsin-Madison, C. R. Dyer
Linear Classifiers
x       f                     y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1

How would you
classify this data?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Linear Classifiers
(aka Linear Discriminant Functions)
• Definition
It is a function that is a linear combination of the
components of the input x
m
f ( x)   wij x j  b  w T x  b
j 1

where w is the weight vector and b the bias

• A two-category classifier then uses the rule:
Decide class c1 if f(x) > 0 and class c2 if f(x) < 0
 Decide c1 if wTx > -b and c2 otherwise

CS 540, University of Wisconsin-Madison, C. R. Dyer
Linear Classifiers
x       f                     y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1

How would you
classify this data?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Linear Classifiers
x       f                     y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1

How would you
classify this data?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Linear Classifiers
x       f                     y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1

How would you
classify this data?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Linear Classifiers
x       f                     y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1

Any of these
would be fine …

… but which is
best?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Classifier Margin
x       f                   y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1                                       Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
data point

CS 540, University of Wisconsin-Madison, C. R. Dyer
Maximum Margin
x                 f                 y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1                                                 The maximum
margin linear
classifier is the
linear classifier
with the, um,
maximum margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
CS 540, University of Wisconsin-Madison, C. R. Dyer
Maximum Margin
x                 f                 y
f(x,w,b) = sign(w · x + b)
denotes +1
denotes -1                                                 The maximum
margin linear
classifier is the
linear classifier
Support Vectors                                                        with the, um,
are those data
points that the                                                        maximum margin
margin pushes                                                          This is the
up against
simplest kind of
SVM (Called an
LSVM)
Linear SVM
CS 540, University of Wisconsin-Madison, C. R. Dyer
Why Maximum Margin?
1. Intuitively this feels safest
f(x,w,b) = sign(w. x - b)
2. If we’ve made a small error in the
denotes +1
location of the boundary (it’s been
denotes -1                                                      The maximum
jolted in its perpendicular direction)
margin of causing a
this gives us least chance linear
misclassification classifier is the
linear classifier
3. Robust to outliers since the model is
Support Vectors                                          immune to change/removal of any
with the, um,
are those data                                           non-support-vector data points
points that the                                                             maximum margin.
margin pushes                                         4. There’s some theory (using VC
up against
This is to
dimension) that is relatedthe(but not
the same as) thesimplest kind of
proposition that this
is a good thing SVM (Called an
LSVM)
5. Empirically it works very well

CS 540, University of Wisconsin-Madison, C. R. Dyer
Specifying a Line and Margin
Plus-Plane
Classifier Boundary
Minus-Plane

• How do we represent this mathematically?
• … in d input dimensions?
• An example x = (x1, …, xd)T

CS 540, University of Wisconsin-Madison, C. R. Dyer
Specifying a Line and Margin
Plus-Plane
Classifier Boundary
Minus-Plane

Weight vector: w = (w1 , …, wd)T
Bias or threshold: b
• Plus-plane = { wT · x + b = +1 }
• Minus-plane = { wT · x + b = -1 }

Classify as.. +1                               if   wT · x + b  1
-1                  if   wT · x + b ≤ -1
Universe            if   -1 < wT · x + b < 1
explodes
CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
M = Margin (width)

w                      How do we compute
M in terms of w
and b?

• Plus-plane = { w x + b = +1 }
• Minus-plane = { w x + b = -1 }
Claim: The vector w is perpendicular to the Plus-Plane

CS 540, University of Wisconsin-Madison, C. R. Dyer
w is the plane’s normal vector
b is the distance from the origin

CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
x+    M = Margin

w                       x-
How do we compute
M in terms of w
and b?

•    Plus-plane = { w x + b = +1 }
•    Minus-plane = { w x + b = -1 }
•    The vector w is perpendicular to the Plus Plane
Any location in
•    Let x- be any point on the minus plane                             Rm: not
m:not
necessarily a
•    Let x+ be the closest plus-plane-point to x-                       data point
datapoint

CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
x+    M = Margin

w                       x-
How do we compute
M in terms of w
and b?

•    Plus-plane = { w x + b = +1 }
•    Minus-plane = { w x + b = -1 }
•    The vector w is perpendicular to the Plus Plane
•    Let x- be any point on the minus plane
•    Let x+ be the closest plus-plane-point to x-
•    Claim: x+ = x- + l w for some value of l. Why?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
x+    M = Margin
The line from x- to x+ is
w                       x-
How do we compute
perpendicular to the
planes terms of w
M in
So to getbfrom x- to x+
and ?
travel some distance in
•    Plus-plane = { w x + b = +1 } direction w
•    Minus-plane = { w x + b = -1 }
•    The vector w is perpendicular to the Plus Plane
•    Let x- be any point on the minus plane
•    Let x+ be the closest plus-plane-point to x-
•    Claim: x+ = x- + l w for some value of l. Why?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
x+    M = Margin

w                       x-

What we know:
• w x+ + b = +1
• w x- + b = - 1
• x + = x- + l w
• |x+ - x- | = M
It’s now easy to get M in terms of w and b
CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
x+       M = Margin

w                       x-

w (x - + l w) + b = 1

What we know:
• w x+ + b = +1                                             w x - + b + l ww = 1
• w x- + b = -1                                             
• x+ = x- + l w
-1 + l ww = 1
• |x+ - x- | = M

It’s now easy to get M                                                2
in terms of w and b                                            λ
w.w
CS 540, University of Wisconsin-Madison, C. R. Dyer
Computing the Margin
2
x+      M = Margin =
w.w

w                       x-             λ
2
w.w

M = |x+ - x- | =| l w |=

 λ | w |  λ w.w
What we know:
• w x+ + b = +1                                               2 w.w   2
• w x- + b = -1                                                    
w.w    w.w
• x+ = x- + l w
• |x+ - x- | = M                                              2
 
w
CS 540, University of Wisconsin-Madison, C. R. Dyer
Learning the Maximum Margin Classifier
2
x+    M = Margin =
w.w

w                       x-

Given a guess of w and b we can
• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all
the data points. How?

CS 540, University of Wisconsin-Madison, C. R. Dyer
• QP is a well-studied class of optimization algorithms to
maximize a quadratic function of some real-valued
variables subject to linear constraints

2
• Minimize                  w            subject to

w x + b  +1 if x in class 1
w x + b  -1 if x in class 2

CS 540, University of Wisconsin-Madison, C. R. Dyer
Uh-oh!                             This is going to be a problem!
What should we do?
denotes +1
denotes -1

CS 540, University of Wisconsin-Madison, C. R. Dyer
Uh-oh!                             This is going to be a problem!
What should we do?
denotes +1                                          Idea 1:
denotes -1                                            Find minimum ||w||2, while
minimizing number of
training set errors
Problem: Two things to
minimize makes for an
ill-defined optimization

CS 540, University of Wisconsin-Madison, C. R. Dyer
Uh-oh!                             This is going to be a problem!
What should we do?
denotes +1                                          Idea 1.1:
denotes -1                                            Minimize
||w||2 + C (#train errors)

There’s a serious practical
us reject this approach. Can
you guess what it is?
CS 540, University of Wisconsin-Madison, C. R. Dyer
Uh-oh!                             This is going to be a problem!
What should we do?
denotes +1                                          Idea 1.1:
denotes -1                                            Minimize
||w||2 + C (#train errors)
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between
disastrous errors and near misses)
us reject this approach. Can
you guess what it is?
CS 540, University of Wisconsin-Madison, C. R. Dyer
Uh-oh!                             This is going to be a problem!
What should we do?
denotes +1                                          Idea 2.0:
denotes -1                                            Minimize
||w||2 + C (distance of
error points to
their correct
place)

CS 540, University of Wisconsin-Madison, C. R. Dyer
Learning Maximum Margin with Noise
M = Given guess of w, b, we can
2
w.w •Compute sum of distances
of points to their correct
zones
• Compute the margin width
Assume N examples, each
(xk , yk) where yk = +/- 1

What should our quadratic                                 How many constraints will we
optimization criterion be?                                 have?
What should they be?

CS 540, University of Wisconsin-Madison, C. R. Dyer
Learning Maximum Margin with Noise
e11                M = Given guess of w , b we can
2
e2                                  w.w •Compute sum of distances
of points to their correct
zones
e7                  • Compute the margin width
Assume N examples, each
(xk , yk) where yk = +/- 1

What should our quadratic How many constraints will we
optimization criterion be?   have? N
Minimize 1             N     What should they be?

2

w.w  C εk w . xk + b  1-ek if yk = +1
k 1
w . xk + b  -1+ek if yk = -1

CS 540, University of Wisconsin-Madison, C. R. Dyer
Suppose we’re in 1 Dimension

What would
SVMs do with
this data?

x=0

CS 540, University of Wisconsin-Madison, C. R. Dyer
Suppose we’re in 1 Dimension

Not a big surprise

x=0
Positive “plane”                        Negative “plane”

CS 540, University of Wisconsin-Madison, C. R. Dyer
Harder 1-Dimensional Dataset

That’s wiped the
smirk off SVM’s
face
What can be
this?

x=0

CS 540, University of Wisconsin-Madison, C. R. Dyer
Harder 1-Dimensional Dataset

The Kernel Trick:
Preprocess the
data, mapping x
into higher
dimensional
space, F(x)

x=0                       z k  ( xk , x )
2
k

CS 540, University of Wisconsin-Madison, C. R. Dyer
Harder 1-Dimensional Dataset

The Kernel Trick:
Preprocess the
data, mapping x
into higher
dimensional
space, F(x)

x=0                     z k  ( xk , x )
2
k

CS 540, University of Wisconsin-Madison, C. R. Dyer
CS 540, University of Wisconsin-Madison, C. R. Dyer
• Project examples into some higher dimensional space
where the data is linearly separable, defined by z = F(x)
• Training depends only on dot products of the form
F(xi) · F(xj)
• Example:

F (x)  ( x , 2x1x2 , x )
2
1
2
2

K(xi, xj) = F(xi) · F(xj) = (xi · xj)2

• Dimensionality of z space is generally much larger than
the dimension of input space x

CS 540, University of Wisconsin-Madison, C. R. Dyer
Common SVM Basis Functions
zk = ( polynomial terms of xk of degree 1 to q )
For example, when q=2 and m=2,
K(x,y) = (x1y1 + x2y2 + 1)2
= 1 + 2x1y1 + 2x2y2 + 2x1x2y1y2 + x12 y12 + x22y22

zk = ( radial basis functions of xk )
 | xk  c j | 
 KW 
z k [ j ]  φ j (x k )  KernelFn              
              
zk = ( sigmoid functions of xk )

CS 540, University of Wisconsin-Madison, C. R. Dyer
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM
kernel function
• Beyond polynomials there are other very high
dimensional basis functions that can be made
practical by finding the right kernel function
 (a  b) 2     ,  and  are magic
K (a, b)  exp 
           
   2 
2

parameters that must
be chosen by a model
• Neural-Net-style Kernel Function:                     selection method
such as CV or VCSRM
K (a, b)  tanh( a.b   )

CS 540, University of Wisconsin-Madison, C. R. Dyer
The Federalist Papers
•     Written in 1787-1788 by Alexander Hamilton, John
New York to ratify the constitution

• Papers consisted of short essays, 900 to 3500 words
in length

• Authorship of 12 of those papers have been in
dispute ( Madison or Hamilton); these papers are
referred to as the disputed Federalist papers
CS 540, University of Wisconsin-Madison, C. R. Dyer
Description of the Data
• For every paper:
• Machine readable text was created using a scanner
• Computed relative frequencies of 70 words that
Mosteller-Wallace identified as good candidates for
• Each document is represented as a vector containing the
70 real numbers corresponding to the 70 word
frequencies

• The dataset consists of 118 papers:
• 56 Hamilton papers
• 12 disputed papers
CS 540, University of Wisconsin-Madison, C. R. Dyer
Function Words Based on Relative
Frequencies

CS 540, University of Wisconsin-Madison, C. R. Dyer
SLA Feature Selection for Classifying
the Disputed Federalist Papers

• Apply the SVM Successive Linearization
Algorithm for feature selection to:
• Train on the 106 Federalist papers with known
authors
• Find a classification hyperplane that uses as few
words as possible
• Use the hyperplane to classify the 12
disputed papers
CS 540, University of Wisconsin-Madison, C. R. Dyer
Hyperplane Classifier Using 3 Words

• A hyperplane depending on three words
was found:

0.537to + 24.663upon + 2.953would = 66.616

• All disputed papers ended up on the

CS 540, University of Wisconsin-Madison, C. R. Dyer
Results: 3D Plot of Hyperplane

CS 540, University of Wisconsin-Madison, C. R. Dyer
Multi-Class Classification
• SVMs can only handle two-class outputs

• What can be done?

• Answer: for N-class problems, learn N SVM’s:
•   SVM 1, f1, learns “Output=1” vs “Output  1”
•   SVM 2, f2, learns “Output=2” vs “Output  2”
•   :
•   SVM N, fN, learns “Output=N” vs “Output  N”

CS 540, University of Wisconsin-Madison, C. R. Dyer
Multi-Class Classification

• Ideally, only one fi(x) > 0 and all others <0,
but this is not often the case in practice

• Instead, to predict the output for a new input,
just predict with each SVM and find out which
one puts the prediction the furthest into the
positive region:

• Classify as class Ci if fi(x) = max { fj(x) }
for all j
CS 540, University of Wisconsin-Madison, C. R. Dyer
Summary
• Learning linear functions
• Pick separating plane that maximizes margin
• Separating plane defined in terms of support
vectors only
• Learning non-linear functions
• Project examples into higher dimensional space
• Use kernel functions for efficiency
• Generally avoids over-fitting problem
• Global optimization method; no local optima
• Can be expensive to apply, especially for multi-
class problems

CS 540, University of Wisconsin-Madison, C. R. Dyer

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 15 posted: 12/2/2011 language: English pages: 49
How are you planning on using Docstoc?