Chapterb8a SVM
Document Sample


MITM 613
Intelligent System
Chapter 8a:
Support Vector Machine
Abdul Rahim Ahmad
2
Chapter Eight(a) : SVM
• Introduction
• Theory
• Implementation
• Tools Comparison
• LIBSVM practical
Abdul
Rahim
Ahmad
Introduction (1 of 9)
Introduction (1)
SVM is mainly used in the problem of classification and
regression.
In classification,
We want to estimate a decision function, f using a
set of training data with labels such that f will
correctly classify unseen test examples.
Definition of SVM:
“The Support Vector Machine is a learning machine
for pattern recognition and regression problems which
constructs its solution (decision function f) in terms of
a subset of the training data, the Support Vectors.”
Introduction (1 of 9)
Introduction (2)
Why the name machine?
Implemented in Software – a software machine
It receive input and produce output – classification.
What are support vectors?
A (small) subset of the set of input vectors that are
needed for the final machine implementation. ie: they
support the final machine functionality.
What relation with Neural Network (NN)?
It perform similar function as NN – pattern
recognition, function estimation, interpolation,
regression etc.
Only BETTER.
Introduction (3 of 9)
Introduction (3)
History
SVM came from the idea of "Generalized Portrait"
Algorithm in 1963 for constructing separating
hyperplanes with optimal margin.
Introduced as Large Margin classifier in the COLT
1992 conference by Boser, Guyon,Vapnik in the
paper:
“A Training Algorithm for Optimal Margin
Classifiers. “
What is Optimal margin classifier?
Classification algorithm that maximize the margin
between nearest points on separate classes in the
classification.
Introduction (4 of 9)
Introduction (4)
Why the need to achieve optimal margin?
Optimal margin leads to better generalization
Implying minimization of overall risk
Two kinds of Risk Minimization :
Structural Risk Minimization (SRM)
As in SVM
Empirical Risk Minimization (ERM)
As in Neural Network
Introduction (5 of 9)
Introduction (5)
What is Risk minimization ?
choosing appropriate value for parameters, eg: α that
minimize:
R( )= Q( z, )dP( z )
where
α defines the parameterisation
Q is the loss function
z belongs to the union of input and output spaces
P describes the distribution of z
P can only be estimated – normally avoided
(to simplify) by using empirical risk:
1
Remp ( )
l
Q( zi , )
Minimizing this is called empirical risk minimisation (as in
NN).
Introduction (6 of 9)
Introduction (6)
Vapnik (Vapnik, 1995) proved that the bound on
expected risk is:
R( ) Remp ( ) f ( h )
Where h, is the VC dimension – measure of the capacity of
the learning machine. f(h) provides the confidence in the
risk.
h log( )
R ( ) Remp ( ) ,
l l
2l
h log 1 log
h log( ) h 4
,
l l l
SRM identify optimal point on the curve for bound on
the expected risk (ie:trade-off between expected risk
and complexity of the approximating function)
Introduction (7 of 9)
Introduction (7)
Risk minimization - two
distinct ways
Fix confidence in the risk,
optimize empirical risk -
Neural network.
Fix empirical risk, optimize
confidence interval - SVM.
• In NN: Fix network structure.
learning -> minimize empirical risk.
(using gradient descent)
• In SVM: Fix empirical risk.
(to min, or 0 for separable data set),
learning -> optimizes for a minimum confidence
interval
(maximizing the margin of the
separating hyper plane).
Introduction (8 of 9)
Introduction (8)
To implemet SRM
-> Find Largest margin by either of the following methods
Find Optimal plane that Find Optimal plane that
bisects closest points in maximize margin
convex hulls More often
used
Introduction (9 of 9)
Introduction (9)
NN NN
Most popular 1 2
classifiers are trained
using Neural
network (NN). D
NN decision function
might not be B
C
the same for every
training and for
different initial A
parameter values
Optimal since
training stops once
convergence is
achieved A, B, C, D Optimal decision
For better are the function
generalization, we Have large margin between nearest
need optimal decision Support points of the 2 classes
function – the one and Vectors
only.
Theory (1/15)
Theory (1)
3 cases of SVM :
Linearly separable case.
Non-linearly separable case.
Non-separable or imperfect separation case
(allowing for noise).
Theory (2/15)
Theory (2)
Linearly separable case.
Specifically we want to find a plane H: y = w.x + b = 0 and
two planes parallel to it, say H1 and H2 such that they are
equidistant from H and
H1: y = w.x + b = +1 and
H2: y = w.x + b = -1 .
Also there should be no data points between H1 and H2
and the distance M between H1 and H2 is maximized.
H1: y = w.x + b = +1
H: y = w.x + b = 0 H2: y = w.x + b = -1
Theory (3/15)
Theory (3)
The distance of a point on H1 to H is :
|w.x + b|/||w|| = 1/||w||,
Therefore the distance between H1 and H2 is 2/||w||
H1: y = w.x + b = +1
H: y = w.x + b = 0 H2: y = w.x + b = -1
Theory (4/15)
Theory (4)
In order to maximize the distance we minimize ||w||.
Furthermore we do not want any data points between
the two. Thus we have :
H1: y = w.x + b +1 for positive examples yi = +1
H2: y = w.x + b -1 for negative examples yi = -1
The two equations can be combined: yi (w.x + b) 1
Formulation for Optimal Hyper plane is :
Minimize ||w|| subject to yi (w.x + b) 1
H1: y = w.x + b = +1
H: y = w.x + b = 0 H2: y = w.x + b = -1
Theory (5/15)
Theory (5)
This is a convex, quadratic programming problem (in w, b) in a
convex set, which can be solved by introducing N non-negative
Lagrange multipliers 1, 2,…, N 0 associated with the
constraints. (Theory of Lagrange Multipliers)
Thus we have the following Lagrangian to solve for i’s :
1 T N N
L(w , b, ) w w i y i (w .x i b) i
2 i 1 i 1
We have to minimize this function over w and b and maximize it
over i’s.
We can solve the Wolfe dual of the Langrangian, instead :
Maximize L(w, b, ) w.r.t , subject to the constraints that
the gradient of L(w, b, ) w.r.t to the primal variables w and
b vanish ie: L/ w = 0
and L/ b = 0 and that 0.
N N
We thus have w i y i x i and i y i 0 i 1
i 1
Theory (6/15)
Theory (6)
N N
Putting w i y i x i and y i i 0 in L(w, b, ), we get the
wolfe dual: i 1 i 1
N
1
Ld i i j y i y j ( x i .x j )
i 1 2 i,j in which input data only appear in a dot
product.
We solve for i’s which will maximize Ld subject to I ≥ 0 i=1,…,l
and N
i y i 0
i 1
The hyperplane decision function is thus :
or N
f ( x ) sgn(( i y i ( x i .x ) b) f ( x ) sgn(w .x b)
i 1
Since I ≥ 0 for all points on the margin and I = 0 for others, only
those I play a role in the decision function. They are called
support vectors
The number of support vectors are usually small, thus we say that
the solution to SVM is sparse.
Theory (7/15)
Theory (7)
Non linear (separable) case
In this case, we can transform the data points into another
high dimensional space such that the data points will be
linearly separable in the new space. We construct Optimal
Separating Hyper plane in that space.
Let the transformation be (.). In the high dimensional
space, we solve: i 1 i j y i y j ( x i ). ( x j )
N
Ld
i 1 2 i,j
Example of
mapping
from 2D to
3D
Theory (8/15)
Theory (8)
Non linear (separable) case
In place of the dot product, if we can find a kernel function
which perform this dot product implicitly, we can replace it with
that kernel (ie: perform kernel evaluation instead of explicitly
map the training data)
N
1
Ld i i j y i y j K ( x i , x j )
i 1 2 i,j
The hyper plane decision function is thus now :
N
f ( x ) sgn(( i y i K ( x i , x ) b)
i 1
Theory (9/15)
Theory (9)
SVM for Non-linear Separable Case
An SVM corresponds to a non-linear
decision surface in input surface R2
Data points
in input
space
Mapping from
R2 via into R3
Hyperplane in
feature space
R3
Theory (10/15)
Theory (10)
Non linear (separable) case
To determine if a dot product in high dimensional space is
equivalent to a kernel function in input space, i.e: (xi).(xj) =
K(xi.xj)
Use Mercer’s condition
Need not have to be explicit about the transformation (.) as
long as we know that K(xi.xj) is equivalent to the dot product of
some other high dimensional space.
Kernel functions that can be used this way:
Linear kernel K ( x , y ) x. y
Polynomial kernels K ( x , y ) ( x. y 1 ) d
2
x y
Radial basis function (Gaussian kernel) K( x y ) e 2 2
Hyperbolic tangent kernel K ( x , y ) tanh( ax.y b )
Theory (11/15)
Theory (11)
Imperfect Separation Case
No strict enforcement that there be no data points
between hyperplanes H1 and H2
But penalize the data points that are in the wrong side.
Penalty C is finite and have to be chosen by the user.
Large C means higher penalty.
We introduce non-negative slack variable 0 so that
:
W.xi + b + 1 - i for yi = +1
W.xi + b - 1 + i for yi = -1
0 i.
Theory (12/15)
Theory (12)
We add to the objective function a penalising
term 1
min imize w T w C ( i )m
w ,b , 2 i
Where m is usually set to 1, which gives us
1 N
min imize w w C ( )
T
i
2
w ,b ,
i 1
subject to y i (w T x i b) i 1 0,1 i N
i 0,1 i N
Theory (13/15)
Theory (13)
Imperfect Separation Case
Introducing Lagrange multipliers , , the lagrangian is:
1 T N N N
L(w , b, i , , ) w w C i i [ y i (w .x i b) i 1] i i
2 i 1 i 1 i 1
1 T N N N N
L(w , b, i , , ) w w C i i ) i ( i y i x i )w ( i y i )b i
T
2 i 1 i 1 i 1 i 1
• Similarly, solving for the Wolfe dual, neither I nor their
Lagrange multipliers, appear in the dual problem. Minimize
N
1
Ld i i j y i y j xi .x j
i 1 2 i,j N
Subject to 0 i C and y
i 1
i i 0
• The only difference from the perfectly separating case is that
I now is bounded by C. The solution is again given by
N
w i y i x i
i 1
Theory (14/15)
Theory (14)
Different SVM Objective functions leads to
different SVM variations
l Most commonly
Using l1 norm 1 T
min w w C i
w , ,b 2
used
i 1
Using l2 norm 1 T 1 l 2
min w w C i
w , ,b 2 2 i 1
Using l1 norm for w - linear programming (LP) SVM
l l
min wi C i
w , ,b
i 1 i 1
v parameter for controlling the number of support
vectors
l
1 l 2
min wi i
w , ,b l i 1
i 1
Theory (15/15)
Theory (15)
SVM architecture (for Neural Network users)
The kernel function k is chosen a priori (determine the type of classifier).
Training – solve a quadratic programming problem to find
no of hidden units (no. of support vectors),
weights (w),
threshold (b)
The first layer weights xi are a subset of the training set (the support
vectors).
The second layer weights I = yi I are computed from the Lagrange
Multipliers.
N
f ( x ) sgn(( i y i K ( x i , x ) b)
i 1
Application (1/1)
Application (1)
SVM Applications
applied to a number of applications such as
Image classification.
Time series prediction
Face recognition
Biological data processing for medical diagnosis
Digit recognition (MLP-SVM)
Text Categorisation
Speech recognition
Using hybrid SVM/HMM
Implementation (1/6)
Implementation (1)
SVM Implementation
High-performance classifiers
use of kernels.
Different kernel functions lead to
very similar classification accuracies
produced similar SV sets.
(that is the SV set seems to characterize the given task
up to a certain degree independent of the type of
kernel)
Implementation (2/6)
Implementation (2)
SVM Implementation
Main issues are classification accuracy and speed
To improve on the speed, a number of improvements to
original SVM are developed:
(1) Chunking - Osuna (1) Nearest Point Algorithm
(2) Sequential Minimization – Keerthi
Optimization (SMO) - Platt
Implementation (3/6)
Implementation (3)
SVM Software Implementation
In high level languages C, C++, FORTRAN
SVM light - Thorsten Joachims'.
mySVM -Ruping
SMO in C++ - XiaPing Yi
LIBSVM – Chih Jen Lin
Matlab, toolbox
OSU SVM Toolbox - Junshui Ma and Stanley Ahalt.
MATLAB Support Vector Machine Toolbox - Gavin Cawley
Matlab routines for support vector machine classification - Anton
Schwaighofer
MATLAB Support Vector Machine Toolbox - Steve Gunn
LearnSC - Vojislav Kecman
LIBSVM Interface – students of C.J.Lin
Implementation (4/6)
Implementation (4)
Steps in SVM training
Select the parameter C (representing the
tradeoff between minimizing the training error
and margin maximization), kernel function and
any kernel parameters.
Solve the dual QP or alternative problem
formulation using appropriate QP or LP algorithm
to obtain the support vectors.
Calculate threshold b using the support vectors.
Implementation (5/6)
Implementation (5)
Model Selection:
Minimizing an estimate of generalization error or some
related performance measures
K-fold cross-validation and leave-one-out (LOO) estimates
Other recent model selection strategies are based on some
bound determined by a quantity (through theoretical
analysis) which is not obtained using retraining with data
points left out (as in cross-validation or LOO)
SV count /Jaakkola Haussler bound /Opper – Winther Bound/
Radius – margin Bound /Span Bound/
10-fold cross-validation is popularly used and used in my
work.
Implementation (6/6)
Implementation (6)
Different methods for QP Optimization:
(a) techniques in which kernel components are
evaluated and discarded during learning
Kernel Adatron
(b) decomposition method in which an evolving
subset of data is used and
Sequential Minimal Optimization (SMO)
SVMlight/LIBSVM
(c) new optimization approaches that specifically
exploit the structure of the SVM problem.
Nearest point algorithm (NPA)
Tools Comparison –
SVMTorch/SVMLight/LIBSVM
Features SVMTorch SVMLight LIBSVM
Developer Ronan Collobert Thosten Joachims Chih-Jen Lin
Uses Classification Classification C-SVC / -SVC
Regression Regression Regression / -SVR
Ranking -SVR / distribution
estimation / one-class SVM
Language C++ C C/C++/Java
Phyton/Matlab/R/Perl
interface
Optimization Decomposition Decomposition Decomposition
method Working set of size - 2 Working set of size – 2 or Working set of size – 2 or
more more
Internal cache Yes Yes Yes
Shrinking optional Yes Yes
Generalization Yes None Yes
Performance LOO and Xi-alpha estimates Automatic cross validation
estimates functionality
Multiclass Yes No Yes
One against all Need to add by the user. One against all
One against one with DAG
Extras Weighted SVM for
unbalanced dataset
Shrinking (Remove equal to bounds 0 or C for a long time)
Implementation
SVMTORC (III)
H
Implementation
SVMLight (III)
Implementation
LIBSVM (III)
LIBSVM
LIBSVM History
1.0 : June 2000 First Release.
2.0 : Aug 2000 Major updates – add nu-svm, one-class
svm, and svr
2.1 : Dec 2000 Java version added, regression demonstrated in
svm-toy
2.2 : Jan 2001 Multi-class classification, nu-SVR
2.3 : Mar 2001 Cross validation, fix some minor bugs
2.31: April 2001 Fix one bug on one-class SVM, use float for Cache
2.33: Dec 2001 Python interface added
2.36: Aug 2002 grid.py added: contour plot of CV accuracy
2.4 : April 2003 improvements of scaling
2.5 : Nov 2003 some minor updates
2.6 : April 2004 Probability estimates for
classification/regression
2.7 : Nov 2004 Stratified cross validation
2.8 : April 2005 New working set selection via
second order information
LIBSVM Current Version
2.81: Nov 2005
2.82: Apr 2006
2.83: Nov 2006
2.84: April 2007
2.85: Nov 2007
2.86: April 2008
2.87: October 2008
2.88: October 2008
2.89: April 2009
2.9: November 2009
2.91: April 2010
3.0 : September 13, 2010
3.12: April Fools' day, 2012
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM for Windows
Java
C/C++
LIBSVM in MATLAB
LIBSVM in R package
LIBSVM in WEKA
Get documents about "