; A Simple Introduction to Support Vector Machines
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

A Simple Introduction to Support Vector Machines

VIEWS: 0 PAGES: 105

• pg 1
```									Vectors..
Vectors: notations
   A vector in a n-dimensional space in described by a n-uple of real
numbers

 A1 
A 2
A 
A  A
T
   1
A   2

 
x2           B 
               
1
B 2                     BT  B1         B2
B 
 
B2
B
A
A2

B1 x
1
A1
2
Vectors: sum
   The components of the sum vector are the sums of the components

C  A B                   C1   A1  B1 
 2 2         
C   A  B   2
              
x2
C2                               C
B2
B
A
A2

A1                B1 C1      x1
3
Vectors: difference
   The components of the sum vector are the sums of the components

C  B A                  C1   B1  A1 
 2 2         
C   B  A   2
              
x2

B2
B
C2        A
A2                     C

A1           C1 B1          x1
-A
4
Vectors: product by a scalar
   The components of the sum vector are the difference of the
components
C  a A                    C1   a  A1 
 2          
C   a  A  2
             
x2
C2
3A

A
A2

A1       C1                   x1
5
Vectors: Norm
   The most simple definition for a norm is the euclidean module of the
components

A       A  i 2

i               1.        || x  y |||| x ||  || y ||
x2                                    2. || x ||  || x ||
3. || x || 0 se x  0
A2
A
A         A   A 
1 2       2 2

A1             x1
6
Vectors: distance between two points
   The distance between two points is the norm of the difference vector

d  A, B  A  B  B  A
x2

B2
B
C2
A2
A
C          d  A, B     B
1    1 2
A     B   2
A    
2 2

A1            C1 B1              x1
-A
7
Vectors: Scalar product
   The components of the sum vector are the sums of the components

c  AB  A  B   A B
T              i   i

i
1.   x, y  y , x
2.   x  y, z  x, z  y, z     e       x, y  z  x, y  x, z
x2            3.   x, y   x, y      e x, y   x, y
4.   x, x  0
B2
B
A
A2
θ
c  A  B  cos
A1                   B1         x1
8
Vectors: Scalar product

v
v
                                         
u                                         u

  90           v, u  0                     90   v, u  0

v


u

  90   v, u  0
9
Vectors: Norm and scalar product
   The components of the sum vector are the sums of the components

A       A 
i
i 2
 AT  A      A, A

10
Vectors: Definition of an hyperplane

In R2 , an hyperplane is a line
A line passing through the origin can be defined with as the set of the
vectors that are perpendicular to a given vector W

x2

XW  W T X  0
W
W X W X  0
1   1      2   2

x1

11
Vectors: Definition of an hyperplane

In R3 , an hyperplane is a plane
A plane passing through the origin can be defined with as the set of
the vectors that are perpendicular to a given vector W

x3
XW  W T X  0
W X W X W X  0
1   1     2   2      3   3

W               x1

x2

12
Vectors: Definition of an hyperplane

In R2 , an hyperplane is a line
A line perpendicular to W and whose distance from the origin is equal
to b is defined by the points whose scalar vector with W is equal to b

XW        WT X b
x2                                       
X                          W         W    W

-b/|W|
W               x1      W 1X 1 W 2 X 2  b  0
-b>0

13
Vectors: Definition of an hyperplane

In R2 , an hyperplane is a line
A line perpendicular to W and whose distance from the origin is equal
to b is defined by the points whose scalar vector with W is equal to b

XW         WT X b
x2                                        
W          W    W
X

W              x1      W 1X 1 W 2 X 2  b  0
b/||W||
-b<0

14
Vectors: Definition of an hyperplane

In Rn , an hyperplane is defined by

XW  b  W X  b  0       T

15
An hyperplane divides the space

A
<AW>/||W||
x2
X
T
<BW>/||W||           -b/||W||    AW  W A  b
T
W           x1      BW  W B  b

B

16
Distance between a hyperplane and a point

A
<AW>/||W||
x2
X                                 AW  b
<BW>/||W||           -b/||W||   d ( A, r ) 
W
W           x1
BW  b
d ( B, r ) 
B                              W

17
Distance between two parallel hyperplane

W X  b' 0
T

W X b  0
T
-b’/||W||
x2

b  b'
d (r , r ' ) 
W                  x1
W
-b/||W||

18
Lagrange Multipliers
Aim
We want to maximise the function z = f(x,y)
subject to the constraints g(x,y) = c (curve in
the x,y plane)

5/23/2013                                    20
Simple solution
Solve the constraint g(x,y) = c and express, for
example, y=h(x)

The substitute in function f and find the maximum
in x of

f(x, h(x))

Analytical solution of the constraint can be very
difficult

5/23/2013                                       21
Geometrical interpretation

The level contours of f(x,y) are defined by f(x,y) = dn
22
Lagrange Multipliers

Suppose we walk along the contour line with g = c.

In general the contour lines of f and g may be distinct:
traversing the contour line for g = c we cross the contour lines
of f.

While moving along the contour line for g = c the value of f
can vary.

Only when the contour line for g = c touches contour lines of f
tangentially, we do not increase or decrease the value of f -
that is, when the contour lines touch but do not cross.

23
Normal to a curve

24
Given a curve g(x,y) = c
 g g 
g   , 
 x y 
(x,y)
(x+εx, x+εy)                          
Consider 2 points of the curve: (x,y); (x+εx, x+εy),
for small ε
g                g
g x   x , y   y   g x, y    x               x                 
x   ( x, y )     y   ( x, y )

 g x, y   εT g ( x , y )
25
Given a curve g(x,y) = c

(x+εx, x+εy)     Since both points satisfy the curve
(x,y)
equation:
ε                         c  c  ε g ( x , y )
T

grad (g)              εT g ( x , y )  0
For small ε, ε is parallel to the curve and,
consequently, the gradient is perpendicular to the
curve

26
Lagrange Multipliers

The point on g(x,y)=c that
of f is perpendicular to the curve
g, otherwise we should increase or decrease f by moving
locally on the curve
So, the two gradients are parallel

for some scalar λ (where  is the gradient).

27
Lagrange Multipliers

Thus we want points (x,y) where g(x,y) = c and
,

To incorporate these conditions into one equation, we introduce
an auxiliary function (Lagrangian)
F ( x, y, )  f ( x, y)  g( x, y)  c

and solve
.

28
Recap of Constrained Optimization
 Suppose we want to: minimize/maximize f(x) subject to
g(x) = 0
 A necessary condition for x0 to be a solution:

-

  a: the Lagrange multiplier
 For multiple constraints gi(x) = 0, i=1, …, m, we need a
Lagrange multiplier ai for each of the constraints

-

29
Constrained Optimization: inequality
 We want to maximize f(x,y) with inequality constraint
g(x,y)c.
 The search must be confined in the red portion

(gradient of a function points towards the direction along
which it increases)
g(x,y) ≤ c
Constrained Optimization: inequality
 maximize f(x,y) with inequality constraint g(x,y)c.
 If the gradients are opposite (<0) the function
increases in the allowed portion The maximum
cannot be on the curve g(xy)=c
 Maximum is on the curve only if >0

g(x,y) ≤ c

f increases,

F ( x, y, )  f ( x, y)  g( x, y)  c
0
Constrained Optimization: inequality
 Minimize f(x,y) with inequality constraint g(x,y)c.
 If the gradients are opposite (<0) the function

increases in the allowed portion
 Minimum is on the curve only if <0

g(x,y) ≤ c

f increases,

F ( x, y, )  f ( x, y)  g( x, y)  c
0
Constrained Optimization: inequality
 maximize f(x,y) with inequality constraint g(x,y)≥c.
 If the gradients are opposite (<0) the function

decreases in the allowed portion
 Maximum is on the curve only if <0

F ( x, y, )  f ( x, y)  g( x, y)  c

g(x,y) ≥ c

0
f decreases,
Constrained Optimization: inequality
 Minimize f(x,y) with inequality constraint g(x,y)≥c.
 If the gradients are opposite (<0) the function

decreases in the allowed portion
 Minimum is on the curve only if >0

F ( x, y, )  f ( x, y)  g( x, y)  c

g(x,y) ≥ c

0
f decreases,
Karush-Kuhn-Tucker conditions
The function f(x) subject to constraints gi(x) ≤or≥ 0 is
max-minimized by opimizing the Lagrange function

F ( x, a i )  f ( x )   a i  g i ( x )
i

with αi satisfying the following conditions:

gi(x) ≤ 0          gi(x) ≥ 0
MIN            αi ≥ 0             αi ≤ 0
MAX            αi ≤ 0             αi ≥ 0

and
ai  gi ( x0 )  0, i
35
Constrained Optimization: inequality
   Karush-Kuhn-Tucker complementarity condition

ai  gi ( x0 )  0, i
means that

ai  0  gi ( xo )  0

The constraint is active only on the border, and cancel out
in the internal regions

36
Concave-Convex functions

Concave

Convex

5/23/2013                    37
Dual problem
   If f(x) is a convex function

Is solved by:

From the first equation we can find x as a function of the ai
These can be substituted in the Lagrangian function
obtaining the dual Lagrangian function

x

L(a i )  inf L( x, a i )  inf f ( x) 
x             
i

ai gi ( x) 

38
Dual problem

x

L(a i )  inf L( x, a i )  inf f ( x) 
x                     i

ai gi ( x) 

   The dual Lagrangian is concave: maximising it with
respect to ai ,with ai>0, solve the original constrained
problem. We compute ai as:

ai
              ai
   x

max L(a i )  max inf L( x, a i )  max inf f ( x)  i ai g i ( x)
ai
   x
                         
Then we can obtain x by substituting using the
expression of x as a function of ai

39
Dual problem:trivial example
 Minimize the function f(x)=x2 with the constraint x≤-1
(trivial: x=-1)

The Lagrangian is
L( x, a )  x 2  a ( x  1)
Minimising with respect to x
L                                     -1
 0  2x  a  0  x  a
x                           2
The dual Lagrangian is              a2 a2           a2
L(a )         a a 
4    2          4
Maximising it gives: a=2
Then subsituting,
x  a        1
2
40
An Introduction to Support Vector
Machines
What is a good Decision Boundary?

 Consider a two-class, linearly
separable classification problem
Class 2
 Many decision boundaries!

   The Perceptron algorithm can be
used to find such a boundary
   Are all decision boundaries
equally good?
Class 1

42

Class 2                    Class 2

Class 1                 Class 1

43
Large-margin Decision Boundary
   The decision boundary should be as far away from the
data of both classes as possible
 We should maximize the margin, m

Class 2

Class 1
m

44
Hyperplane Classifiers(2)

w  xi  b  1 for yi  1
w  xi  b  1 for yi  1

45
Finding the Decision Boundary
   Let {x1, ..., xn} be our data set and let yi  {1,-1} be
the class label of xi

For yi=1     w xi  b  1
T

For yi=-1    wT xi  b  1
y=1
y=1
So:
y=-1
y=-1
y=1

y=1
y=1                     
yi  w xi  b  1, xi , yi 
T

Class 2
y=-1
y=-1
y=-1

Class 1      y=-1           m

46
Finding the Decision Boundary
   The decision boundary should classify all points correctly


   The decision boundary can be found by solving the
following constrained optimization problem

   This is a constrained optimization problem. Solving it
requires to use Lagrange multipliers

47
Finding the Decision Boundary

   The Lagrangian is

 ai≥0
 Note that ||w|| = w w
2   T

48
Gradient with respect to w and b
   Setting the gradient of       w.r.t. w and b to zero, we
have
                 
n
1 T
L  w w   a i 1  yi wT xi  b 
2      i 1

1 m k k n              m k k      
  w w   a i 1  yi   w xi  b  
                      
2 k 1   i 1          k 1       
n: no of examples, m: dimension of the space

 L
 w k  0, k

 L
 b  0

                                                                 49
The Dual Problem
   If we substitute                to   , we have

Since

   This is a function of ai only

50
The Dual Problem
 The new objective function is in terms of ai only
 It is known as the dual problem: if we know w, we
know all ai; if we know all ai, we know w
 The original problem is known as the primal problem

 The objective function of the dual problem needs to be
maximized (comes out from the KKT theory)
 The dual problem is therefore:

Properties of ai when we introduce   The result when we differentiate the
the Lagrange multipliers             original Lagrangian w.r.t. b
51
The Dual Problem

   This is a quadratic programming (QP) problem
   A global maximum of ai can always be found

   w can be recovered by

52
Characteristics of the Solution
   Many of the ai are zero
 w is a linear combination of a small number of data points
 This “sparse” representation can be viewed as data

compression as in the construction of knn classifier

   xi with non-zero ai are called support vectors (SV)
 The decision boundary is determined only by the SV

 Let tj (j=1, ..., s) be the indices of the s support vectors.

We can write

   Note: w need not be formed explicitly

53
A Geometrical Interpretation

Class 2

a8=0.6 a10=0

a7=0
a2=0
a5=0

a1=0.8
a4=0
a6=1.4
a9=0
a3=0
Class 1

54
Characteristics of the Solution
   For testing with a new data z

   Compute                                                 and
classify z as class 1 if the sum is positive, and class 2
otherwise

   Note: w need not be formed explicitly

55
   Many approaches have been proposed
   Loqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html)
   Most are “interior-point” methods
 Improve this solution by optimizing the objective function

and/or reducing the amount of constraint violation
   For SVM, sequential minimal optimization (SMO) seems
to be the most popular
 A QP with two variables is trivial to solve
 Each iteration of SMO picks a pair of (ai,aj) and solve the

QP with these two variables; repeat until convergence
   In practice, we can just regard the QP solver as a
“black-box” without bothering how it works

56
Non-linearly Separable Problems
We allow “error” xi in classification; it is based on the
output of the discriminant function wTx+b
 xi approximates the number of misclassified samples

Class 2

Class 1

57
Soft Margin Hyperplane
   The new conditions become

 xi are “slack variables” in optimization
 Note that xi=0 if there is no error for xi

 xi is an upper bound of the number of errors

   We want to minimize         1 2       n
w  C  xi
2       i 1

   C : tradeoff parameter between error and margin
58
The Optimization Problem

                 
n       n                              n
1 T
L  w w  C  x i   a i 1  x i  yi wT xi  b   ix i
2        i 1    i 1                           i 1

With α and μ Lagrange multipliers, POSITIVE
L             n
 n          
 w j  a i yi xij  0       w   a i yi xi  0
w j         i 1                       i 1

L
 C a j   j  0
x j

L   n
  yia i  0
b i 1
59
The Dual Problem

1 n n             T          n
L   a ia j yi y j xi x j  C  x i 
2 i 1 j 1                  i 1

n                n                     n
  a i 1  x i  yi   a j y j x j xi  b     ix i
T
                                   
i 1                j 1                   i 1
n
With
 ya           0       C aj  j
i   i
i 1

1                 T
n    n     n
L    a ia j yi y j xi x j   a i
2 i 1 j 1                i 1
The Optimization Problem
   The dual of this new constrained optimization problem is

 New constrainsderive from C  a j   j since μ and α
are positive.
 w is recovered as

 This is very similar to the optimization problem in the
linear separable case, except that there is an upper
bound C on ai now
 Once again, a QP solver can be used to find ai

61
n
1 2
w  C  xi
2       i 1

 The algorithm try to keep ξ null, maximising the
margin
 The algorithm does not minimise the number of

error. Instead, it minimises the sum of distances
fron the hyperplane

   When C increases the number of errors tend to
lower. At the limit of C tending to infinite, the
solution tend to that given by the hard margin
formulation, with 0 errors

5/23/2013                                          62
Soft margin is more robust

63
Extension to Non-linear Decision Boundary
 So far, we have only considered large-margin classifier
with a linear decision boundary
 How to generalize it to become nonlinear?

 Key idea: transform xi to a higher dimensional space to
“make life easier”
 Input space: the space the point xi are located
 Feature space: the space of f(xi) after transformation

   Why transform?
 Linear operation in the feature space is equivalent to non-
linear operation in input space
 Classification can become easier with a proper

transformation. In the XOR problem, for example, adding a
new feature of x1x2 make the problem linearly separable

64
XOR
Is not linearly separable
X       Y
0       0    0
0       1    1
1       0    1
1       1    0

Is linearly separable
X       Y   XY
0       0   0    0
0       1   0    1
1       0   0    1
1       1   1    0
65
Find a feature space

S.Mika: Kernel Fisher Discriminant   66
Transforming the Data
f( )
f( )       f( )
f( ) f( ) f( )
f(.)          f( )
f( )     f( )
f( ) f( )
f( ) f( )
f( )     f( ) f( )
f( )
f( )

Input space                       Feature space
Note: feature space is of higher dimension
than the input space in practice

   Computation in the feature space can be costly because it is
high dimensional
   The feature space is typically infinite-dimensional!
   The kernel trick comes to rescue

67
Transforming the Data
f( )
f( )       f( )
f( ) f( ) f( )
f(.)          f( )
f( )     f( )
f( ) f( )
f( ) f( )
f( )     f( ) f( )
f( )
f( )

Input space                       Feature space
Note: feature space is of higher dimension
than the input space in practice

   Computation in the feature space can be costly because it is
high dimensional
   The feature space is typically infinite-dimensional!
   The kernel trick comes to rescue

68
The Kernel Trick
   Recall the SVM optimization problem

 The data points only appear as inner product
 As long as we can calculate the inner product in the

feature space, we do not need the mapping explicitly
 Many common geometric operations (angles, distances)

can be expressed by inner products
 Define the kernel function K by

69
An Example for f(.) and K(.,.)
   Suppose f(.) is given as follows

   An inner product in the feature space is

   So, if we define the kernel function as follows, there is
no need to carry out f(.) explicitly

   This use of kernel function to avoid carrying out f(.)
explicitly is known as the kernel trick

70
Kernels
   Given a mapping:

x  φ(x)

a kernel is represented as the inner product

K (x, y)     φ (x)φ (y)
i
i   i

A kernel must satisfy the Mercer’s condition:

                 
g (x) such that g 2 (x)dx  0  K (x, y) g (x) g (y)dxdy  0

71
Modification Due to Kernel Function
 Change all inner products to kernel functions
 For training,

Original

With kernel
function

72
Modification Due to Kernel Function
   For testing, the new data z is classified as class 1 if f 0,
and as class 2 if f <0

Original

With kernel
function

73
More on Kernel Functions
   Since the training of SVM only requires the value of K(xi,
xj), there is no restriction of the form of xi and xj
   xi can be a sequence or a tree, instead of a feature vector

   K(xi, xj) is just a similarity measure comparing xi and xj

   For a test object z, the discriminat function essentially is
a weighted sum of the similarity between z and a pre-
selected set of objects (the support vectors)

74
Example
   Suppose we have 5 1D data points
   x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
5 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1

75
Example

class 1           class 2           class 1

1   2      4      5   6

76
Example
   We use the polynomial kernel of degree 2
 K(x,y) = (xy+1)2
 C is set to 100

   We first find ai (i=1, …, 5) by

77
Example
   By using a QP solver, we get
 a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
 Note that the constraints are indeed satisfied

 The support vectors are {x2=2, x4=5, x5=6}

   The discriminant function is

   b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

   All three give b=9

78
Example

Value of discriminant function

class 1             class 2           class 1

1   2        4      5   6

79
Kernel Functions
 In practical use of SVM, the user specifies the kernel
function; the transformation f(.) is not explicitly stated
 Given a kernel function K(xi, xj), the transformation f(.)

is given by its eigenfunctions (a concept in functional
analysis)
 Eigenfunctions can be difficult to construct explicitly
 This is why people only specify the kernel function without

   Another view: kernel function, being an inner product, is
really a similarity measure between the objects

80
A kernel is associated to a transformation
   Given a kernel, in principle it should be recovered the
transformation in the feature space that originates it.

   K(x,y) = (xy+1)2= x2y2+2xy+1

It corresponds the transformation

 x    2
    
x   2x 
 1 
    
5/23/2013                                                     81
Examples of Kernel Functions

   Polynomial kernel up to degree d

   Polynomial kernel up to degree d

   Radial basis function kernel with width s

   The feature space is infinite-dimensional
   Sigmoid with parameter k and q

   It does not satisfy the Mercer condition on all k and q
82
Example

83
Building new kernels

   If k1(x,y) and k2(x,y) are two valid kernels then the
following kernels are valid
 Linear Combination

k ( x, y)  c1k1 ( x, y)  c2 k2 ( x, y)
Exponential
k ( x, y )  exp k1 ( x, y )


   Product
k ( x, y)  k1 ( x, y)  k2 ( x, y)
   Polymomial tranfsormation (Q: polymonial with non negative
coeffients)
k ( x, y )  Qk1 ( x, y )
   Function product (f: any function)
k ( x, y )  f ( x)k1 ( x, y ) f ( y )
84
Ploynomial kernel

Ben-Hur et al, PLOS computational Biology 4 (2008)
85
Gaussian RBF kernel

Ben-Hur et al, PLOS computational Biology 4 (2008)
86
Spectral kernel for sequences
   Given a DNA sequence x we can count the number of
bases (4-D feature space)

f1 ( x)  (n A , nC , nG , nT )
   Or the number of dimers (16-D space)
f2 ( x)  (n AA , n AC , n AG , n AT , nCA , nCC , nCG , nCT ,..)

   Or l-mers (4l –D space)

   The spectral kernel is

kl ( x, y )  fl x   fl  y 
5/23/2013                                                           87
Choosing the Kernel Function
 Probably the most tricky part of using SVM.
 The kernel function is important because it creates the
kernel matrix, which summarizes all the data
 Many principles have been proposed (diffusion kernel,
Fisher kernel, string kernel, …)
 There is even research to estimate the kernel matrix

from available information

 In practice, a low degree polynomial kernel or RBF
kernel with a reasonable width is a good initial try
 Note that SVM with RBF kernel is closely related to RBF
neural networks, with the centers of the radial basis
functions automatically chosen for SVM
88
Why SVM Work?
 The feature space is often very high dimensional. Why
don’t we have the curse of dimensionality?
 A classifier in a high-dimensional space has many

parameters and is hard to estimate
 Vapnik argues that the fundamental problem is not the

number of parameters to be estimated. Rather, the
problem is about the flexibility of a classifier
 Typically, a classifier with many parameters is very
flexible, but there are also exceptions
 Let xi=10i where i ranges from 1 to n. The classifier
can classify all xi correctly for all possible
combination of class labels on xi
 This 1-parameter classifier is very flexible

89
Why SVM works?
   Vapnik argues that the flexibility of a classifier should
not be characterized by the number of parameters, but
by the flexibility (capacity) of a classifier
   This is formalized by the “VC-dimension” of a classifier
 Consider a linear classifier in two-dimensional space
 If we have three training data points, no matter how

those points are labeled, we can classify them perfectly

90
VC-dimension
   However, if we have four points, we can find a labeling
such that the linear classifier fails to be perfect

 We can see that 3 is the critical number
 The VC-dimension of a linear classifier in a 2D space is 3

because, if we have 3 points in the training set, perfect
classification is always possible irrespective of the
labeling, whereas for 4 points, perfect classification can
be impossible

91
VC-dimension
 The VC-dimension of the nearest neighbor classifier is
infinity, because no matter how many points you have,
you get perfect classification on training data
 The higher the VC-dimension, the more flexible a
classifier is
 VC-dimension, however, is a theoretical concept; the VC-

dimension of most classifiers, in practice, is difficult to
be computed exactly
   Qualitatively, if we think a classifier is flexible, it probably
has a high VC-dimension

92
Other Aspects of SVM
   How to use SVM for multi-class classification?
 One can change the QP formulation to become multi-class
 More often, multiple binary classifiers are combined

   See DHS 5.2.2 for some discussion
   One can train multiple one-versus-all classifiers, or combine
multiple pairwise classifiers “intelligently”
   How to interpret the SVM discriminant function value as
probability?
   By performing logistic regression on the SVM output of a
set of data (validation set) that is not used for training
   Some SVM software (like libsvm) have these features
built-in

93
Software
 A list of SVM implementation can be found at
http://www.kernel-machines.org/software.html
 Some implementation (such as LIBSVM) can handle

multi-class classification
 SVMLight is among one of the earliest implementation of

SVM
 Several Matlab toolboxes for SVM are also available

94
Summary: Steps for Classification
 Prepare the pattern matrix
 Select the kernel function to use

 Select the parameter of the kernel function and the

value of C
   You can use the values suggested by the SVM software, or
you can set apart a validation set to determine the values
of the parameter
 Execute the training algorithm and obtain the ai
 Unseen data can be classified using the ai and the

support vectors

95
Strengths and Weaknesses of SVM
   Strengths
   Training is relatively easy
   No local optimal, unlike in neural networks
 It scales relatively well to high dimensional data
 Tradeoff between classifier complexity and error can be

controlled explicitly
 Non-traditional data like strings and trees can be used as

input to SVM, instead of feature vectors
   Weaknesses
   Need to choose a “good” kernel function.

96
Other Types of Kernel Methods
 A lesson learnt in SVM: a linear algorithm in the feature
space is equivalent to a non-linear algorithm in the input
space
 Standard linear algorithms can be generalized to its non-
linear version by going to the feature space
   Kernel principal component analysis, kernel independent
component analysis, kernel canonical correlation analysis,
kernel k-means, 1-class SVM are some examples

97
Conclusion
 SVM is a useful alternative to neural networks
 Two key concepts of SVM: maximize the margin and the
kernel trick
 Many SVM implementations are available on the web for
you to try on your data set!

98
Resources
 http://www.kernel-machines.org/
 http://www.support-vector.net/

 http://www.support-vector.net/icml-tutorial.pdf

 http://www.kernel-machines.org/papers/tutorial-

nips.ps.gz
 http://www.clopinet.com/isabelle/Projects/SVM/applist.h

tml

99
SVM-light
 http://svmlight.joachims.org
 Author: Thorsten Joachims , Cornell University

   To install SVMlight you need to download svm_light.tar.gz. Create a new
directory: mkdir svm_light
   Move svm_light.tar.gz to this directory and unpack it with
gunzip -c svm_light.tar.gz | tar xvf -
 Now execute make or make all

Two programs are compiled:
svm_learn (learning module)
svm_classify (classification module)
100
SVM-light: Training Input
1 1:2 2:1 3:4 4:3
1 1:2 2:1 3:4 4:3
-1 1:2 2:1 3:3 4:0
1 1:2 2:2 3:3 4:3
1 1:2 2:4 3:3 4:2
-1 1:2 2:2 3:3 4:0
-1 1:2 2:0 3:3
-1 1:2 2:4 3:3
-1 1:4 2:5 3:3
1 1:2 2:2 3:3 4:2

Class FeatureN:ValueN
101
SVM-light: Training
svm_learn [options] example_file model_file
SOME OPTIONS
General options:
-? - this help
Learning options:
-c float: trade-off between training error and margin (default [avg. x*x]^-1)
Performance estimation options:
-x [0,1] - compute leave-one-out estimates (default 0
Kernel options:
-t int - type of kernel function: 0: linear (default) 1: polynomial (s a*b+c)^d 2: radial
basis function exp(-gamma ||a-b||^2) 3: sigmoid tanh(s a*b + c) 4: user defined
kernel from kernel.h
-d int - parameter d in polynomial kernel
-g float - parameter gamma in rbf kernel
-s float - parameter s in sigmoid/poly kernel -r float - parameter c in sigmoid/poly
kernel
-u string - parameter of user defined kernel Optimization

102
SVM-light: Trained Model
SVM-light Version V6.02
0 # kernel type
3 # kernel parameter -d
1 # kernel parameter -g
1 # kernel parameter -s
1 # kernel parameter -r
empty# kernel parameter -u
4 # highest feature index
12 # number of training documents
13 # number of support vectors plus 1
1.0380931 # threshold b, each following line is a SV (starting with alpha*y)
0.03980964156284725469214791360173 1:2 2:4 3:3 4:0 #
-0.018316632908270628204983054843069 1:4 2:5 3:3 4:0 #
-0.03980964156284725469214791360173 1:2 2:1 3:3 4:0 #
0.03980964156284725469214791360173 1:2 2:1 3:4 4:3 #
-0.03980964156284725469214791360173 1:2 2:0 3:3 4:0 #
0.03980964156284725469214791360173 1:2 2:4 3:3 4:2 #
0.03980964156284725469214791360173 1:2 2:2 3:3 4:2 #
0.03980964156284725469214791360173 1:2 2:1 3:4 4:3 #
-0.037841055392657176048576417315417 1:3 2:1 3:3 4:0 #
-0.03980964156284725469214791360173 1:2 2:2 3:3 4:0 #
0.03980964156284725469214791360173 1:2 2:2 3:3 4:3 #
0.016345916179801231460366750525282 1:1 2:2 3:4 4:3 #

103
SVM-light: Predicting

   svm_classify [options] example_file model_file output_file

104
SVM-light: Prediction
0.88647894
0.88647894
0.81321667
0.24665358
0.29204665
0.99999997
-0.8864726
-0.93186567
-0.84107953
-1.000032
-0.90916914
-0.99999364

105

```
To top