Docstoc

A Simple Introduction to Support Vector Machines

Document Sample
A Simple Introduction to Support Vector Machines Powered By Docstoc
					Vectors..
     Vectors: notations
        A vector in a n-dimensional space in described by a n-uple of real
         numbers

              A1 
           A 2
             A 
                                      A  A
                                        T
                                                 1
                                                      A   2
                                                              
              
x2           B 
                                                             
                1
           B 2                     BT  B1         B2
             B 
              
B2
                               B
          A
A2

                            B1 x
                                 1
         A1
                                                                          2
     Vectors: sum
        The components of the sum vector are the sums of the components


         C  A B                   C1   A1  B1 
                                    2 2         
                                   C   A  B   2
                                                 
x2
C2                               C
B2
                             B
          A
A2

         A1                B1 C1      x1
                                                                     3
     Vectors: difference
        The components of the sum vector are the sums of the components


         C  B A                  C1   B1  A1 
                                   2 2         
                                  C   B  A   2
                                                
x2


B2
                              B
C2        A
A2                     C

         A1           C1 B1          x1
-A
                                                                     4
     Vectors: product by a scalar
        The components of the sum vector are the difference of the
         components
         C  a A                    C1   a  A1 
                                     2          
                                    C   a  A  2
                                                 
x2
C2
                       3A

          A
A2

         A1       C1                   x1
                                                                      5
     Vectors: Norm
        The most simple definition for a norm is the euclidean module of the
         components


             A       A  i 2

                      i               1.        || x  y |||| x ||  || y ||
x2                                    2. || x ||  || x ||
                                      3. || x || 0 se x  0
A2
                                 A
                                     A         A   A 
                                                  1 2       2 2




                            A1             x1
                                                                                6
     Vectors: distance between two points
        The distance between two points is the norm of the difference vector


               d  A, B  A  B  B  A
x2


B2
                               B
C2
A2
          A
                        C          d  A, B     B
                                                   1    1 2
                                                       A     B   2
                                                                         A    
                                                                              2 2



         A1            C1 B1              x1
-A
                                                                                7
     Vectors: Scalar product
        The components of the sum vector are the sums of the components

                              c  AB  A  B   A B
                                                  T              i   i

                                                            i
              1.   x, y  y , x
              2.   x  y, z  x, z  y, z     e       x, y  z  x, y  x, z
x2            3.   x, y   x, y      e x, y   x, y
              4.   x, x  0
B2
                                   B
          A
A2
         θ
                                                      c  A  B  cos
         A1                   B1         x1
                                                                               8
Vectors: Scalar product




         v
                                        v
                                              
             u                                         u


  90           v, u  0                     90   v, u  0

                       v

                            
                                    u


                             90   v, u  0
                                                                9
Vectors: Norm and scalar product
   The components of the sum vector are the sums of the components




         A       A 
                  i
                       i 2
                              AT  A      A, A




                                                               10
 Vectors: Definition of an hyperplane

       In R2 , an hyperplane is a line
A line passing through the origin can be defined with as the set of the
vectors that are perpendicular to a given vector W

           x2

                                      XW  W T X  0
                W
                                    W X W X  0
                                       1   1      2   2

                             x1




                                                                    11
 Vectors: Definition of an hyperplane

In R3 , an hyperplane is a plane
A plane passing through the origin can be defined with as the set of
the vectors that are perpendicular to a given vector W

       x3
                                  XW  W T X  0
                                  W X W X W X  0
                                   1   1     2   2      3   3



             W               x1

 x2


                                                                   12
 Vectors: Definition of an hyperplane

        In R2 , an hyperplane is a line
A line perpendicular to W and whose distance from the origin is equal
to b is defined by the points whose scalar vector with W is equal to b

                                       XW        WT X b
           x2                                       
             X                          W         W    W

                  -b/|W|
             W               x1      W 1X 1 W 2 X 2  b  0
                                               -b>0



                                                                   13
 Vectors: Definition of an hyperplane

In R2 , an hyperplane is a line
A line perpendicular to W and whose distance from the origin is equal
to b is defined by the points whose scalar vector with W is equal to b

                                       XW         WT X b
           x2                                        
                                        W          W    W
X

              W              x1      W 1X 1 W 2 X 2  b  0
    b/||W||
                                                -b<0



                                                                   14
Vectors: Definition of an hyperplane

   In Rn , an hyperplane is defined by



      XW  b  W X  b  0       T




                                         15
   An hyperplane divides the space


                     A
                          <AW>/||W||
         x2
             X
                                        T
<BW>/||W||           -b/||W||    AW  W A  b
                                       T
             W           x1      BW  W B  b

                 B


                                             16
   Distance between a hyperplane and a point


                     A
                          <AW>/||W||
         x2
             X                                 AW  b
<BW>/||W||           -b/||W||   d ( A, r ) 
                                                W
             W           x1
                                               BW  b
                                d ( B, r ) 
                 B                              W

                                                    17
     Distance between two parallel hyperplane


                    W X  b' 0
                           T

W X b  0
 T
                               -b’/||W||
           x2

                                                            b  b'
                                           d (r , r ' ) 
            W                  x1
                                                             W
                -b/||W||




                                                                 18
Lagrange Multipliers
    Aim
We want to maximise the function z = f(x,y)
subject to the constraints g(x,y) = c (curve in
the x,y plane)




5/23/2013                                    20
    Simple solution
Solve the constraint g(x,y) = c and express, for
example, y=h(x)

The substitute in function f and find the maximum
in x of

f(x, h(x))

Analytical solution of the constraint can be very
difficult



5/23/2013                                       21
      Geometrical interpretation




The level contours of f(x,y) are defined by f(x,y) = dn
                                                          22
        Lagrange Multipliers


Suppose we walk along the contour line with g = c.

In general the contour lines of f and g may be distinct:
traversing the contour line for g = c we cross the contour lines
of f.

While moving along the contour line for g = c the value of f
can vary.

Only when the contour line for g = c touches contour lines of f
tangentially, we do not increase or decrease the value of f -
that is, when the contour lines touch but do not cross.

                                                               23
Normal to a curve




                    24
            Gradient of a curve
         Given a curve g(x,y) = c
         the gradient of g is:
                                        g g 
                                  g   , 
                                        x y 
(x,y)
        (x+εx, x+εy)                          
         Consider 2 points of the curve: (x,y); (x+εx, x+εy),
         for small ε
                                           g                g
  g x   x , y   y   g x, y    x               x                 
                                           x   ( x, y )     y   ( x, y )

   g x, y   εT g ( x , y )
                                                                       25
              Gradient of a curve
         Given a curve g(x,y) = c
         the gradient of g is:

        (x+εx, x+εy)     Since both points satisfy the curve
(x,y)
                         equation:
          ε                         c  c  ε g ( x , y )
                                             T


              grad (g)              εT g ( x , y )  0
         For small ε, ε is parallel to the curve and,
         consequently, the gradient is perpendicular to the
         curve


                                                               26
    Lagrange Multipliers


The point on g(x,y)=c that
Max-min-imize f(x,y) the gradient
of f is perpendicular to the curve
g, otherwise we should increase or decrease f by moving
locally on the curve
So, the two gradients are parallel



for some scalar λ (where  is the gradient).



                                                          27
          Lagrange Multipliers

Thus we want points (x,y) where g(x,y) = c and
    ,



To incorporate these conditions into one equation, we introduce
an auxiliary function (Lagrangian)
        F ( x, y, )  f ( x, y)  g( x, y)  c

and solve
    .




                                                             28
Recap of Constrained Optimization
 Suppose we want to: minimize/maximize f(x) subject to
  g(x) = 0
 A necessary condition for x0 to be a solution:



                      -


  a: the Lagrange multiplier
 For multiple constraints gi(x) = 0, i=1, …, m, we need a
  Lagrange multiplier ai for each of the constraints

                  -



                                                       29
 Constrained Optimization: inequality
  We want to maximize f(x,y) with inequality constraint
   g(x,y)c.
  The search must be confined in the red portion

 (gradient of a function points towards the direction along
   which it increases)
g(x,y) ≤ c
   Constrained Optimization: inequality
    maximize f(x,y) with inequality constraint g(x,y)c.
    If the gradients are opposite (<0) the function
     increases in the allowed portion The maximum
     cannot be on the curve g(xy)=c
    Maximum is on the curve only if >0

g(x,y) ≤ c



       f increases,


                            F ( x, y, )  f ( x, y)  g( x, y)  c
                      0
   Constrained Optimization: inequality
    Minimize f(x,y) with inequality constraint g(x,y)c.
    If the gradients are opposite (<0) the function

     increases in the allowed portion
    Minimum is on the curve only if <0



g(x,y) ≤ c



       f increases,


                            F ( x, y, )  f ( x, y)  g( x, y)  c
                      0
Constrained Optimization: inequality
 maximize f(x,y) with inequality constraint g(x,y)≥c.
 If the gradients are opposite (<0) the function

  decreases in the allowed portion
 Maximum is on the curve only if <0



                         F ( x, y, )  f ( x, y)  g( x, y)  c

                              g(x,y) ≥ c

        0
              f decreases,
Constrained Optimization: inequality
 Minimize f(x,y) with inequality constraint g(x,y)≥c.
 If the gradients are opposite (<0) the function

  decreases in the allowed portion
 Minimum is on the curve only if >0



                         F ( x, y, )  f ( x, y)  g( x, y)  c

                              g(x,y) ≥ c

        0
              f decreases,
      Karush-Kuhn-Tucker conditions
       The function f(x) subject to constraints gi(x) ≤or≥ 0 is
       max-minimized by opimizing the Lagrange function

      F ( x, a i )  f ( x )   a i  g i ( x )
                                     i

with αi satisfying the following conditions:

                      gi(x) ≤ 0          gi(x) ≥ 0
       MIN            αi ≥ 0             αi ≤ 0
       MAX            αi ≤ 0             αi ≥ 0

and
       ai  gi ( x0 )  0, i
                                                             35
Constrained Optimization: inequality
   Karush-Kuhn-Tucker complementarity condition

        ai  gi ( x0 )  0, i
means that

    ai  0  gi ( xo )  0

The constraint is active only on the border, and cancel out
 in the internal regions




                                                       36
Concave-Convex functions

                   Concave




                    Convex




5/23/2013                    37
     Dual problem
   If f(x) is a convex function

Is solved by:




From the first equation we can find x as a function of the ai
These can be substituted in the Lagrangian function
 obtaining the dual Lagrangian function



                    x
                                        
         L(a i )  inf L( x, a i )  inf f ( x) 
                                      x             
                                                     i
                                                                  
                                                       ai gi ( x) 
                                                                  
                                                                      38
          Dual problem


                     x
                                         
          L(a i )  inf L( x, a i )  inf f ( x) 
                                       x                     i
                                                                            
                                                                 ai gi ( x) 
                                                                            
             The dual Lagrangian is concave: maximising it with
              respect to ai ,with ai>0, solve the original constrained
              problem. We compute ai as:



 ai
                    ai
                              x
                                        
max L(a i )  max inf L( x, a i )  max inf f ( x)  i ai g i ( x)
                                              ai
                                                      x
                                                                                    
          Then we can obtain x by substituting using the
           expression of x as a function of ai


                                                                                39
    Dual problem:trivial example
 Minimize the function f(x)=x2 with the constraint x≤-1
(trivial: x=-1)

The Lagrangian is
 L( x, a )  x 2  a ( x  1)
Minimising with respect to x
L                                     -1
     0  2x  a  0  x  a
 x                           2
The dual Lagrangian is              a2 a2           a2
                            L(a )         a a 
                                     4    2          4
Maximising it gives: a=2
Then subsituting,
                                x  a        1
                                         2
                                                           40
An Introduction to Support Vector
Machines
        What is a good Decision Boundary?


 Consider a two-class, linearly
  separable classification problem
                                                    Class 2
 Many decision boundaries!

       The Perceptron algorithm can be
        used to find such a boundary
   Are all decision boundaries
    equally good?
                                          Class 1




                                                        42
Examples of Bad Decision Boundaries




           Class 2                    Class 2




Class 1                 Class 1




                                         43
Large-margin Decision Boundary
   The decision boundary should be as far away from the
    data of both classes as possible
      We should maximize the margin, m




                       Class 2



    Class 1
                   m


                                                       44
Hyperplane Classifiers(2)

     w  xi  b  1 for yi  1
     w  xi  b  1 for yi  1




                                   45
          Finding the Decision Boundary
             Let {x1, ..., xn} be our data set and let yi  {1,-1} be
              the class label of xi

                                              For yi=1     w xi  b  1
                                                             T


                                              For yi=-1    wT xi  b  1
                       y=1
                y=1
                                             So:
   y=-1
y=-1
                         y=1

                             y=1
                                    y=1                     
                                             yi  w xi  b  1, xi , yi 
                                                       T

                                   Class 2
       y=-1
                      y=-1
  y=-1

  Class 1      y=-1           m


                                                                        46
Finding the Decision Boundary
   The decision boundary should classify all points correctly
    

   The decision boundary can be found by solving the
    following constrained optimization problem




   This is a constrained optimization problem. Solving it
    requires to use Lagrange multipliers

                                                             47
    Finding the Decision Boundary




   The Lagrangian is




     ai≥0
     Note that ||w|| = w w
                     2   T




                                    48
       Gradient with respect to w and b
      Setting the gradient of       w.r.t. w and b to zero, we
       have
                                          
               n
      1 T
   L  w w   a i 1  yi wT xi  b 
      2      i 1

    1 m k k n              m k k      
     w w   a i 1  yi   w xi  b  
                                         
    2 k 1   i 1          k 1       
                  n: no of examples, m: dimension of the space

 L
 w k  0, k

 L
 b  0

                                                                 49
The Dual Problem
   If we substitute                to   , we have




Since

   This is a function of ai only


                                                     50
   The Dual Problem
    The new objective function is in terms of ai only
    It is known as the dual problem: if we know w, we
     know all ai; if we know all ai, we know w
    The original problem is known as the primal problem

    The objective function of the dual problem needs to be
     maximized (comes out from the KKT theory)
    The dual problem is therefore:




Properties of ai when we introduce   The result when we differentiate the
the Lagrange multipliers             original Lagrangian w.r.t. b
                                                                    51
The Dual Problem




   This is a quadratic programming (QP) problem
       A global maximum of ai can always be found

   w can be recovered by




                                                     52
Characteristics of the Solution
   Many of the ai are zero
      w is a linear combination of a small number of data points
      This “sparse” representation can be viewed as data

       compression as in the construction of knn classifier


   xi with non-zero ai are called support vectors (SV)
      The decision boundary is determined only by the SV

      Let tj (j=1, ..., s) be the indices of the s support vectors.

       We can write


        Note: w need not be formed explicitly



                                                                   53
A Geometrical Interpretation

               Class 2


              a8=0.6 a10=0

                          a7=0
                                 a2=0
a5=0

                             a1=0.8
  a4=0
                 a6=1.4
   a9=0
               a3=0
    Class 1


                                        54
Characteristics of the Solution
   For testing with a new data z



       Compute                                                 and
        classify z as class 1 if the sum is positive, and class 2
        otherwise




       Note: w need not be formed explicitly




                                                                      55
The Quadratic Programming Problem
   Many approaches have been proposed
       Loqo, cplex, etc. (see http://www.numerical.rl.ac.uk/qp/qp.html)
   Most are “interior-point” methods
     Start with an initial solution that can violate the constraints
     Improve this solution by optimizing the objective function

      and/or reducing the amount of constraint violation
   For SVM, sequential minimal optimization (SMO) seems
    to be the most popular
     A QP with two variables is trivial to solve
     Each iteration of SMO picks a pair of (ai,aj) and solve the

      QP with these two variables; repeat until convergence
   In practice, we can just regard the QP solver as a
    “black-box” without bothering how it works

                                                                    56
Non-linearly Separable Problems
We allow “error” xi in classification; it is based on the
 output of the discriminant function wTx+b
 xi approximates the number of misclassified samples




                                 Class 2




          Class 1

                                                             57
Soft Margin Hyperplane
   The new conditions become




     xi are “slack variables” in optimization
     Note that xi=0 if there is no error for xi

     xi is an upper bound of the number of errors

   We want to minimize         1 2       n
                                  w  C  xi
                                2       i 1




   C : tradeoff parameter between error and margin
                                                      58
     The Optimization Problem


                                                
               n       n                              n
    1 T
 L  w w  C  x i   a i 1  x i  yi wT xi  b   ix i
    2        i 1    i 1                           i 1

   With α and μ Lagrange multipliers, POSITIVE
L             n
                                    n          
      w j  a i yi xij  0       w   a i yi xi  0
w j         i 1                       i 1


L
      C a j   j  0
x j

L   n
     yia i  0
b i 1
                                                         59
       The Dual Problem


     1 n n             T          n
  L   a ia j yi y j xi x j  C  x i 
     2 i 1 j 1                  i 1

       n                n                     n
    a i 1  x i  yi   a j y j x j xi  b     ix i
                                       T
                                             
    i 1                j 1                   i 1
            n
With
            ya           0       C aj  j
                  i   i
           i 1

        1                 T
                          n    n     n
   L    a ia j yi y j xi x j   a i
        2 i 1 j 1                i 1
The Optimization Problem
   The dual of this new constrained optimization problem is




 New constrainsderive from C  a j   j since μ and α
  are positive.
 w is recovered as



 This is very similar to the optimization problem in the
  linear separable case, except that there is an upper
  bound C on ai now
 Once again, a QP solver can be used to find ai



                                                            61
                         n
               1 2
                 w  C  xi
               2       i 1

 The algorithm try to keep ξ null, maximising the
  margin
 The algorithm does not minimise the number of

  error. Instead, it minimises the sum of distances
  fron the hyperplane

   When C increases the number of errors tend to
    lower. At the limit of C tending to infinite, the
    solution tend to that given by the hard margin
    formulation, with 0 errors


5/23/2013                                          62
Soft margin is more robust




                             63
Extension to Non-linear Decision Boundary
 So far, we have only considered large-margin classifier
  with a linear decision boundary
 How to generalize it to become nonlinear?

 Key idea: transform xi to a higher dimensional space to
  “make life easier”
     Input space: the space the point xi are located
     Feature space: the space of f(xi) after transformation

   Why transform?
     Linear operation in the feature space is equivalent to non-
      linear operation in input space
     Classification can become easier with a proper

      transformation. In the XOR problem, for example, adding a
      new feature of x1x2 make the problem linearly separable

                                                               64
    XOR
                     Is not linearly separable
    X       Y
    0       0    0
    0       1    1
    1       0    1
    1       1    0

                     Is linearly separable
X       Y   XY
0       0   0    0
0       1   0    1
1       0   0    1
1       1   1    0
                                             65
  Find a feature space




S.Mika: Kernel Fisher Discriminant   66
         Transforming the Data
                                                            f( )
                                                      f( )       f( )
                                                               f( ) f( ) f( )
                                      f(.)          f( )
                                                         f( )     f( )
                                                                       f( ) f( )
                                                              f( ) f( )
                                                       f( )     f( ) f( )
                                                         f( )
                                                                    f( )

                    Input space                       Feature space
                                              Note: feature space is of higher dimension
                                              than the input space in practice

   Computation in the feature space can be costly because it is
    high dimensional
       The feature space is typically infinite-dimensional!
   The kernel trick comes to rescue

                                                                                   67
         Transforming the Data
                                                            f( )
                                                      f( )       f( )
                                                               f( ) f( ) f( )
                                      f(.)          f( )
                                                         f( )     f( )
                                                                       f( ) f( )
                                                              f( ) f( )
                                                       f( )     f( ) f( )
                                                         f( )
                                                                    f( )

                    Input space                       Feature space
                                              Note: feature space is of higher dimension
                                              than the input space in practice

   Computation in the feature space can be costly because it is
    high dimensional
       The feature space is typically infinite-dimensional!
   The kernel trick comes to rescue

                                                                                   68
The Kernel Trick
   Recall the SVM optimization problem




 The data points only appear as inner product
 As long as we can calculate the inner product in the

  feature space, we do not need the mapping explicitly
 Many common geometric operations (angles, distances)

  can be expressed by inner products
 Define the kernel function K by




                                                   69
An Example for f(.) and K(.,.)
   Suppose f(.) is given as follows



   An inner product in the feature space is



   So, if we define the kernel function as follows, there is
    no need to carry out f(.) explicitly



   This use of kernel function to avoid carrying out f(.)
    explicitly is known as the kernel trick

                                                             70
     Kernels
        Given a mapping:

          x  φ(x)

     a kernel is represented as the inner product

      K (x, y)     φ (x)φ (y)
                      i
                          i   i


     A kernel must satisfy the Mercer’s condition:



                                 
g (x) such that g 2 (x)dx  0  K (x, y) g (x) g (y)dxdy  0



                                                                71
     Modification Due to Kernel Function
      Change all inner products to kernel functions
      For training,




Original




With kernel
function




                                                       72
     Modification Due to Kernel Function
        For testing, the new data z is classified as class 1 if f 0,
         and as class 2 if f <0


Original




With kernel
function



                                                                  73
More on Kernel Functions
   Since the training of SVM only requires the value of K(xi,
    xj), there is no restriction of the form of xi and xj
       xi can be a sequence or a tree, instead of a feature vector


   K(xi, xj) is just a similarity measure comparing xi and xj

   For a test object z, the discriminat function essentially is
    a weighted sum of the similarity between z and a pre-
    selected set of objects (the support vectors)




                                                                 74
Example
   Suppose we have 5 1D data points
       x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1 and 4,
        5 as class 2  y1=1, y2=1, y3=-1, y4=-1, y5=1




                                                                75
Example




    class 1           class 2           class 1


              1   2      4      5   6




                                                  76
Example
   We use the polynomial kernel of degree 2
     K(x,y) = (xy+1)2
     C is set to 100




   We first find ai (i=1, …, 5) by




                                               77
Example
   By using a QP solver, we get
     a1=0, a2=2.5, a3=0, a4=7.333, a5=4.833
     Note that the constraints are indeed satisfied

     The support vectors are {x2=2, x4=5, x5=6}

   The discriminant function is




   b is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1,

   All three give b=9


                                                             78
Example

        Value of discriminant function




    class 1             class 2           class 1


              1   2        4      5   6




                                                    79
Kernel Functions
 In practical use of SVM, the user specifies the kernel
  function; the transformation f(.) is not explicitly stated
 Given a kernel function K(xi, xj), the transformation f(.)

  is given by its eigenfunctions (a concept in functional
  analysis)
     Eigenfunctions can be difficult to construct explicitly
     This is why people only specify the kernel function without

      worrying about the exact transformation
   Another view: kernel function, being an inner product, is
    really a similarity measure between the objects




                                                              80
A kernel is associated to a transformation
   Given a kernel, in principle it should be recovered the
    transformation in the feature space that originates it.

   K(x,y) = (xy+1)2= x2y2+2xy+1



It corresponds the transformation

         x    2
            
    x   2x 
         1 
            
5/23/2013                                                     81
Examples of Kernel Functions

   Polynomial kernel up to degree d

   Polynomial kernel up to degree d


   Radial basis function kernel with width s


       The feature space is infinite-dimensional
   Sigmoid with parameter k and q


       It does not satisfy the Mercer condition on all k and q
                                                                  82
Example




          83
Building new kernels

   If k1(x,y) and k2(x,y) are two valid kernels then the
    following kernels are valid
      Linear Combination

         k ( x, y)  c1k1 ( x, y)  c2 k2 ( x, y)
        Exponential
         k ( x, y )  exp k1 ( x, y )
    




       Product
         k ( x, y)  k1 ( x, y)  k2 ( x, y)
       Polymomial tranfsormation (Q: polymonial with non negative
        coeffients)
          k ( x, y )  Qk1 ( x, y )
       Function product (f: any function)
         k ( x, y )  f ( x)k1 ( x, y ) f ( y )
                                                             84
        Ploynomial kernel




Ben-Hur et al, PLOS computational Biology 4 (2008)
                                                     85
        Gaussian RBF kernel




Ben-Hur et al, PLOS computational Biology 4 (2008)
                                                     86
Spectral kernel for sequences
   Given a DNA sequence x we can count the number of
    bases (4-D feature space)

    f1 ( x)  (n A , nC , nG , nT )
   Or the number of dimers (16-D space)
f2 ( x)  (n AA , n AC , n AG , n AT , nCA , nCC , nCG , nCT ,..)

   Or l-mers (4l –D space)

   The spectral kernel is

    kl ( x, y )  fl x   fl  y 
5/23/2013                                                           87
Choosing the Kernel Function
 Probably the most tricky part of using SVM.
 The kernel function is important because it creates the
  kernel matrix, which summarizes all the data
 Many principles have been proposed (diffusion kernel,
  Fisher kernel, string kernel, …)
 There is even research to estimate the kernel matrix

  from available information

 In practice, a low degree polynomial kernel or RBF
  kernel with a reasonable width is a good initial try
 Note that SVM with RBF kernel is closely related to RBF
  neural networks, with the centers of the radial basis
  functions automatically chosen for SVM
                                                       88
Why SVM Work?
 The feature space is often very high dimensional. Why
  don’t we have the curse of dimensionality?
 A classifier in a high-dimensional space has many

  parameters and is hard to estimate
 Vapnik argues that the fundamental problem is not the

  number of parameters to be estimated. Rather, the
  problem is about the flexibility of a classifier
 Typically, a classifier with many parameters is very
  flexible, but there are also exceptions
     Let xi=10i where i ranges from 1 to n. The classifier
                         can classify all xi correctly for all possible
      combination of class labels on xi
     This 1-parameter classifier is very flexible



                                                                   89
Why SVM works?
   Vapnik argues that the flexibility of a classifier should
    not be characterized by the number of parameters, but
    by the flexibility (capacity) of a classifier
       This is formalized by the “VC-dimension” of a classifier
 Consider a linear classifier in two-dimensional space
 If we have three training data points, no matter how

  those points are labeled, we can classify them perfectly




                                                                   90
VC-dimension
   However, if we have four points, we can find a labeling
    such that the linear classifier fails to be perfect




 We can see that 3 is the critical number
 The VC-dimension of a linear classifier in a 2D space is 3

  because, if we have 3 points in the training set, perfect
  classification is always possible irrespective of the
  labeling, whereas for 4 points, perfect classification can
  be impossible



                                                         91
VC-dimension
 The VC-dimension of the nearest neighbor classifier is
  infinity, because no matter how many points you have,
  you get perfect classification on training data
 The higher the VC-dimension, the more flexible a
  classifier is
 VC-dimension, however, is a theoretical concept; the VC-

  dimension of most classifiers, in practice, is difficult to
  be computed exactly
       Qualitatively, if we think a classifier is flexible, it probably
        has a high VC-dimension




                                                                      92
Other Aspects of SVM
   How to use SVM for multi-class classification?
     One can change the QP formulation to become multi-class
     More often, multiple binary classifiers are combined

           See DHS 5.2.2 for some discussion
       One can train multiple one-versus-all classifiers, or combine
        multiple pairwise classifiers “intelligently”
   How to interpret the SVM discriminant function value as
    probability?
       By performing logistic regression on the SVM output of a
        set of data (validation set) that is not used for training
   Some SVM software (like libsvm) have these features
    built-in


                                                                 93
Software
 A list of SVM implementation can be found at
  http://www.kernel-machines.org/software.html
 Some implementation (such as LIBSVM) can handle

  multi-class classification
 SVMLight is among one of the earliest implementation of

  SVM
 Several Matlab toolboxes for SVM are also available




                                                     94
Summary: Steps for Classification
 Prepare the pattern matrix
 Select the kernel function to use

 Select the parameter of the kernel function and the

  value of C
       You can use the values suggested by the SVM software, or
        you can set apart a validation set to determine the values
        of the parameter
 Execute the training algorithm and obtain the ai
 Unseen data can be classified using the ai and the

  support vectors




                                                               95
Strengths and Weaknesses of SVM
   Strengths
       Training is relatively easy
           No local optimal, unlike in neural networks
     It scales relatively well to high dimensional data
     Tradeoff between classifier complexity and error can be

      controlled explicitly
     Non-traditional data like strings and trees can be used as

      input to SVM, instead of feature vectors
   Weaknesses
       Need to choose a “good” kernel function.




                                                               96
Other Types of Kernel Methods
 A lesson learnt in SVM: a linear algorithm in the feature
  space is equivalent to a non-linear algorithm in the input
  space
 Standard linear algorithms can be generalized to its non-
  linear version by going to the feature space
       Kernel principal component analysis, kernel independent
        component analysis, kernel canonical correlation analysis,
        kernel k-means, 1-class SVM are some examples




                                                                97
Conclusion
 SVM is a useful alternative to neural networks
 Two key concepts of SVM: maximize the margin and the
  kernel trick
 Many SVM implementations are available on the web for
  you to try on your data set!




                                                    98
Resources
 http://www.kernel-machines.org/
 http://www.support-vector.net/

 http://www.support-vector.net/icml-tutorial.pdf

 http://www.kernel-machines.org/papers/tutorial-

  nips.ps.gz
 http://www.clopinet.com/isabelle/Projects/SVM/applist.h

  tml




                                                      99
SVM-light
 http://svmlight.joachims.org
 Author: Thorsten Joachims , Cornell University


   Can be downloaded and easily installed
   http://download.joachims.org/svm_light/current/svm_light.tar.gz

   To install SVMlight you need to download svm_light.tar.gz. Create a new
    directory: mkdir svm_light
   Move svm_light.tar.gz to this directory and unpack it with
gunzip -c svm_light.tar.gz | tar xvf -
 Now execute make or make all




Two programs are compiled:
svm_learn (learning module)
svm_classify (classification module)
                                                                          100
SVM-light: Training Input
1 1:2 2:1 3:4 4:3
1 1:2 2:1 3:4 4:3
-1 1:2 2:1 3:3 4:0
1 1:2 2:2 3:3 4:3
1 1:2 2:4 3:3 4:2
-1 1:2 2:2 3:3 4:0
-1 1:2 2:0 3:3
-1 1:2 2:4 3:3
-1 1:4 2:5 3:3
1 1:2 2:2 3:3 4:2

Class FeatureN:ValueN
                            101
SVM-light: Training
svm_learn [options] example_file model_file
SOME OPTIONS
General options:
-? - this help
Learning options:
-c float: trade-off between training error and margin (default [avg. x*x]^-1)
Performance estimation options:
-x [0,1] - compute leave-one-out estimates (default 0
Kernel options:
-t int - type of kernel function: 0: linear (default) 1: polynomial (s a*b+c)^d 2: radial
   basis function exp(-gamma ||a-b||^2) 3: sigmoid tanh(s a*b + c) 4: user defined
   kernel from kernel.h
-d int - parameter d in polynomial kernel
-g float - parameter gamma in rbf kernel
-s float - parameter s in sigmoid/poly kernel -r float - parameter c in sigmoid/poly
   kernel
-u string - parameter of user defined kernel Optimization

                                                                                    102
 SVM-light: Trained Model
SVM-light Version V6.02
0 # kernel type
3 # kernel parameter -d
1 # kernel parameter -g
1 # kernel parameter -s
1 # kernel parameter -r
empty# kernel parameter -u
4 # highest feature index
12 # number of training documents
13 # number of support vectors plus 1
1.0380931 # threshold b, each following line is a SV (starting with alpha*y)
0.03980964156284725469214791360173 1:2 2:4 3:3 4:0 #
-0.018316632908270628204983054843069 1:4 2:5 3:3 4:0 #
-0.03980964156284725469214791360173 1:2 2:1 3:3 4:0 #
 0.03980964156284725469214791360173 1:2 2:1 3:4 4:3 #
-0.03980964156284725469214791360173 1:2 2:0 3:3 4:0 #
 0.03980964156284725469214791360173 1:2 2:4 3:3 4:2 #
 0.03980964156284725469214791360173 1:2 2:2 3:3 4:2 #
 0.03980964156284725469214791360173 1:2 2:1 3:4 4:3 #
-0.037841055392657176048576417315417 1:3 2:1 3:3 4:0 #
-0.03980964156284725469214791360173 1:2 2:2 3:3 4:0 #
 0.03980964156284725469214791360173 1:2 2:2 3:3 4:3 #
 0.016345916179801231460366750525282 1:1 2:2 3:4 4:3 #



                                                                               103
      SVM-light: Predicting


   svm_classify [options] example_file model_file output_file




                                                           104
SVM-light: Prediction
0.88647894
0.88647894
0.81321667
0.24665358
0.29204665
0.99999997
-0.8864726
-0.93186567
-0.84107953
-1.000032
-0.90916914
-0.99999364




                        105

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/24/2013
language:English
pages:105