Part 2

Document Sample
Part 2 Powered By Docstoc
					           Part 2:
  Support Vector Machines
             Vladimir Cherkassky
            University of Minnesota
               cherk001@umn.edu

Presented at Tech Tune Ups, ECE Dept, June 1, 2011


                           Electrical and Computer Engineering
                                                          1
SVM: Brief History



1963 Margin (Vapnik & Lerner)
1964 Margin (Vapnik and Chervonenkis, 1964)
1964 RBF Kernels (Aizerman)
1965 Optimization formulation (Mangasarian)
1971 Kernels (Kimeldorf annd Wahba)
1992-1994 SVMs (Vapnik et al)
1996 – present Rapid growth, numerous apps
1996 – present Extensions to other problems
                                              2
        MOTIVATION for SVM
•   Problems with ‘conventional’ methods:
    - model complexity ~ dimensionality (# features)
    - nonlinear methods  multiple minima
    - hard to control complexity

•   SVM solution approach
    - adaptive loss function (to control complexity
    independent of dimensionality)
    - flexible nonlinear models
    - tractable optimization formulation

                                                      3
        SVM APPROACH
•   Linear approximation in Z-space using
    special adaptive loss function
•   Complexity independent of dimensionality




    x     gx
            
            
                 
                 
                    z      wz         ˆ
                                         y

                                               4
              OUTLINE

•   Margin-based loss
•   SVM for classification
•   SVM examples
•   Support vector regression
•   Summary



                                5
   Example: binary classification
• Given: Linearly separable data
How to construct linear decision boundary?




                                             6
  Linear Discriminant Analysis
LDA solution    Separation margin




                                    7
        Perceptron (linear NN)
• Perceptron solutions and separation margin




                                               8
         Largest-margin solution
• All solutions explain the data well (zero error)
All solutions ~ the same linear parameterization
Larger margin ~ more confidence (falsifiability)




                                             M  2



                                                      9
  Complexity of -margin hyperplanes

• If data samples belong to a sphere of radius R,
  then the set of -margin hyperplanes has VC
  dimension bounded by

         h  min( R /  , d )  1
                        2    2


• For large margin hyperplanes, VC-dimension
  controlled independent of dimensionality d.



                                                    10
       Motivation: philosophical
• Classical view: good model
  explains the data + low complexity
 Occam’s razor (complexity ~ # parameters)
• VC theory: good model
  explains the data + low VC-dimension
  ~ VC-falsifiability: good model:
  explains the data + has large falsifiability
The idea: falsifiability ~ empirical loss function

                                                11
       Adaptive loss functions
• Both goals (explanation + falsifiability) can
  encoded into empirical loss function where
  - (large) portion of the data has zero loss
  - the rest of the data has non-zero loss,
  i.e. it falsifies the model
• The trade-off (between the two goals) is
  adaptively controlled  adaptive loss fct
• Examples of such loss functions for
  different learning problems are shown next
                                              12
Margin-based loss for classification




  Margin  2   L ( y, f (x,  ))  max   yf (x,  ),0
                                                          13
Margin-based loss for classification:
 margin is adapted to training data
       Class +1            Class -1




                Margin                     y  f (x,  )


      L ( y, f (x,  ))  max   yf (x,  ),0
Epsilon loss for regression




L ( y, f (x,  ))  max | y  f (x,  ) |  ,0
                                                     15
Parameter epsilon is adapted to training data
  Example: linear regression y = x + noise
  where noise = N(0, 0.36), x ~ [0,1], 4 samples
Compare: squared, linear and SVM loss (eps = 0.6)
              2



            1.5



              1



            0.5
        y




              0



            -0.5



             -1
                   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
                                               x
             OUTLINE
• Margin-based loss
• SVM for classification
  - Linear SVM classifier
  - Inner product kernels
  - Nonlinear SVM classifier
• SVM examples
• Support Vector regression
• Summary
                               17
    SVM Loss for Classification




Continuous quantity yf (x, w) measures how
close a sample x is to decision boundary

                                         18
  Optimal Separating Hyperplane




Distance btwn hyperplane and sample f (x' ) / w
 Margin   1/ w     Shaded points are SVs       19
    Linear SVM Optimization Formulation
            (for separable data)
•    Given training data x i ,yi  i  1,...,n
•    Find parameters w, b of linear hyperplane
            f x   w  x   b
                           w   0.5  w
                                                 2
     that minimize
     under constraints          y i w  x i   b  1
•    Quadratic optimization with linear constraints
     tractable for moderate dimensions d
•    For large dimensions use dual formulation:
     - scales with sample size(n) rather than d
     - uses only dot products x i , x j 
                                                           20
Classification for non-separable data




  slack _ variables
                       L ( y, f (x,  ))  max   yf (x,  ),0
      yf (x,  )
                                                                     21
      SVM for non-separable data
         x 1 = 1 - f (x1 )


                                                                                    f (x)  w  x   b
                     x1




                                   x 2 = 1 - f (x 2 )

                              x2

                                                           x3                f ( x) = +1



                                                                       f ( x) = 0
                                                  x 3 = 1 + f (x 3 )


                                                             f ( x) = - 1


                               n
                                      1 2
Minimize                     C   i  w  min
                               i 1   2
under constraints yi w  x i   b  1   i                                                            22
         SVM Dual Formulation
•   Given training data x i ,yi  i  1,...,n
•   Find parameters   i* ,b * of an opt. hyperplane
    as a solution to maximization problem

     L     i    i j yi y j x i  x j   max
              n
                    1 n
             i 1   2 i , j 1
                            n

    under constraints     y
                           i 1
                                   i   i    0,         0  i  C
                             n

•   Solution      f x     i* yi x  x i   b *
                    i 1
    where samples with nonzero  i* are SVs
•   Needs only inner products x  x'
                                                                     23
     Nonlinear Decision Boundary
• Fixed (linear) parameterization is too rigid
• Nonlinear curved margin may yield larger margin
  (falsifiability) and lower error




                                               24
   Nonlinear Mapping via Kernels
Nonlinear f(x,w) + margin-based loss = SVM
• Nonlinear mapping to feature z space, i.e.
  x ~ ( x1 , x2 )  z ~ (1, x1 , x2 , x1 x2 , x , x )
                                             2
                                             1
                                                  2
                                                  2
• Linear in z-space ~ nonlinear in x-space
              
• BUT z z  Hx, x ~ kernel trick
 Compute dot product via kernel analytically


  x    gx
        
        
        
             
             
                z   wz   ˆ
                            y

                                                   25
SVM Formulation (with kernels)
•   Replacing z  z   H x, x         leads to:
•   Find parameters  , b of an optimal
                                    *   *
                                    i
                                n
    hyperplane Dx     y H x , x   b
                                        *
                                        i   i       i
                                                               *

                               i 1
    as a solution to maximization problem
     L     i    i j y i y j H x i , x j   max
              n
                    1 n
             i 1   2 i , j 1
                                n

    under constraints         y           i   i        0,       0  i  C
                                                x i ,yi 
                               i 1
    Given: the training data                                       i  1,...,n
              an inner product kernel H x, x
              regularization parameter C
                                                                                 26
            Examples of Kernels
Kernel H x, x is a symmetric function satisfying general
  math conditions (Mercer’s conditions)
 Examples of kernels for different mappings xz
• Polynomials of degree q
                     H x, x   x  x'  1q

• RBF kernel                             x  x'
                                        
                                                   2
                                                       
                                                       
                         H x, x  exp             
                                        
                                        
                                            2         
                                                       
• Neural Networks         H x, x  tanhv(x  x' )  a
  for given parameters    v, a
Automatic selection of the number of hidden units (SV’s)     27
            More on Kernels
• The kernel matrix has all info (data + kernel)
 H(1,1) H(1,2)…….H(1,n)
 H(2,1) H(2,2)…….H(2,n)
 ………………………….
 H(n,1) H(n,2)…….H(n,n)
• Kernel defines a distance in some feature
  space (aka kernel-induced feature space)
• Kernels can incorporate apriori knowledge
• Kernels can be defined over complex
  structures (trees, sequences, sets etc)
                                              28
                   Support Vectors
• SV’s ~ training samples with non-zero loss
• SV’s are samples that falsify the model
• The model depends only on SVs
 SV’s ~ robust characterization of the data
WSJ Feb 27, 2004:
  About 40% of us (Americans) will vote for a Democrat, even if the
  candidate is Genghis Khan. About 40% will vote for a Republican,
  even if the candidate is Attila the Han. This means that the election
  is left in the hands of one-fifth of the voters.


• SVM Generalization ~ data compression
                                                                          29
   New insights provided by SVM
• Why linear classifiers can generalize?

      h  min( R /  , d )  1
                    2    2


  (1) Margin is large (relative to R)
  (2) % of SV’s is small
  (3) ratio d/n is small
• SVM offers an effective way to control
  complexity (via margin + kernel selection)
  i.e. implementing (1) or (2) or both
• Requires common-sense parameter tuning
                                           30
              OUTLINE
•   Margin-based loss
•   SVM for classification
•   SVM examples
•   Support Vector regression
•   Summary



                                31
                Ripley’s data set
• 250 training samples, 1,000 test samples
                                       
• SVM using RBF kernel  (u, v)  exp  u  v
                                               2
                                                   
• Model selection via 10-fold cross-validation




                                                       32
      Ripley’s data set: SVM model
• Decision boundary and margin borders
• SV’s are circled
          1.2


            1


          0.8


          0.6
     x2




          0.4


          0.2


            0


          -0.2
             -1.5   -1   -0.5        0   0.5   1
                                x1                 33
      Ripley’s data set: model selection
• SVM tuning parameters C,
• Select opt parameter values via 10-fold x-validation
• Results of cross-validation are summarized below:

             C= 0.1    C= 1    C= 10   C= 100   C= 1000 C= 10000


      =2-3   98.4%    23.6%   18.8%    20.4%    18.4%   14.4%
      =2-2   51.6%    22%     20%      20%      16%     14%
      =2-1   33.2%    19.6%   18.8%    15.6%    13.6%   14.8%
       =20   28%      18%     16.4%    14%      12.8%   15.6%
       =21   20.8%    16.4%   14%      12.8%    16%     17.2%
       =22   19.2%    14.4%   13.6%    15.6%    15.6%   16%
       =23   15.6%    14%     15.6%    16.4%    18.4%   18.4%

                                                                   34
                       Noisy Hyperbolas data set
• This example shows application of different kernels
• Note: decision boundaries are quite different

  RBF kernel
   1
                                                                            Polynomial
                                                               1

  0.9
                                                              0.9

  0.8
                                                              0.8

  0.7
                                                              0.7

  0.6
                                                              0.6

  0.5                                                         0.5


  0.4                                                         0.4


  0.3                                                         0.3


  0.2                                                         0.2
    0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1     0.1   0.2    0.3   0.4   0.5   0.6   0.7   0.8   0.9




                                                                                                                 35
    Many challenging applications
•   Mimic human recognition capabilities
    - high-dimensional data
    - content-based
    - context-dependent
•   Example: read the sentence
    Sceitnitss osbevred: it is nt inptrant how
    lteters are msspled isnide the word. It is
    ipmoratnt that the fisrt and lsat letetrs do not
    chngae, tehn the txet is itneprted corrcetly

•   SVM is suitable for sparse high-dimensional
    data
                                                 36
    Example SVM Applications
• Handwritten digit recognition
• Genomics
• Face detection in unrestricted
  images
• Text/ document classification
• Image classification and retrieval
• …….                                  37
Handwritten Digit Recognition (mid-90’s)
 • Data set:
   postal images (zip-code), segmented, cropped;
   ~ 7K training samples, and 2K test samples
 • Data encoding:
   16x16 pixel image  256-dim. vector
 • Original motivation: Compare SVM with custom
   MLP network (LeNet) designed for this application
 • Multi-class problem: one-vs-all approach
    10 SVM classifiers (one per each digit)
                                                       38
        Digit Recognition Results
• Summary
  - prediction accuracy better than custom NN’s
  - accuracy does not depend on the kernel type
  - 100 – 400 support vectors per class (digit)
• More details
  Type of kernel No. of Support Vectors   Error%
  Polynomial            274               4.0
  RBF                   291               4.1
  Neural Network        254               4.2

• ~ 80-90% of SV’s coincide (for different kernels)
                                                   39
Document Classification (Joachims, 1998)
• The Problem: Classification of text documents in
  large data bases, for text indexing and retrieval
• Traditional approach: human categorization (i.e. via
  feature selection) – relies on a good indexing scheme.
  This is time-consuming and costly
• Predictive Learning Approach (SVM): construct a
  classifier using all possible features (words)
• Document/ Text Representation:
  individual words = input features (possibly weighted)
• SVM performance:
   – Very promising (~ 90% accuracy vs 80% by other classifiers)
   – Most problems are linearly separable  use linear SVM
                                                                   40
             OUTLINE

•   Margin-based loss
•   SVM for classification
•   SVM examples
•   Support vector regression
•   Summary



                                41
           Linear SVM regression
Assume linear parameterization         f (x,  )  w  x  b

y
                            
           1
                            


                       2
                        *




                                x

            L ( y, f (x,  ))  max | y  f (x,  ) |  ,0
                                                                 42
  Direct Optimization Formulation
                               y
                                                         
Given training data                        1
                                                         
 x i ,yi    i  1,...,n
                                                    2
                                                     *



 Minimize
                                                             x
                     n
    1
      (w  w )  C  ( i   i* )
    2              i 1
                             yi  (w  x i )  b     i
                            
    Under constraints
                              (w  x i )  b  y i     i*
                              ,  *  0, i  1,..., n
                                  i     i

                                                                 43
Example:                  SVM regression using RBF kernel

           1


         0.8


         0.6
     y




         0.4


         0.2


           0


         -0.2
                0   0.1    0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1
                                              x

SVM estimate is shown in dashed line
SVM model uses only 5 SV’s (out of the 40 points)
                                                                               44
                                                                         xc 2 
                                                               m
                                                                               j 
RBF regression model                             f  x, w    w j exp         2 
                                                              j 1        0.20  
                                                                                   
          4


          3


          2


          1


          0
      y




          -1


          -2


          -3


          -4
               0   0.1   0.2   0.3   0.4   0.5     0.6   0.7   0.8   0.9   1
                                            x


 Weighted sum of 5 RBF kernels gives the SVM model
                                                                                   45
              Summary
• Margin-based loss: robust +
  performs complexity control
• Nonlinear feature selection (~
  SV’s): performed automatically
• Tractable model selection – easier
  than most nonlinear methods.
• SVM is not a magic bullet solution
  - similar to other methods when n >> h
   - SVM is better when n << h or n ~ h    46

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:2/17/2012
language:
pages:46