# SVM

Document Sample

```					         Course

Summer

Mining

Data

Summer Course: Data Mining
Vector Vector Machines
SupportSupportMachines
and other penalization classifiers

Presenter: Georgi Nalbantov

Presenter: Georgi Nalbantov

August 2009
Summer
Course
2/20

Mining

Data

Contents

   Purpose
   Linear Support Vector Machines
   Nonlinear Support Vector Machines
   (Theoretical justifications of SVM)
   Marketing Examples
   Other penalization classification methods
   Conclusion and Q & A
   (some extensions)
Summer
Course
3/20

Mining

Data

Purpose

Classify cases (customers) into “type 1” or “type 2” on the basis of
some known attributes (characteristics)

   Chosen tool to solve this task:

Support Vector Machines
Summer
Course
4/20

Mining

Data

   Given data on explanatory and explained variables, where the explained variable
can take two values {  1 }, find a function that gives the “best” separation between
the “-1” cases and the “+1” cases:

Given:     ( x1, y1 ), … , ( xm , ym )      n        {1}

Find:       : n  {  1 }

“best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k )
is minimal

   Existing techniques to solve the classification task:
       Linear and Quadratic Discriminant Analysis
       Logit choice models (Logistic Regression)
       Decision trees, Neural Networks, Least Squares SVM
Summer
Course
5/20

Mining

Data

Support Vector Machines: Definition

   Support Vector Machines are a non-parametric tool for classification/regression

   Support Vector Machines are used for prediction rather than description purposes

   Support Vector Machines have been developed by Vapnik and co-workers
Summer
Course
6/20

Mining

Data

Linear Support Vector Machines
   A direct marketing company wants to sell a
new book:
“The Art History of Florence”
Number of art books purchased

●      non-buyers                            Nissan Levin and Jacob Zahavi in Lattin,
∆       ∆                              Carroll and Green (2003).
∆
∆                     ∆
∆                                              Problem: How to identify buyers and non-
●         ●                buyers using the two variables:
∆                            ●         Months since last purchase

●       ∆                  ∆ ●          Number of art books purchased

●
●            ●
●
●
●

Months since last purchase
Summer
Course
7/20

Mining

Data

Linear SVM: Separable Case
   Main idea of SVM:

separate groups by a line.
Number of art books purchased

   However: There are infinitely many lines
∆       ∆
that have zero training error…
∆
∆
∆                                                … which line shall we choose?
∆                           ●
●
●
●            ●
●
●
●

Months since last purchase
Summer
Course
8/20

Mining

Data

Linear SVM: Separable Case
   SVM use the idea of a margin around the
separating line.

Number of art books purchased

●      non-buyers                               The thinner the margin,
∆       ∆
margin
∆
∆
∆
   the more complex the model,
∆                            ●
●
●
●            ●                        The best line is the one with the
●       
●                    largest margin.
●

Months since last purchase
Summer
Course
9/20

Mining

Data

Linear SVM: Separable Case
   The line having the largest margin is:

x2
w1x1 + w2x2 + b = 0
w
Number of art books purchased

∆       ∆
∆
∆
∆
∆                           ●            Where
margin                               x1 = months since last purchase
●
●                    x2 = number of art books purchased
●           ●
●
●                    Note:
●                                    w1xi 1 + w2xi 2 + b  +1    for i  ∆
x1          w1xj 1 + w2xj 2 + b  –1    for j  ●
Months since last purchase
Summer
Course
10/20

Mining

Data

Linear SVM: Separable Case

   The width of the margin is given by:
x2                                               w
1  ( 1)          2
Number of art books purchased

margin                 
∆       ∆                                                w1  w 2
2     2       || w ||
∆
∆
∆
∆                                        Note:
●
margin                             2 w              w 2                    w   2
2
●
●                  maximize       minimize                minimize
●           ●
●             the margin
●
●
x1
Months since last purchase
Summer
Course
11/20

Mining

Data

Linear SVM: Separable Case
2
2 w                 w 2              w       2
x2                                                               maximize       minimize       minimize
the margin

∆       ∆
∆                            The optimization problem for SVM is:
∆
∆
margin                                       2
∆                                         minimize L( w )  w            2
●
●
●                subject to:
●           ●
●                   w1xi 1 + w2xi 2 + b  +1     for i  ∆
●                     

●                                      w1xj 1 + w2xj 2 + b  –1     for j  ●
x1
Summer
Course
12/20

Mining

Data

Linear SVM: Separable Case

x2                                    “Support vectors”

   “Support vectors” are those points that lie
∆       ∆                                    on the boundaries of the margin
∆
∆
∆
∆                           ●               The decision surface (line) is determined
only by the support vectors. All other
●                points are irrelevant
●
●           ●
●
●
●
x1
Summer
Course
13/20

Mining

Data

Linear SVM: Nonseparable Case
   Non-separable case: there is no line
Training set: 1000 targeted customers          separating errorlessly the two groups
x2
   Here, SVM minimize L(w,C) :
∆       ∆                                     L( w,C )  w
2
2         C  i
∆                                                                   i
∆                 ∆
∆                                                                 maximize         minimize the
●        ●                                   the margin       training errors
∆                         ●
L(w,C) = Complexity +          Errors
●       ∆                 ∆ ●
●
●            ●                       subject to:
●        
●
●                                   w1xi 1 + w2xi 2 + b  +1 – i      for i  ∆
x1          w1xj 1 + w2xj 2 + b  –1 + i      for j  ●
   I,j  0
Summer
Course
14/20

Mining

Data

Linear SVM: The Role of C
x2                                                    x2
∆       ∆                     C=5                        ∆       ∆               C=1
∆                                                     ∆
∆                         ●                              ∆                       ●
∆                                                      ∆
●         ●           ●                                          ●
●                                          ●           ●       ●

x1                                                   x1
        Bigger C              increased complexity                Smaller C           decreased complexity
( thinner margin )                                      ( wider margin )

smaller number errors                                    bigger number errors
( better fit on the data )                           ( worse fit on the data )

        Vary both complexity and empirical error via C … by affecting the optimal w and optimal
number of training errors
Summer
Course
15/20

Mining

Data

Summer
Course
16/20

Mining

Data

From Regression into Classification

   We have a linear model, such as

y  b * x  const
   We have to estimate this relation using our training data set and having in mind
the so-called “accuracy”, or “0-1” loss function (our evaluation criterion).

   The training data set we have consists of only MANY observations, for instance:

Output (y)      Input (x)
-1              0.2
1              0.5
Training data:                1              0.7
...             ..
-1             -0.7
Summer
Course
17/20

Mining

Data

From Regression into Classification
   We have a linear model, such as

y  b * x  const                                y

   We have to estimate this relation using our
training data set and having in mind the so-
1
called “accuracy”, or “0-1” loss function (our
evaluation criterion).
   The training data set we have consists of
only MANY observations, for instance:            -1

Training data:
Output (y)                Input (x)
x
-1                        0.2
1                        0.5
1                        0.7                                                                 x
...                       ..
Support vector              Support vector
-1                        -0.7                                    “margin”
Summer
Course
18/20

Mining

Data

From Regression into Classification:
Support Vector Machines
   flatter line  greater penalization
y    y  b * x  const
equivalently:
   smaller slope  bigger margin         1

-1

x

x

“margin”
Summer
Course
19/20

Mining

Data

From Regression into Classification:
Support Vector Machines

y  b1 * x1  b2 * x2  const                   y  b1 * x1  b2 * x2  const

y
x2

x2                                                                                                 x1
x1
“margin”
          flatter line  greater penalization      equivalently:           smaller slope  bigger margin
Summer
Course
20/20

Mining

Data

Nonlinear SVM: Nonseparable Case
   Mapping into a higher-dimensional space

x2
 x11   x12              x11
2
2 x11x12     x12 
2

x                        2                         2 
∆       ∆                                     21    x 22 
           x 21       2 x 21x 22   x 22 
                                              
∆                                                        2                            
∆                   ∆                                              xl 1                    xl 2 
2
∆                                                      xl1   xl 2                        2 xl 1xl 2        
∆           ●       ●
●
●       ∆                ∆ ●
●                                                           C  i
2
L(w,C )  w             2        
●           ●
●                                                             i
●                   subject to:
●                                                                                                ∆

w1xi2  w2 2 xi 1xi 2  w3 xi22  b  1  i
x1                 1
●
w1x 21  w2 2 x j 1x j 2  w3 x 22  b  1  j

j                           j
Summer
Course
21/20

Mining

Data

Nonlinear SVM: Nonseparable Case

 Map the data into higher-dimensional space:         2  3
 1, 1        
 1, 2 , 1         ∆
 x12 
 x1                                                1,  1        
 1, 2 , 1         ∆
                  2 x1 x2 
x                                                      1,  1        1,        ●
 2                x2                                                        2,1

   2                                 1, 1       1,    2,1 ●

x2                                   2
x2

●                                                                                              ∆
∆
(-1,1)              (1,1)
x1                                                 ●

∆                     ●                                                                            x12
(-1,-1)             (1,-1)
2 x1 x2
Summer
Course
22/20

Mining

Data

Nonlinear SVM: Nonseparable Case

 Find the optimal hyperplane in the transformed space

 1, 1        
 1, 2 , 1         ∆
 x12 
 x1                                               1,  1        
 1, 2 , 1         ∆
                  2 x1 x2 
x                                                     1,  1        1,        ●
 2                x2                                                       2,1

   2                                1, 1       1,    2,1 ●

x2                                  2
x2

●                                                                                             ∆
∆
(-1,1)              (1,1)
x1                                                ●

∆                     ●                                                                           x12
(-1,-1)             (1,-1)
2 x1 x2
Summer
Course
23/20

Mining

Data

Nonlinear SVM: Nonseparable Case

 Observe the decision surface in the original space (optional)

 1, 1        
 1, 2 , 1         ∆
 x12 
 x1                                             1,  1        
 1, 2 , 1         ∆
              2 x1 x2 
x                                                   1,  1        1,        ●
 2            x2                                                         2,1

   2                                  1, 1       1,    2,1 ●

x2                                    2
x2

●                                                                                           ∆
∆

x1                                                      ●

∆                 ●                                                                             x12
2 x1 x2
Summer
Course
24/20

Mining

Data

Nonlinear SVM: Nonseparable Case

 Dual formulation of the (primal) SVM minimization problem

Primal                                      Dual
2

min
w
2
    C  i         max          
i
i
1
2
i   j
i   j   yi yj  xi  xj 




i

Subject to                                 Subject to

yi w  x i   b  1   i
0  i  C
i  0
yi   1                           
i
i   yi  0

yi   1
Summer
Course
25/20

Mining

Data

Nonlinear SVM: Nonseparable Case

 Dual formulation of the (primal) SVM minimization problem

 x12                                                 Dual
 x1                                  
                              2 x1 x2 
x 
 2                            x2 
   2     
max        
i
i
1
2
i   j
i   j   yi yj  xi  xj 





( xi )  ( x j ) 
x ,     2
i1      2 xi1 xi 2 , xi22    x

2
j1                      
, 2 x j1 x j 2 , x 22 
j

( x          i1 , xi 2 )  ( x j1 , x j 2 )        2

 x x   i         j
2

Subject to
K ( xi , x j )  ( xi )  ( x j )
(kernel function)
0  i  C          i
i   yi  0        yi   1
Summer
Course
26/20

Mining

Data

Nonlinear SVM: Nonseparable Case

 Dual formulation of the (primal) SVM minimization problem

 x12                                                       Dual
 x1                                     
                                 2 x1 x2 
x 
 2                               x2 
   2     
max         
i
i
1
2
i   j
i   j   yi yj  xi  xj 





( xi )  ( x j ) 
x ,     2
i1         2 xi1 xi 2 , x2
 x
i2 
2
j1   , 2 x j1 x j 2 , x   2
j2      max    2
i  1 i j yi yj  ( xi )  ( xj )
i          i


j



( x          i1   , xi 2 )  ( x j1 , x j 2 )      2

 x x   i           j
2
max       i  12 i j yi yj  xi  xj 
2

i                     i   j

Subject to
K ( xi , x j )  ( xi )  ( x j )
(kernel function)
0  i  C            i
i   yi  0        yi   1
Summer
Course
27/20

Mining

Data

Strengths and Weaknesses of SVM

   Strengths of SVM:
 Training is relatively easy

 No local minima

 It scales relatively well to high dimensional data

 Trade-off between classifier complexity and error can be controlled

explicitly via C
 Robustness of the results

 The “curse of dimensionality” is avoided

   Weaknesses of SVM:
   What is the best trade-off parameter C ?
   Need a good transformation of the original space
Summer
Course
28/20

Mining

Data

The Ketchup Marketing Problem

   Two types of ketchup: Heinz and Hunts
   Seven Attributes
 Feature Heinz

 Feature Hunts

 Display Heinz

 Display Hunts

 Feature&Display Heinz

 Feature&Display Hunts

 Log price difference between Heinz and Hunts

   Training Data: 2498 cases (89.11% Heinz is chosen)
   Test Data: 300 cases (88.33% Heinz is chosen)
Summer
Course
29/20

Mining

Data

The Ketchup Marketing Problem
   Choose a kernel mapping:
Cross-validation mean squared
errors, SVM with RBF kernel        K (xi , x j )  (xi  x j )         Linear kernel
K (xi , x j )  (xi  x j  1)d     Polynomial kernel
 xi xj / 2 2
2

K (xi, xj )  e                     RBF kernel

   Do (5-fold ) cross-validation procedure to
C                                                       find the best combination of the manually
adjustable parameters (here: C and σ)

min   max

σ
Summer
Course
30/20

Mining

Data

The Ketchup Marketing Problem – Training Set

Model

Linear Discriminant                      Heinz
Predicted Group
Membership         Total
Analysis
Hunts     Heinz                  Hit Rate
Original   Count   Hunts       68       204           272   89.51%

Heinz       58      2168      2226

%       Hunts   25.00%   75.00%    100.00%

Heinz    2.61%   97.39%    100.00%
Summer
Course
31/20

Mining

Data

The Ketchup Marketing Problem – Training Set

Model

Predicted Group
Logit Choice Model                      Heinz   Membership         Total

Hunts     Heinz                  Hit Rate
Original   Count   Hunts      214        58           272   77.79%

Heinz      497      1729      2226

%       Hunts   78.68%   21.32%    100.00%

Heinz   22.33%   77.67%    100.00%
Summer
Course
32/20

Mining

Data

The Ketchup Marketing Problem – Training Set

Model

Support Vector                      Heinz
Predicted Group
Membership         Total
Machines
Hunts     Heinz                  Hit Rate
Original   Count   Hunts      255        17           272   99.08%

Heinz        6      2220      2226

%       Hunts   93.75%     6.25%   100.00%

Heinz    0.27%   99.73%    100.00%
Summer
Course
33/20

Mining

Data

The Ketchup Marketing Problem – Training Set

Model

Predicted Group
Majority Voting                      Heinz   Membership         Total

Hunts     Heinz                  Hit Rate
Original   Count   Hunts        0       272           272   89.11%

Heinz        0      2226      2226

%       Hunts      0%      100%    100.00%

Heinz      0%      100%    100.00%
Summer
Course
34/20

Mining

Data

The Ketchup Marketing Problem – Test Set

Model

Linear Discriminant                      Heinz
Predicted Group
Membership         Total
Analysis
Hunts     Heinz                  Hit Rate
Original   Count   Hunts        3        32           35    88.33%

Heinz        3       262           265

%       Hunts    8.57%   91.43%    100.00%

Heinz    1.13%   98.87%    100.00%
Summer
Course
35/20

Mining

Data

The Ketchup Marketing Problem – Test Set

Model

Predicted Group
Logit Choice Model                      Heinz   Membership            Total

Hunts     Heinz                     Hit Rate
Original   Count   Hunts       29            6           35       77%

Heinz       63       202              265

%       Hunts   82.86%   17.14%       100.00%

Heinz   23.77%   76.23%       100.00%
Summer
Course
36/20

Mining

Data

The Ketchup Marketing Problem – Test Set

Model

Support Vector                        Heinz
Predicted Group
Membership         Total
Machines
Hunts     Heinz                  Hit Rate
Original   Count   Hunts       25        10           35    95.67%

Heinz        3       262           265

%       Hunts   71.43%   28.57%    100.00%

Heinz    1.13%   98.87%    100.00%
•37/36

•Part II
•Penalized classification and regression methods

   Support Hyperplanes

   Nearest Convex Hull classifier

   Soft Nearest Neighbor

   Application:
An example Support Vector Regression financial study

   Conclusion
•38/36

•Classification:
•Support Hyperplanes

•+                                       •+

•+                                        •+
•+ •+
•+                                   •+  •+•+
•+                                         •+



•Consider a (separable) binary             •There are infinitely many
classification case: training data (+,-)   hyperplanes that are semi-consistent
and a test point x.                        (= commit no error) with the training
data.
•39/36

•Classification:
•Support Hyperplanes

•+                                      •+
•Support hyperplane

•+                                       •+
of x

•+                                      •+•+
•+ •+                                    •+
•+                                        •+



•For the classification of the test      •The SH decision surface. Each
point x, use the farthest-away h-        point on it has 2 support h-planes.
plane that is semi-consistent with
training data.
•40/36

•Classification:
•Support Hyperplanes
SH, linear kernel      SH, RBF kernel, =5      SH, RBF kernel,=35
•+                       •+                       •+
•+ •+                    •+ •+                    •+ •+
•+ •+                    •+ •+                    •+ •+
•+                       •+                       •+

SVM, linear kernel     SVM, RBF kernel, =5     SVM, RBF kernel,=35
•+                       •+                       •+
•+ •+                    •+ •+                    •+ •+
•+ •+                    •+ •+                    •+ •+
•+                       •+                       •+

•Toy Problem Experiment with Support Hyperplanes and Support Vector Machines
•41/36

•Classification:
•Support Vector Machines and Support Hyperplanes

•   Support Vector                 •   Support Hyperplanes
Machines
•42/36

•Classification:
•Support Vector Machines and Nearest Convex Hull cl.

•   Support Vector                   •   Nearest Convex Hull
Machines                             classification
•43/36

•Classification:
•Support Vector Machines and Soft Nearest Neighbor

•   Support Vector                  •   Soft Nearest Neighbor
Machines
•44/36

•Classification: Support Hyperplanes

•   Support Hyperplanes               •    Support Hyperplanes
•   (bigger penalization)
•45/36

•Classification: Nearest Convex Hull classification

•   Nearest Convex Hull classification      •   Nearest Convex Hull
classification
•   (bigger penalization)
•46/36

•Classification: Soft Nearest Neighbor

•       Soft Nearest Neighbor              •   Soft Nearest Neighbor
•   (bigger penalization)
•47/36

•Classification: Support Vector Machines,
•Nonseparable Case

•   Support Vector Machines
•48/36

•Classification: Support Hyperplanes,
•Nonseparable Case

•   Support Hyperplanes
•49/36

•Classification: Nearest Convex Hull classification,
•Nonseparable Case

•   Nearest Convex Hull
classification
•50/36

•Classification: Soft Nearest Neighbor,
•Nonseparable Case

•   Soft Nearest Neighbor
•51/36

Summary: Penalization Techniques for Classification

•Penalization methods for classification: Support Vector Machines (SVM), Support Hyperplanes (SH), Nearest
Convex Hull classification (NCH), and Soft Nearest Neighbour (SNN). In all cases, the classificarion of test point x is
dete4rmined using the hyperplane h. Equivalently, x is labelled +1 (-1) if it is farther away from set S_ (S+).
Summer
Course
52/20

Mining

Data

Conclusion

   Support Vector Machines (SVM) can be applied in the binary
and multi-class classification problems

   SVM behave robustly in multivariate problems

   Further research in various Marketing areas is needed to justify
or refute the applicability of SVM

   Support Vector Regressions (SVR) can also be applied

   http://www.kernel-machines.org

   Email: nalbantov@few.eur.nl

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 6 posted: 9/7/2011 language: English pages: 52