# Support Vector Regression (PowerPoint) by yaofenji

VIEWS: 115 PAGES: 49

• pg 1
```									Support Vector Regression

David R. Musicant and O.L. Mangasarian
International Symposium on
Mathematical Programming
Thursday, August 10, 2000
http://www.cs.wisc.edu/~musicant
Outline
   Robust Regression
–   Huber M-Estimator loss function
–   Numerical comparisons
–   Nonlinear kernels
   Tolerant Regression
– New formulation of Support Vector Regression
(SVR)
– Numerical comparisons
– Massive regression: Row-column chunking
   Conclusions & Future Work
2
Focus 1:
Robust Regression

a.k.a. Huber Regression

2

1

0

-g                g
-2   -1        0   1       2

-1
“Standard” Linear Regression
ê
y = Aw + be

y

b

Find w, b such that:     Aw + be ù y
A
m points in Rn, represented by an m x n matrix A.
y in Rm is the vector to be approximated.           4
Optimization problem
   Find w, b such that:
Aw + be ù y
   Bound the error by s:
à s ô Aw + be à y ô s
   Minimize the error:
min jj sjj 2
2
w;b;s
à s ô Aw + be à y ô s
5
Examining the loss function
   Standard regression uses a squared error
loss function.
– Points which are far from the predicted line
(outliers) are overemphasized.

4

3

2

1

0
-2      -1          0   1       2
-1

6
Alternative loss function
   Instead of squared error, try absolute value of
the error:               0
min e s
w;b;s
à s ô Aw + be à y ô s
2

1

0
-2     -1               0   1   2

This is the 1-norm loss function.
7
1-Norm Problems And Solution
– Overemphasizes error on points close to the
predicted line
   Solution: Huber loss function hybrid approach
Linear
2

1

-2     -1          0   1      2

-1

Many practitioners prefer the Huber loss function.
8
Mathematical Formulation
   g indicates switchover from quadratic to linear
2

1

0
-2
-g
-1         0
g
1      2

-1

ú
t 2=2;      if j t j ô í
ú(t) =
í j t j à í 2=2;   if j t j > í
9
Regression Approach Summary

3

– Standard method in statistics             2

– Over-emphasizes outliers
1

0
-2   -1        0   1   2
-1

   Linear Loss Function (1-norm)
2
– Formulates well as a linear
program                                    1

– Over-emphasizes small errors               0
-2   -1        0   1   2

   Huber Loss Function (hybrid                  2

approach)                                    1

– Appropriate emphasis on large
0
-2   -1        0   1   2

and small errors                          -1

10
Previous attempts complicated
   Earlier efforts to solve Huber regression:
–   Huber: Gauss-Seidel method
   Our new approach: convex quadratic program
1       2       0
min      2jj zjj 2 +   í et
w; z; t
z à t ô Aw + be à y ô z + t
Our new approach is simpler and faster.
11
Experimental Results: Census20k
20,000 points
1.345                                       11 features
Li
1                                     Huber
g                                          Smola
MM
0.1

Faster!
0   200        400         600
Time (CPU sec)
12
Experimental Results: CPUSmall
8,192 points
1.345                                     12 features
Li
1                                    Huber
g                                         Smola
MM
0.1

Faster!
0   50    100    150      200
Time (CPU sec)
13
Introduce nonlinear kernel
   Begin with previous formulation:
min 1jj zjj 2 + í e0t
2       2
w; z; t

z à t ô Aw + be à y ô z + t
   Substitute w = A’a and minimize a instead:

z à t ô AA 0 + be à y ô z + t
ë
   Substitute K(A,A’) for AA’:
z à t ô K ( A; A 0) ë + be à y ô z + t

14
Nonlinear results

Dataset  Kernel   Training Accuracy Testing Accuracy
CPUSmall Linear         94.50%           94.06%
Gaussian       97.26%           95.90%
Boston   Linear         85.60%           83.81%
Housing  Gaussian       92.36%           88.15%

Nonlinear kernels improve accuracy.
15
Focus 2:
Support Vector Tolerant Regression
Regression Approach Summary

3

– Standard method in statistics             2

1

– Over-emphasizes outliers        -2   -1
0
0   1   2
-1

   Linear Loss Function (1-norm)                2

– Formulates well as a linear
1
program
– Over-emphasizes small errors    -2   -1
0
0   1   2

   Huber Loss Function (hybrid                  2

approach)                                    1

– Appropriate emphasis on large
0
-2   -1        0   1   2

and small errors                          -1

17
Optimization problem
   Find w, b such that:
Aw + be ù y
   Bound the error by s:
à s ô Aw + be à y ô s
   Minimize the error:
0
min e s
w;b;s
à s ô Aw + be à y ô s
Minimize the magnitude of the error.
18
The overfitting issue
   Noisy training data can be fitted “too well”
– leads to poor generalization on future data

   Prefer simpler
regressions, i.e. where
– some w coefficients are
ê
y = Aw + be                  zero
– line is “flatter”
A
19
Reducing overfitting
   To achieve both goals
– minimize magnitude of w vector
0
min jj wjj 1 + Ce s
à s ô Aw + be à y ô s
   C is a parameter to balance the two goals
– Chosen by experimentation
   Reduces overfitting due to points far from
surface

20
Overfitting again: “close” points
   “Close points” may be wrong due to noise only
– Line should be influenced by “real” data, not noise

"

   Ignore errors from
those points which are
y = Aw + be
ê                           close!

A
21
Tolerant regression
   Allow an interval of size e with uniform error
min jj wjj 1 + Ce0
s
à s ô Aw + be à y ô s
e" ô s
   How large should e be?
– Large as possible, while preserving accuracy
1              C 0
min m jj wjj 1 + m e s à Cö"
à s ô Aw + be à y ô s
e" ô s
22

23
Introduce nonlinear kernel
   Begin with previous formulation:
1            C
min m jj wjj 1 + m e0s à Cö"
à s ô Aw + be à y ô s
e" ô s
   Substitute w = A’a and minimize a instead:

à s ô AA 0 + be à y ô s
ë
   Substitute K(A,A’) for AA’:
à s ô K ( A; A 0) ë + be à y ô s

K(A,A’) = nonlinear kernel function
24
Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation
     Our formulation

1 0
min          m
C
e a + m e0s à   Cö"
(ë ;b;s;" ;a)

à s ô K (A; A 0)ë + be à y ô s
tolerance as a    e" ô s
constraint
à aô ë ô a                    single error bound

25
   Smola, Schölkopf, Rätsch
1 0 1
min              m e (ë +   ë 2) +   C 0 1
m e (ø +       ø2) + C(1 à ö)"
(ë 1;ë 2;b;ø1;ø2;" )

à ø2 à e" ô K (A; A 0)(ë 1 à ë 2) + be à y ô ø1 + e"

ë 1; ë 2; ø1; ø2 õ 0
multiple error bounds
26
   Reduction in:
– Variables:
• 4m+2 --> 3m+2
– Solution time

27
Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation
     Our formulation
    Reduction in:
1 0                                     – Variables:
min                  C
e a + m e0s à    Cö"
(ë ;b;s;" ;a)
m                                          • 4m+2 --> 3m+2
– Solution time
à s ô K (A; A 0)ë + be à y ô s
tolerance as a    e" ô s
constraint
à aô ë ô a                         single error bound

    Smola, Schölkopf, Rätsch
1 0 1
min              m e (ë +   ë 2) +   C 0 1
m e (ø +       ø2) + C(1 à ö)"
(ë 1;ë 2;b;ø1;ø2;" )

à ø2 à e" ô K (A; A 0)(ë 1 à ë 2) + be à y ô ø1 + e"

ë 1; ë 2; ø1; ø2 õ 0
multiple error bounds
28
Natural interpretation for m
   ö = 0 : our linear program is equivalent to classical
stabilized least 1-norm approximation problem
min jj ë jj 1 + Cjj K (A; A 0)ë + be à yjj 1
(ë ;b)

   Perturbation theory results show there exists a fixed
ö 2 (0; 1] such that:
   For all ö 2 (0; ö ]
– we solve the above stabilized least 1-norm problem
– additionally we maximize e, the least error component
   As m goes from 0 to 1,
– least error component e is monotonically nondecreasing
function of m.

29
Numerical Testing
   Two sets of tests
– Compare computational times of our method (MM) and the
SSR method
– Row-column chunking for massive datasets
   Datasets:
–   US Census Bureau Adult Dataset: 300,000 points in R11
–   Delve Comp-Activ Dataset: 8192 points in R13
–   UCI Boston Housing Dataset: 506 points in R13
–   Gaussian noise was added to each of these datasets.
   Hardware: Locop2: Dell PowerEdge 6300 server with:
– Four gigabytes of memory, 36 gigabytes of disk space
– Windows NT Server 4.0
– CPLEX 6.5 solver

30
Experimental Process
   m is a parameter which needs to be determined
experimentally
   Use a hold-out tuning set to determine optimal value
for m
   Algorithm:
m=0
while (tuning set accuracy continues to improve)
{
Solve LP
m = m + 0.1
}
   Run for both our method and SSR methods and
compare times
31
Comparison Results
m
Dataset                        0      0.1    0.2   ...   0.7     Total       Time
Time (sec) Improvement
Census Tuning set error      5.10%   4.74%                                   Max
e                   0.00   0.02                                   79.7%

SSR time (sec)      980     935                        5086        Avg
MM time (sec)       199     294                        3765       26.0%
Comp-     Tuning set error   6.60%   6.32%                                   Max
Activ     e                   0.00   3.09                                   65.7%

SSR time (sec)      1364   1286                        7604        Avg
MM time (sec)       468     660                        6533       14.1%
Boston Tuning set error      14.69% 14.62%                                   Max
Housing   e                   0.00   0.42                                   52.0%

SSR time (sec)       36     34                          170        Avg
MM time (sec)        17     23                          140       17.6%

32
Linear Programming Row Chunking
   Basic approach: (PSB/OLM) for classification problems
   Classification problem is solved for a subset, or chunk of
constraints (data points)
   Those constraints with positive multipliers are preserved
and integrated into next chunk (support vectors)
   Objective function is montonically nondecreasing
   Dataset is repeatedly scanned until objective function
stops increasing

33
Innovation: Simultaneous Row-Column Chunking
   Row Chunking
– Cannot handle problems with large numbers of
variables
– Therefore: Linear kernel only
   Row-Column Chunking
– New data increase the dimensionality of K(A,A’)
by adding both rows and columns (variables) to
the problem.
– We handle this with row-column chunking.
– General nonlinear kernel

34
Row-Column Chunking Algorithm
while (problem termination criteria not satisfied)
{
choose set of rows as row chunk
while (row chunk termination criteria not satisfied) {
from row chunk, select set of columns
solve LP allowing only these columns to vary
add columns with nonzero values to next column
chunk
}
add rows with nonzero multipliers to next row chunk
}

35
Row-Column Chunking Diagram

36
Row-Column Chunking Diagram

37
Row-Column Chunking Diagram

38
Row-Column Chunking Diagram

39
Row-Column Chunking Diagram

40
Row-Column Chunking Diagram

41
Row-Column Chunking Diagram

42
Chunking Experimental Results

Dataset:                  16,000 point subset of Census in R 11+ noise
LP Size:                      32,000 nonsparse rows and columns
Problem Size:                     1.024 billion nonzero values
Time to termination:                        18.8 days
Number of SVs:                       1621 support vectors
Solution variables:                 33 nonzero components
Final tuning set error:                       9.8%
Tuning set error on first                     16.2%
chunk (1000 points)

43
Objective Value & Tuning Set Error
for Billion-Element Matrix
Objective Value                                              Tuning Set Error

25000                                                          20%
18%

Tuning Set Error
20000
Objective Value

16%
15000
14%
10000
12%
5000                                                           10%

0
8%
0      500   1000     1500   2000                            0      500   1000   1500    2000
0       5      9       13     18                             0       5      9     13      18
Row-Column Chunk Iteration Number                            Row-Column Chunk Iteration Number
Time in Days                                                 Time in Days

44
Conclusions and Future Work
   Conclusions
– Robust regression can be modeled simply and
– Tolerant Regression can be handled more
efficiently using improvements on previous
formulations
– Row-column chunking is a new approach which
can handle massive regression problems
   Future work
– Chunking via parallel and distributed approaches
– Scaling Huber regression to larger problems

45
Questions?

46
LP Perturbation Regime #1
   Our LP is given by:
1 0
min          m
C
e a + m e0s à   Cö"
(ë ;b;s;" ;a)

à s ô K (A; A 0)ë + be à y ô s
e" ô s
à aô ë ô a
   When m = 0, the solution is the stabilized least 1-
norm solution.
   Therefore, by LP Perturbation Theory, there exists a
ö 2 (0; 1] such that
– The solution to the LP with ö 2 (0; ö ] is a solution to the
least 1-norm problem that also maximizes e.

47
LP Perturbation Regime #2
   Our LP can be rewritten as:
1
min         m
C
jj ë jj 1 + m e0( j dj   à e" ) + + C(1 à ö)"
(ë ;b;" ;d)

d = K (A; A 0)ë + be à y

   Similarly, by LP Perturbation Theory, there exists a
ö 2 [0; 1) such that
à
– The solution to the LP with ö 2 [ö ; 1) is the solution that
à
minimizes least error (e) among all minimizers of average
tolerated error.

48
Motivation for dual variable substitution
   Primal:
min ÷e0s + 1jj wjj 2
2
(w;b;s)

à s ô Aw + be à y ô s

   Dual:
min 1jj A 0(u à v) jj 2 + y(u à v)
2
(u;v)
e0u = ev0
u + v = ÷e
u; v õ 0
w = A 0ë = A 0(u à v)

49

```
To top