Support Vector Regression (PowerPoint) by yaofenji

VIEWS: 115 PAGES: 49

									Support Vector Regression

David R. Musicant and O.L. Mangasarian
      International Symposium on
      Mathematical Programming
       Thursday, August 10, 2000
      http://www.cs.wisc.edu/~musicant
                       Outline
   Robust Regression
    –   Huber M-Estimator loss function
    –   New quadratic programming formulation
    –   Numerical comparisons
    –   Nonlinear kernels
   Tolerant Regression
    – New formulation of Support Vector Regression
      (SVR)
    – Numerical comparisons
    – Massive regression: Row-column chunking
   Conclusions & Future Work
                                                     2
    Focus 1:
Robust Regression

a.k.a. Huber Regression



                     2


                     1


                     0


                -g                g
           -2   -1        0   1       2

                     -1
    “Standard” Linear Regression
          ê
          y = Aw + be

y

b

           Find w, b such that:     Aw + be ù y
                          A
     m points in Rn, represented by an m x n matrix A.
     y in Rm is the vector to be approximated.           4
             Optimization problem
   Find w, b such that:
                   Aw + be ù y
   Bound the error by s:
           à s ô Aw + be à y ô s
   Minimize the error:
                    min jj sjj 2
                               2
                     w;b;s
           à s ô Aw + be à y ô s
    Traditional approach: minimize squared error.
                                                    5
        Examining the loss function
   Standard regression uses a squared error
    loss function.
    – Points which are far from the predicted line
      (outliers) are overemphasized.

                            4

                            3

                            2

                            1

                            0
             -2      -1          0   1       2
                            -1




                                                     6
          Alternative loss function
   Instead of squared error, try absolute value of
    the error:               0
                    min e s
                         w;b;s
           à s ô Aw + be à y ô s
                                 2




                                 1




                                 0
             -2     -1               0   1   2



          This is the 1-norm loss function.
                                                      7
      1-Norm Problems And Solution
    – Overemphasizes error on points close to the
      predicted line
   Solution: Huber loss function hybrid approach
                          Linear
                            2


                            1



Quadratic                   0
              -2     -1          0   1      2

                            -1




Many practitioners prefer the Huber loss function.
                                                     8
          Mathematical Formulation
   g indicates switchover from quadratic to linear
                                  2


                                  1


                                  0
               -2
                          -g
                           -1         0
                                           g
                                           1      2

                                 -1

                    ú
                               t 2=2;      if j t j ô í
      ú(t) =
                        í j t j à í 2=2;   if j t j > í
          Larger g means “more quadratic.”
                                                          9
      Regression Approach Summary
   Quadratic Loss Function                     4

                                                3

    – Standard method in statistics             2



    – Over-emphasizes outliers
                                                1

                                                0
                                      -2   -1        0   1   2
                                                -1



   Linear Loss Function (1-norm)
                                                 2
    – Formulates well as a linear
      program                                    1


    – Over-emphasizes small errors               0
                                      -2   -1        0   1   2




   Huber Loss Function (hybrid                  2




    approach)                                    1




    – Appropriate emphasis on large
                                                 0
                                      -2   -1        0   1   2


      and small errors                          -1




                                                                 10
        Previous attempts complicated
   Earlier efforts to solve Huber regression:
    –   Huber: Gauss-Seidel method
    –   Madsen/Nielsen: Newton Method
    –   Li: Conjugate Gradient Method
    –   Smola: Dual Quadratic Program
   Our new approach: convex quadratic program
                          1       2       0
                 min      2jj zjj 2 +   í et
                w; z; t
          z à t ô Aw + be à y ô z + t
           Our new approach is simpler and faster.
                                                     11
      Experimental Results: Census20k
                                           20,000 points
1.345                                       11 features
                                           Li
                                           Madsen/Nielsen
     1                                     Huber
g                                          Smola
                                           MM
    0.1

                                              Faster!
          0   200        400         600
                    Time (CPU sec)
                                                            12
      Experimental Results: CPUSmall
                                          8,192 points
1.345                                     12 features
                                          Li
                                          Madsen/Nielsen
     1                                    Huber
g                                         Smola
                                          MM
    0.1

                                             Faster!
          0   50    100    150      200
                   Time (CPU sec)
                                                           13
          Introduce nonlinear kernel
   Begin with previous formulation:
                   min 1jj zjj 2 + í e0t
                       2       2
                  w; z; t

              z à t ô Aw + be à y ô z + t
   Substitute w = A’a and minimize a instead:

           z à t ô AA 0 + be à y ô z + t
                      ë
   Substitute K(A,A’) for AA’:
         z à t ô K ( A; A 0) ë + be à y ô z + t

                                                  14
              Nonlinear results

Dataset  Kernel   Training Accuracy Testing Accuracy
CPUSmall Linear         94.50%           94.06%
         Gaussian       97.26%           95.90%
Boston   Linear         85.60%           83.81%
Housing  Gaussian       92.36%           88.15%




         Nonlinear kernels improve accuracy.
                                                       15
             Focus 2:
Support Vector Tolerant Regression
      Regression Approach Summary
   Quadratic Loss Function                     4

                                                3


    – Standard method in statistics             2

                                                1

    – Over-emphasizes outliers        -2   -1
                                                0
                                                     0   1   2
                                                -1




   Linear Loss Function (1-norm)                2


    – Formulates well as a linear
                                                 1
      program
    – Over-emphasizes small errors    -2   -1
                                                 0
                                                     0   1   2




   Huber Loss Function (hybrid                  2




    approach)                                    1




    – Appropriate emphasis on large
                                                 0
                                      -2   -1        0   1   2


      and small errors                          -1




                                                                 17
             Optimization problem
   Find w, b such that:
                   Aw + be ù y
   Bound the error by s:
           à s ô Aw + be à y ô s
   Minimize the error:
                             0
                     min e s
                     w;b;s
           à s ô Aw + be à y ô s
        Minimize the magnitude of the error.
                                               18
              The overfitting issue
   Noisy training data can be fitted “too well”
    – leads to poor generalization on future data




                                    Prefer simpler
                                     regressions, i.e. where
                                     – some w coefficients are
          ê
          y = Aw + be                  zero
                                     – line is “flatter”
                            A
                                                                 19
              Reducing overfitting
   To achieve both goals
     – minimize magnitude of w vector
                                        0
               min jj wjj 1 + Ce s
           à s ô Aw + be à y ô s
   C is a parameter to balance the two goals
    – Chosen by experimentation
   Reduces overfitting due to points far from
    surface


                                                 20
      Overfitting again: “close” points
   “Close points” may be wrong due to noise only
    – Line should be influenced by “real” data, not noise


         "

                                    Ignore errors from
                                     those points which are
         y = Aw + be
         ê                           close!

                            A
                                                              21
              Tolerant regression
   Allow an interval of size e with uniform error
                min jj wjj 1 + Ce0
                                 s
            à s ô Aw + be à y ô s
                    e" ô s
   How large should e be?
    – Large as possible, while preserving accuracy
                 1              C 0
           min m jj wjj 1 + m e s à Cö"
            à s ô Aw + be à y ô s
                   e" ô s
                                                     22
How about a nonlinear surface?




                                 23
           Introduce nonlinear kernel
   Begin with previous formulation:
                  1            C
              min m jj wjj 1 + m e0s à Cö"
              à s ô Aw + be à y ô s
                           e" ô s
   Substitute w = A’a and minimize a instead:

               à s ô AA 0 + be à y ô s
                        ë
   Substitute K(A,A’) for AA’:
             à s ô K ( A; A 0) ë + be à y ô s

          K(A,A’) = nonlinear kernel function
                                                 24
Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation
      Our formulation

                     1 0
        min          m
                             C
                       e a + m e0s à   Cö"
     (ë ;b;s;" ;a)

 à s ô K (A; A 0)ë + be à y ô s
tolerance as a    e" ô s
constraint
               à aô ë ô a                    single error bound




                                                                  25
   Smola, Schölkopf, Rätsch
                             1 0 1
            min              m e (ë +   ë 2) +   C 0 1
                                                 m e (ø +       ø2) + C(1 à ö)"
      (ë 1;ë 2;b;ø1;ø2;" )

            à ø2 à e" ô K (A; A 0)(ë 1 à ë 2) + be à y ô ø1 + e"

                                    ë 1; ë 2; ø1; ø2 õ 0
                                        multiple error bounds
                                                                                  26
   Reduction in:
     – Variables:
        • 4m+2 --> 3m+2
     – Solution time




                          27
Equivalent to Smola, Schölkopf, Rätsch (SSR) Formulation
      Our formulation
                                                           Reduction in:
                     1 0                                     – Variables:
        min                  C
                       e a + m e0s à    Cö"
     (ë ;b;s;" ;a)
                     m                                          • 4m+2 --> 3m+2
                                                             – Solution time
 à s ô K (A; A 0)ë + be à y ô s
tolerance as a    e" ô s
constraint
               à aô ë ô a                         single error bound

     Smola, Schölkopf, Rätsch
                                 1 0 1
                min              m e (ë +   ë 2) +   C 0 1
                                                     m e (ø +       ø2) + C(1 à ö)"
          (ë 1;ë 2;b;ø1;ø2;" )

                à ø2 à e" ô K (A; A 0)(ë 1 à ë 2) + be à y ô ø1 + e"

                                        ë 1; ë 2; ø1; ø2 õ 0
                                            multiple error bounds
                                                                                      28
          Natural interpretation for m
   ö = 0 : our linear program is equivalent to classical
    stabilized least 1-norm approximation problem
             min jj ë jj 1 + Cjj K (A; A 0)ë + be à yjj 1
             (ë ;b)

   Perturbation theory results show there exists a fixed
    ö 2 (0; 1] such that:
   For all ö 2 (0; ö ]
     – we solve the above stabilized least 1-norm problem
     – additionally we maximize e, the least error component
   As m goes from 0 to 1,
     – least error component e is monotonically nondecreasing
       function of m.


                                                                29
                  Numerical Testing
   Two sets of tests
    – Compare computational times of our method (MM) and the
      SSR method
    – Row-column chunking for massive datasets
   Datasets:
    –   US Census Bureau Adult Dataset: 300,000 points in R11
    –   Delve Comp-Activ Dataset: 8192 points in R13
    –   UCI Boston Housing Dataset: 506 points in R13
    –   Gaussian noise was added to each of these datasets.
   Hardware: Locop2: Dell PowerEdge 6300 server with:
    – Four gigabytes of memory, 36 gigabytes of disk space
    – Windows NT Server 4.0
    – CPLEX 6.5 solver

                                                                30
              Experimental Process
   m is a parameter which needs to be determined
    experimentally
   Use a hold-out tuning set to determine optimal value
    for m
   Algorithm:
    m=0
    while (tuning set accuracy continues to improve)
    {
        Solve LP
        m = m + 0.1
    }
   Run for both our method and SSR methods and
    compare times
                                                           31
                       Comparison Results
                                             m
Dataset                        0      0.1    0.2   ...   0.7     Total       Time
                                                               Time (sec) Improvement
Census Tuning set error      5.10%   4.74%                                   Max
          e                   0.00   0.02                                   79.7%

          SSR time (sec)      980     935                        5086        Avg
          MM time (sec)       199     294                        3765       26.0%
Comp-     Tuning set error   6.60%   6.32%                                   Max
Activ     e                   0.00   3.09                                   65.7%

          SSR time (sec)      1364   1286                        7604        Avg
          MM time (sec)       468     660                        6533       14.1%
Boston Tuning set error      14.69% 14.62%                                   Max
Housing   e                   0.00   0.42                                   52.0%

          SSR time (sec)       36     34                          170        Avg
          MM time (sec)        17     23                          140       17.6%


                                                                                        32
    Linear Programming Row Chunking
   Basic approach: (PSB/OLM) for classification problems
   Classification problem is solved for a subset, or chunk of
    constraints (data points)
   Those constraints with positive multipliers are preserved
    and integrated into next chunk (support vectors)
   Objective function is montonically nondecreasing
   Dataset is repeatedly scanned until objective function
    stops increasing




                                                                 33
Innovation: Simultaneous Row-Column Chunking
   Row Chunking
    – Cannot handle problems with large numbers of
      variables
    – Therefore: Linear kernel only
   Row-Column Chunking
    – New data increase the dimensionality of K(A,A’)
        by adding both rows and columns (variables) to
      the problem.
    – We handle this with row-column chunking.
    – General nonlinear kernel


                                                         34
   Row-Column Chunking Algorithm
while (problem termination criteria not satisfied)
{
  choose set of rows as row chunk
  while (row chunk termination criteria not satisfied) {
       from row chunk, select set of columns
       solve LP allowing only these columns to vary
       add columns with nonzero values to next column
              chunk
  }
  add rows with nonzero multipliers to next row chunk
}

                                                           35
Row-Column Chunking Diagram




                              36
Row-Column Chunking Diagram




                              37
Row-Column Chunking Diagram




                              38
Row-Column Chunking Diagram




                              39
Row-Column Chunking Diagram




                              40
Row-Column Chunking Diagram




                              41
Row-Column Chunking Diagram




                              42
     Chunking Experimental Results

Dataset:                  16,000 point subset of Census in R 11+ noise
Kernel:                          Gaussian Radial Basis Kernel
LP Size:                      32,000 nonsparse rows and columns
Problem Size:                     1.024 billion nonzero values
Time to termination:                        18.8 days
Number of SVs:                       1621 support vectors
Solution variables:                 33 nonzero components
Final tuning set error:                       9.8%
Tuning set error on first                     16.2%
chunk (1000 points)




                                                                         43
Objective Value & Tuning Set Error
    for Billion-Element Matrix
                              Objective Value                                              Tuning Set Error

                  25000                                                          20%
                                                                                 18%




                                                              Tuning Set Error
                  20000
Objective Value




                                                                                 16%
                  15000
                                                                                 14%
                  10000
                                                                                 12%
                  5000                                                           10%

                     0
                                                                                 8%
                          0      500   1000     1500   2000                            0      500   1000   1500    2000
                          0       5      9       13     18                             0       5      9     13      18
                          Row-Column Chunk Iteration Number                            Row-Column Chunk Iteration Number
                          Time in Days                                                 Time in Days




                                                                                                                           44
       Conclusions and Future Work
   Conclusions
     – Robust regression can be modeled simply and
       efficiently as a quadratic program
     – Tolerant Regression can be handled more
       efficiently using improvements on previous
       formulations
     – Row-column chunking is a new approach which
       can handle massive regression problems
   Future work
     – Chunking via parallel and distributed approaches
     – Scaling Huber regression to larger problems


                                                          45
Questions?




             46
         LP Perturbation Regime #1
   Our LP is given by:
                                   1 0
                      min          m
                                           C
                                     e a + m e0s à   Cö"
                   (ë ;b;s;" ;a)

               à s ô K (A; A 0)ë + be à y ô s
                           e" ô s
                        à aô ë ô a
   When m = 0, the solution is the stabilized least 1-
    norm solution.
   Therefore, by LP Perturbation Theory, there exists a
     ö 2 (0; 1] such that
    – The solution to the LP with ö 2 (0; ö ] is a solution to the
      least 1-norm problem that also maximizes e.


                                                                     47
         LP Perturbation Regime #2
   Our LP can be rewritten as:
                           1
               min         m
                                         C
                             jj ë jj 1 + m e0( j dj   à e" ) + + C(1 à ö)"
             (ë ;b;" ;d)

                              d = K (A; A 0)ë + be à y

   Similarly, by LP Perturbation Theory, there exists a
     ö 2 [0; 1) such that
     à
    – The solution to the LP with ö 2 [ö ; 1) is the solution that
                                       à
      minimizes least error (e) among all minimizers of average
      tolerated error.




                                                                             48
Motivation for dual variable substitution
   Primal:
                      min ÷e0s + 1jj wjj 2
                                 2
                      (w;b;s)

                 à s ô Aw + be à y ô s

   Dual:
              min 1jj A 0(u à v) jj 2 + y(u à v)
                  2
              (u;v)
                           e0u = ev0
                          u + v = ÷e
                            u; v õ 0
                  w = A 0ë = A 0(u à v)

                                                   49

								
To top