lecture set 7_final

Document Sample
lecture set 7_final Powered By Docstoc
					Predictive Learning
    from Data


  LECTURE SET 7
Methods for Regression


              Electrical and Computer Engineering
                                             1
          OUTLINE of Set 7
•   Objectives
    - introduce taxonomy of methods for regression;
    - describe several representative nonlinear methods;
    - empirical comparisons illustrating advantages and limitations
    of these methods
•   Methods taxonomy
•   Linear methods
•   Adaptive dictionary methods
•   Kernel methods and local risk minimization
•   Empirical comparisons
•   Combining methods
•   Summary and discussion

                                                                  2
         Motivation and issues
•   Importance of regression for implementation of
    - classification
    - density estimation
•   Estimation of a real-valued function when data
    (x,y) is generated as y  g (x)  noise

•   Major issues for regression
    - parameterization (representation) of f(x,w)
    - optimization formulation (~ empirical loss)
    - complexity control (model selection)
•   These issues are inter-related

                                                     3
    Loss function and noise model
•    Fundamental problem: how to distinguish
     between true signal and noise?
            y  g (x)  noise
•    Classical statistical view
     - noise density p(noise) is known
      statistically optimal loss function in the
     maximum likelihood sense is
       L( y, f ( x, w))  logp( y  f (x, w))
      for Gaussian noise use squared loss
     (MSE) as empirical loss function
                                                4
    Loss functions for linear regression
•    Consider linear regression only
          f (x, w)  w0   w  x
•    Several unimodal noise models:
     - Gaussian, Laplacian, unimodal
•    Statistical view:
     - Optimal loss for known noise density
     - asymptotic setting
     - robust strategies when noise model unknown
•    Practical situations
     - noise model unknown
     - finite (sparse) sample setting

                                                    5
(a)Linear loss for Laplacian noise       (b)Squared loss for Gaussian noise

 3                                          3



2.5                                        2.5



 2                                          2



1.5                                        1.5



 1                                          1



0.5                                        0.5



 0                                          0
 -3   -2    -1    0    1    2        3      -3   -2   -1   0    1    2        3




                                                                         6
 -insensitive loss (SVM) has common-sense interpretation.
Optimal epsilon depends on noise level and sample size


             3


            2.5


             2


            1.5


             1


            0.5


             0
             -3   -2   -eps   0   eps   2   3

                                                        7
Comparison for low-dimensional data:
    g1 (x)  x1  x2 , x  [0,1]2     2, n  30
Gaussian noise                          Laplacian noise
2                                       2




1                                       1




0    OLS        LM          SVM.        0   OLS     LM    SVM.




                                                                 8
Comparison for high-dimensional data:
g 4 (x)  x1  x2      x20 x  [0,1]20        1, n  30
Gaussian noise                          Laplacian noise
10                                 10
9                                  9
8                                  8
7                                  7
6                                  6
5                                  5
4                                  4
3                                  3
2                                  2
1                                  1
0    OLS      LM        SVM.       0       OLS     LM        SVM




                                                                   9
           Methods’ Taxonomy
•   Recall implementation of SRM:
    - fix complexity (VC-dimension)
    - minimize empirical risk (squared-loss)
•   Two interrelated issues:
    - parameterization (of possible models)
    - optimization method
•   Taxonomy will be based on
    parameterization: dictionary vs kernel
    flexibility: non-adaptive vs adaptive
                                               10
                                                      m

• Dictionary representation            f m x, w,V   wi gx,vi 
                                                     i0

Two possibilities
•   Linear (non-adaptive) methods
   ~ predetermined (fixed) basis functions g i x 
    only parameters wi have to be estimated
   via standard optimization methods (linear least squares)
Examples: linear regression, polynomial regression
                linear classifiers, quadratic classifiers
• Nonlinear (adaptive) methods
   ~ basis functions gx,vi  depend on the training data
Possibilities : nonlinear b.f. (in parameters v i ), i.e. MLP
                feature selection (i.e. wavelet denoising)
                                                                11
 Example Nonlinear Parameterizations
• Basis functions of the form gi (t )  g (xv i  bi ) 
    i.e. sigmoid aka logistic function

     st  
                     1
               1  exp  t 




    - commonly used in artificial neural networks
    - combination of sigmoids ~ universal approximator
                                                         12
      Neural Network Representation
•   MLP or RBF networks                                 m
                                                    ˆ  wjz j
                                                    y
                   m                                   j 1

    f m x, w,V   wi gx,vi 
                   i0
                                                                  W is m  1

                                     z1        z2                zm
                                 1        2                       m     zj  gx,v j 


                                                                       V is d  m



                                x1        x2                      xd

    - dimensionality reduction
    - universal approximation property – see example at
    http://www.mathworks.com/products/demos/nnettlbx/radial/index.html

                                                                                     13
              Kernel Methods                  n
•   Model estimated as               f x   Ki x,x i yi
                                               i 1
    where symmetric kernel function is
    - non-negative Kx,x'   0
    - radially symmetric Kx,x'   K  x  x' 
    - monotonically decreasing with t  x  x' lim K t   0
                                                    t
•   Duality between dictionary and kernel representation:
    model ~ weighted combination of basis functions
    model ~ weighted combination of output values
•   Selection of kernel functions Ki x,xi 
    non-adaptive ~ depends only on x-values
    adaptive ~ depends on y-values of training data
Note: kernel methods may require local complexity control14
               OUTLINE
•   Objectives
•   Methods taxonomy
•   Linear methods
    Estimation of linear models
    Equivalent Representations
    Non-adaptive methods
    Application Example
•   Adaptive dictionary methods
•   Kernel methods and local risk minimization
•   Empirical comparisons
•   Combining methods
•   Summary and discussion
                                                 15
        Estimation of Linear Models
                                                      m
Dictionary representation f m x, w    wi g i x 
                                                     i 0

•   Parameters w estimated via least-squares
•   Denote training data as matrix X  x1 ,..., xn 
    and vector of response values y  y1 ,...,y n 
•   OLS solution ~ solving matrix equation
                                                  1
                Z y
                 w                     Re mp w  Zw  y
                                                                        2

                                                  n
where         1 x1 ... gm x1  
              g
         Z                                                 
              ...................... g1 X g 2 X... gm X
                                   
              1 x n ...gm x n  
              g
                                                                   16
    Estimation of Linear Models (cont’d)
•     Solution exists if the columns of Z are linearly
      independent (m < n)
                                      Z y
                                       w
•     Solving the normal equation          T
                                      Z Zw  Z y
                                                            T


                                    w  Z Z  Z y
                                      *            T   1       T
      yields OLS solution

•     Similar math holds for penalized OLS where
            Rpenw 
                      1
                      n
                            2   T
                        Z  y  w w
                         w                         
                      w  Z Z    Z y
                        *     T           1   T
      OLS solution
                                                                    17
         Equivalent Representation
                                                                         m

For dictionary representation                         f m x, w,V   wi gx,vi 
                                                                         i0
                                     *
OLS solution is ˆ  Zw  Sy
                y                                       y

nXn matrix S ~projection matrix

  S  ZZ Z Z
          T     1   T                                   y  Zw *




                         Column Space of Z
                                                         ˆ  Zw *  Sy
                                                         y




Matrix S ~ ‘equivalent kernel’ of an OLS model w*
              S x, xi   gxZ Z g x i 
                                         T   1   T
                                                                               18
  Equivalent Representation (cont’d)
• Equivalent kernel y      ˆ  Zw*  Sy
  S x, xi   gxZ Z g x i  may not be local
                     T  1 T



•   Equivalent ‘kernels’ of a 3-rd degree polynomial




                                                  19
    Equivalent BFs for Symmetric Kernel
•   Eigenfunctiondecomposition of a kernel
            K x, x   ei g i x  g i x   wi g i x 
                            i 1                                i 1
•   The eigenvalues tend to fall off rapidly with I
                            0.6

                            0.4
4 BF’s for kernel                      e2  0.45
                                                        e4  0.02
                            0.2
                t2     
g t   exp 
             2(0.55) 2 
                             0                     e3  0.10
                       
                            -0.2
                                         e1  1.0
                            -0.4

                            -0.6
                                   0      0.2              0.4         0.6   0.8   1

                                                                                   20
    Equivalent Representation: summary
• Equivalence of representations is due to
  duality of OLS solution
                           *
                   ˆ  Zw  Sy
                   y
•   Equivalent ‘kernels’ are just math artifacts (may be
    non-local). Notational distinction: K vs S
• Practical use of matrix S for:
  - analytic form of LOO cross-validation
  - estimating model complexity for penalized
  linear estimators (ridge regression)
                                                     21
               Estimating Complexity
•   Linear estimator is specified via matrix S. Its complexity ~
    the number of parameters m of an equivalent linear
    estimator
                   ˆ                                           
              var yi   E  yi  E yi   E s i y  Es i y 
                              ˆ       ˆ
                                      2                    2


                        E s y  Ey    E s     s s 
                                             2              2      T   2
                               i                        i        i i

                                                   2
     ave variance of training data var ˆ  
                                         y       trace SST 
                                               n
•   Consider an equivalent linear estimator with matrix          ˜
                                                                 S
           ˜
    where S is symmetric of rank m :
               
              ˜˜          ˜      ˜   
        trace SST  trace S  rank S  m
                                                   2m
             so the average variance is var ˆ   n
                                             y
 effective DoF of estimator with matrix S is DoF  trace SST             
                                                                       22
         Non-adaptive methods                    m

•   Dictionary representation     f m x, w,V   wi gx,vi 

    basis functions gx,vi depend only on x-values
                                                i0




•   Representative methods include:
    - local polynomials (splines) from statistics
    where parameters   v i are knot locations
    - RBF networks from neural networks
    where parameters v i are RBF center and width
    Only non-adaptive implementation of RBF
    will be considered
                                                            23
       Local polynomials and splines
•   Problem setting: data interpolation(univariate)
    problem with polynomials  local low-order polynomials
    knot location strategies: subset of training samples, or
    uniformly spaced in x-domain.




                                                           24
         RBF Networks for Regression
•    RBF networks                                                  m
                                                               ˆ  wjz j
                                                               y
                 m       x  v                                j 1


    f m x, w   w j g
                                j
                                    w0
                 j 1     j                                             W is m  1

     typically local BFs                        z1        z2                zm
                                            1        2                       m     zj  gx,v j 
•    Training ~ estimating
     -parameters of BF’s                                                          V is d  m

     -linear weights W
                                            x1       x2                      xd

     - non-adaptive implementation (TBD)
     - adaptive implementation
                                                                                          25
     Non-adaptive RBF training algorithm
1. Choose the number of basis functions
   (centers) m.
2. Estimate centers v j using x-values of training data
     via unsupervised training (SOM, GLA, clustering etc.)
3.   Determine width parameters  j using heuristic:
     For a given center v j
     (a) find the distance to the closest center:
            r j  min vk  v j for all k  j
               k

    (b) set the width parameter  j   rj
    where parameter  controls degree of overlap between
    adjacent basis functions. Typically 1    3
4. Estimate weights w via linear least squares
    (minimization of the empirical risk).
                                                             26
Application Example: Predicting NAV of
        Domestic Mutual Funds
•   Motivation
•   Background on mutual funds
•   Problem specification + experimental setup
•   Modeling results
•   Discussion



                                            27
 Background: pricing mutual funds
• Mutual funds trivia
• Mutual fund pricing:
  - priced once a day (after market close)
   NAV unknown when order is placed
• How to estimate NAV accurately?
 Approach 1: Estimate holdings of a fund (~200-
 400 stocks), then find NAV
 Approach 2: Estimate NAV via correlations
 btwn NAV and major market indices (learning)

                                             28
    Problem specs and experimental setup

• Domestic fund: Fidelity OTC (FOCPX)
• Possible Inputs:
       SP500, DJIA, NASDAQ, ENERGY SPDR
• Data Encoding:
   Output ~ % daily price change in NAV
   Inputs ~ % daily price changes of market indices

• Modeling period: 2003.

• Issues: modeling method? Selection of input
variables? Experimental setup?
                                                      29
Experimental Design and Modeling Setup

 Possible variable selection:

  Mutual Funds                  Input Variables
        Y             X1            X2            X3
     FOCPX           ^IXIC           -             -
     FOCPX          ^GSPC          ^IXIC           -
     FOCPX          ^GSPC          ^IXIC          XLE

 • All variables represent % daily price changes.
 • Modeling method: linear regression
 • Data obtained from Yahoo Finance.
 • Time period for modeling 2003.

                                                        30
        Specification of Training and Test Data

                         Year 2003

 1, 2        3, 4       5, 6       7, 8      9, 10       11, 12



Training     Test

            Training   Test

                       Training   Test

                                  Training   Test

                                              Training     Test



        Two-Month Training/ Test Set-up
         Total 6 regression models for 2003



                                                                  31
Results for Fidelity OTC Fund (GSPC+IXIC)


      Coefficients         w0      w1 (^GSPC) W2(^IXIC)
       Average            -0.027     0.173        0.771
Standard Deviation (SD)   0.043      0.150        0.165




 Average model: Y =-0.027+0.173^GSPC+0.771^IXIC
^IXIC is the main factor affecting FOCPX’s daily price change

 Prediction error: MSE (GSPC+IXIC) = 5.95%




                                                            32
Results for Fidelity OTC Fund (GSPC+IXIC)
                       140




                       130




                       120
 Daily Account Value




                       110




                       100




                       90

                                                                                              FOCPX
                                                                                              Model(GSPC+IXIC)
                       80
                       1-Jan-03   20-Feb-03   11-Apr-03   31-May-03    20-Jul-03   8-Sep-03     28-Oct-03   17-Dec-03
                                                                      Date




                         Daily closing prices for 2003: NAV vs synthetic model
                                                                                                                        33
    Results for Fidelity OTC Fund (GSPC+IXIC+XLE)


       Coefficients        w0     w1 (^GSPC) W2(^IXIC) W3(XLE)
        Average          -0.029     0.147      0.784      0.029
Standard Deviation (SD) 0.044       0.215      0.191      0.061



 Average Model: Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE
 ^IXIC is the main factor affecting FOCPX daily price change

 Prediction error: MSE (GSPC+IXIC+XLE) = 6.14%




                                                                  34
Results for Fidelity OTC Fund (GSPC+IXIC+XLE)
                          140




                          130




                          120
    Daily Account Value




                          110




                          100




                           90                                                               FOCPX

                                                                                            Model(GSPC+IXIC+XLE
                           80                                                               )
                           1-Jan-03   20-Feb-03   11-Apr-03   31-May-03   20-Jul-03   8-Sep-03   28-Oct-03   17-Dec-03
                                                                          Date



                            Daily closing prices for 2003: NAV vs synthetic model
                                                                                                                         35
              Effect of Variable Selection

Different linear regression models for FOCPX:
•    Y =-0.035+0.897^IXIC
•   Y =-0.027+0.173^GSPC+0.771^IXIC
•   Y=-0.029+0.147^GSPC+0.784^IXIC+0.029XLE
•   Y=-0.026+0.226^GSPC+0.764^IXIC+0.032XLE-0.06^DJI
Have different prediction error (MSE):
•   MSE (IXIC) = 6.44%
•   MSE (GSPC + IXIC) = 5.95%
•   MSE (GSPC + IXIC + XLE) = 6.14%
•   MSE (GSPC + IXIC + XLE + DJIA) = 6.43%
(1) Variable Selection is a form of complexity control
(2) Good selection can be performed by domain experts
                                                       36
              Discussion
• Many funds simply mimic major indices
statistical NAV models can be used for
  ranking/evaluating mutual funds
• Statistical models can be used for
  - hedging risk and
  - to overcome restrictions on trading
  (market timing) of domestic funds
• Since 70% of the funds under-perform their
  benchmark indices, better use index funds
                                          37
              OUTLINE
•   Objectives
•   Methods taxonomy
•   Linear methods
•   Adaptive dictionary methods
    - additive modeling and projection pursuit
    - MLP networks
    - CART and MARS
•   Kernel methods and local risk minimization
•   Empirical comparisons
•   Combining methods
•   Summary and discussion

                                            38
    Additive Modeling & Projection Pursuit
•   Additive models have parameterization for
                                m
    regression    f x,V   g x,v  w
                                      j   j   0
                               j 1


    where g j x,v j  is an adaptive basis function
•   Backfitting is a greedy optimization approach
    for estimating basis functions sequentially:
    - basis function g k x, v k  is estimated by holding
    all other basis functions j  k fixed



                                                         39
•   By fixing all basis functions j  k the empirical
    risk (MSE) can be decomposed as
                1 n
     Remp V     yi  f x i , V 
                                        2

                n i 1
                                                                            2
                1                                                    
                   yi   g j x i , v j   w0   g k x i , v k 
                     n


                n i 1  
                            j k                   
                                                                        
                                                                        
                1 n
                 ri  g k x i , v k 
                                          2

                n i 1

 Each basis function g k x, v k  is estimated via an
   iterative backfitting algorithm (until some
   stopping criterion is met)
Note: ri can be interpreted as the response variable
   for the adaptive method g k x, v k                 40
                   Backfitting Algorithm
•  Consider regression estimation of a function of
   two variables of the form y  g1 x1   g 2 x2   noise
   from training data ( x1i , x2i , yi ) i  1,2,...,n
   For example t ( x1 , x2 )  x1  sin(2x2 ) x  0,1
                                2                             2


Backfitting method:      (1) estimate g1  x1  for fixed g 2
                         (2) estimate g 2 x2  for fixed g 1
                          iterate above two steps
•   Estimation via minimization of empirical risk
                            n
         Remp  g1 ( x1 ), g 2 ( x2 )             yi  g1 ( x1i )  g 2 ( x2i ) 2
                                            1
                                              
                                            n i 1
                                    1 n
                  ( first _ step )    yi  g 2 ( x2i )  g1 ( x1i ) 
                                                                          2

                                    n i 1
                                       1 n
                                        ri  g1 ( x1i ) 
                                                            2
                                                                                        41
                                       n i 1
          Backfitting Algorithm(cont’d)
•  Estimation of g1 ( x1 ) via minimization of MSE
                        1 n
   Remp  g1 ( x1 )    ri  g1 ( x1i )   min
                                             2

                        n i 1
• This is a univariate regression problem of
   estimating g1  x1  from n data points ( x1i , ri )
   where ri  yi  g 2 ( x2i )
• Can be estimated by smoothing (kNN regression)
• Estimation of g 2 x2  (second_step) proceeds in a
   similar manner, via minimization of
                       1 n
  Remp  g 2 ( x2 )    ri  g 2 ( x2i )  where ri  yi  g1 ( x1i )
                                             2

                       n i 1
                                                                    42
        Projection Pursuit regression
•   Projection Pursuit is an additive model:
                   m
       f x,V, W   g   x,v   w
                         w
                         j   j   j   0


    where basis functions g j z,v j  are univariate
                  j 1



    functions (of projections)

•   Backfitting algorithm is used to estimate
    iteratively
    (a) basis functions (parameters v j) via
    scatterplot smoothing
    (b) projection parameters w j (via gradient
    descent)
                                                        43
EXAMPLE: estimation of a two-dimensional fct via projection pursuit
(a) Projections are found that minimize unexplained variance.
    Smoothing is performed to create adaptive basis functions.

(b)   The final model is a sum of two univariate adaptive basis functions.




                                                                        44
            Multilayer Perceptrons (MLP)
•       Recall MLP networks                                           m
                                                                  ˆ  wjz j
                                                                  y
                                                                     j 1

        for regression
                                                                                W is m  1
        where
                      d
                v   x v   sx v           z1        z2                zm
    gx,vi   s i 0
                                                                                  zj  gx,v j 
                            k ik       i
                       k 1                    1        2                       m
                     1
        st  
                1 exp t                                                          V is d  m
or
                          expt   expt 
     st   tanht  
                          expt   expt    x1       x2                      xd

•       Parameters (weights) estimated via
        backpropagation
                                                                                                  45
        Details of backpropagation
                                             1
•   Sigmoid activation         st           - picture?
                                         1 exp t 
•   simple derivative   st   st 1  st 
     Poor behaviour for large t ~ saturation
•   How to avoid saturation?
    - Proper initialization (small weights)
    - Pre-scaling of inputs (zero mean, unit variance)
•   Learning rate schedule (initial, final)
•   Stopping rules, number of epochs
•   Number of hidden units
                                                         46
         Additional enhancements
•   The problem: convergence may be very slow
    for error functional with different curvatures:




•   Solution: add momentum term to smooth
    oscillations wk 1  wk  kzk   wk
where wk   wk  wk 1 and  is momentum parameter



                                                       47
    Various forms of complexity control
•   MLP topology ~ number of hidden units
•   Constraints on parameters (weights) ~
    weight decay
•   Type of optimization algorithm (many
    versions of backprop., other opt. methods)
•   Stopping rules
•   Initial conditions (initial ‘small’ weights)
•   So many factors make it difficult to control
    complexity; usually vary 1 complexity factor
    while keeping all others fixed
                                              48
         Toy example: regression
•   Data set: 25 samples generated using sine-squared
    target function with Gaussian noise (st. deviation 0.1).
•   MLP network
    (two hidden units)
     underfitting




                                                           49
         Toy example: regression
•   Data set: 25 samples generated using sine-squared
    target function with Gaussian noise (st. deviation 0.1).
•   MLP network
    (10 hidden units)
     near-optimal




                                                           50
    Backpropagation for classification
                                                     m

•   Original MLP is for regression               ˆ  wjz j
                                                 y
                                                    j 1

    (as introduced above)
                                                               W is m  1

                                   z1       z2                zm
                               1        2                      m        zj  gx,v j 


                                                                       V is d  m


•   For classification:            x    x             x
                                1       2                          d

    - use sigmoid output unit
    - during training, use real-values 0/1 for class labels
    - during operation, threshold the output of a trained MLP
    classifier at 0.5 to predict class labels (as in HW2)
                                                                               51
        Toy example: classification
•   Data set: 250 samples ~ mixture of gaussians, where
    Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and
    Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
    The variance of all gaussians is 0.03.
•   MLP classifier
    (two hidden units)




                                                           52
       Toy example: classification

•   MLP classifier (six hidden units)
    ~ near optimal solution




                                        53
             MLP architectures
•   Supervised learning: single output
    i.e., classification, regression
•   Supervised learning:              1 2 k

                                                                         W is m  k
    multiple outputs         sx v 
                             j

                                  1       2                       m

                                                                             
                                                                         V  v1 v 2 ...vm   
                                                                         V is d  m


                                 x1      x2                       xd

                                               m
                                 f x,w ,V   wj sx  v j   w0    for each output unit
                                               j 1


                                 In matrix notation: Fx,W,V  sxVW

•   Unsupervised learning (data compression)
                                                                                      54
                       NetTalk
            (Sejnowski and Rosenberg, 1987)
One of the first successful applications of backpropagation:
http://www.cnl.salk.edu/ParallelNetsPronounce/index.php
•   Goal: Learning to read (English text) aloud, i.e.
    Learn Mapping: English text  phonemes
    using MLP network

•   Network inputs encode 7-letter window (the 4-th letter
    in the middle needs to be pronounced)
•   Network outputs encode phonemes (used in English)
•   The MLP network is trained using labeled data (both
    individual words and unrestricted text)

                                                           55
          NetTalk architecture
Input encoding: 7x29 = 203 units
Output encoding: 26 units (phonemes)
Hidden layer: 80 hidden units




                                       56
           MLP networks: summary
•   MLP and Projection Pursuit models have the same
    mathematical parameterization but very different
    statistical properties:
    MLP model ~ sum of many basis functions of projections
    (basis functions are the same)
    PP model ~ sum of a few basis functions of projections
    (basis functions are adapted to data)
•   Model complexity control for MLP:
    - may be tricky as it depends on many factors
    (optimization method, weight initialization, network
    topology)
    - in practice, tune just one factor (with others fixed) using
    resampling

NOTE: implementation of resampling may be tricky (with
   nonlinear optimization)
                                                               57
                          Regression Trees (CART)
•          Minimization of empirical risk (squared error)
           via partitioning of the input space into regions
                                             m
                                     f x   w j I  R j 
                                                      x
                                            j 1
•          Example of CART partitioning for a function of 2 inputs
                                                                                                      split 1 x1 ,s1 
x2                                                                                              1
           s4
                                                                                                                   R1
                                                                            2   x2 ,s2 
      R4             R5

                                R1
s2                                                         3    x2 ,s3                    4       x1 ,s4 
                R3
 s3
                R2
                                                 x1   R2        R3               R4                     R5
                           s1

                                                                                                                          58
              Growing CART tree
•   Recursive partitioning for estimating regions
    (via binary splitting)
•   Initial Model ~ Region R 0 (the whole input domain)
    is divided into two regions R 1 and R 2
•   A split is defined by one of the inputs(k) and split point s
•   Optimal values of (k, s) chosen so that splitting a region
    into two daughter regions minimizes empirical risk
•   Issues:
    - efficient implementation (selection of opt. split)
    - optimal tree size ~ model selection(complexity control)
•   Advantages and limitations

                                                             59
           CART model selection
•   Model selection strategy
    (1) Grow a large tree (subject to min leaf node size)
    (2) Tree pruning by selectively merging tree nodes

•   The final model ~ minimizes penalized risk
        R pen ,    Remp     T
    where empirical risk ~ MSE
            number of leaf nodes ~ T
            regularization parameter ~ 
•   Note: larger   smaller trees
•   In practice: often user-defined (splitmin in Matlab)
                                                       60
     Example: Boston Housing data set
•       Objective: to predict the value of homes in Boston

•       Data set ~ 506 samples total
        Output: value of owner-occupied homes (in $1,000’s)
        Inputs: 13 variables
    1. CRIM    per capita crime rate by town
    2. ZN    proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS proportion of non-retail business acres per town
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX     nitric oxides concentration (parts per 10 million)
    6. RM     average number of rooms per dwelling
    7. AGE    proportion of owner-occupied units built prior to 1940
    8. DIS   weighted distances to five Boston employment centres
    9. RAD    index of accessibility to radial highways
    10. TAX    full-value property-tax rate per $10,000
    11. PTRATIO pupil-teacher ratio by town
    12. B    1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT % lower status of the population


                                                                                        61
  Example CART trees for Boston Housing
1.Training set: 450 samples Splitmin =100 (user-defined)

                         R0
                              R1   R2




                                                      62
  Example CART trees for Boston Housing
2.Training set: 450 samples Splitmin =50 (user-defined)

                         R0
                              R1   R2




                                                      63
  Example CART trees for Boston Housing
3.Training set: 455 samples Splitmin =100 (user-defined)
    Note: CART model is sensitive to training samples (vs model 1)




                                                                64
         Decision Trees: summary
•   Advantages
    - speed
    - interpretability
    - different types of input variables
•   Limitations: sensitivity to
    - correlated inputs
    - affine transformations (of input variables)
    - general instability of trees
•   Variations: ID3 (in machine learning), linear CART

                                                     65
                         MARS
•   MARS features and improvements (over CART)
    - continuous approximation (via tensor-product splines)
    - greedy selection of low-order basis functions
    - variable selection (local + global)
•   MARS complexity control
    - lack of fit measure based on Generalized Cross
    Validation (GCV), i.e. MSE on the training set penalized by
    model complexity (tree size)
•   MARS applicability
    - good for high- and low-D problems with a small number
    of low-order interactions (local or global)
    Interaction occurs when the effect of one variable (on the
    output) depends on the level of other variable

                                                              66
         MARS Basis Functions
Truncated Splines          (+)
                                                  (-)




                             t    x           t             x

a pair of truncated linear splines b   x  t 
                                          


                       y
Basic building block



                                                        x
                                      t
                                                            67
          Tensor-Product Splines
Multivariate Splines
- Tensor product gx,u,v    j x j  v j 
                            d                  q
                               u
                                               
                                 j 1
    adaptive selection of knot locations
-   Valid knot locations




                                                   68
                           MARS Tree Structure
                                                            B1  x   1




     B2  x   B1  x               B3  x   B1  x                B4  x   B1  x            B5  x   B1  x 
              b  x1  t1                    b   x1  t1                       b x 2  t 2             b   x 2  t2 




                B6  x   B3  x                    B7  x   B3  x 
                           b  x 3  t3                        b  x 3  t 3 
                                                                                                                     7

•   Each node ~ active basis function y   ai Bi x 
                                           i1
•   Basis functions estimated recursively
•   On each path, variables are split at most once
•   Depth of tree indicates interaction level                                                                                     69
            Algorithm for MARS
•   Forward stepwise: search over each node to find
    - split variable
    - split point t
    - coefficients (weights of basis functions)
    that minimize lack of fit criterion (GCV)
•   Backwards stepwise: remove nodes which cause
    - decrease of gcv or
    - the smallest increase of gcv
                             hl 
                   R   r                r p  1 p
                                                         2
•                                   R
    GCV criterion             n  e mp
    where h
            mars  m
                   1
                                                             70
               MARS Summary
•   Advantages
-   Provides variable subset selection
-   Continuous approximation
-   Works well for low-order interactions and
    additive functions
-   Interpretable
•   Limitations
    - sensitive to coordinate rotation
    - problems in dealing with collinear variables
    - stability of MARS modeling, for small samples
                                                     71
                OUTLINE
•   Objectives
•   Methods taxonomy
•   Linear methods
•   Adaptive dictionary methods
•   Kernel methods and local risk minimization
    - kernel methods and local risk minimization
    - Generalized Memory-Based Learning
    - Constrained Topological Mapping
•   Empirical comparisons
•   Combining methods
•   Summary and discussion

                                              72
           Local Risk Minimization
•   Local learning (memory-based learning):
    estimate a function at a single point x0
•   Local risk minimization
                                         K x, x 0
    R ,  ;x 0    Ly, f x,                   px,y dxdy
                                           x 0 
    local neighborhood function K x,x 0 
    normalizing function   x 0    K x, x0  pxdx
•   The goal is minimization of local prediction
    risk over a set of f x,  and over the kernel
    width  using only training data
                                                                     73
    Practical Implementation of LRM
•   Sumultaneous minimization of local risk over f x, 
    and over kernel width  is hard
• Practical methods assume fixed (constant or
    linear) parameterization and then adjust only the
    kernel width.
Local Estimation at a point x0 :
(1) Select approximating functions of fixed low
    complexity, and select kernel function (i.e.
    gaussian or hard threshold).
(2) Select optimal kernel width, providing min
    estimated local risk. That is, selectively decrease
    training sample (near x0 ) to make an estimate.
                                                     74
          LRM and Kernel Methods
•   Consider minimization of Local Empirical Risk
                          1 n
        Re mp local    K x i , x0 yi  f x i , 
                                                             2

                          n i 1
    assuming constant parameterization f x,w0   w0
     local average f x   w             1 n
                                 0      0       yi K x i ,x 0 
                                            n i 1
    (similarly, for local linear parameterization)
•   Solution to LRM leads to adaptive kernel
    method, because the kernel width  is adapted
    to data at each estimation point x0. However,
    adaptive selection of kernel width is hard.
                                                                75
    Practical Selection of Kernel Width
•   Global Adaptive Approach: the kernel width is
    estimated globally, independent of a particular
    estimation point.
• Global model selection for k-nn regression:
For a given value of k:
(1) Compute a local estimate ˆ i at each input x i
                              y
(2) Compute total empirical risk of these estimates
                       1 n
           Re mp k    yi  ˆ i 
                                     2
                                y
                       n i 1
(3) Estimate prediction risk using (analytic) model
    selection criterion.
Minimize this risk through appropriate selection of k.
                                                    76
    Generalized Memory-Based Learning
•   For a given new input, an output is
    estimated via local learning using past data
•   GMBL implements locally weighted linear
    approximation minimizing
                     n
                          1
                                          
     Re mp localw,w0    K xi ,x 0  w x i  w0  yi 
                                                             2

                          n i 1
                                                                 q
                                         d
                                x  x 2 v 2 
where the kernel Kx, x,v       k  k       k
                                                   
                                 k 1
    has adaptable width and scale parameters
    estimated via cross-validation using all data
                                                                  77
Constrained Topological Mapping
Recall applying SOM
to regression problem

Nonadaptive
CTM Approach:
Given training data (x,y) perform
1. Dimensionality reduction xz
   (Apply SOM to x-values of training data)
2. Apply kernel regression to estimate
   y=f(z) at discrete points in z-space
                                          78
      Adaptive CTM Implementation
•   Batch implementation
•   Local linear modeling for each CTM unit
                                  n                                       2
                                 1
                                                  
       Re mp localw j ,w0 j    Kzi , j  w j  x i  w0 j  yi   
                                 n i 1

•   Variable selection via adaptive scaling b
                  d
    c j  x i v   v l  jl  xil         vl   w jl
              2       2             2
                         c            where
                 l 1                                             j 1



•   Final neighborhood width (model complexity)
    selected via cross-validation


                                                                              79
                OUTLINE
•   Objectives
•   Methods taxonomy
•   Linear methods
•   Adaptive dictionary methods
•   Kernel methods and local risk minimization
•   Empirical comparisons
•   Combining methods
•   Summary and discussion

                                            80
        Empirical Comparisons
Ref: Cherkassky et al. (1996), Comparison of adaptive
    methods for function estimation from samples, IEEE
    Transactions on Neural Networks, 7, 969-984
•   Challenge of comparisons
    - who performs comparisons (experts vs general
    users)
    - goals of comparison
    - synthetic vs real-life data
    - importance of experimental procedure (i.e. for
    model selection, double resampling etc.)
                                                         81
       Example comparison study
Time series prediction (Weigend & Gershenfeld 1992)
•   Performed by experts on time series 1K-100K
    samples long
•   Lessons learned/ conclusions
    - knowledge of application domain is important
    (simplistic black-box approaches usually fail)
    - successful methods are nonlinear
    - custom/manual control of method’s parameters
    (model selection)

                                                  82
     Application of Adaptive Methods
1. Choose flexible method (parameterization)
2. Choose complexity parameter
   - automatic (from data) or user-selected
3. Estimate model (from training data)
4. Estimate prediction performance(on test data)

NOTE: empirical comparison (of methods) is difficult
  because prediction performance depends on all
  factors (1) - (3), in addition to data itself

                                                   83
      Example Comparison Study
• Objectives and assumptions
  - non-expert users
  - public-domain s/w for regression methods
  - manual model selection using test set; just
  1 or 2 user-defined parameters
  - off-line training (batch mode)
  - comparison focus on methods’
  parameterization (1) and model selection (2)
  for different synthetic data sets
                                            84
    Objectives and assumptions (cont’d)
•    Methods in XTAL package
     k- nearest neighbors regression (k-NN)
     Linear Regression (LR)
     Projection Pursuit (PPR)
     Multivariate Adaptive Regression Splines (MARS)
     Generalized Memory-Based Learning (GMBL)
     Constrained Topological Mapping (CTM)
     Artificial Neural Network / backpropagation (ANN)
•    Synthetic data
     low- and high-dimensional
     uniformly distributed in x-space


                                                         85
             Experimental Set-Up
•   Specification of
    properties of synthetic data: target functions,
    training/test set size, x-distribution, noise level
    performance metric: NRMS error (for test set)
    4 parameter settings for each method:
    KNN: k = 2, 4, 8, 16
    GMBL: no parameters (run only once)
    CTM: smoothing parameter = 0, 2, 5, 9
    MARS: smoothing parameter = 0, 2, 5, 9
    PPR: number of terms (in the smallest model) = 1, 2, 5, 8
    ANN: number of hidden units = 5, 10, 20, 40
                                                          86
•   Training Data
    - uniform distribution (random and spiral)
    - size: smalle (25), medium (100), large (400)
    - noise level: none, medium(SNR=4), large (SNR=2)
•   Test data: 961 samples (no noise)
    - spaced uniformly on 2D grid (for 2-dimensional data)
    - randomly sampled for high-dimensional data




                                                             87
•   2D Target functions




             Function 1       Function 2




                 Function 3       Function 4

                                               88
•   2D Target functions (cont’d)




              Function 5           Function 6




                    Function 7           Function 8


                                                      89
              Comparison Summary
                                       BEST        WORST

Prediction accuracy (dense samples)    ANN         KNN, GMBL
Prediction accuracy (sparse samples)   GMBL, KNN   MARS, PP
Additive target functions              MARS, PP    KNN, GMBL
Harmonic functions                     CTM, ANN    PP
Radial functions                       ANN, PP     KNN
Robustness wrt parameter tuning        ANN, GMBL   PP
Robustness wrt sample properties       ANN, GMBL   PP, MARS

•    Methods performance
     - similar at dense (large) samples
     - uneven at sparse samples and depends significantly on
     the properties of data
                                                               90
     Comments on Specific Methods
•   Comparison metrics:
    (a) generalization
    (b) robust parameter tuning
    (c)robust to data characteristics
•   kNN and GMBL:
    (a) inferior to other methods when accurate
    prediction is possible
    (b) very robust
    (c) very robust

                                                  91
     Comments on Specific Methods
•   MARS
    (a) good for additive functions
    (b) somewhat brittle
    (c) rather unpredictable
•   PPR
    (a) good for additive functions, functions of linear
    combinations of inputs; poor for harmonic
    functions
    (b) brittle
    (c) rather unpredictable

                                                      92
      Comments on Specific Methods
•   ANN
    (a) good for functions of linear combinations,
    harmonic and radial-type functions
    (b) very robust
    (c) very predictable
•   CTM
    (a) very good for harmonic functions, poor
    functions of linear combinations
    (b) robust
    (c)predictable; best for spiral distribution in x-space

                                                       93
         Conclusions and Caveats
•   Comparison results always biased by
    - selection of data sets
    - s/w implementation of adaptive methods
    - (expert) user bias
•   Relative performance varies with properties of
    data sets (i.e. sample size, noise level etc)
•   Heuristic optimization methods (ANN, CTM)
    are computationally intensive but often more
    robust than faster statistical methods
•   Nonlinear methods should be robust: only for
    robust methods it is possible to develop
    automatic parameter tuning (complexity control)

                                                 94
                OUTLINE
•   Objectives
•   Methods taxonomy
•   Linear methods
•   Adaptive dictionary methods
•   Kernel methods and local risk minimization
•   Empirical comparisons
•   Combining methods
•   Summary and discussion

                                            95
    Motivation for Combining Methods
•   General setting (used in this course)
    - given training data set
    - apply different learning methods
    - select the best model (method)

Learning Method + Data  Predictive Model

•   Why discard other models?
             Motivation (cont’d)
Learning Method + Data  Predictive Model
• Theoretical and empirical evidence
   - no single ‘best’ method exists
• Always possible to find:
   - best method for given data set
   - best data set for given method
• Philosophical + statistical connections,
   Eastern philosophy, Bayesian averaging:
   Combine several theories (models)
   explaining the data
      Strategies for Combining Methods
•   Predictive model depends on 3 factors
    (a) parameterization of admissible models
    (b) random training sample
    (c) empirical loss (for risk minimization)
•   Three combining strategies (for improved generalization)
    1. Different (a), the same (b) and (c)
     Committee of Networks, Stacking, Bayesian
    averaging
    2. Different (b), the same (a) and (c)
     Bagging
    3. Different (c), the same (a) and (b)
     Boosting
             Combining Strategy 1
•   Apply N different methods (parameterizations) to
    the same data  N distinct models
•   Form (linear) combination of N models
        Combining Strategy 1 (cont’d)
Design issues:
•   What parameterizations (methods) to use?
    - as different as possible
•   How many component models?
•   How to combine component models?
    - via empirical risk minimization (neural
    network strategy)
    - Bayesian averaging (statistical strategy)
      Committee of networks approach
Given training data x i ,yi           i 1,...,n
• Estimate N candidate (regression) models
      f 1 x, 1 , f 2 x,  2 ,..., f N x,  N 
                 *            *                    *

   using different methods
• Construct the combined model as
                       1 N
                      N j 1
                                     
      f com x,      j f j x,  *     j 
  where coefficients  j are estimated via min. of emp. risk
                                                    N

                                                    j  1
             n
  R     f com x i ,   y i 
         1                           2 under
         n i 1                        constraints j 1
                                                  j  0
       Example of Committee Approach
                                               y  0.8sin2 x  0.2x  
                                                                      2
•   Regression data set:
    with x-values uniform in [0,1] and noise variance  2  0.25
•   Regression methods used                   m1 1

    (a) polynomial f polyx ,um  uj x j m 1
                                  1



    (b) trigonometric f trigx ,v m , wm  vj sinjx  w j cos jx w0
                                         j 0             2


                                          2           2
                                                 j 1
    (c) combined (Committee of Networks)
       f comb1x ,    f poly x,um  1    f trigx,v m ,wm 
                                      1                       2   2


•   Model selection:
    - VC model selection for (a) and (b)
    - empirical risk minimization for (c)
      Comparison: 25 training samples
Red ~ target function; Blue dashed ~ polynomial model
Blue dotted ~ trigonometric model; Black ~ combined model
      Comparison: 50 training samples
Red ~ target function; Blue dashed ~ polynomial model
Blue dotted ~ trigonometric model; Black ~ combined model
              Comparison results
25 training samples: Model f(x) MSE(f(x),target)
                      Poly (d=3)     0.0857
                      Trigon (d=3)   0.0237
                      Combined       0.0239
                     (alpha=0.5)


50 training samples: Model f(x) MSE(f(x),target)
                      Poly (d=4)     0.0046
                      Trigon (d=4)   0.0044
                      Combined       0.0038
                     (alpha=0.2)
                 Stacking approach
Given training data x i ,yi           i 1,...,n
• Estimate N candidate (regression) models
      f 1 x, 1 , f 2 x,  2 ,..., f N x,  N 
                 *            *                    *

   using different methods
• Construct the combined model as
                       1 N
                      N j 1
                                     
      f com x,      j f j x,  *     j 
  where coefficients  j are estimated via resampling
                                                    N

                                                    j  1
             n
  R     f com x i ,   y i 
         1                           2 under
         n i 1                        constraints j 1
                                                  j  0
          Empirical Comparison
•   The same data set and experimental setup
•   Committee approach ~ Comb 1
    Stacking approach ~ Comb 2




                                               107
       Summary and Discussion
•   Linear (nonadaptive) methods for regression
    - theoretically well-understood
    - effective methods for complexity control
•   Nonlinear (adaptive) methods
    - inherently complex (non-tractable optimization)
    - difficult to apply analytic model selection and
    resampling
    - no single best method exists for all data sets
•   Combining methods often result in better
    predictions
                                                   108

				
DOCUMENT INFO