Docstoc

Basis Expansions and Regularization

Document Sample
Basis Expansions and Regularization Powered By Docstoc
					Basis Expansions and Regularization



   Selection of Smoothing Parameters
   Non-parametric Logistic Regression

   Multi-dimensional Splines



            - Nagaraj Prasanth
Smoothing Parameters

   Regression Splines
       Degree
       Number of knots
       Placement of knots


   Smoothing Splines
    Only penalty parameter 
    (since knots are at all X’s and cubic degree is almost
       always used)
Smoothing spline

   Cubic smoothing spline: fˆ is minimizer of

                                      f         x 
                n

                y  f  x 
                               2              ''            2
                      i    i                                    dx
               i 1


    : smoothing parameter (+ve const)

                                                 ˆ
    Controls trade-off betn bias and variance of f 
Selection of Smoothing Parameters

   Fixing Degrees of Freedom

   Bias-Variance Trade-off
       Cross-Validation
         C p Criterion
       Improved AIC


   Risk Estimation Methods
       Risk Estimation using Classical Pilots (RECP)
       Exact Double Smoothing (EDS)
Fixing Degrees of Freedom

    ˆ
     f  N ( N T N  N )1 N T y  S y
 df  trace  S 
(monotone in  for smoothing splines)

   Fix df => get value of   
Bias-Variance Tradeoff
Sample Data Generation
   N=100 pairs of xi , yi drawn independently from the
    model below:
    1)       X ~ U [0,1]
                       Sin(12( X  0.2))
    2)        f (X ) 
                            X  0.2
    3)        ~ N (0,1)
    4)       Y  f (X )  


   Standard error bands at a given point x
         ^                ^
         f  ( x)  2se( f  ( x))
Bias-Variance Tradeoff


      ˆ
      f  S y

          
         ˆ  S Cov  y  S T  S S T
     Cov f                     


     Bias  f   f  E  f   f  S
            ˆ             ˆ                 f
                                      
Observations (Bias-Variance Tradeoff)
   df=5
       Spline under fits
       Bias that is most dramatic in regions of high curvature
       Narrow se band => bad estimate with great reliability


   df=9
       Close to true function
       Slight bias
       Small variance


   df=15
       Wiggly, but close to true function`
       Increased width of se bands
Integrated Squared Prediction Error (EPE)

    Combines bias and variance…
           ^               ^                                   ^                ^
    EPE ( f  )  E (Y  f  ( X ))  Var (Y )  E[ Bias ( f  ( X ))  Var ( f  ( X ))]
                                   2                       2




               ^          ^                           ^                ^
    MSE ( f  )  E[( f  ( x)  Y ) ]  Var ( f  )  ( Bias( f  )) 2
                                         2


            ^             ^
    Bias( f  )  E[ f  ( X )]  Y
            ^                  ^               ^
    Var ( f  ( X ))  E[( f  ( X )  E ( f  ( X )) 2 ]
Smoothing parameter selection methods

   EPE calculation needs true function

   Other methods
       Cross-Validation
       Generalized Cross Validation
        C p Criterion
       Improved AIC
Cross Validation

   Divide data into N blocks of N-1 points ( xi , yi )
                                     ^
                                             ( i )
   Compute squared error ( yi  f                   ( xi ))2
    over all examples ( xi , yi )
    and take its average

    is chosen as the minimizer of

                                                       1  yi  f   xi  
                                                                             2
                                                                     ˆ
                                            
                                                 2
                1    n                                      n
                                                                            
      CV      yi  f   i   xi 
                        ˆ                                                 
                n i 1                                 n i 1  1   S ii 
                                                                            
Cross-Validation Estimate of EPE




   EPE and CV curves have similar shape
   CV curve is an approximately unbiased as an estimate of EPE
Other Classical Methods
   Generalized Cross-Validation
                                                                   
                               n

                              1
                                                   yi  f   xi 
                                                                        2
                                                        ˆ
                   GCV     i 1
                              n 1  n 1tr  S                    
                                                                        2
                                                               

   Mallow’s C p Criterion
             Cp   
                       1
                       n
                                 S  I  y      2
                                                        2 2tr  S    2    
             E C p     E f  f 
                            1
                                                                           
                                                       2
                                   ˆ                        R f , fˆ  RISK
                            n

                     S            
                                           2

                          p   I y
              
             ˆ p
                          
                      tr 1  S p      
              p : Pre-chosen by CV
Classical Methods (contd…)
   Improved AIC

                             S  I  y             2 tr  S   1
                                           2

         AICc     log                      1
                                 n                   n  tr  S   2


       Improved == finite sample bias of classical AIC is corrected
         is chosen as minimizer

   Classical methods:
       Tend to be highly variable
       Tendency to undersmooth
Risk Estimation Methods
       Require choosing pilot estimates
       Risk Estimation using Classical Pilots (RECP)

                 1                       1
                                                 S  I  f                         
                                                                      2tr  S S T 
                                     2
               ˆ          ˆ
         R f , f  E f  f             
                                                                2

                   n                       n

          Need pilot estimates for f and  2
          Method 1: “Blocking Method”
          Method 2:
        1.   Get a pilot  p using a classical method
        2.
                                   ˆ
             Use it to compute f  and    ˆ2
                                                      
                                 p                p

                                           ˆ ˆ
           Final  as minimizer of R f , f , hoping that it is a good
                                
                                 ˆ
                                             p
           estimator of R f , f
                             
Risk Estimation Methods (contd…)
   Exact Double Smoothing (EDS)
       Another approach to choose pilot estimates

       Two “levels” of pilot estimates

        0 : optimal parameter that minimizes R  f , fˆ  (Assume)
                                         
    

                            E  0   
                                         2
        p1 : minimizer of

                                                         
        Derive closed form expression L  0 ,  , for E  0   
                                                                       2
                                                                           
       Replace unknown 0 with  p 2
    
                                      ˆ    ˆ    
        Choose  as minimizer of R f  p1 , f  , where  p1      minimizes
                
         L p 2 , 
Speed Comparisons

   CV, GCV, AIC: roughly same computational time

   Cp, RECP: longer computational time
    (2 numerical minimization)


   EDS: even longer (3 minimizations)
Conclusions from simulations

   No method performed uniformly the best
   The three classical methods – CV, GCV and Cp - gave
    very similar results
   The AIC method has never given a worse performance
    than CV, GCV or Cp
   For a simple regression function with a high noise level,
    the two restriction methods (RECP and EDS) seem to be
    superior
Nonparametric Logistic Regression

    Goal: Approximate                       Pr Y  1| X  x 

                                        log                      f  x
                                            Pr Y  0 | X  x 
                                                                  e 
                                                                           f x
                                         Pr Y  1| X  x  
                                                                1  e f  x
   Penalized log-likelihood:
                   yi log p  xi   1  yi  log 1  p  xi        f ''  t  dt
                 N
                                                                       1
    l  f ;   
                                                                                         2
                                                                     2
                 i 1



                                                   
                     yi f  xi   log 1  e f  xi    1   f ''  t 2 dt
                  N
              
               i 1
                                                       2 
Nonparametric Logistic Regression (contd…)

   Parameters in f(x) are set so as to maximize log-
    likelihood
   Parametric:
             p( x)  p(Y  1 | X  x)   T x

   Non-parametric:
                                        N
            p( x)  p(Y  1 | X  x)   N j ( x) j
                                        j 1
Nonparametric Logistic Regression (contd…)

             l  
                       N T  y  p   
               
              2l  
                         N TWN  
              T

     p: N-vector with elements p  xi 
     W: diagonal matrix of weights p  xi  1  p  xi 

      N  jk   N j ''  t N k ''  t  dt

   First derivative is non-linear in  =>
     Iterative algorithm (Newton-Raphson)
Nonparametric Logistic Regression (contd…)

                N WN    N T Wz
                               1
       new        T



               N  N WN    N TW  f old  W 1  y  p    S  , w z
        new            T             1
    f
      ˆ
    ( f  N ( N T N  N )1 N T y  S y )

   Update fits a weighted smoothing spline to the working
    response z

   Altho’ here x is 1D, generalized to higher-D x
Multidimensional Splines
   Suppose      XR      2


   We have a separate basis of functions for representing
    functions of X1 and X2


               h1k (X1), k  1,...,M1
                 h2k (X 2 ), k  1,...,M 2
   The M1xM2 dimensional tensor product basis is

         
    g jk h1 j (X1)h2k (X 2 ), j 1,...,M1, k 1,...,M2
2D Function Basis
                          M1    M2

                  g(X)            jk   g jk (X)
                          j1   k1


   Can be generalized to higher dimensions

     
    Dimension of the basis grows exponentially in the
    number of coordinates

   MARS: greedy algorithm for including only the basis
    functions deemed necessary by least squares
Higher Dimensional Smoothing Splines

   Suppose we have pairs                y i, x i , x i  Rd
   Want to find d-D regression function f(x)
                             N
   Solve
                     
                  min {y i  f (x i )} 2  J[ f ]
                         f
                             i1

            min   yi  f  xi      f         x 
                   n                                         2
                                   2           ''
( 1-D:        f
                                                                 dx )
                  i 1


   J is a penalty functional for stabilizing f in                  Rd


                                       
2D Roughness Penalty

   Generalization of 1D penalty

                         2 f (x) 2      2 f (x) 2  2 f (x) 2
J[ f ]       R2
                     [(
                          x1 2
                                 )  2(
                                        x1 x 2
                                                 ) (
                                                      x 22
                                                             ) ]dx1dx 2


   Yields thin plate spline (smooth 2D surface)
Thin Plate Spline

   Properties in common with 1D cubic smoothing spline

       As   0, solution approaches interpolating fn

       As   , solution approaches least squares plane

       For intermediate values of , solution is a linear expansion of
        basis functions
Thin Plate Spline
    Solution has form
                                   N
           f (x)   0   T x   j h j (x)
                                   j1

          h j (x)  ( x  x j )
          (z)  z 2 log z 2


    Can be generalized to arbitrary dimension

Computation Speeds

   1D splines: O(N)

   Thin plate splines: O(N^3)

   Can use fewer than N knots
    Using K knots reduces order to O(NK^2+K^3)
Additive Splines
    Solution of the form
              f (X)    f1(X1)  ... f d (Xd )
    Each fi is a univariate spline

     Assume f is additive and impose a penalty on each of the


     component functions              d
               J[ f ]  J( f1  ... f d )   f j ''(t i )2 dt j
                                               j1

    ANOVA Spline Decomposition

                 f (X)     f j (X j )   f j (X j , X k )  ...
                              j              jk




      
Additive Vs Tensor Product




  Tensor product basis can achieve more flexibility but introduces
    some spurious structure too.
Conclusion

   Smoothing Parameter selection and Bias-Variance
   Comparison of various classical and risk estimation
    methods for parameter selection
   Non-parametric logistic regression
   Extension to higher dimensions
       Tensor Product
       Thin plate splines
       Additive Splines
References

   Lee, T. (2002). Smoothing parameter selection for Smoothing
    Splines: A Simulation Study
    (www.stat.colostate.edu/~tlee/PSfiles/spline.ps.gz)
   Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of
    Statistical Learning: Data Mining, Inference, and Prediction

				
DOCUMENT INFO