# Basis Expansions and Regularization by pengtt

VIEWS: 12 PAGES: 36

• pg 1
```									Basis Expansions and Regularization

 Selection of Smoothing Parameters
 Non-parametric Logistic Regression

 Multi-dimensional Splines

- Nagaraj Prasanth
Smoothing Parameters

   Regression Splines
   Degree
   Number of knots
   Placement of knots

   Smoothing Splines
Only penalty parameter 
(since knots are at all X’s and cubic degree is almost
always used)
Smoothing spline

   Cubic smoothing spline: fˆ is minimizer of

   f         x 
n

  y  f  x 
2              ''            2
i    i                                    dx
i 1

    : smoothing parameter (+ve const)

ˆ
Controls trade-off betn bias and variance of f 
Selection of Smoothing Parameters

   Fixing Degrees of Freedom

   Cross-Validation
     C p Criterion
   Improved AIC

   Risk Estimation Methods
   Risk Estimation using Classical Pilots (RECP)
   Exact Double Smoothing (EDS)
Fixing Degrees of Freedom

    ˆ
f  N ( N T N  N )1 N T y  S y
df  trace  S 
(monotone in  for smoothing splines)

   Fix df => get value of   
Sample Data Generation
   N=100 pairs of xi , yi drawn independently from the
model below:
1)       X ~ U [0,1]
Sin(12( X  0.2))
2)        f (X ) 
X  0.2
3)        ~ N (0,1)
4)       Y  f (X )  

   Standard error bands at a given point x
^                ^
f  ( x)  2se( f  ( x))

ˆ
f  S y

 
ˆ  S Cov  y  S T  S S T
Cov f                     

Bias  f   f  E  f   f  S
ˆ             ˆ                 f
              
   df=5
   Spline under fits
   Bias that is most dramatic in regions of high curvature
   Narrow se band => bad estimate with great reliability

   df=9
   Close to true function
   Slight bias
   Small variance

   df=15
   Wiggly, but close to true function`
   Increased width of se bands
Integrated Squared Prediction Error (EPE)

    Combines bias and variance…
^               ^                                   ^                ^
EPE ( f  )  E (Y  f  ( X ))  Var (Y )  E[ Bias ( f  ( X ))  Var ( f  ( X ))]
2                       2

^          ^                           ^                ^
MSE ( f  )  E[( f  ( x)  Y ) ]  Var ( f  )  ( Bias( f  )) 2
2

^             ^
Bias( f  )  E[ f  ( X )]  Y
^                  ^               ^
Var ( f  ( X ))  E[( f  ( X )  E ( f  ( X )) 2 ]
Smoothing parameter selection methods

   EPE calculation needs true function

   Other methods
   Cross-Validation
   Generalized Cross Validation
    C p Criterion
   Improved AIC
Cross Validation

   Divide data into N blocks of N-1 points ( xi , yi )
^
( i )
   Compute squared error ( yi  f                   ( xi ))2
over all examples ( xi , yi )
and take its average

    is chosen as the minimizer of

1  yi  f   xi  
2
ˆ
                   
2
1    n                                      n
              
CV      yi  f   i   xi 
ˆ                                                 
n i 1                                 n i 1  1   S ii 
              
Cross-Validation Estimate of EPE

   EPE and CV curves have similar shape
   CV curve is an approximately unbiased as an estimate of EPE
Other Classical Methods
   Generalized Cross-Validation
                    
n

1
yi  f   xi 
2
ˆ
GCV     i 1
n 1  n 1tr  S                    
2


   Mallow’s C p Criterion
Cp   
1
n
  S  I  y      2
 2 2tr  S    2    
E C p     E f  f 
1
             
2
ˆ                        R f , fˆ  RISK
n

S            
2

p   I y
 
ˆ p

tr 1  S p      
 p : Pre-chosen by CV
Classical Methods (contd…)
   Improved AIC

 S  I  y             2 tr  S   1
2

AICc     log                      1
n                   n  tr  S   2

   Improved == finite sample bias of classical AIC is corrected
     is chosen as minimizer

   Classical methods:
   Tend to be highly variable
   Tendency to undersmooth
Risk Estimation Methods
       Require choosing pilot estimates
       Risk Estimation using Classical Pilots (RECP)

      1                       1
  S  I  f                         
  2tr  S S T 
2
ˆ          ˆ
R f , f  E f  f             
2

n                       n

      Need pilot estimates for f and  2
      Method 1: “Blocking Method”
      Method 2:
1.   Get a pilot  p using a classical method
2.
ˆ
Use it to compute f  and    ˆ2
           
p                p

                                       ˆ ˆ
Final  as minimizer of R f , f , hoping that it is a good
        
ˆ
p
estimator of R f , f

Risk Estimation Methods (contd…)
   Exact Double Smoothing (EDS)
   Another approach to choose pilot estimates

   Two “levels” of pilot estimates

0 : optimal parameter that minimizes R  f , fˆ  (Assume)
           


E  0   
2
    p1 : minimizer of

                                                     
Derive closed form expression L  0 ,  , for E  0   
2

   Replace unknown 0 with  p 2

ˆ    ˆ    
Choose  as minimizer of R f  p1 , f  , where  p1      minimizes
      
L p 2 , 
Speed Comparisons

   CV, GCV, AIC: roughly same computational time

   Cp, RECP: longer computational time
(2 numerical minimization)

   EDS: even longer (3 minimizations)
Conclusions from simulations

   No method performed uniformly the best
   The three classical methods – CV, GCV and Cp - gave
very similar results
   The AIC method has never given a worse performance
than CV, GCV or Cp
   For a simple regression function with a high noise level,
the two restriction methods (RECP and EDS) seem to be
superior
Nonparametric Logistic Regression

Goal: Approximate                       Pr Y  1| X  x 

log                      f  x
Pr Y  0 | X  x 
e 
f x
 Pr Y  1| X  x  
1  e f  x
   Penalized log-likelihood:
 yi log p  xi   1  yi  log 1  p  xi        f ''  t  dt
N
1
l  f ;   
2
 2
i 1

           
 yi f  xi   log 1  e f  xi    1   f ''  t 2 dt
N

i 1
                                   2 
Nonparametric Logistic Regression (contd…)

   Parameters in f(x) are set so as to maximize log-
likelihood
   Parametric:
p( x)  p(Y  1 | X  x)   T x

   Non-parametric:
N
p( x)  p(Y  1 | X  x)   N j ( x) j
j 1
Nonparametric Logistic Regression (contd…)

l  
 N T  y  p   

 2l  
  N TWN  
 T

p: N-vector with elements p  xi 
W: diagonal matrix of weights p  xi  1  p  xi 

 N  jk   N j ''  t N k ''  t  dt

   First derivative is non-linear in  =>
Iterative algorithm (Newton-Raphson)
Nonparametric Logistic Regression (contd…)

  N WN    N T Wz
                               1
   new        T

 N  N WN    N TW  f old  W 1  y  p    S  , w z
new            T             1
f
ˆ
( f  N ( N T N  N )1 N T y  S y )

   Update fits a weighted smoothing spline to the working
response z

   Altho’ here x is 1D, generalized to higher-D x
Multidimensional Splines
   Suppose      XR      2

   We have a separate basis of functions for representing
functions of X1 and X2

               h1k (X1), k  1,...,M1
h2k (X 2 ), k  1,...,M 2
   The M1xM2 dimensional tensor product basis is


g jk h1 j (X1)h2k (X 2 ), j 1,...,M1, k 1,...,M2
2D Function Basis
M1    M2

g(X)            jk   g jk (X)
j1   k1

   Can be generalized to higher dimensions

     
Dimension of the basis grows exponentially in the
number of coordinates

   MARS: greedy algorithm for including only the basis
functions deemed necessary by least squares
Higher Dimensional Smoothing Splines

   Suppose we have pairs                y i, x i , x i  Rd
   Want to find d-D regression function f(x)
N
   Solve

min {y i  f (x i )} 2  J[ f ]
f
i1

min   yi  f  xi      f         x 
n                                         2
2           ''
( 1-D:        f
dx )
i 1

   J is a penalty functional for stabilizing f in                  Rd



2D Roughness Penalty

   Generalization of 1D penalty

 2 f (x) 2      2 f (x) 2  2 f (x) 2
J[ f ]       R2
[(
x1 2
)  2(
x1 x 2
) (
x 22
) ]dx1dx 2

   Yields thin plate spline (smooth 2D surface)
Thin Plate Spline

   Properties in common with 1D cubic smoothing spline

   As   0, solution approaches interpolating fn

   As   , solution approaches least squares plane

   For intermediate values of , solution is a linear expansion of
basis functions
Thin Plate Spline
    Solution has form
N
f (x)   0   T x   j h j (x)
j1

h j (x)  ( x  x j )
(z)  z 2 log z 2

    Can be generalized to arbitrary dimension

Computation Speeds

   1D splines: O(N)

   Thin plate splines: O(N^3)

   Can use fewer than N knots
Using K knots reduces order to O(NK^2+K^3)
    Solution of the form
f (X)    f1(X1)  ... f d (Xd )
    Each fi is a univariate spline

Assume f is additive and impose a penalty on each of the


component functions              d
J[ f ]  J( f1  ... f d )   f j ''(t i )2 dt j
j1

    ANOVA Spline Decomposition

f (X)     f j (X j )   f j (X j , X k )  ...
                         j              jk



 Tensor product basis can achieve more flexibility but introduces
some spurious structure too.
Conclusion

   Smoothing Parameter selection and Bias-Variance
   Comparison of various classical and risk estimation
methods for parameter selection
   Non-parametric logistic regression
   Extension to higher dimensions
   Tensor Product
   Thin plate splines