VIEWS: 12 PAGES: 36 POSTED ON: 4/19/2011 Public Domain
Basis Expansions and Regularization Selection of Smoothing Parameters Non-parametric Logistic Regression Multi-dimensional Splines - Nagaraj Prasanth Smoothing Parameters Regression Splines Degree Number of knots Placement of knots Smoothing Splines Only penalty parameter (since knots are at all X’s and cubic degree is almost always used) Smoothing spline Cubic smoothing spline: fˆ is minimizer of f x n y f x 2 '' 2 i i dx i 1 : smoothing parameter (+ve const) ˆ Controls trade-off betn bias and variance of f Selection of Smoothing Parameters Fixing Degrees of Freedom Bias-Variance Trade-off Cross-Validation C p Criterion Improved AIC Risk Estimation Methods Risk Estimation using Classical Pilots (RECP) Exact Double Smoothing (EDS) Fixing Degrees of Freedom ˆ f N ( N T N N )1 N T y S y df trace S (monotone in for smoothing splines) Fix df => get value of Bias-Variance Tradeoff Sample Data Generation N=100 pairs of xi , yi drawn independently from the model below: 1) X ~ U [0,1] Sin(12( X 0.2)) 2) f (X ) X 0.2 3) ~ N (0,1) 4) Y f (X ) Standard error bands at a given point x ^ ^ f ( x) 2se( f ( x)) Bias-Variance Tradeoff ˆ f S y ˆ S Cov y S T S S T Cov f Bias f f E f f S ˆ ˆ f Observations (Bias-Variance Tradeoff) df=5 Spline under fits Bias that is most dramatic in regions of high curvature Narrow se band => bad estimate with great reliability df=9 Close to true function Slight bias Small variance df=15 Wiggly, but close to true function` Increased width of se bands Integrated Squared Prediction Error (EPE) Combines bias and variance… ^ ^ ^ ^ EPE ( f ) E (Y f ( X )) Var (Y ) E[ Bias ( f ( X )) Var ( f ( X ))] 2 2 ^ ^ ^ ^ MSE ( f ) E[( f ( x) Y ) ] Var ( f ) ( Bias( f )) 2 2 ^ ^ Bias( f ) E[ f ( X )] Y ^ ^ ^ Var ( f ( X )) E[( f ( X ) E ( f ( X )) 2 ] Smoothing parameter selection methods EPE calculation needs true function Other methods Cross-Validation Generalized Cross Validation C p Criterion Improved AIC Cross Validation Divide data into N blocks of N-1 points ( xi , yi ) ^ ( i ) Compute squared error ( yi f ( xi ))2 over all examples ( xi , yi ) and take its average is chosen as the minimizer of 1 yi f xi 2 ˆ 2 1 n n CV yi f i xi ˆ n i 1 n i 1 1 S ii Cross-Validation Estimate of EPE EPE and CV curves have similar shape CV curve is an approximately unbiased as an estimate of EPE Other Classical Methods Generalized Cross-Validation n 1 yi f xi 2 ˆ GCV i 1 n 1 n 1tr S 2 Mallow’s C p Criterion Cp 1 n S I y 2 2 2tr S 2 E C p E f f 1 2 ˆ R f , fˆ RISK n S 2 p I y ˆ p tr 1 S p p : Pre-chosen by CV Classical Methods (contd…) Improved AIC S I y 2 tr S 1 2 AICc log 1 n n tr S 2 Improved == finite sample bias of classical AIC is corrected is chosen as minimizer Classical methods: Tend to be highly variable Tendency to undersmooth Risk Estimation Methods Require choosing pilot estimates Risk Estimation using Classical Pilots (RECP) 1 1 S I f 2tr S S T 2 ˆ ˆ R f , f E f f 2 n n Need pilot estimates for f and 2 Method 1: “Blocking Method” Method 2: 1. Get a pilot p using a classical method 2. ˆ Use it to compute f and ˆ2 p p ˆ ˆ Final as minimizer of R f , f , hoping that it is a good ˆ p estimator of R f , f Risk Estimation Methods (contd…) Exact Double Smoothing (EDS) Another approach to choose pilot estimates Two “levels” of pilot estimates 0 : optimal parameter that minimizes R f , fˆ (Assume) E 0 2 p1 : minimizer of Derive closed form expression L 0 , , for E 0 2 Replace unknown 0 with p 2 ˆ ˆ Choose as minimizer of R f p1 , f , where p1 minimizes L p 2 , Speed Comparisons CV, GCV, AIC: roughly same computational time Cp, RECP: longer computational time (2 numerical minimization) EDS: even longer (3 minimizations) Conclusions from simulations No method performed uniformly the best The three classical methods – CV, GCV and Cp - gave very similar results The AIC method has never given a worse performance than CV, GCV or Cp For a simple regression function with a high noise level, the two restriction methods (RECP and EDS) seem to be superior Nonparametric Logistic Regression Goal: Approximate Pr Y 1| X x log f x Pr Y 0 | X x e f x Pr Y 1| X x 1 e f x Penalized log-likelihood: yi log p xi 1 yi log 1 p xi f '' t dt N 1 l f ; 2 2 i 1 yi f xi log 1 e f xi 1 f '' t 2 dt N i 1 2 Nonparametric Logistic Regression (contd…) Parameters in f(x) are set so as to maximize log- likelihood Parametric: p( x) p(Y 1 | X x) T x Non-parametric: N p( x) p(Y 1 | X x) N j ( x) j j 1 Nonparametric Logistic Regression (contd…) l N T y p 2l N TWN T p: N-vector with elements p xi W: diagonal matrix of weights p xi 1 p xi N jk N j '' t N k '' t dt First derivative is non-linear in => Iterative algorithm (Newton-Raphson) Nonparametric Logistic Regression (contd…) N WN N T Wz 1 new T N N WN N TW f old W 1 y p S , w z new T 1 f ˆ ( f N ( N T N N )1 N T y S y ) Update fits a weighted smoothing spline to the working response z Altho’ here x is 1D, generalized to higher-D x Multidimensional Splines Suppose XR 2 We have a separate basis of functions for representing functions of X1 and X2 h1k (X1), k 1,...,M1 h2k (X 2 ), k 1,...,M 2 The M1xM2 dimensional tensor product basis is g jk h1 j (X1)h2k (X 2 ), j 1,...,M1, k 1,...,M2 2D Function Basis M1 M2 g(X) jk g jk (X) j1 k1 Can be generalized to higher dimensions Dimension of the basis grows exponentially in the number of coordinates MARS: greedy algorithm for including only the basis functions deemed necessary by least squares Higher Dimensional Smoothing Splines Suppose we have pairs y i, x i , x i Rd Want to find d-D regression function f(x) N Solve min {y i f (x i )} 2 J[ f ] f i1 min yi f xi f x n 2 2 '' ( 1-D: f dx ) i 1 J is a penalty functional for stabilizing f in Rd 2D Roughness Penalty Generalization of 1D penalty 2 f (x) 2 2 f (x) 2 2 f (x) 2 J[ f ] R2 [( x1 2 ) 2( x1 x 2 ) ( x 22 ) ]dx1dx 2 Yields thin plate spline (smooth 2D surface) Thin Plate Spline Properties in common with 1D cubic smoothing spline As 0, solution approaches interpolating fn As , solution approaches least squares plane For intermediate values of , solution is a linear expansion of basis functions Thin Plate Spline Solution has form N f (x) 0 T x j h j (x) j1 h j (x) ( x x j ) (z) z 2 log z 2 Can be generalized to arbitrary dimension Computation Speeds 1D splines: O(N) Thin plate splines: O(N^3) Can use fewer than N knots Using K knots reduces order to O(NK^2+K^3) Additive Splines Solution of the form f (X) f1(X1) ... f d (Xd ) Each fi is a univariate spline Assume f is additive and impose a penalty on each of the component functions d J[ f ] J( f1 ... f d ) f j ''(t i )2 dt j j1 ANOVA Spline Decomposition f (X) f j (X j ) f j (X j , X k ) ... j jk Additive Vs Tensor Product Tensor product basis can achieve more flexibility but introduces some spurious structure too. Conclusion Smoothing Parameter selection and Bias-Variance Comparison of various classical and risk estimation methods for parameter selection Non-parametric logistic regression Extension to higher dimensions Tensor Product Thin plate splines Additive Splines References Lee, T. (2002). Smoothing parameter selection for Smoothing Splines: A Simulation Study (www.stat.colostate.edu/~tlee/PSfiles/spline.ps.gz) Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction