Principal Component Regression Analysis
• Pseudo Inverse
• “Heisenberg Uncertainty” for Data Mining
• Explicit Principal Components
• Implicit Principal Components
• NIPALS Algorithm for Eigenvalues and Eigenvectors
• Scripts
- PCA transformation of data
- Pharma-plots
- PCA training and testing
- Bootstrap PCA
- NIPALS and other PCA algorithms
• Examples
• Feature selection
Classical Regression Analysis
X nm wm yn
T
X mn X nm wm X yn
T
mn
T
X mn X nm wm T
X mn yn
X T
mn X nm
1 T
X mn X nm wm
X X nm
T
mn
1 T
X yn
mn
X
T 1 T
wm mn X nm X mn yn
X
T
mn Xnm
1
XT
mn
Pseudo inverse
Penrose inverse
Least-Squares Optimization
The Machine Learning Paradox
X nm wm yn
wm X mn X nm
T
1 T
X yn
mn
If data are can learned from, they must have redundancy
X X
T 1 If there is redundancy, (XTX)-1 is ill-conditioned
- similar data patterns
- closely correlated descriptive features
Beyond Regression
wm X mn X nm
T
1 T
X mn yn
• Paul Werbos motivated beyond regression in 1972
• In addition, there are related statistical “duals” (PCA, PLS, SVM)
• Principal component analysis:
X nm Tnm Bmm
Bmm : eigenvectors
Tnm : loading factors
X nm Tnm Bmm Xnm Tnh Bhm
Tnm X nm Bmm T X BT h = # Principal components
T
nh nm mh
• Trick: eliminate poor conditioning by using h PC’s (largest )
wm Bmh Bhm X mn X nm Bmh
T T T
1 T
Bhm X mn yn
• Now matrix to invert is small and well-conditioned
• Generally include ~ 2 - 4 - 6 PCAs
• A Better PCA Regression is PLS (Please Listen to Savanti Wold)
• A Better PLS is nonlinear PNLS
Explicit PCA Regression
• We had
X nm wm yn
wm X mn X nm
T
1 T
X yn
mn
• Assume we derive PCA features for A according to
Xnm Tnh Bhm
Tnh Xnm Bmh h = # Principal components
T
• We now have
ˆ n Tnh wh yn
y
wh ThnTnh Thn yn
T 1 T
Explicit PCA Regression on training/test set
• We have for training set:
train
ˆ train train
yn Tnh wh yn
yn Tnh Thn Tnh Thn yn
train
ˆ train trainT train 1 trainT train
• And for the test set:
test
ˆ T test wtrain y test
yk kh h k
test
yk kh
ˆ T test T trainT T train 1T trainT y train
hn nh hn n
trainT train 1 trainT train
X km Bmh Thn Tnh
test T
Thn yn
Implicit PCA Regression
X nm wm yn
wm X mn X nm
T
1 T
X yn
mn
Xnm Tnh Bhm
Tnh Xnm Bmh h = # Principal components
T
ˆ
ym Bmh ThnTnh
T T
1
T
Bhm B T yn T
mh hn
1
T
Bmh ThnTnh I hhThn yn
T T
1 T
Bmh ThnTnh Thn yn
T T
How to apply? Calculate T and B with NIPALS algorithm
Determine b, and apply to data matrix
ˆ X w y
yn nm m1 n
Algorithm
Xnm Tnh Bhm
Tnh Xnm Bmh h = # Principal components
T
• The B matrix is a matrix of eigenvectors of the correlation matrix C
• If the features are zero centered we have:
Cmm 1
n 1
T
X mn X nm
• We only consider the h eigenvectors corresponding to largest eigenvalues
• The eigenvalues are the variances
• Eigenvectors are normalized to 1 and solutions of:
Cmm wm wm
Cmm Bmm Bmm
s.t. w 1
• Use NIPALS algorithm to build up B and T
NIPALS Algorithm: Part 2
Xnm Tnh Bhm
Tnh Xnm Bmh h = # Principal components
T
1. Estimate t
e.g. tnest ah
1
2. bm1 Amntnest
T
1
b1 b1
3. bm1 m2 Tm
t t 1n tn1
4. tn X nmbm
tn
5. tn
T
bm bm
est
6. Go to step1 until convergenc with tn tn
e
T
t T tn bm bm
7. n
n 1
8. Deflate X according to
T
X nm X nm tnbm
9. Go to step 2 and repeat for h PCA' s
10. Put t's in Tnh and put b' s in B m h
PRACTICAL TIPS FOR PCA
• NIPALS algorithm assumes the features are zero centered
• It is standard practice to do a Mahalanobis scaling of the data
xi x
x scaled
x
i
• PCA regression does not consider the response data
• The t’s are called the scores
• Use 3-10 PCA’s
• I usually use 4 PCA’s
• It is common practice to drop 4 sigma outlier features
(if there are many features)
PCA with Analyze
• Several options: option #17 for training and #18 for testing
(the weight vectors after training is in file bbmatrixx.txt)
• The file num_eg.txt contains a number equal to # PCAs
• Option –17 is the NIPALS algorithm and generally faster than 17
• SAnalyze has options for calculating T’s, B’s and ’s
- option #36 transforms a data matrix to it’s PCAs
- option #36 also saves eigenvalues and eigenvectors of XTX
• Analyze has also option for bootstrap PCA (-33)
StripMiner Scripts
• last lecture: iris_pca.bat (make PCAs and visualize)
• iris.bat (split up data in training and validation set and predict)
• iris_boot.bat (bootstrap prediction)
REM PCA REGRESIION MODEL FOR IRIS DATA
REM GENERATE IRIS DATA (5)
analyze iris.txt 3301
REM ELIMINATE COMMAS
analyze iris.txt 100
REM MAHALANOBOIS SCALE
analyze iris.txt.txt 3
REM GENERATE # PCAs (5)
analyze num_eg.txt 105
REM SPLIT DATA (100 2)
analyze iris.txt.txt.txt 20
copy cmatrix.txt a.pat
copy dmatrix.txt a.tes
REM MAKE PCA REGRESSION MODEL
analyze a.pat 17
analyze a.tes 18
pause
REM VISUALIZE RESULTS
analyze resultss.xxx 4
copy results.ttt results.xxx
analyze resultss.ttt 4
analyze results.ttt 3313
pause
Bootstrap Prediction (iris_boo.bat)
• Make different models for training set
• Predict Test set on average model
REM PCA BOOTSTRAP REGRESIION MODEL FOR IRIS DATA
REM GENERATE IRIS DATA (5)
analyze iris.txt 3301
REM ELIMINATE COMMAS
analyze iris.txt 100
REM MHALANOBOIS SCALE
analyze iris.txt.txt 3
REM GENERATE # PCAs (5)
analyze num_eg.txt 105
REM SPLIT DATA (100 2)
analyze iris.txt.txt.txt 20
copy cmatrix.txt a.pat
copy dmatrix.txt a.tes
REM MAKE PCA BOOTRTRAP REGRESSION MODEL (7 100)
analyze a.pat 33
REM MAKE PREDICTIONS
analyze a.tes 18
pause
REM VISUALIZE RESULTS
analyze resultss.xxx 4
copy results.ttt results.xxx
analyze resultss.ttt 4
analyze results.ttt 3313
pause
Neural Network Interpretation of PCA
PCA in DATA SPACE
Means that the similarity score with each data point will be weighed
(i.e.., effectively incorporating Mahalanobis scaling in data space)
TNH THN
T
1
Σ
Σ y1
x1 Σ This layer gives a similarity score
Σ
with each datapoint
... Σ
Σ Σ ˆ
yi
xi yi
Σ
Σ
Kind of a nearest
xM T Σ neighbor weighted
BMH T yM
T HN prediction score
Weights correspond to Weights correspond to
H eigenvectors Σ the dependent variable
corresponding to for the entire training data
largest eigenvalues
T 1
of XTX Weights correspond to bM BMHTHN TNH THN y N
T T
the scores
or PCAs for the
entire training set
T T
T 1
yi xiT BMHTHN TNH THN y N
ˆ