Selection of predictor variables Statement of problem • A common problem is that there is a large set of candidate predictor variables • Goal is to choose a small subset from the lar
Document Sample


Selection of predictor variables
Statement of problem
• A common problem is that there is a large
set of candidate predictor variables.
• Goal is to choose a small subset from the
larger set so that the resulting regression
model is simple, yet have good predictive
ability.
Example: Cement data
• Response y: heat evolved in calories during
hardening of cement on a per gram basis
• Predictor x1: % of tricalcium aluminate
• Predictor x2: % of tricalcium silicate
• Predictor x3: % of tetracalcium alumino
ferrite
• Predictor x4: % of dicalcium silicate
Example: Cement data
105.05
y
83.35
16
x1
6
59.75
x2
37.25
18.25
x3
8.75
46.5
x4
19.5
. 35 .05 6 16 . 25 9.75 5
8.7 18. 2
5
19
.5
46
.5
83 105 37 5
Two basic methods of selecting
predictors
• Stepwise regression: Enter and remove
variables, in a stepwise manner, until no
justifiable reason to enter or remove more.
• Best subsets regression: Select the subset
of variables that do the best at meeting
some well-defined objective criterion.
Stepwise regression: the idea
• Start with no predictors in the model.
• At each step, enter or remove a variable
based on partial F-tests.
• Stop when no more variables can be
justifiably entered or removed.
Stepwise regression: the steps
• Specify an Alpha-to-Enter (0.15) and an
Alpha-to-Remove (0.15).
• Start with no predictors in the model.
• Put the predictor with the smallest P-value
based on the partial F statistic (a t-statistic)
in the model. If P-value > 0.15, then stop.
None of the predictors have good predictive
ability. Otherwise …
Stepwise regression: the steps
• Add the predictor with the smallest P-value
(below 0.15) based on the partial F-statistic
(a t-statistic) in the model. If none of the
predictors yield P-values < 0.15, stop.
• If P-value of any of the partial F statistics >
0.15, then remove the violating predictor.
• Continue the above two steps, until no more
predictors can be entered or removed.
Stepwise Regression: y versus x1, x2, x3, x4
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is y on 4 predictors, with N = 13
Step 1 2 3 4
Constant 117.57 103.10 71.65 52.58
x4 -0.738 -0.614 -0.237
T-Value -4.77 -12.62 -1.37
P-Value 0.001 0.000 0.205
x1 1.44 1.45 1.47
T-Value 10.40 12.41 12.10
P-Value 0.000 0.000 0.000
x2 0.416 0.662
T-Value 2.24 14.44
P-Value 0.052 0.000
S 8.96 2.73 2.31 2.41
R-Sq 67.45 97.25 98.23 97.87
R-Sq(adj) 64.50 96.70 97.64 97.44
C-p 138.7 5.5 3.0 2.7
Drawbacks of stepwise regression
• The final model is not guaranteed to be
optimal in any specified sense.
• The procedure yields a single final model,
although in practice there are often several
almost equally good models.
Best subsets regression
• If there are P-1 possible predictors, then
there are 2P-1 possible regression models
containing the predictors.
• For example, 10 predictors yields 210 = 1024
possible regression models.
• A best subsets algorithm determines the
best subsets of each size, so that choice of
the final model can be made by researcher.
What is used to judge “best”?
• R-square
• Adjusted R-square
• MSE (or S = square root of MSE)
• Mallow’s Cp
R-square
SSR SSE
R
2
1
SSTO SSTO
Use the R-square values to find the point where
adding more predictors is not worthwhile because
it leads to a very small increase in R-square.
Adjusted R-square or MSE
n 1 SSE n 1
n p SSTO 1 SSTO MSE
R 1
2
a
Adjusted R-square increases only if MSE decreases,
so adjusted R-square and MSE provide equivalent
information.
Find a few subsets for which MSE is smallest (or
adjusted R-square is largest) or so close to the
smallest (largest) that adding more predictors is not
worthwhile.
Mallow’s Cp criterion
SSE p
Mallow’s Cp statistic: Cp n 2 p
MSE ( X1 ,...,X P 1 )
is an estimator of total standardized mean square
error of prediction:
E Y
n 2
E Yi
1 ˆ
p
2 ip
i 1
which equals:
1 n
ˆ E Y Var Y
n
p 2 E Yip ˆ
2
i 1
i ip
i 1
Plots of Cp against p
• Models with little bias will tend to fall near
the line Cp = p.
• Models with substantial bias will tend to fall
considerably above the line Cp = p.
• Cp values below the line Cp = p are
interpreted as showing no bias (being below
the line due to sampling error).
Using the Cp criterion
• Subsets with small Cp values have a small
total (standardized) mean square error of
prediction.
• When the Cp value is also near p, the bias of
the regression model is small.
• So, identify subsets of predictors for which:
– the Cp value is small, and
– the Cp value is near p (if possible)
Best Subsets Regression: y versus x1, x2, x3, x4
Response is y
x x x x
Vars R-Sq R-Sq(adj) C-p S 1 2 3 4
1 67.5 64.5 138.7 8.9639 X
1 66.6 63.6 142.5 9.0771 X
2 97.9 97.4 2.7 2.4063 X X
2 97.2 96.7 5.5 2.7343 X X
3 98.2 97.6 3.0 2.3087 X X X
3 98.2 97.6 3.0 2.3121 X X X
4 98.2 97.4 5.0 2.4460 X X X X
Example: Modeling PIQ
130.5
PIQ
91.5
100.728
MRI
86.283
73.25
Height
65.75
170.5
Weight
127.5
.5
91 130
.5 83 28 .75 3.25 7.5 70.5
8 6.2 00.7 65 7 12 1
1
Stepwise Regression: PIQ versus MRI, Height, Weight
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is PIQ on 3 predictors, with N = 38
Step 1 2
Constant 4.652 111.276
MRI 1.18 2.06
T-Value 2.45 3.77
P-Value 0.019 0.001
Height -2.73
T-Value -2.75
P-Value 0.009
S 21.2 19.5
R-Sq 14.27 29.49
R-Sq(adj) 11.89 25.46
C-p 7.3 2.0
Best Subsets Regression: PIQ versus MRI, Height, Weight
Response is PIQ
H W
e e
i i
M g g
R h h
Vars R-Sq R-Sq(adj) C-p S I t t
1 14.3 11.9 7.3 21.212 X
1 0.9 0.0 13.8 22.810 X
2 29.5 25.5 2.0 19.510 X X
2 19.3 14.6 6.9 20.878 X X
3 29.5 23.3 4.0 19.794 X X X
The regression equation is
PIQ = 111 + 2.06 MRI - 2.73 Height
Predictor Coef SE Coef T P
Constant 111.28 55.87 1.99 0.054
MRI 2.0606 0.5466 3.77 0.001
Height -2.7299 0.9932 -2.75 0.009
S = 19.51 R-Sq = 29.5% R-Sq(adj) = 25.5%
Analysis of Variance
Source DF SS MS F P
Regression 2 5572.7 2786.4 7.32 0.002
Error 35 13321.8 380.6
Total 37 18894.6
Source DF Seq SS
MRI 1 2697.1
Height 1 2875.6
Example: Modeling BP
120
BP
110
53.25
Age
47.75
97.325
Weight
89.375
2.125
BSA
1.875
8.275
Duration
4.425
72.5
Pulse
65.5
76.25
Stress
30.75
0 0 . 75 3.25 5 5 75 25 25 .275 .5 .5 .75 6. 25
11 12 47 5 .37 7. 32 1. 8 2. 1 4. 4 65 72 30
89 9 8 7
Stepwise Regression: BP versus Age, Weight, BSA, Duration,
Pulse, Stress Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is BP on 6 predictors, with N = 20
Step 1 2 3
Constant 2.205 -16.579 -13.667
Weight 1.201 1.033 0.906
T-Value 12.92 33.15 18.49
P-Value 0.000 0.000 0.000
Age 0.708 0.702
T-Value 13.23 15.96
P-Value 0.000 0.000
BSA 4.6
T-Value 3.04
P-Value 0.008
S 1.74 0.533 0.437
R-Sq 90.26 99.14 99.45
R-Sq(adj) 89.72 99.04 99.35
C-p 312.8 15.1 6.4
Best Subsets Regression: BP versus Age, Weight, ...
Response is BP
D
u
W r S
e a P t
i t u r
A g B i l e
g h S o s s
Vars R-Sq R-Sq(adj) C-p S e t A n e s
1 90.3 89.7 312.8 1.7405 X
1 75.0 73.6 829.1 2.7903 X
2 99.1 99.0 15.1 0.53269 X X
2 92.0 91.0 256.6 1.6246 X X
3 99.5 99.4 6.4 0.43705 X X X
3 99.2 99.1 14.1 0.52012 X X X
4 99.5 99.4 6.4 0.42591 X X X X
4 99.5 99.4 7.1 0.43500 X X X X
5 99.6 99.4 7.0 0.42142 X X X X X
5 99.5 99.4 7.7 0.43078 X X X X X
6 99.6 99.4 7.0 0.40723 X X X X X X
The regression equation is
BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA
Predictor Coef SE Coef T P
Constant -13.667 2.647 -5.16 0.000
Age 0.70162 0.04396 15.96 0.000
Weight 0.90582 0.04899 18.49 0.000
BSA 4.627 1.521 3.04 0.008
S = 0.4370 R-Sq = 99.5% R-Sq(adj) = 99.4%
Analysis of Variance
Source DF SS MS F P
Regression 3 556.94 185.65 971.93 0.000
Error 16 3.06 0.19
Total 19 560.00
Source DF Seq SS
Age 1 243.27
Weight 1 311.91
BSA 1 1.77
Stepwise regression in Minitab
• Stat >> Regression >> Stepwise …
• Specify response and all possible predictors.
• If desired, specify predictors that must be
included in every model.
• Select OK. Results appear in session
window.
Best subsets regression
• Stat >> Regression >> Best subsets …
• Specify response and all possible predictors.
• If desired, specify predictors that must be
included in every model.
• Select OK. Results appear in session
window.
Get documents about "