# Selection of predictor variables Statement of problem • A common problem is that there is a large set of candidate predictor variables • Goal is to choose a small subset from the lar

W
Shared by:
Categories
Tags
-
Stats
views:
4
posted:
4/28/2012
language:
English
pages:
28
Document Sample

```							Selection of predictor variables
Statement of problem
• A common problem is that there is a large
set of candidate predictor variables.
• Goal is to choose a small subset from the
larger set so that the resulting regression
model is simple, yet have good predictive
ability.
Example: Cement data
• Response y: heat evolved in calories during
hardening of cement on a per gram basis
• Predictor x1: % of tricalcium aluminate
• Predictor x2: % of tricalcium silicate
• Predictor x3: % of tetracalcium alumino
ferrite
• Predictor x4: % of dicalcium silicate
Example: Cement data

105.05
y
83.35

16
x1
6

59.75
x2
37.25

18.25
x3
8.75

46.5
x4
19.5

. 35 .05   6        16     . 25 9.75      5
8.7 18. 2
5
19
.5
46
.5
83 105                     37     5
Two basic methods of selecting
predictors
• Stepwise regression: Enter and remove
variables, in a stepwise manner, until no
justifiable reason to enter or remove more.
• Best subsets regression: Select the subset
of variables that do the best at meeting
some well-defined objective criterion.
Stepwise regression: the idea
• At each step, enter or remove a variable
based on partial F-tests.
• Stop when no more variables can be
justifiably entered or removed.
Stepwise regression: the steps
• Specify an Alpha-to-Enter (0.15) and an
Alpha-to-Remove (0.15).
• Put the predictor with the smallest P-value
based on the partial F statistic (a t-statistic)
in the model. If P-value > 0.15, then stop.
None of the predictors have good predictive
ability. Otherwise …
Stepwise regression: the steps
• Add the predictor with the smallest P-value
(below 0.15) based on the partial F-statistic
(a t-statistic) in the model. If none of the
predictors yield P-values < 0.15, stop.
• If P-value of any of the partial F statistics >
0.15, then remove the violating predictor.
• Continue the above two steps, until no more
predictors can be entered or removed.
Stepwise Regression: y versus x1, x2, x3, x4
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is    y     on 4 predictors, with N =   13

Step         1        2        3        4
Constant    117.57   103.10    71.65    52.58

x4          -0.738   -0.614   -0.237
T-Value      -4.77   -12.62    -1.37
P-Value      0.001    0.000    0.205

x1                     1.44     1.45     1.47
T-Value               10.40    12.41    12.10
P-Value               0.000    0.000    0.000

x2                             0.416    0.662
T-Value                         2.24    14.44
P-Value                        0.052    0.000

S             8.96     2.73     2.31     2.41
R-Sq         67.45    97.25    98.23    97.87
C-p          138.7      5.5      3.0      2.7
Drawbacks of stepwise regression

• The final model is not guaranteed to be
optimal in any specified sense.
• The procedure yields a single final model,
although in practice there are often several
almost equally good models.
Best subsets regression
• If there are P-1 possible predictors, then
there are 2P-1 possible regression models
containing the predictors.
• For example, 10 predictors yields 210 = 1024
possible regression models.
• A best subsets algorithm determines the
best subsets of each size, so that choice of
the final model can be made by researcher.
What is used to judge “best”?
•   R-square
•   MSE (or S = square root of MSE)
•   Mallow’s Cp
R-square
SSR       SSE
R 
2
 1
SSTO      SSTO

Use the R-square values to find the point where
adding more predictors is not worthwhile because
it leads to a very small increase in R-square.
 n  1  SSE          n 1 
 n  p  SSTO   1   SSTO  MSE
R  1 
2
a            
                          

Adjusted R-square increases only if MSE decreases,
so adjusted R-square and MSE provide equivalent
information.
Find a few subsets for which MSE is smallest (or
adjusted R-square is largest) or so close to the
smallest (largest) that adding more predictors is not
worthwhile.
Mallow’s Cp criterion
SSE p
Mallow’s Cp statistic:    Cp                                  n  2 p 
MSE ( X1 ,...,X P 1 )

is an estimator of total standardized mean square
error of prediction:

 E Y                 
n                         2

 E Yi 
1            ˆ
p 
   2             ip
i 1

which equals:
1 n
 
ˆ  E Y   Var Y                      
n
p  2  E Yip                 ˆ 
2

  i 1
i           ip
i 1      
Plots of Cp against p
• Models with little bias will tend to fall near
the line Cp = p.
• Models with substantial bias will tend to fall
considerably above the line Cp = p.
• Cp values below the line Cp = p are
interpreted as showing no bias (being below
the line due to sampling error).
Using the Cp criterion
• Subsets with small Cp values have a small
total (standardized) mean square error of
prediction.
• When the Cp value is also near p, the bias of
the regression model is small.
• So, identify subsets of predictors for which:
– the Cp value is small, and
– the Cp value is near p (if possible)
Best Subsets Regression: y versus x1, x2, x3, x4

Response is y

x x x x
Vars   R-Sq     R-Sq(adj)       C-p         S      1 2 3 4

1   67.5         64.5      138.7    8.9639           X
1   66.6         63.6      142.5    9.0771        X
2   97.9         97.4        2.7    2.4063      X X
2   97.2         96.7        5.5    2.7343      X     X
3   98.2         97.6        3.0    2.3087      X X   X
3   98.2         97.6        3.0    2.3121      X X X
4   98.2         97.4        5.0    2.4460      X X X X
Example: Modeling PIQ

130.5
PIQ
91.5

100.728
MRI
86.283

73.25
Height
65.75

170.5
Weight
127.5

.5
91 130
.5        83    28     .75 3.25      7.5 70.5
8 6.2 00.7      65    7      12    1
1
Stepwise Regression: PIQ versus MRI, Height, Weight
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is   PIQ    on 3 predictors, with N =    38

Step          1         2
Constant      4.652   111.276

MRI           1.18      2.06
T-Value       2.45      3.77
P-Value      0.019     0.001

Height                 -2.73
T-Value                -2.75
P-Value                0.009

S              21.2     19.5
R-Sq          14.27    29.49
C-p             7.3      2.0
Best Subsets Regression: PIQ versus MRI, Height, Weight
Response is PIQ

H   W
e   e
i   i
M g   g
R h   h
Vars   R-Sq    R-Sq(adj)        C-p         S   I t   t

1   14.3         11.9        7.3    21.212   X
1    0.9          0.0       13.8    22.810     X
2   29.5         25.5        2.0    19.510   X X
2   19.3         14.6        6.9    20.878   X   X
3   29.5         23.3        4.0    19.794   X X X
The regression equation is
PIQ = 111 + 2.06 MRI - 2.73 Height

Predictor        Coef        SE Coef           T       P
Constant       111.28          55.87        1.99   0.054
MRI            2.0606         0.5466        3.77   0.001
Height        -2.7299         0.9932       -2.75   0.009

S = 19.51         R-Sq = 29.5%       R-Sq(adj) = 25.5%

Analysis of Variance

Source       DF       SS           MS        F        P
Regression    2     5572.7       2786.4     7.32    0.002
Error        35    13321.8        380.6
Total        37    18894.6

Source       DF        Seq SS
MRI           1        2697.1
Height        1        2875.6
Example: Modeling BP

120
BP
110

53.25
Age
47.75

97.325
Weight
89.375

2.125
BSA
1.875

8.275
Duration
4.425

72.5
Pulse
65.5

76.25
Stress
30.75

0      0     . 75 3.25        5     5       75   25       25 .275     .5      .5     .75 6. 25
11     12    47     5        .37 7. 32    1. 8 2. 1     4. 4          65     72      30
89     9                            8                           7
Stepwise Regression: BP versus Age, Weight, BSA, Duration,
Pulse, Stress   Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is    BP    on 6 predictors, with N =    20

Step         1         2         3
Constant     2.205   -16.579   -13.667

Weight       1.201    1.033     0.906
T-Value      12.92    33.15     18.49
P-Value      0.000    0.000     0.000

Age                   0.708     0.702
T-Value               13.23     15.96
P-Value               0.000     0.000

BSA                               4.6
T-Value                          3.04
P-Value                         0.008

S             1.74    0.533     0.437
R-Sq         90.26    99.14     99.45
C-p          312.8     15.1       6.4
Best Subsets Regression: BP versus Age, Weight, ...
Response is BP
D
u
W   r       S
e   a   P   t
i   t   u   r
A g B i   l   e
g h S o   s   s
Vars   R-Sq    R-Sq(adj)        C-p         S   e t A n   e   s

1   90.3         89.7     312.8     1.7405    X
1   75.0         73.6     829.1     2.7903         X
2   99.1         99.0      15.1    0.53269   X X
2   92.0         91.0     256.6     1.6246     X           X
3   99.5         99.4       6.4    0.43705   X X   X
3   99.2         99.1      14.1    0.52012   X X       X
4   99.5         99.4       6.4    0.42591   X X   X X
4   99.5         99.4       7.1    0.43500   X X   X     X
5   99.6         99.4       7.0    0.42142   X X   X   X X
5   99.5         99.4       7.7    0.43078   X X   X X X
6   99.6         99.4       7.0    0.40723   X X   X X X X
The regression equation is
BP = - 13.7 + 0.702 Age + 0.906 Weight + 4.63 BSA

Predictor        Coef     SE Coef          T        P
Constant      -13.667       2.647      -5.16    0.000
Age           0.70162     0.04396      15.96    0.000
Weight        0.90582     0.04899      18.49    0.000
BSA             4.627       1.521       3.04    0.008
S = 0.4370      R-Sq = 99.5%     R-Sq(adj) = 99.4%

Analysis of Variance

Source       DF      SS        MS        F       P
Regression    3   556.94    185.65   971.93   0.000
Error        16     3.06      0.19
Total        19   560.00

Source       DF        Seq SS
Age           1        243.27
Weight        1        311.91
BSA           1          1.77
Stepwise regression in Minitab
• Stat >> Regression >> Stepwise …
• Specify response and all possible predictors.
• If desired, specify predictors that must be
included in every model.
• Select OK. Results appear in session
window.
Best subsets regression
• Stat >> Regression >> Best subsets …
• Specify response and all possible predictors.
• If desired, specify predictors that must be
included in every model.
• Select OK. Results appear in session
window.

```
Related docs
Other docs by ert554898
Coral Reefs - Judas World
3D VIRTUAL REALITY CAMERAS