Multiple Regression

Document Sample

```					Multiple Regression
Multiple regression
 Typically, we want to use more than a single predictor
(independent variable) to make predictions

 Regression with more than one predictor is called “multiple
regression”

y i    1 x1i   2 x 2i        p x pi  i


Motivating example: Sex discrimination in
wages
 In 1970’s, Harris Trust and Savings Bank was sued for
discrimination on the basis of sex.
 Analysis of salaries of employees of one type (skilled, entry-
level clerical) presented as evidence by the defense.
 Did female employees tend to receive lower starting salaries
than similarly qualified and experienced male employees?
Variables collected
 93 employees on data file (61 female, 32 male).

   bsal: Annual salary at time of hire.
   sal77 : Annual salary in 1977.
   educ: years of education.
   exper: months previous work prior to hire at bank.
   fsex: 1 if female, 0 if male
   senior: months worked at bank since hired
   age: months

 So we have six x’s and and one y (bsal). However, in what follows
we won’t use sal77.
Comparison for male and females
 This shows men started at     Oneway Analysis of bsal By fsex
higher salaries than women
8000
(t=6.3, p<.0001).
7000

 But, it doesn’t control for

bs al
6000
other characteristics.
5000

4000

Female           Male

fs ex
Relationships of bsal with other variables
 Senior and education predict bsal well. We want to
control for them when judging gender effect.
F i           t                    Y                       b y                   X           G                    r           o           u p
B i                  v a                  r       i           a t         e             F ii
B                    r
tv a o fi                                   t
a b       se a    B    t     r        b
lF ii Bv ya o sfi ea nt ise o rlF ii Bv ya o afi ga et
a B      t     r        b                                                        se a       lF
l

l

l

l
8           0        0           0                                                         8       0            0           0                                      8       0        0       0                            8           0         0       0
a

a

a

a
7           0        0           0                                                         7       0            0           0                                      7       0        0       0                            7           0         0       0
s

s

s

s
6           0        0           0                                                         6       0            0           0                                      6       0        0       0                            6           0         0       0
b

b

b

b
5           0        0           0                                                         5       0            0           0                                      5       0        0       0                            5           0         0       0

4           0        0           0                                                         4       0            0           0                                      4       0        0       0                            4           0         0       0

6            0
6       5
7           7
0           5
8       0
8   9
5        0
9 15     0           0        3           04 00 5 00 6 007 00 8 00                           0 7 8 9 1 1 0 11 1 2 1 3 14 1 5 16 7                      -            0
5 5 0 1 0 10 250 200 350 300 45 0 00
s       e       n   io        r                                                         a   g     e                                                e   d   u    c                                             e   x   p   e      r

L           in       e           a           r               F   it                        L       in           e           a    r           F     it              L       in       e       a   r       F   it           L           in        e       a   r       F   it
Multiple regression model
 For any combination of values of the predictor variables, the
average value of the response (bsal) lies on a straight line:

bsali    1fsexi  2seniori  3agei  4educi  5experi  i

 Just like in simple regression, assume that ε follows a normal
curve within any combination of predictors.
Output from regression
(fsex = 1 for females, = 0 for males)
Term Estimate Std Error t Ratio Prob>|t|
Response bsal
Whole Model                                                                                    age   educ   exper
Actual by Predicted Plot

Int. 6277.9     652     9.62 <.0001                             800 0

700 0

bsal Actual
600 0

500 0

Fsex     -767.9    128.9    -5.95 <.0001                        400 0

400 0     500 0     600 0     700 0      800 0
bsa l Predicted P<.0 001 RSq=0.5 2
RMSE=50 8.09

Senior    -22.6     5.3     -4.26 <.0001
Summary of Fit
RSqua re                                        0.5 15156
Root Mea n Square Error                         508 .0906
Mean of Response                                542 0.323
Observations (o r Sum Wgts)                            93
Analysis of Variance

Age        0.63     .72     .88    .3837       Source
Model
Erro r
DF
5
87
Sum of Squa res
23 86371 5
22 45957 5
Mean Square
47 72743
25 8156
F Ratio
18 .4878
Prob > F
C. Tota l                  92            46 32329 0                            <. 0001
Parameter Estimates
Term                         Estimate       Std Error    t Ratio    Prob>|t|
Interce pt                 627 7.893 4      652 .2713      9. 62    <.0 001

Educ       92.3    24.8     3.71   .0004       fsex
senior
age
-767.9 127
-22.58 23
0.6 30960 3
12 8.97
5.2 95732
0.7 20654
-5.95
-4.26
0. 88
<.0 001
<.0 001
0.3 837
educ                       92. 30602 3      24. 86354      3. 71    0.0 004
exper                      0.5 00639 7      1.0 55262      0. 47    0.6 364
Effect Tests
Source                   Nparm      DF    Sum of Squa res           F Ratio      Prob > F

Exper       0.50     1.05    .47   .6364       fsex
senior
age
1
1
1
1
1
1
91 52264 .3
46 94256 .3
19 7894. 0
35 .4525
18 .1838
0. 7666
<. 0001
<. 0001
0. 3837
educ                         1       1          35 58085 .8        13 .7827       0. 0004
exper                        1       1            5 8104.8          0. 2251       0. 6364
Residual by Predicted Plot

150 0

100 0
bsal Residual

500

0

-500

-1000

400 0    500 0      600 0     700 0     800 0
bsa l Predicted
Predictions
 Example: Prediction of beginning wages for a woman with
10 months seniority, that is 25 years old, with 12 years of
education, and two years of experience:

bsali    1fsexi  2seniori  3agei  4educi  5experi  i

 Pred. bsal = 6277.9 - 767.9*1 - 22.6*10

+ .63*300 + 92.3*12 + .50*24
= 6592.6
Interpretation of coefficients in multiple
regression
 Each estimated coefficient is amount Y is expected to increase when the
value of its corresponding predictor is increased by one, holding
constant the values of the other predictors.

 Example: estimated coefficient of education equals 92.3.

For each additional year of education of employee, we expect salary to
increase by about 92 dollars, holding all other variables constant.

 Estimated coefficient of fsex equals -767.

For employees who started at the same time, had the same education
and experience, and were the same age, women earned \$767 less on
average than men.
Which variable is the strongest predictor of
the outcome?
 The coefficient that has the strongest linear association with
the outcome variable is the one with the largest absolute
value of T, which equals the coefficient over its SE.
 It is not size of coefficient. This is sensitive to scales of
predictors. The T statistic is not, since it is a standardized
measure.
 Example: In wages regression, seniority is a better predictor
than education because it has a larger T.
Hypothesis tests for coefficients
 The reported t-stats (coef. / SE) and p-values are used to test whether a
particular coefficient equals 0, given that all other coefficients are in the
model.

 Examples:

1) Test whether coefficient of education equals zero has p-value = .0004.
Hence, reject the null hypothesis; it appears that education is a useful
predictor of bsal when all the other predictors are in the model.

 2) Test whether coefficient of experience equals zero has p-value =
.6364. Hence, we cannot reject the null hypothesis; it appears that
experience is not a particularly useful predictor of bsal when all other
predictors are in the model.
Hypothesis tests for coefficients
 The test statistics have the usual form
 (observed – expected)/SE.

 For p-value, use area under a t-curve with
(n-k) degrees of freedom, where k is the number of terms in
the model.

 In this problem, the degrees of freedom equal (93-6=87).
CIs for regression coefficients
 A 95% CI for the coefficients is obtained in the usual way:

coef. ± (multiplier) SE

 The multiplier is obtained from the t-curve with (n-k) degrees of
freedom. (If degrees of freedom is greater than 26 use normal
table)
 Example: A 95% CI for the population regression coefficient of
age equals:
(0.63 – 1.96*0.72, 0.63 + 1.96*0.72)
 Hypothesis tests and CIs are meaningful only when the data fits
the model well.

 Remember, when the sample size is large enough, you will
probably reject any null hypothesis of β=0.
 When the sample size is small, you may not have enough evidence
to reject a null hypothesis of β=0.

 When you fail to reject a null hypothesis, don’t be too hasty to say
that a predictor has no linear association with the outcome. It is
likely that there is some association, it just isn’t a very strong one.
Checking assumptions
 Plot the residuals versus the predicted values from the
regression line.

 Also plot the residuals versus each of the predictors.

 If non-random patterns in these plots, the assumptions might
be violated.
Plot of residuals versus predicted values
 This plot has a fan shape.    Response sal 77
 It suggests non-constant       W hol e M odel

variance (heteroscedastic).      Resi dual by Pr edi ct ed Pl ot
5 00 0
4 00 0

s al 7 Re s i u al
 We need to transform
3 00 0

d
2 00 0
variables.                                              1 00 0

7
0
- 10 0 0
- 20 0 0
- 30 0 0
7 00 0 9 00 0 1 10 0 0 1 30 0 0 1 50 0 01 70 0 0
s al7 7 Pr e d ic t e d
2

2

2
l

l

l
a

Plots of residuals vs. predictors

a

a
s

s

s
b

b

b
l

l

l
F       i                            t                    Y               b       y                   X               G            r           o           u             p
B                    i                   v        a       r       i       a       t           e           F           i   tB       i           v
o           fa                               r       R
i       a
e       st       ie       d   F
u       ai   ltB       i            o f
bv sa ar                lR
i    a s
e 2t                d
ie B        yF
u   ai     l
st
a

a

a
u

u

u
1                 5        0        0                                                                           1           5           0                            0                                                                        1           5       0       0
d

d

d
i

i

i
1                 0        0        0                                                                           1           0           0                            0                                                                        1           0       0       0
s

s

s
5           0        0                                                                                    5           0           0                                                                                                     5           0       0
e

e

e
0                                                                                                         0                                                                                                                             0
R

R

R
-             5        0           0                                                                        -           5           0                                0                                                                    -           5       0       0

-                    1           0       0       0                                                                -           1           0                                0       0                                                            -           1           0       0   0

6        6
0           7
5       0
7       8
5           8
0       5
9       9
0           1
5   0        0                       3                           0 40             0   50       0   60           0 7 0   0   80   0     0                   7       8       9   1    10       1
1   12    3
1     14    5
1    16    7
s           e       n       o
i       r                                                                                                 a        g        e                                                                            e       d     u     c

Fi t                                         Y            by          X       G           r oup
Bi v a r i a t e                                                          Fi t                    of                          bs l      2 B
Re s i d u a l Bi v a r i a t e y F i t po f
e x e r                                                                                                    Re s i d u a l                                    bs a l            2       By         f
Re s i u a l b s a l 2

Re s i u a l b s a l 2

1 5 0 0                                                                                                                                                    1 5 0 0

1 0 0 0                                                                                                                                                    1 0 0 0

5 0 0                                                                                                                                                      5 0 0
d

d

0                                                                                                                                                      0

- 5 0 0                                                                                                                                                    - 5 0 0

- 1 0 0 0                                                                                                                                                  - 1 0 0 0

- 5 00                   1   2   2
5 0 1 0 0 5 0 0 0 5 0 0 3 5 0 0 0
3   0   4                                                                                                         0
- 0 . 1. 1. 2. 3. 4. 5. 6. 7. 8. 9 1 1 . 1
e x p e r                                                                                                                                           f s e x
Summary of residual plots
 There appears to be a non-random pattern in the plot of
residuals versus experience, and also versus age.

 This model can be improved.
Modeling categorical predictors
 When predictors are categorical and assigned numbers,
regressions using those numbers make no sense.

 Instead, we make “dummy variables” to stand in for the
categorical variables.
Collinearity
 When predictors are highly correlated, standard errors are
inflated
 Conceptual example:
 Suppose two variables Z and X are exactly the same.
 Suppose the population regression line of Y on X is
Y = 10 + 5X

 Fit a regression using sample data of Y on both X and Z. We
could plug in any value for the coefficients of X and Z, so long
as they add up to 5. Equivalently this means that the standard
errors for the coefficients are huge
General warnings for multiple
regression
 Be even more wary of extrapolation. Because there are
several predictors, you can extrapolate in many ways

 Multiple regression shows association. It does not prove
causality. Only a carefully designed observational study or
randomized experiment can show causality

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 9 posted: 11/24/2011 language: English pages: 22