Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Chapter 12

VIEWS: 30 PAGES: 49

									                                 Chapter 12
                        Multiple Regression Analysis

 The basic ideas are the same as in Chapters 3 & 11.
 We have one response (dependent) variable, Y.
 The response (Y) is a quantitative variable.

 There are more than one predictors (independent
   variables): X1, X2, …, Xp where p = number of
   predictors in the model.
     o The predictors can be:
           Quantitative (as before)
           Categorical (new)
           Interaction terms (product of predictors)
           Powers of predictors (e.g. X 4 ).
                                         2




        In this course we will concentrate on
           o Reading computer output
           o Interpreting coefficients
           o Determining the order to interpret things.




Chapter 12, Fall 2007
                                                       Page 1 of 49
                        Some Examples

Example – 1: Suppose we want to predict temperature
for different cities, based on their latitude and
elevation.

In this case, the response and the predictors are
     Y = temperature
     X1 = Latitude
     X2 = Elevation

Possible models are
With p = 2: y    1 x1   2 x2   (Stiff surface)
With p = 3: y    1 x1   2 x2  3 x1 x2   (Twisted surface)

Example – 2: We want to predict patients’ “well-
being” from the dosage of medicine they take (mg.)
using a quadratic model:
              y    1 x  2 ( x)2  
Here X = Dosage of the active ingredient (in mg’s),
and p = 2.




Chapter 12, Fall 2007
                                                            Page 2 of 49
Example – 3: Suppose we want to predict
         Y = the highway mileage of a car using
         X1= its city mileage and
         X2= its size (a categorical variable) where,
               0 if car is compact
          X2  
               1 if car is larger


The model we may use is
                        y    1 x2   2 x2  3 ( x1 x2 )  
Note that the last term  3 ( x1 x2 ) is for interaction which
allows for NON-parallel lines.




Chapter 12, Fall 2007
                                                                    Page 3 of 49
In general terms:
                                   The model:
                        y    1 x2  2 x2  L  p x p  

Assumptions:
 1)  ~ N(0, ) [ Error terms are iid normal with
    mean zero and constant standard deviation ].

    2) As a result of this, we have, Y ~ N(µY, ), for
       every combination of x1, x2, …, xp. That is, the
       response (Y) has a normal distribution with mean
       µY (that depends on the values of the independent
       variables, x’s) and a constant standard deviation,
        (that does not depend on the values of X’s).


We use data to find the
                  Fitted Equation or
                 Prediction Equation
              y  a  b1 x2  b2 x2 L bp x p
              ˆ




Chapter 12, Fall 2007
                                                                Page 4 of 49
ANOVA F-test:
Overall test of “goodness” of the model
Ho: 1 = 2 = 3 = … = p = 0 NOTHING GOOD in model
Ha: at least one of ’s ≠ 0     SOMETHING IS GOOD.

                               MSReg
Test Statistic: F 
                                MSE

P-Value from the tables of the F-distribution with
   df1 = p         = degrees of freedom of MSReg
   df2 = n – p – 1 = degrees of freedom of MSE

       ANOVA for Multiple Regression Model
Source      df     SS          MSE                              F
Regression                                        SSReg         MSReg
                          p     SSReg   MSReg             F
(Model)                                             p            MSE
Residual                                          SSE
                        n–p–1    SSE    MSE 
(Error)                                         n  p 1
Total                    n–1     SST




Chapter 12, Fall 2007
                                                            Page 5 of 49
Testing for Individual ’s:

                        Computer output from Minitab:

Regression Analysis Y vs. X1, X2, …, Xp
Predictor    Coef         SE Coef              T                      P
 Constant       a            SE(a)       a/SE(a)                     .
    X1         b1           SE(b1)     b1 /SE(b1)                    .
    X2         b2           SE(b2)     b2 /SE(b2)                    .
     M          M                 M             M                    M
    Xp         bp           SE(bp)     bp /SE(bp)                    .

                        Estimates   SE of the   Test Statistic    p-value
                          of i     estimate    for Ho: i = 0   (2-sided)
                                      of i     vs. Ha: i  0

Look at the p-value for each 

     If p-value for i is small, then Xi is good

     If p-value for i is large, then the independent
      variable Xi is NOT ADDING any information to
      the model AFTER all other predictors are taken
      into account.




Chapter 12, Fall 2007
                                                                   Page 6 of 49
Example - 1:
   Let Y = height of a person,
   X1 = Length of right arm,
   X2 = Length of left arm.

1. Suppose after collecting data we obtained an
   ANOVA table with a small p-value. What does
   that mean?

2. What is the next step?

3. Let’s say you carried out individual t-tests on each
   of the slopes, 1 and 2 and found that the p-values
   for both are large, what does that mean?

4. Can you see a contradiction here?

5. When do we get such contradictory results?

6. So, when do we have multicollinearity?




Chapter 12, Fall 2007
                                                 Page 7 of 49
Example – 2: Suppose we are interesting in predicting
the GPA of students in college (CGPA) using 16
different predictor variables. Data were collected from
a random sample of 59 college students.

1. What is the response variable in this problem?

2. What are the values of n and p?

3. What are Ho and Ha that you can test using the
   ANOVA table?

4. What is your decision, based on the following
   ANOVA table? What is your conclusion?

         Analysis of Variance

         Source           DF   SS       MS       F      P
         Regression       16   3.3135   0.2071   1.99   0.037
         Residual Error   42   4.3601   0.1038
         Total            58   7.6736

5. What is the next step?

6. When do you NOT take the next step?




Chapter 12, Fall 2007
                                                        Page 8 of 49
Now look at the following output from Minitab:


Regression Analysis: CGPA versus Height, Gender, ...

The regression equation is
CGPA = 0.53
         + 0.0194 Height       + 0.047 Gender
         – 0.00163 Haircut     – 0.042 Job
         + 0.0004 Studytime – 0.375 Smokecig
         + 0.0488 Dated        + 0.546 HSGPA
         + 0.00315 HomeDist + 0.00069BrowseInternet
         – 0.00128 WatchTV – 0.0117 Exercise
         + 0.0140 ReadNwsP + 0.039 Vegan
         – 0.0139 PoliticalDeg – 0.0801 PoliticalAff

7. Can you make any decisions based on the above?
   Why or why not?




Chapter 12, Fall 2007
                                                  Page 9 of 49
8. The following is another part of the Minitab output.
   Which predictor(s) is/are “good?”

Predictor                      Coef  SE Coef        T             P
Constant                      0.532     1.496    0.36         0.724
Height                      0.01942  0.01637     1.19         0.242
Gender                       0.0468    0.1429    0.33         0.745
Haircut                 – 0.001633 0.001697     –0.96         0.341
Job                        – 0.0418    0.1024   –0.41         0.685
Studytime                   0.00043  0.01921     0.02         0.982
Smokecig                   – 0.3746    0.2249   –1.67         0.103
Dated                       0.04881  0.07111     0.69         0.496
HSGPA                        0.5457    0.1776    3.07         0.004
HomeDist                  0.003147 0.003400      0.93         0.360
BrowseInternet            0.000689 0.001163      0.59         0.557
WatchTV                 –0.0012840 0.0009710    –1.32         0.193
Exercise                 –0.011657 0.005934     –1.96         0.056
ReadNewsP                   0.01395  0.02272     0.61         0.543
Vegan                        0.0392    0.1578    0.25         0.805
PoliticalDegree           –0.01390   0.03185    –0.44         0.665
PoliticalAff              –0.08006   0.07741    –1.03         0.307


S = 0.322198 R-Sq = 43.2% R-Sq(adj) = 21.5%




Chapter 12, Fall 2007
                                                        Page 10 of 49
9. The following is the last part of the output. What
   does it tell us?
Unusual Observations
Obs Height     CGPA     Fit      SE Fit   Residual   St Resid
28   67.0      2.9800   3.5898   0.2442   –0.6098    –2.90R
40   65.0      3.9300   3.3458   0.2176    0.5842     2.46R
59   62.0      2.5000   3.4718   0.1352   –0.9718    –3.32R

R denotes an observation with a large standardized residual.

Although the individual t-tests indicate that the GPA
of the student in high school (HSGPA) and Exercise
have coefficients (i) that are significantly different
from zero, when tested one at a time, with p-values of
0.004 and 0.056, respectively (hence they seem to look
good), we should look at all possible combinations of
the 16 predictors, so as not to miss any combination
that may give better results. It is almost impossible to
do this by hand, but fortunately computers can do it
for us.

In this way, we can find the “best subset” of
predictors that will give the “best” prediction
equation. The Minitab output on the next page gives
“all” possible subsets of regression models.




Chapter 12, Fall 2007
                                                          Page 11 of 49
         Best Subsets Regression: CGPA versus Height, Gender, ...
         Response is CGPA
                                                                                                             P
                                                                                         B                   o
                                                                                         r                   l
                                                                                         o                   i   P
                                                                                         w                   t   o
                                                                                         s                   i   l
                                                                   S                     e           R       c   i
                                                                   t     S           H   I       E   e       a   t
                                                               H   u     m           o   n   W   x   a       l   i
                                                         H   G a   d     o           m   t   a   e   d       D   c
                                                         e   e i   y     k   D   H   e   e   t   r   N   V   e   a
                                                         i   n r   t     e   a   S   D   r   c   c   e   e   g   l
                                                         g   d c J i     c   t   G   i   n   h   i   w   g   r   A
                                     Mallows             h   e u o m     i   e   P   s   e   T   s   s   a   e   f
         Vars     R-Sq   R-Sq(adj)       C-p         S   t   r t b e     g   d   A   t   t   V   e   P   n   e   f
            1     25.5        24.2       0.1   0.31667                           X
            1     13.0        11.5       9.3   0.34217                                           X
            2     31.6        29.2      -2.4   0.30613                           X               X
            2     29.4        26.9      -0.8   0.31109                           X           X
            3     33.8        30.2      -2.1   0.30389           X               X               X
            3     33.7        30.0      -2.0   0.30423                           X           X   X
            4     35.7        31.0      -1.5   0.30223                   X       X           X   X
            4     35.3        30.5      -1.2   0.30320           X               X           X   X
            5     37.3        31.4      -0.6   0.30132   X               X       X           X   X
            5     37.0        31.1      -0.4   0.30198           X       X       X           X   X
            6     38.3        31.2       0.6   0.30163   X       X       X       X           X   X
            6     38.3        31.2       0.6   0.30164   X               X       X           X   X               X
            7     39.6        31.3       1.7   0.30150   X       X       X       X           X   X               X
            7     39.3        30.9       1.9   0.30231   X               X       X   X       X   X               X
            8     40.4        30.8       3.1   0.30249   X       X       X       X           X   X           X   X
            8     40.4        30.8       3.1   0.30256   X       X       X       X   X       X   X               X
            9     41.5        30.8       4.2   0.30266   X       X       X       X   X       X   X           X   X
            9     41.0        30.2       4.6   0.30395   X       X   X   X       X   X       X   X               X
           10     41.9        29.8       6.0   0.30478   X       X       X   X   X   X       X   X       X       X
           10     41.8        29.7       6.0   0.30492   X       X   X   X       X   X       X   X       X       X
           11     42.2        28.7       7.7   0.30712   X       X   X   X   X   X   X       X   X       X       X
           11     42.2        28.7       7.7   0.30715   X       X       X   X   X   X       X   X   X   X       X
           12     42.6        27.6       9.4   0.30945   X       X       X   X   X   X   X   X   X   X   X       X
           12     42.6        27.6       9.5   0.30954   X       X   X   X   X   X   X   X   X   X       X       X
           13     42.9        26.4      11.2   0.31205   X       X   X   X   X   X   X   X   X   X   X   X       X
           13     42.8        26.3      11.3   0.31229   X   X   X       X   X   X   X   X   X   X   X   X       X
           14     43.1        25.0      13.1   0.31502   X   X   X   X   X   X   X   X   X   X   X   X   X       X
           14     43.0        24.9      13.1   0.31526   X       X   X   X   X   X   X   X   X   X   X X X       X
           15     43.2        23.4      15.0   0.31843   X   X   X   X   X   X   X   X   X   X   X   X X X       X
           15     43.1        23.2      15.1   0.31866   X   X   X   X X X   X   X   X   X   X   X   X   X       X
           16     43.2        21.5      17.0   0.32220   X   X   X   X X X   X   X   X   X   X   X   X X X       X

Observe that
 R2 never goes down when you add predictors to
  the model, whereas
 Adjusted R2 will go down when you add new
  predictors to a model that are not adding any
  information to the model.

Chapter 12, Fall 2007
                                                                                                     Page 12 of 49
Note that when there are more than 2 predictors in the
regression model, the adjusted R2 does not change
much from the model that has HGPA and Exercise as
the predictors.

Another consideration in model selection is
“parsimony.” That is, the preferred model is one that is
as simple as possible and with a high adjusted R2.
Thus, it seems that “the best combination” of
predictors is HGPA and Exercise.

Now we need to work through the above steps again
and see what we can say for a regression model that
has only HGPA (X1) and Exercise (X2) as predictors.

We obtain the following output from Minitab:




Chapter 12, Fall 2007
                                                 Page 13 of 49
Regression Analysis: CGPA versus HSGPA, Exercise

The regression equation is
CGPA = 1.55 + 0.560 HSGPA - 0.0111 Exercise

Predictor                    Coef   SE Coef         T      P
Constant                   1.5489    0.5551        2.79 0.007
HSGPA                      0.5599    0.1436        3.90 0.000
Exercise                -0.011138   0.004985       –2.23 0.029

S = 0.306126 R-Sq = 31.6% R-Sq(adj) = 29.2%

Analysis of Variance
Source     DF    SS  MS     F                          P
Regression 2 2.4256 1.2128 12.94                       0.000
Residual
(Error)    56 5.2479 0.0937
Total      58 7.6736

First, using ANOVA we test Ho: 1 = 2 = 0 against
Ha: At least one of 1 and 2 is different from zero.

Since the p-value < 0.0005, we reject Ho. The
observed data give strong evidence that at least one of
the two predictors is good in explaining the variation
in CGPA.

Next we carry out two independent tests:
   Ho: 1 = 0 vs. Ha: 1  0 and
   Ho: 2 = 0 vs. Ha: 2  0


Chapter 12, Fall 2007
                                                           Page 14 of 49
The regression equation is
CGPA = 1.55 + 0.560 HSGPA - 0.0111 Exercise

Predictor                    Coef     SE Coef        T      P
Constant                   1.5489      0.5551       2.79 0.007
HSGPA                      0.5599      0.1436       3.90 0.000
Exercise                -0.011138     0.004985      –2.23 0.029

Using the above output we reject both of the null
hypotheses, with p-value < 0.0005 for HGPA and the
p-value = 0.029 for exercise. These decisions indicate
that both of the predictors are “good” ones.

               Analyses of Residuals:
Before we move on, we need to look at the last part of
the output, which gives us some warning messages,
based on an analysis of residuals:
Unusual Observations
Obs HSGPA     CGPA   Fit            SE Fit   Residual    St Resid
  3   3.00 3.6000 3.2176            0.1297     0.3824    1.38 X
  9   3.50 2.8800 3.4808            0.0642    -0.6008   -2.01R
 14   3.30 2.6000 2.7284            0.2647    -0.1284   -0.83 X
 27   2.55 3.1400 2.9099            0.1840     0.2301    0.94 X
 28   3.80 2.9800 3.6544            0.0445    -0.6744   -2.23R
 59   3.60 2.5000 3.5424            0.0556    -1.0424   -3.46R

R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large
influence.

The above output indicates that there are 3 influential
observations and 3 observations that may also be
influential. Let’s look at some graphs of the residuals
to see what is happening:

Chapter 12, Fall 2007
                                                             Page 15 of 49
                                                                    Residual Plots for CGPA
                                   Normal Probability Plot of the Residuals                                         Residuals Versus the Fitted Values
                                 99.9




                                                                                   Standardized Residual
                                  99
                                                                                                           1
                                  90
                                                                                                           0




                   Percent
                                  50                                                                       -1
                                  10                                                                       -2
                                   1                                                                       -3
                                  0.1
                                        -4        -2         0         2       4                                2.7          3.0        3.3       3.6     3.9
                                                   Standardized Residual                                                           Fitted Value

                                             Histogram of the Residuals                                         Residuals Versus the Order of the Data
                                  16




                                                                                   Standardized Residual
                                                                                                           1
                                  12
                     Frequency




                                                                                                           0

                                   8                                                                       -1
                                                                                                           -2
                                   4
                                                                                                           -3
                                   0
                                             -3     -2     -1      0       1                                    1     5   10 15 20 25 30 35 40 45 50 55
                                                  Standardized Residual                                                         Observation Order




 In the first panel (top-left) we see that the all but
  one of the residuals are close to the blue line,
  indicating that the assumption of the normality of
  residuals is supported by the data. The lowest dot
  on the LHS of this graph is the outlier or the
  influential observation.
 The second panel (top-right) shows that the
  standardized residuals are randomly scattered
  around the horizontal line (residual = 0), and all,
  except one (the outlier) are within 3 standard
  deviations of the mean, i.e., zero (as expected).
 The third panel (Low-left) is a histogram of the
  standardized residuals and support that the
  residuals have a normal distribution with zero
  mean and some constant variance.
 Finally, the last graph do not show any funnel
  shape, so the assumption of constant variance is
  supported. We can still see the outlier(s).
Chapter 12, Fall 2007
                                                                                                                                                                Page 16 of 49
In order to see if including a quadratic or higher order
term of one or both of the predictors might improve
the model, we look at a scatter diagram of the
standardized residuals vs. the predictors. These are
given below:
Residual Plots for CGPA
                                            Residuals Versus HSGPA                                                              Residuals Versus Exercise
                                                (response is CGPA)                                                                   (response is CGPA)
                         2                                                                                        2


                         1                                                                                        1
 Standardized Residual




                                                                                          Standardized Residual
                         0                                                                                        0


                         -1                                                                                       -1


                         -2                                                                                       -2


                         -3                                                                                       -3


                         -4                                                                                       -4
                              2.50   2.75    3.00       3.25         3.50   3.75   4.00                                0   10       20         30         40   50       60
                                                        HSGPA                                                                               Exercise




We do not see any higher order relation between the
residuals and the predictors. So we cannot improve the
model by adding any other predictor.

However, there is at least one observation that needs to
be checked and corrected if possible, or removed from
the data set otherwise.

The question is which one of the observation should
we look at and delete first (if we cannot find the
reason why it has such a large residual)?

[We delete observations one at a time because things
may change after deleting one observation.]

Chapter 12, Fall 2007
                                                                                                                                                                    Page 17 of 49
Easiest way is to look at the plot of residuals against
order of observations.

                                         Residuals Versus the Order of the Data
                                                       (response is CGPA)
                           2


                           1
   Standardized Residual




                           0


                           -1


                           -2


                           -3


                           -4
                                1   5   10   15   20      25   30    35     40   45   50   55
                                                        Observation Order




We immediately see that the last observation (#14 in
the data set) has the largest standardized residual and
hence we should start with that.

We see that this student practices for 60 hours per
week and hence is far from others in the data set. The
student who has the nearest X2 value, practices for 25
hours. Other students practice for 15 hours per week
or less. Thus, this student is not typical at all. The
following is the Minitab output, when the observations
from this student are deleted.

Chapter 12, Fall 2007
                                                                                                Page 18 of 49
Regression Analysis: CGPA versus HSGPA, Exercise

         Analysis of Variance

         Source         DF              SS           MS           F      P
         Regression      2          1.45009       0.72504        7.69 0.001
         Residual Error 55          5.18265       0.09423
         Total          57          6.63274

Looking at the ANOVA table we decide to reject Ho
and conclude that at least one of the two predictors is
“good.” [Note the change in the degrees of freedom in
ANOVA. Why should they change?]
         The regression equation is
         CGPA = 1.54 + 0.554 HSGPA - 0.00432 Exercise

         Predictor   Coef           SE Coef                T           P
         Constant   1.5388           0.5568               2.76        0.008
         HSGPA      0.5542           0.1441               3.85        0.000
         Exercise -0.004320         0.009596             -0.45        0.654

         S = 0.306969        R-Sq = 21.9%         R-Sq(adj) = 19.0%


Test on the ’s one at a time show that the second
predictor (Exercise) is not “good” since the
corresponding p-value = 0.654 > larger than any
reasonable . This means we should try a new model,
without Exercise. Because we are going to change the
model, we do not need to do anything based on the
rest of the output.
         Unusual Observations
         Obs HSGPA     CGPA   Fit    SE Fit   Residual   St Resid
           3   3.00 3.6000 3.1970    0.1324     0.4030       1.45 X
          25   3.50 3.3100 3.3705    0.1974    -0.0605      -0.26 X
          26   2.55 3.1400 2.9261    0.1856     0.2139       0.87 X
          27   3.80 2.9800 3.6361    0.0497    -0.6561      -2.17R
          58   3.60 2.5000 3.5252    0.0594    -1.0252      -3.40R

         R denotes an observation with a large standardized residual.
         X denotes an observation whose X value gives it large influence.


Chapter 12, Fall 2007
                                                                            Page 19 of 49
 Here is the output for the SLR model with HSGPA as
the predictor:
Regression Analysis: CGPA versus HSGPA

The regression equation is
CGPA = 1.50 + 0.560 HSGPA

Predictor                 Coef    SE Coef      T       P
Constant                1.4964     0.5448   2.75   0.008
HSGPA                   0.5596     0.1426   3.92   0.000

S = 0.304776                R-Sq = 21.6%      R-Sq(adj) = 20.2%

Analysis of Variance

Source                       DF       SS        MS       F       P
Regression                    1   1.4310    1.4310   15.41   0.000
Residual Error               56   5.2017    0.0929
Total                        57   6.6327

Both panels show that  is significantly different from
zero and hence we have a “reasonably good” model.
[It is not really a good model, why?]

Next we should look at the following panel??? of the
output to see if we want to delete a few more
observations to improve the model.

We may also try to find other predictors so as to
improve R2 which is around 20%, i.e., only 20% of the
variation in CGPA is explained by changes in
HSGPA, or alternatively we can say that HSGPA has
reduced the error sum of squares only by 20%.

Chapter 12, Fall 2007
                                                                     Page 20 of 49
                        Categorical Variables in MLR

Categorical variables (in multiple linear regression)
are coded as 0 and 1. They are called dummy
variables or indicator variables. When we want to
compare a group of observations with a baseline
group or a control group, we code the dummy
variable as zero for that group.

Example: Suppose we want to predict the wages of
employees using the length of service. Thus we have a
quantitative response ( Y = Wages) and a quantitative
predictor (LOS = Length of service). Of course wages
also depend on the size of the company that employs
these workers, so let’s add another variable and call it
SIZE.

So we have,
    Y = Wages = Response variable
    X1 = LOS = Length of service (predictor 1) and
    X2 = SIZE = Size of company (a categorical
        variable coded small or large).

We will use small companies as the baseline group,
i.e., we will code the two categories of SIZE as
      X2 = 0 if the company is small and
      X2 = 1 if the company is large (not small).

Chapter 12, Fall 2007
                                                       Page 21 of 49
Model:
                        y    1 X 1   2 X 2  
                           0  1 X 1   2 X 2  

In the above model, when we substitute X2 = 0 we
obtain a SLR model for small companies:
                    y    1 X 1  
 Similarly, substituting X2 = 1 will give us another
SLR for the large companies:
               y  0  1 X 1   2 ( 1 )  
                          ( 0   2 )  1 X 1  

Observe that the difference between the two is only in
the intercept: for small companies the intercept is 0
but for large companies it is (0 + 2).

However, both models have the same slope 1.
         Thus we have two parallel lines.

Interpretation of the coefficients of regression model:
         0: intercept for the baseline group
         1: Slope for both groups
         2: Change (or difference) in intercept for the
             other group compared to the baseline.


Chapter 12, Fall 2007
                                                        Page 22 of 49
The above lines were forced to be parallel by the
choice of model. To allow for non-parallel lines we
will add an interaction term to the model:

   A Multiple Regression Model with Interaction
           y  0  1 X 1   2 X 2   3 X 1 X 2  
Now let’s see what we get when we substitute 0 or 1
for X2 in the above model:

Small companies: Substitute X2 = 0
        y  0  1 X 1   2 ( 0 )   3 X 1 ( 0 )  
            0  1 X 1  
    Intercept = 0, Slope = 1.
Large companies: Substitute X2 = 1
         y  0  1 X 1   2 ( 1 )   3 X 1 ( 1 )  
                         0  1 X 1   2    3 X 1   
                 ( 0   2 )  ( 1   3 )X 1  
         Intercept = 0 + 2, Slope = 1 + 3

Interpretations:
0: Intercept for the baseline group
1: Slope of the baseline group
3: Change in intercept for the other group compared to baseline
3: Change in slope for the other group compared to baseline
Interaction term allows for non-parallel lines.


Chapter 12, Fall 2007
                                                              Page 23 of 49
Steps:
  1. Start with a model that has the interaction term.
  2. Using the ANOVA table, test
     Ho: 1 = 2 = 3 = 0 vs. Ha: At least one i  0
  3. Test Ho: 3 = 0 vs. Ha: 3  0
       a. If the null hypothesis is not rejected then fit a
          simpler model with no interaction term.
       b. If null hypothesis is rejected then keep the
          interaction term in the model.

What if there are 3 levels for the categorical predictor?
Suppose we have 3 categories of SIZE (small, medium
and large) in addition to the quantitative predictor LOS
to predict the wages.
Then we need two dummy variables for SIZE:

     1 medium                         1   large
X2              and             X3  
     0 Otherwise                      0 Otherwise

Now let’s first look at a MLR model with no
interaction terms:




Chapter 12, Fall 2007
                                                    Page 24 of 49
                        MLR Model without Interaction
                         y  0  1 X 1   2 X 2   3 X 3  

Small companies: X2 = 0, X3 = 0
         y  0  1 X 1   2 ( 0 )   3 ( 0 )  
                           0  1 X 1  

Medium companies: X2 = 1, X3 = 0
         y  0  1 X 1   2 ( 1 )   3 ( 0 )  
                           ( 0   2 )  1 X 1  

Large companies: X2 = 0, X3 = 1
         y  0  1 X 1   2 ( 0 )   3 ( 1 )  
                           ( 0   3 )  1 X 1  

Interpretation:
0: Intercept for baseline (small)
1: Slope for all 3
2: Change in intercept for medium vs. small
3: Change in slope for large vs. small




Chapter 12, Fall 2007
                                                                   Page 25 of 49
                        MLR Model with interaction:

  y  0  1 X 1   2 X 2   3 X 3   4 X 1 X 2   5 X 1 X 3  

Small companies: X2 = 0, X3 = 0
y  0  1 X 1   2 X 2   3 X 3   4 X 1 X 2   5 X 1 X 3  
     0  1 X 1   2 ( 0 )   3 ( 0 )   4 X 1 ( 0 )   5 X 1 ( 0 )  
     0  1 X 1  


Medium companies: X2 = 1, X3 = 0
y  0  1 X 1   2 X 2   3 X 3   4 X 1 X 2   5 X 1 X 3  
   0  1 X 1   2 ( 1 )   3 ( 0 )   4 X 1 ( 1 )   5 X 1 ( 0 )  
   ( 0   2 )  ( 1   4 ) X 1  

Small companies: X2 = 0, X3 = 0
y  0  1 X 1   2 X 2   3 X 3   4 X 1 X 2   5 X 1 X 3  
   0  1 X 1   2 ( 0 )   3 ( 1 )   4 X 1 ( 0 )   5 X 1 ( 1 )  
   ( 0   3 )  ( 1   5 )X 1  

How do you interpret the ’s?




Chapter 12, Fall 2007
                                                                   Page 26 of 49
Example: Wages vs Length of Service and Size of Company




Coding of size of company:      small = 0   large = 1


 A model with interaction term:
Regression Analysis: Wages versus LOS, size, LOS*size

Analysis of Variance
Source          DF     SS          MS      F       P
Regression       3 2438.1       812.7   6.76   0.001
Residual Error 56 6728.3        120.1
Total           59 9166.4


The above ANOVA table tells us that at least one of
the regression coefficients (’s) is significantly
different from zero, but that is no help, since it does
not say which one(s).
Let’s look at the next panel that gives the estimated
coefficients, test statistics and more:



Chapter 12, Fall 2007
                                                          Page 27 of 49
The regression equation is
Wages = 35.9 + 0.104 LOS + 13.6 size - 0.0483 LOS*size

Predictor                   Coef   SE Coef        T       P
Constant                  35.914     3.562    10.08   0.000
LOS                      0.10424   0.03632     2.87   0.006
size                      13.631     4.910     2.78   0.007
LOS*size                -0.04828   0.05634    -0.86   0.395

S = 10.9612                R-Sq = 26.6%      R-Sq(adj) = 22.7%


Since the test for the coefficient of the interaction term
(LOS*size) has a large p-value, we fail to reject the
hypothesis of no interaction (Ho: 3 = 0), hence try a
model with no interaction term.
The model with no interaction:
Regression Analysis: Wages versus LOS, size

The regression equation is
Wages = 37.5 + 0.0842 LOS + 10.2 size

Predictor                  Coef    SE Coef       T        P
Constant                 37.466      3.061   12.24    0.000
LOS                     0.08417    0.02770    3.04    0.004
size                     10.228      2.882    3.55    0.001

S = 10.9357                R-Sq = 25.6%      R-Sq(adj) = 23.0%


In the above output we see that both coefficients are
significantly different from zero so this is the model
we want (or is it?).
We decide to keep both variables in the model.
However, because adjusted R2 = 23% is too small we
are too happy with the model.


Chapter 12, Fall 2007
                                                                 Page 28 of 49
What else do we need to do? Look at the residuals to
see if they give us any suggestions:
Unusual Observations
Obs LOS Wages      Fit   SE Fit   Residual   St Resid
 15   70 97.68 53.59       1.85      44.09       4.09R
 22 222 54.95 56.15        4.57      -1.21      -0.12 X
 29   98 34.34 55.94       2.05     -21.60      -2.01R
 42 228 67.91 56.66        4.71      11.25       1.14 X
 47 204 50.17 64.87        4.26     -14.69      -1.46 X

R denotes an observation with a large standardized residual.
X denotes an observation whose X value gives it large
influence.



R2 may increase by removing (after trying to see if we
can do any correction) observation number 15 from
the data set.


Alternatively we could try to find other predictors and
start all over.




Chapter 12, Fall 2007
                                                          Page 29 of 49
Example:
Reaction Time in a Computer Game vs Distance to move mouse and Hand used.




Coding of hand:                  right = 0   left = 1


                             A model with interaction:
         Analysis of Variance
         Source          DF      SS                    MS        F       P
         Regression       3 136948                  45649    17.82   0.000
         Residual Error 36    92198                  2561
         Total           39 229146


What does the ANOVA table tell us?
Now let’s have a look at the estimates:
Regression Analysis: time versus distance, hand, dist*hand

The regression equation is
time = 99.4 + 0.028 distance + 72.2 hand + 0.234 dist*hand

Predictor                 Coef     SE Coef      T        P
Constant                 99.36       25.25   3.93    0.000
distance                0.0283      0.1308   0.22    0.830
hand                     72.18       35.71   2.02    0.051
dist*hand               0.2336      0.1850   1.26    0.215

S = 50.6067                R-Sq = 59.8%       R-Sq(adj) = 56.4%




Chapter 12, Fall 2007
                                                                             Page 30 of 49
Since the test on the coefficient of the interaction term
has a large p-value we fail to reject Ho: 3 = 0 and use
a model with no interaction term. [Ignore the rest of
the output.]
Unusual Observations
Obs distance     time     Fit SE Fit Residual St Resid
 25       163 315.00 214.29     11.38    100.71      2.04R
 30       271 401.00 242.65     17.19    158.35      3.33R
 31        40 320.00 182.09     20.68    137.91      2.99R
R denotes an observation with a large standardized residual.


                          The model with no interaction:
Regression Analysis: time versus distance, hand

         Analysis of Variance
         Source          DF      SS                   MS       F       P
         Regression       2 132865                 66433   25.53   0.000
         Residual Error 37    96281                 2602
         Total           39 229146


The ANOVA tables tells us that at least one of the ’s
is significantly different from zero. But which one(s)?
The regression equation is
time = 79.2 + 0.145 distance + 112 hand

Predictor                  Coef   SE Coef      T         P
Constant                  79.21     19.72   4.02     0.000
distance                0.14512   0.09324   1.56     0.128
hand                     112.50     16.13   6.97     0.000

S = 51.0116                R-Sq = 58.0%     R-Sq(adj) = 55.7%


Since the p-value for the test on the coefficient of
distance is large, we fail to reject Ho: 1 = 0. This
leaves HAND as the only predictor in the model.


Chapter 12, Fall 2007
                                                                           Page 31 of 49
You may try to see if removing observation 30 will
change this result. It didn’t.
Unusual                                                        Observations
Obs   distance            time      Fit    SE Fit      Residual   St Resid
 25        163          315.00   215.39     11.44         99.61       2.00R
 30        271          401.00   231.10     14.67        169.90       3.48R
 31         40          320.00   197.55     16.80        122.45       2.54R
 So, what is next?? Simple Linear Regression:
Regression Analysis: time versus hand

         The regression equation is
         time = 104 + 112 hand

         Predictor        Coef   SE Coef      T         P
         Constant       104.25     11.62   8.97     0.000
         hand           112.50     16.43   6.85     0.000

         S = 51.9573       R-Sq = 55.2%     R-Sq(adj) = 54.1%

         Analysis of Variance
         Source          DF     SS             MS        F       P
         Regression       1 126562         126562    46.88   0.000
         Residual Error 38 102583            2700
         Total           39 229146


The above output indicates that both 0 and 1 are
significantly different from zero. How do you interpret
that ’s?
Unusual Observations
Obs hand     time     Fit SE Fit Residual St Resid
 30 1.00 401.00 216.75      11.62    184.25      3.64R
 31 1.00 320.00 216.75      11.62    103.25      2.04R
 32 1.00 113.00 216.75      11.62   -103.75     -2.05R
R denotes an observation with a large standardized residual.
You may try to see if removing observation # 30 will
change the results. [It didn’t.]


Chapter 12, Fall 2007
                                                                     Page 32 of 49
We may use the methods of Chapter 13 in this case
since the predictor is a categorical variable. Observe
that we have the same ANOVA table in both cases.
Testing the hypothesis of equal population means
using the methods of Chapter 13:
         One-way ANOVA: time versus hand

         Source         DF       SS         MS         F       P
         hand            1   126563     126563     46.88   0.000
         Error          38   102584       2700
         Total          39   229146

         S = 51.96           R-Sq = 55.23%         R-Sq(adj) = 54.05%

                                      Individual   95%   CIs For   Mean  Based   on
                                                      Pooled                  StDev
Level N            Mean       StDev        +---------+---------+---------+---------
0     20           104.25        8.25         (-----*-----)
1     20           216.75      73.01                                  (-----*-----)
                                           +---------+---------+---------+---------
                                        80           120          160           200
Pooled StDev = 51.96



The above graph shows that the CIs for the means of
the two populations do not overlap, indicating that
there is a significant difference in the means of the two
populations.
Since there are only 2 populations (define them) we
can also test the hypothesis of no difference between
the two population means using the methods of
Chapter 9.
Minitab gives the output on the next page:

Chapter 12, Fall 2007
                                                                          Page 33 of 49
Testing hypothesis of no difference between the two
population means using methods of Chapter 9:
         Two-Sample T-Test and CI: time, hand

         Two-sample T for time

         hand N Mean StDev SE Mean
         0 20 104.25 8.25  1.8
         1 20 216.8 73.0   16

         Difference = mu (0) – mu (1)
         Estimate for difference: –112.500
         95% CI for difference: (–146.889, –78.111)
         T-Test of difference = 0 (vs not =): T-Value = -6.85 P-Value = 0.000 DF=19




 How will you interpret the above output?


In the last example, you have seen different ways of
getting the same results. Can you see the differences
and similarities between them?




Chapter 12, Fall 2007
                                                                          Page 34 of 49
                       12.6 Logistic Regression
              (This section will be included in Exam 3.)

This model is used when response is categorical with
two categories (Yes = “Success” and No = “Failure”).
There may be one or more predictors, one of which is
quantitative.

Example: suppose you want to predict whether a
person has cancer or not (Yes/No) based on the
number of cigarettes s/he smokes (Quantitative),
gender (Male/Female), Age (Quantitative) and an
index of family history (Quantitative). Then,

Response:
       Y = Have Cancer? (Categorical)

Predictors:
         X1 = Number of cigarettes (Quantitative)
         X2 = Gender (Categorical)
         X3 = Age (Quantitative)
         X4 = Family History Index (Quantitative)

To analyze such data, we will express the
probabilities in terms of “odds” or “odds ratio” and
use the logarithms (to base e) of these odds. [That is
the reason the model is called the logistic regression
model.]
Chapter 12, Fall 2007
                                                           Page 35 of 49
We will concentrate on transforming the data
  o P(“Success”) to “Odds” to “Log Odds”
  o Interpreting computer output

Let’s first clarify the concepts of “odds” and “logit
function” or “log odds”:

Transforming the data:
   For each unit in the sample, we will have
    observation(s) on X(s) as well as an observation
    on Y (a dummy variable or an indicator) where,
        Y = 1 if response is “Yes” (= “Success”) and
        Y = 0 if response is “No” (= “Failure”)

     Then the number of “Success”s in the sample is
                               n

            n                 Y     i

          Y and hence
           i 1
                  i      p
                         ˆ
                           n
                              i 1
                                         = sample proportion.

     [Note that the sample proportion is the sample
      mean of a binary (Bernoulli) variable!]

     The “Odds” or “Odds ratio” is the ratio of the
      probability of “Success” to the probability of
                               pˆ
      “Failure”, estimated by       .
                              1 pˆ



Chapter 12, Fall 2007
                                                          Page 36 of 49
     The LOG ODDS (or the Logit function) is defined
      as the natural logarithm of the odds ratio, i.e.,
                          p 
                             ˆ          p 
                                           ˆ
      LOG ODDS = log e            ln 
                               ˆ
                         1 p              ˆ
                                        1 p 

                        Interpreting “Odds”

Example – 1:
Suppose the “odds of having a disease” is 0.33, that is,
ODDS = 0.33 = 1/3 = 1 : 3

This means 1 person has disease for every 3 who
don’t. Thus,
  ˆ
  p = probability that a person has disease
         Number of people who have the disease
    
       Total number of people (haves+don't haves)
        1     1
             0.25  25%
       1 3 4

Example – 2:
ODDS = 0.5 = ½ = 1 : 2
ˆ
p = probability that a person has disease
      1    1
   =       0.33  33%
     1 2 3



Chapter 12, Fall 2007
                                                  Page 37 of 49
Example – 3:
ODDS = 1.5 = 15/10 = 3/2 = 3:2
ˆ
p = probability that a person has disease
      3     3
   =        0.60  60%
     3 2 5

                        ˆ
Note that ODD > 1 means p > 50%

Working Backwards:

Example – 1:
                            pˆ     0.9 9
Supose p  0.90 then ODDS 
       ˆ                               9 , that
                          1  p 0.1 1
                               ˆ
is ODDS = 9:1. So, LogODDS = ln(9) = 2.1972

Let’s work backwards
      ˆ
Find p , when LogODDS is given as 2.1972.

When LogODDS = 2.1972, we take the exponent (the
opposite of natural logs) of both sides to get
ODDS = elogODDS = e2.1972 = 8.9998 = 9 = 9 : 1
          9
So, p 
    ˆ         0.9
        9 1




Chapter 12, Fall 2007
                                            Page 38 of 49
Example – 2:

Suppose p  0.45, then
        ˆ
          pˆ     0.45
ODDS                 0.8181
        1  p 0.55
             ˆ
        45 9  5 9
                     9 :11
        55 11 5 11


LogODDS = ln(0.8181) = – 0.20067

Let’s work backwards
      ˆ
Find p , when LogODDS = – 0.20067 is given.

When LogODDS = – 0.20067
ODDS = elogODDS
       = e– 0.20067 = 0.8181 = 0.82 = 82/100 = 82:100

                           82     82
Hence, p 
       ˆ                             0.45
                        82  100 182




Chapter 12, Fall 2007
                                               Page 39 of 49
Example – 3:

Suppose computer output reports LogODDS = 1.57
What is the sample proportion?
When LogODDS = 1.57, ODDS = e1.57 = 4.81
That is ODDS = 481/100 = 481 : 100
           481
So, p
    ˆ               0.828  82.8%
         481  100

                                       e   X
Logistic Regression Model:         p       X
                                      1 e
Here p = P(“Success”) = P(Y = 1) and X is the
predictor (a quantitative variable).

Then, the LogODDS (Logit) function is
                             p 
                        Log          X
                            1 p 

Although the right hand side of the above equation
looks like SLR, a scatter diagram of X against p is an
S-shaped curve. In the above model,  is NOT the
slope, although its sign tells us whether there is an
increasing ( > 0) or a decreasing ( < 0) relation
between X and p.
Chapter 12, Fall 2007
                                                   Page 40 of 49
Fitted equation:
                             p 
                                ˆ
                        Log          a  bX
                             1 p 
                                  ˆ
Interpretation of b (we don’t interpret a):
b = LogODDS of “Success” (Check the sign)

Is b is significantly different from zero?
We can test this by Ho:  = 0 vs. Ha:  ≠ 0.

ODDS ratio = eb . Thus, b gives us the change in
ODDS as X increases by one unit.

Is ODDS ratio (eb) significantly different from one?
If we fail to reject Ho:  = 0 vs. Ha:  ≠ 0, then this
means  is NOT significantly different from zero and
hence e is not significant different from 1. [This
means probability of “Success” is the same for all
values of X.]

On the other hand, if the p-value reported by computer
is small, we reject Ho:  = 0 vs. Ha:  ≠ 0, which
implies  ≠ 0 and hence e ≠ 1 and hence probability
of “Success” is different for different X values.


Chapter 12, Fall 2007
                                                 Page 41 of 49
Note that  = 0 means there is no linear relationship
between X and LogODDS, hence no relationship
between p and X.


Example:

How does age affect the chances of developing
osteoporosis?

Data:
Age = X = Predictor     72      85     84      …
                        Yes     No     Yes     …
   Y = Osteoporosis?
                         1       0      1      …

Logistic Regression Model:
                p 
           Log        a  bX   0  1 X
               1 p 

Let us interpret the Minitab output on the next page.




Chapter 12, Fall 2007
                                                   Page 42 of 49
 Minitab Output:
Logistic Regression of Osteoporosis (yes=1, no=0) on Age (in years)
Logistic Regression Table
                                                            Odds 95% CI
Predictor                Coef     SE Coef    Z     P        Ratio Lower Upper
Constant                -4.353     2.4865   1.75   0.0802
age                       0.038    0.0072   5.28   0.0000   1.04   1.02 1.05



Fitted Equation (from Minitab)
                p 
                   ˆ
           Log          4.353  0.038 Age
               1 p 
                     ˆ
Testing the hypothesis that age has no effect on
developing osteoporosis, i.e.,
            Ho:  = 0 vs. Ha:  ≠ 0 (Z-test)
Computer gives p-value 0.000. So we reject Ho. Age
is a “good” predictor of whether a woman will develop
osteoporosis.

b = 0.38 (Do not interpret)
Interpret eb = e0.038 = 1.039 [On output ODDS RATIO = 1.04]

Interpretation: As age increases by one year, the
odds of getting osteoporosis are 1.04 times what they
were the year before.

95% CI for ODDS Ratio: (1.02, 1.05)
Since CI does not contain one, age is a “good”
predictor of osteoporosis.

Chapter 12, Fall 2007
                                                                       Page 43 of 49
Predict the chance (probability) of getting osteoporosis
at age 65 and also at age 75.

a) When X = 65
               p 
                  ˆ
LogODDS  Log          4.353  0.038 Age
               1 p 
                    ˆ
         4.353  0.038  65  1.883

ODDS = e-1.883 = 0.152  0.15 = 15/100
          15
So p 
   ˆ             0.13 , i.e., 13% of women aged 65
       15  100
have osteoporosis.

Another way:
                          e   X
Using the model p              X
                                      we can estimate p as
                      1 e
                4.353 0.038(65)
              e                        0.152
          p
          ˆ       4.353 0.038(65)
                                             0.13
             1 e                      1.152




Chapter 12, Fall 2007
                                                      Page 44 of 49
b) When X = 75
               p ˆ
LogODDS = log           4.353  0.38(65)  1.503
              1 p ˆ
Then, ODDS = e-1.503 = 0.22 and hence,

      ODDS       0.22
 p
 ˆ                    0.18; that is, 18% of women
    1  ODDS 1.22
aged 75 will have osteoporosis.




Chapter 12, Fall 2007
                                              Page 45 of 49
                   Multiple Logistic Regression Model

Example: Predicting chances of cancer (for a
population at a very high risk) from age and smoking
status:

Response = Y (Binary, categorical with 2 options)

            1 Cancer
         Y 
            0  Not

Predictors:
    X1 = Age (Quantitative)
          1 Smoke
    X2            (Binary)
           0 Not

               Multiple Logistic Regression Model:
                  p 
             Log         0  1 ( Age)   2 ( Smoking )
                 1 p 

Here is a Minitab output. Let’s interpret it:




Chapter 12, Fall 2007
                                                               Page 46 of 49
Minitab output:
Example - Logistic Regression of Cancer (yes=1, no=0)
on Age (in years) and Smoking (yes=1, no=0)

Logistic Regression Table
                                                        Odds      95% CI
Predictor             Coef   SE Coef      Z        P   Ratio   Lower Upper
Constant           -4.4777    2.7465   1.63   0.1032
age                 0.1123    0.0386   2.91   0.0036    1.12   1.04   1.21
smoking             1.1638    0.4537   2.57   0.0103    3.21   1.32   7.79

Log-Likelihood = -137.18596
Test that all slopes are zero: G = 18.8479, DF = 2, P-Value = 0.000



Fitted Equation:
     p 
       ˆ
Log         b0  b1 ( Age)  b2 ( Smoking )
    1 p 
         ˆ
              4.4777  0.1123( Age)  1.1638( Smoking )

Inferences about Age:
     The p-value = 0.0036 < 0.01 Thus age has a
      significant effect on probability of getting cancer.
     Odds Ratio = 1.12 means each year as one gets
      older, the ODDS of getting cancer is 1.12 times
      what it was during the previous year.
     CI for ODDS: (1.04, 1.21) Does not include 1
      hence age has a significant effect on cancer




Chapter 12, Fall 2007
                                                                         Page 47 of 49
Inferences on Smoking:
   P-value = 0.0103 hence at α = 0.05 and α = 0.10

    2 is significantly different from zero, thus
    smoking has a significant effect on cancer.
        ODDS Ratio = 3.21, that is ODDS of getting
         cancer for smokers is 3.21 times what they are for
         non-smokers (at the same age).
        CI for ODDS: (1.32, 7.79) does not include zero
         so significant. ODDS of getting cancer for
         smokers may be as high as 7.79 times as that for
         non-smokers. [Are you still smoking?]


Predicting the probability of Cancer for an 80 year
old non-smoker (X1 = 80, X2 = 0)

          p 
            ˆ
     Log         4.4777  0.1123( Age)  1.1638( Smoking )
              ˆ
         1 p 
                  4.4777  0.1123(80)  1.1638(0)  4.5

    ODDS = e4.5 = 90.6
                                90.6
    Probability of cancer =             0.9891  98.91%
                              1  90.6




Chapter 12, Fall 2007
                                                     Page 48 of 49
    Homework:
    1. Find the probability of cancer for a 75 year old
       smoker (Answer = 99%)
    2. Find the probability of cancer for a 40 year old
       non-smoker (Answer = 50%)
    3. Do you think this output is useful to predict the
       probability for a smoker at your age? Why or why
       not? [You need to make some assumptions before
       you can answer this.]
    4. Use the output below to interpret the numbers and
       find if the proportion of binge drinkers differ by
       gender.
    5. Can you answer the same question using the same
       output by two other methods? [You should not
       have difficulty to answer this question since we
       have seen those methods in Chapters 9 and 10.]

Example
Logistic Regression of Frequent                  Binge   Drinking    (yes=1,   no=0)
on Gender (males=1, females=0)

                             Gender     YES      NO    Total
                              Male      1630      7180
                             Female     1684      9916
                              Total     3314     17096

 Logistic Regression Table
                                                          Odds      95% CI
Predictor            Coef     SE Coef        Z       P   Ratio   Lower Upper
Constant         -1.58686   0.0267449   -59.33   0.000
gender           0.361639   0.0388452     9.31   0.000    1.44   1.33   1.55




Chapter 12, Fall 2007
                                                                           Page 49 of 49

								
To top