multicollinearity

Document Sample
multicollinearity Powered By Docstoc
					Multicollinearity
           Multicollinearity
• Multicollinearity (or intercorrelation)
  exists when at least some of the predictor
  variables are correlated among themselves.
• In observational studies, multicollinearity
  happens more often than not.
• So, we need to understand the effects of
  multicollinearity on regression analyses.
    Example #1




n = 20 hypertensive individuals
p-1 = 6 predictor variables
                    Example #1

             BP    Age     Weight   BSA     Duration   Pulse
Age        0.659
Weight     0.950   0.407
BSA        0.866   0.378   0.875
Duration   0.293   0.344   0.201    0.131
Pulse      0.721   0.619   0.659    0.465     0.402
Stress     0.164   0.368   0.034    0.018     0.312    0.506



            Blood pressure (BP) is the response.
What is effect on regression analyses
     if predictors are perfectly
            uncorrelated?
x1   x2    y
 2    5    52
 2    5    43
 2    7    49     Pearson correlation of
                  x1 and x2 = 0.000
 2    7    46
 4    5    50
 4    5    48
 4    7    44
 4    7    43
               Regress Y on X1
The regression equation is y = 48.8 - 0.63 x1

Predictor        Coef     SE Coef          T        P
Constant       48.750       4.025      12.11    0.000
x1             -0.625       1.273      -0.49    0.641

Analysis of Variance
Source      DF       SS       MS         F         P
Regression   1     3.13      3.13      0.24     0.641
Error        6    77.75     12.96
Total        7    80.88
                 Regress Y on X2
The regression equation is y = 55.1 - 1.38 x2

Predictor         Coef      SE Coef        T         P
Constant        55.125        7.119     7.74     0.000
x2              -1.375        1.170    -1.17     0.285

Analysis of   Variance
Source        DF     SS      MS        F          P
Regression    1     15.13   15.13     1.38      0.285
Error         6     65.75   10.96
Total         7     80.88
            Regress Y on X1 and X2
The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2

Predictor        Coef       SE Coef          T        P
Constant       57.000         8.486       6.72    0.001
x1             -0.625         1.251      -0.50    0.639
x2             -1.375         1.251      -1.10    0.322

Analysis of Variance
Source     DF        SS            MS      F        P
Regression 2      18.25           9.13   0.73    0.528
Error       5     62.63          12.53
Total       7     80.88

Source       DF         Seq SS
x1            1           3.13
x2            1          15.13
            Regress Y on X2 and X1
The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1

Predictor        Coef       SE Coef          T        P
Constant       57.000         8.486       6.72    0.001
x2             -1.375         1.251      -1.10    0.322
x1             -0.625         1.251      -0.50    0.639

Analysis of Variance
Source      DF     SS              MS      F        P
Regression   2     18.25          9.13   0.73    0.528
Error        5     62.63         12.53
Total        7     80.88

Source       DF         Seq SS
x2            1          15.13
x1            1           3.13
       If predictors are perfectly
          uncorrelated, then…
• You get the same slope estimates regardless
  of the first-order regression model used.
• That is, the effect on the response ascribed
  to a predictor doesn’t depend on the other
  predictors in the model.
       If predictors are perfectly
          uncorrelated, then…
• The sum of squares SSR(X1) is the same as
  the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is the same as
  the sequential sum of squares SSR(X2|X1).
• That is, the marginal contribution of one
  predictor variable in reducing the error sum
  of squares doesn’t depend on the other
  predictors in the model.
     Same effects for “real data” with
      nearly uncorrelated predictors?
             BP    Age     Weight   BSA     Duration   Pulse
Age        0.659
Weight     0.950   0.407
BSA        0.866   0.378   0.875
Duration   0.293   0.344   0.201    0.131
Pulse      0.721   0.619   0.659    0.465     0.402
Stress     0.164   0.368   0.034    0.018     0.312    0.506
              Regress BP on Stress
The regression equation is
BP = 113 + 0.0240 Stress

Predictor         Coef       SE Coef        T        P
Constant       112.720         2.193    51.39    0.000
Stress         0.02399       0.03404     0.70    0.490

S = 5.502        R-Sq = 2.7%       R-Sq(adj) = 0.0%

Analysis of   Variance
Source        DF     SS         MS        F       P
Regression     1    15.04      15.04     0.50   0.490
Error         18   544.96      30.28
Total         19   560.00
            Regress BP on BSA
The regression equation is
BP = 45.2 + 34.4 BSA

Predictor       Coef         SE Coef         T       P
Constant      45.183           9.392      4.81   0.000
BSA           34.443           4.690      7.34   0.000

S = 2.790       R-Sq = 75.0%       R-Sq(adj) = 73.6%

Analysis of Variance
Source       DF       SS          MS       F       P
Regression    1    419.86       419.86   53.93   0.000
Error        18    140.14         7.79
Total        19    560.00
    Regress BP on BSA and Stress
The regression equation is
BP = 44.2 + 34.3 BSA + 0.0217 Stress

Predictor          Coef      SE Coef         T       P
Constant         44.245        9.261      4.78   0.000
BSA              34.334        4.611      7.45   0.000
Stress          0.02166      0.01697      1.28   0.219

Analysis of   Variance
Source        DF        SS        MS       F       P
Regression     2    432.12      216.06   28.72   0.000
Error         17    127.88        7.52
Total         19    560.00

Source         DF      Seq SS
BSA             1      419.86
Stress          1       12.26
   Regress BP on Stress and BSA
The regression equation is
BP = 44.2 + 0.0217 Stress + 34.3 BSA

Predictor          Coef       SE Coef          T        P
Constant         44.245         9.261       4.78    0.000
Stress          0.02166       0.01697       1.28    0.219
BSA              34.334         4.611       7.45    0.000

Analysis of   Variance
Source        DF       SS            MS       F        P
Regression     2     432.12        216.06   28.72    0.000
Error         17     127.88          7.52
Total         19     560.00

Source         DF         Seq SS
Stress          1          15.04
BSA             1         417.07
         If predictors are nearly
          uncorrelated, then…
• You get similar slope estimates regardless
  of the first-order regression model used.
• The sum of squares SSR(X1) is similar to the
  sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is similar to the
  sequential sum of squares SSR(X2|X1).
       What happens if the predictor
      variables are highly correlated?
             BP    Age     Weight   BSA     Duration   Pulse
Age        0.659
Weight     0.950   0.407
BSA        0.866   0.378   0.875
Duration   0.293   0.344   0.201    0.131
Pulse      0.721   0.619   0.659    0.465     0.402
Stress     0.164   0.368   0.034    0.018     0.312    0.506
            Regress BP on Weight
The regression equation is
BP = 2.21 + 1.20 Weight

Predictor        Coef        SE Coef         T        P
Constant        2.205          8.663      0.25    0.802
Weight        1.20093        0.09297     12.92    0.000

S = 1.740       R-Sq = 90.3%       R-Sq(adj) = 89.7%

Analysis of Variance
Source       DF       SS          MS        F        P
Regression    1    505.47       505.47   166.86    0.000
Error        18      54.53        3.03
Total        19    560.00
              Regress BP on BSA
The regression equation is
BP = 45.2 + 34.4 BSA

Predictor         Coef       SE Coef         T        P
Constant        45.183         9.392      4.81    0.000
BSA             34.443         4.690      7.34    0.000

S = 2.790        R-Sq = 75.0%       R-Sq(adj) = 73.6%

Analysis of   Variance
Source        DF       SS         MS       F       P
Regression     1    419.86      419.86   53.93   0.000
Error         18    140.14        7.79
Total         19    560.00
  Regress BP on BSA and Weight
The regression equation is
BP = 5.65 + 5.83 BSA + 1.04 Weight

Predictor             Coef    SE Coef         T        P
Constant             5.653      9.392      0.60    0.555
BSA                  5.831      6.063      0.96    0.350
Weight              1.0387     0.1927      5.39    0.000

Analysis of   Variance
Source        DF        SS          MS       F       P
Regression     2    508.29        254.14   83.54   0.000
Error         17     51.71          3.04
Total         19    560.00

Source         DF        Seq SS
BSA             1        419.86
Weight          1         88.43
  Regress BP on Weight and BSA
The regression equation is
BP = 5.65 + 1.04 Weight + 5.83 BSA

Predictor             Coef     SE Coef        T        P
Constant             5.653       9.392     0.60    0.555
Weight              1.0387      0.1927     5.39    0.000
BSA                  5.831       6.063     0.96    0.350

Analysis of   Variance
Source        DF        SS          MS       F        P
Regression     2     508.29       254.14   83.54    0.000
Error         17       51.71        3.04
Total         19     560.00

Source         DF        Seq SS
Weight          1        505.47
BSA             1          2.81
     Effect #1 of multicollinearity
When predictor variables are correlated, the regression
coefficient of any one variable depends on which other
predictor variables are included in the model.

           Variables       b1          b2
           in model
              X1         1.20         ----
               X2         ----        34.4
             X 1, X 2    1.04         5.83
Even correlated predictors not in
 the model can have an impact!
• Regression of territory sales on territory
  population, per capita income, etc.
• Against expectation, coefficient of territory
  population was determined to be negative.
• Competitor’s market penetration, which was
  strongly positively correlated with territory
  population, was not included in model.
• But, competitor kept sales down in territories with
  large populations.
    Effect #2 of multicollinearity
When predictor variables are correlated, the marginal
contribution of any one predictor variable in reducing
the error sum of squares varies, depending on which
other variables are already in model.

    SSR(X1) = 505.47            SSR(X2) = 419.86
    SSR(X1|X2) = 88.43          SSR(X2|X1) = 2.81
     Effect #3 of multicollinearity
When predictor variables are correlated, the precision of
the estimated regression coefficients decreases as more
predictor variables are added to the model.

           Variables     se(b1)      se(b2)
           in model
              X1         0.093         ----
               X2         ----        4.69
             X 1, X 2    0.193        6.06
 What is the effect on estimating
mean or predicting new response?
      Effect #4 of multicollinearity on
      estimating mean or predicting Y
Weight     Fit         SE Fit       95.0% CI            95.0% PI
 92        112.7       0.402     (111.85,113.54)    (108.94,116.44)



BSA    Fit         SE Fit           95.0% CI            95.0% PI
 2    114.1         0.624       (112.76,115.38)     (108.06,120.08)


BSA Weight       Fit     SE Fit       95.0% CI            95.0% PI
 2    92      112.8      0.448    (111.93,113.83)    (109.08, 116.68)

High multicollinearity among predictor variables does
not prevent good, precise predictions of the response
(within scope of model).
              What is effect on tests
              of individual slopes?
The regression equation is
BP = 45.2 + 34.4 BSA

Predictor          Coef      SE Coef        T        P
Constant         45.183        9.392     4.81    0.000
BSA              34.443        4.690     7.34    0.000

S = 2.790         R-Sq = 75.0%     R-Sq(adj) = 73.6%

Analysis of   Variance
Source        DF       SS       MS       F         P
Regression     1    419.86    419.86   53.93     0.000
Error         18    140.14      7.79
Total         19    560.00
             What is effect on tests
             of individual slopes?
The regression equation is
BP = 2.21 + 1.20 Weight

Predictor        Coef        SE Coef        T        P
Constant        2.205          8.663     0.25    0.802
Weight        1.20093        0.09297    12.92    0.000

S = 1.740       R-Sq = 90.3%       R-Sq(adj) = 89.7%

Analysis of Variance
Source       DF       SS         MS        F         P
Regression    1    505.47      505.47   166.86     0.000
Error        18      54.53       3.03
Total        19    560.00
            What is effect on tests
            of individual slopes?
The regression equation is
BP = 5.65 + 1.04 Weight + 5.83 BSA

Predictor      Coef        SE Coef       T        P
Constant      5.653          9.392    0.60    0.555
Weight        1.0387        0.1927    5.39    0.000
BSA           5.831          6.063    0.96    0.350

Analysis of   Variance
Source        DF    SS         MS       F        P
Regression     2 508.29      254.14   83.54    0.000
Error         17   51.71       3.04
Total         19 560.00

Source         DF      Seq SS
Weight          1      505.47
BSA             1        2.81
   Effect #5 of multicollinearity on
              slope tests
When predictor variables are correlated, hypothesis tests
for βk = 0 may yield different conclusions depending on
which predictor variables are in the model.

 Variables      b2       se(b2)        t       P-value
 in model
    X2         34.4        4.7       7.34       0.000

   X 1, X 2    5.83        6.1       0.96       0.350
         Summary comments
• Tests for slopes should generally be used to
  answer a scientific question and not for
  model building purposes.
• Even then, caution should be used when
  interpreting results when multicollinearity
  exists. (Think marginal effects.)
   Summary comments (cont’d)
• Multicollinearity has little to no effect on
  estimation of mean response or prediction
  of future response.
   Diagnosing multicollinearity
• Realized effects (changes in coefficients,
  changes in sequential sums of squares, etc.)
  of multicollinearity.
• Scatter plot matrices.
• Pairwise correlation coefficients among
  predictor variables.
• Variance inflation factors (VIF).

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:1/21/2014
language:English
pages:35