INTRODUCTION TO MULTIPLE REGRESSION ANALYSIS

Document Sample

```					     INTRODUCTION TO MULTIPLE REGRESSION ANALYSIS

The general format of a multiple regression model:

Yi = 0 + 1X1i + 2X2i + …. + KXKi + e i          where

0 = Regression constant
1 = Regression coefficient for variable X1
K = Regression coefficient for variable XK
K = Number of independent variables
ei = Residual (Error)

Three general assumptions:

1. The errors are normally distributed
2. The mean of the error terms is 0
3. The error terms has a constant variance, 2, for all
combinations of values of the independent variables

When a decision maker has sample data available for the dependent
variable and for K independent variables, the least squares regression
coefficients are estimated, forming the sample regression model of the
following form:

Yhati = b0 + b1X1 + b2X2 + …. + bKXK          where

b0 = Y-intercept (constant)
b1, b2, ...., bK = Regression slope coefficients
Yhati = ith estimated value of the dependent variables
X1, X2, .... XK = Independent variables

The sample size required to compute a regression model must be
at least 1greater than the number of independent variables

1
As a practical matter, the sample size should be at least 4 times
the number of independent variables

Example: AMERICAN COUNTY APPRAISAL MODEL

The assessor gathered data on the following residential property
variables for 531 houses:

Y = Sales price
X1 = Square feet
X2 = Age of house
X3 = Number of bedrooms
X4 = Number of bathrooms
X5 = Number of fireplaces

Y         X1           X2         X3        X4         X5
Y     1.000      0.841       -0.068     0.494      0.720     0.599
X1    0.841      1.000        0.054     0.644      0.680     0.589
X2   -0.068      0.054        1.000     0.007     -0.149     0.086
X3    0.494      0.644        0.007     1.000      0.551     0.338
X4    0.720      0.680       -0.149     0.551      1.000     0.518
X5    0.599      0.589        0.086     0.338      0.518     1.000

MTB > REGRESS 'Y' on 4 predictors 'X1' 'X2' 'X3' 'X4'

The regression equation is
Y = 20,494.980 + 29.173X1 – 5,050.290X3 + 11,710.938X4 +
4,450.552X5
Predictor        Coef             Stddev          t-ratio
Constant         20494.980
X1                  29.173           1.496        19.504
X3               -5050.290        1094.795        -4.613
X4               11710.938        1250.234         9.367
X5                4450.552        1210.376         3.677

2
s = 13600.502      R-sq = 76.7%     R-sq(adj) = 76.5%

Analysis of Variance

SOURCE       DF         SS                      MS                     F
Regression   4     320033124448.934             80008281112.233      432.538
Error        526    97296146146.656               184973661.876
Total        530   417329270595.589

To obtain a sales price point estimate for any house, we could substitute
values for X1, X3, X4, and X5 into this regression model. For example,
suppose a property with the following characteristics is considered:
X1 = Square feet = 2,100
X3 = Number of bedrooms = 4
X4 = Number of baths = 1.75
X5 = Number of fireplaces = 2

The point estimate for the sales price is
Yhat = 20,494.98 + 29.173(2,100) – 5,050.290(4) + 11,710.938(l.75)
+ 4,450.552(2) = 90,946.31

Please study the equation "The point estimate for the sales price is..."
What exactly does it say about what a house-owner should do to
improve the value of her/his house?

i)     Is the overall model significant?
ii)    Are the individual variables significant?
iii)   Is the standard error of the estimate too large to provide
meaningful results?
iv)    Is multicollinearity a problem?

3
i)   Is the overall model significant?

R-square = R2
(Sum of squares regression) / (Total sum of squares) =
SSR/TSS

To test the model's significance, we compare the calculated F value
(eg., 432.538 above) with a table F value for a given a level (here
0.01) and 4 (K-1; 5-1) and 526 (N-K; 531-5) degrees of freedom
gives a value of 3.32; v1 = 4 and v2 =  )

Thus in this case, we have to reject the null hypothesis that states
that the regression model does not explain a significant proportion of
the total variation in the dependent variable.

The adjusted R-square = RA2 = 1 - (1-R2)[(n-l)/(n-K-1)]

Where       n = Sample size
K = # of independent variables in model

In Appraisal Model example:
RA2 =(1-((1-0.767)*((531-1)/(531-5-1)))) = 0.764780952

Adding more independent variables to the regression model
generally increases R2. However, the cost in terms of losing degrees
value accordingly. If a variable is added that does not contribute its
fair share to the explained variance, the Adjusted R-square will
actually decline. The Adjusted R-square is particularly important
when the number of independent variables is large relative to the
sample size.

4
ii)    Are individual variables significant?

If the model is significant, at least one independent variable explains
a significant proportion of the variance. Which variables are
significant?

H0: i = 0, given all other variables are already in the model

We can test the significance of each independent variable using a
t test: The t statistic is determined by dividing the regression
coefficient by the standard deviation of the regression coefficient.

Example above: For X1: (coef.) 29.173 / 1.496 (St.d.) = 19.504 (t)
These tests are conditional tests: the statement that the value of each
slope coefficient is 0 is made recognizing that the other independent
variables are already in the model.

iii)   Is the standard error of the estimate too large?

The sample standard deviation of the regression model –the standard
error of the estimate, Se– measures the dispersion of observed values,
Y, around values predicted by the regression model.

Se = [SSE / (n - K - 1)]

where      SSE = Sum of squares error
n = sample size
K = # of independent variables

A heuristic: Examine the range  2Se
Is this range acceptable from a practical viewpoint?
CV = Coefficient of variance = (Se / Ybar)(100)
As a rule, we like to see a CV of 10% or less.

5
iv)   Is multicollinearity a problem?

Multicollinearity occurs when two independent variables are
correlated with each other and therefore contribute redundant
information to the model. When highly correlated independent
variables are included in the regression model, they can affect the
regression results.

Study our previous model of house values carefully!

Dummy Variables in Regression Analysis

Frequently, you may want to use a nominal or an ordinal (i.e., qualitative)
variable as an independent variable in a regression model, e.g.:
- Gender (two dummy variables, say 1 & 0)
- Race (a series of dummy variables)
- Political party
- Marital status (a series of dummy variables)
- Employment status
- Type of degree
- Does house have central air-conditioning
- School district (important determinant of real estate value)

The number of dummy variables must be one less than the number of
possible categories (the dummy variable trap) to avoid perfect
multicollinearity (that makes it impossible to solve the least squares
regression equations)

6
Stepwise Regression Analysis

In ordinary regression analysis we bring all independent variables
into the model at one step

In stepwise regression analysis we develop the model in steps,
either through backward elimination or forward selection.

Backward elimination

First an ordinary regression model is developed using all
independent variables. Then a t-test for significance is performed on
each regression coefficient at a specified -level. Provided that at
least one t-value is in the "do not reject" region (i = 0), the variable
with the t-value closest to zero is removed, and another ordinary
regression model is developed with the remaining independent
variables. This elimination process continues until all independent
variables remaining in the model have coefficients that are
significantly different from 0.

Forward selection

The forward selection process starts with the independent variable
that is most highly correlated with the dependent variable (highest
coefficient of partial determination), followed by the second most
highly correlated variable, etc.

Importantly:
If two or more variables overlap, a variable selected in an early step
may become insignificant (and will be dropped from the model)
when other variables are added at later steps. Method offers a means
for observing multicollinearity.

Remember, however, that the order of variable selection is
conditional (based on variables already in the model)

7
Common Problems with Multiple Regression

Beware of the "perfect" model (R2 of 1.0, or very close)

Remember the measurement scale requirements (interval or ratio)

Independent variables should "make sense"

Example: Income Related to Education, Job Experience, and Age

Income Education Job Experience Age
Individual (1000s of (years)     (years)    (years)
\$)
Y      X1          X2         X3
A          5.0      2           9         29
B          9.7      4          18         50
C         28.4      8          21         41
D          8.8      8          12         55
E         21.0      8          14         34
F         26.6     10          16         36
G         25.4     12          16         61
H         23.1     12           9         29
I        22.5     12          18         64
J        19.5     12           5         30
K         21.7     12           7         28
L         24.8     13           9         29
M         30.1     14          12         35
N         24.8     14          17         59
O         28.5     15          19         65
P         26.0     15           6         30
Q         38.9     16          17         40
R         22.1     16           1         23
S         33.1     17          10         58
T         48.3     21          17         44

Income     Education   Job Experience    Age
Income              1
Education        0.8457          1
Job Experience   0.2677      -0.1069          1
Age              0.1050       0.0982       0.6755          1

8
SUMMARY OUTPUT

Regression Statistics
Multiple R         0.972
R Square           0.945
Square
Standard Error     2.505
Observations         20

ANOVA
df          SS         MS          F         Significance
F
Regression           3         1719.998   573.333     91.343      2.79297E-10
Residual             16        100.428     6.277
Total                19        1820.426

Coefficients   Standard    t Stat     P-value
Error
Intercept          -2.983       2.357     -1.265       0.224
X Variable 1        2.099       0.133     15.817    3.44183E-11
X Variable 2        1.197       0.147      8.151    4.34418E-07
X Variable 3       -0.311       0.058     -5.381       0.000

Yhat = -2.983 + 2.099X1 + 1.197X2 - 0.311X3

If a person’s Education = 9, Job Experience = 13, and Age = 39, the

Yhat = 19.36

-2.98
Education:            9         18.89
Job Experience:       13        15.57
Age:                  39       -12.12

19.36

9
If we add gender as X4 –a dummy variable with 0 = Male and
1 = Female– we get the following result:

Income    Education Job Experience   Age    Gender
Individual (1000s of \$) (years)      (years)    (years) 0=M, 1=F
Y        X1           X2         X3       X4
A           5.0        2            9          29      0
B           9.7        4           18          50      0
C          28.4        8           21          41      0
D           8.8        8           12          55      1
E          21.0        8           14          34      1
F          26.6       10           16          36      1
G          25.4       12           16          61      0
H          23.1       12            9          29      0
I         22.5       12           18          64      1
J         19.5       12            5          30      1
K          21.7       12            7          28      0
L          24.8       13            9          29      1
M          30.1       14           12          35      0
N          24.8       14           17          59      1
O          28.5       15           19          65      0
P          26.0       15            6          30      1
Q          38.9       16           17          40      0
R          22.1       16            1          23      1
S          33.1       17           10          58      1
T          48.3       21           17          44      0

SUMMARY OUTPUT

Regression Statistics
Multiple R              0.973
R Square                0.946
Standard Error          2.561
Observations               20

ANOVA
df         SS        MS        F      Significance
F
Regression                 4    1722.077 430.519    65.662 2.52741E-09
Residual                  15      98.349   6.557
Total                     19    1820.426

10
Coefficients Standard Error   t Stat      P-value
Intercept                 -2.538          2.536     -1.001        0.333
X Variable 1               2.099          0.136    15.473 1.25108E-10
X Variable 2               1.155          0.168      6.877 5.25828E-06
X Variable 3              -0.300          0.062     -4.829        0.000
X Variable 4              -0.725          1.288     -0.563        0.582

Yhat = -2.538 + 2.099X1 + 1.155X2 - 0.300X3 - 0.725X4

Examples:
-2.538                                        -2.538       -2.538
2.099    X1          3          3             6.297        6.297
1.155    X2         10          10            11.55        11.55
-0.300    X3         30          30               -9           -9
-0.725    X4          0          1                 0       -0.725 Difference:
6.309        5.584        0.725

-2.538                                        -2.538       -2.538
2.099    X1          9          9            18.891       18.891
1.155    X2         13          13           15.015       15.015
-0.300    X3         56          56            -16.8        -16.8
-0.725    X4          0          1                 0       -0.725 Difference:
14.568       13.843        0.725

-2.538                                        -2.538       -2.538
2.099    X1         13          13           27.287       27.287
1.155    X2          6          6              6.93         6.93
-0.300    X3         31          31              -9.3         -9.3
-0.725    X4          0          1                  0      -0.725 Difference:
22.379       21.654        0.725

-2.538                                        -2.538       -2.538
2.099    X1         15          15           31.485       31.485
1.155    X2         20          20             23.1         23.1
-0.300    X3         66          66            -19.8        -19.8
-0.725    X4          0          1                 0       -0.725 Difference:
32.247       31.522        0.725

-2.538                                        -2.538       -2.538
2.099    X1         19          19           39.881       39.881
1.155    X2         18          18            20.79        20.79
-0.300    X3         45          45            -13.5        -13.5
-0.725    X4          0          1                 0       -0.725 Difference:
44.633       43.908        0.725

11
In the above example all the differences are \$725. This means that the
dummy variable results in a parallel shift if the regression line: in (because
of the negative slope), if the value takes the value of 1.

THE DUMMY VARIABLE METHOD

Season    Sales      t      c1     c2      c3    Y_hat
97 W      107      0        0      0       0      65.3
S       146      1        1      0       0     105.6
S       177      2        0      1       0     163.5
F       139      3        0      0       1     131.5
98 W      108      4        0      0       0     104.8
S       130      5        1      0       0     145.2
S       169      6        0      1       0     203.0
F       154      7        0      0       1     171.0
99 W      144      8        0      0       0     144.4
S       208      9        1      0       0     184.7
S       292      10       0      1       0     242.6
F       271      11       0      0       1     210.6
00 W      224      12       0      0       0     184.0
S       268      13       1      0       0     224.3
S       264      14       0      1       0     282.1
F       191      15       0      0       1     250.1
01 W      146      16       0      0       0     223.5
S       194      17       1      0       0     263.8
S       238      18       0      1       0     321.7
F       213      19       0      0       1     289.7
02 W      194      20       0      0       0     263.1
S       263      21       1      0       0     303.4
S       340      22       0      1       0     361.3
F       317      23       0      0       1     329.3

12
03 W       269    24        0        0      0      302.7
S        361    25        1        0      0      343.0
S        495    26        0        1      0      400.8
F        466    27        0        0      1      368.8
04 W       438    28        0        0      0      342.2
S               29        1        0      0      382.5
S               30        0        1      0      440.4
F               31        0        0      1      408.4
05 W              32        0        0      0      381.8

Regression Statistics
Multiple R         0.859
R Square           0.738
Square             0.694
Standard Error 57.375
Observations           29

ANOVA
df         SS            MS       F   Significance F
Regression               4 222577.5         55644   16.9   1.02986E-06
Residual                24 79006.6         3291.9
Total                   28 301584.1

Coefficients       Standard Error    t Stat P-value
Intercept        65.28                27.05          2.41   0.024
X Variable 1      9.89                 1.28          7.74 5.6E-08
X Variable 2     30.43                29.72          1.02   0.316
X Variable 3     78.39                29.69          2.64   0.014
X Variable 4     36.50                29.72          1.23   0.231

13
97

100
200
300
400
500
600

0
W

S
98
W

S
99
W

S
00
W

S
01
W

S
02
W

S
03
W

S
04
W

S
05
W

14

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 48 posted: 3/4/2010 language: English pages: 14
How are you planning on using Docstoc?