Lecture8

Document Sample

```					Model Building

Estimating the Standard Error in Multiple Linear Regression

The estimate s  in multiple linear regression is given by
SSE
s             ,
n  k 1

where k is the number of independent variables in the model (k+1 is the total number of coefficients
estimated in the model). This measure of fit should always be analyzed in conjunction with the R 2 .
As before, you don’t need to calculate this estimate of   because Excel provides it as part of the
standard output. However, the formula does tell us something important: s  worsens if additional
independent variables are added to the model and SSE (the residual sum of squares) is not lowered
sufficiently to offset the smaller denominator. Generally speaking, one should not include
independent variables in a predictive model (“predictive” meaning a model used for predictions or
valuations) if it worsens s  , even though R 2 will typically increase. Thus, when analyzing fit, it is
essential to examine s  .

Using Dummy Variables: A Model for Predicting Construction Costs at Carlson
George Melas (SMU MBA Class 47) works for Carlson, a national company that designs and
constructs “mission critical buildings” for Fortune 500 companies. These buildings are facilities
that must operate uninterrupted 7-days per week, 24 hours per day. Interruptions are very costly.
Over the past 4 years the company has performed a number of telecommunication projects
geared toward companies that want to set-up the infrastructure necessary to operate their fiber
networks. George has gathered cost and other data on several facilities.
Our goal will be to analyze the data and build a multiple linear regression model that can
predict the future construction costs for this type of facility given only a few key variables. After
some preliminary screening, 5 variables are identified for possible inclusion in the final model:
Gross Area, Network Area, Building Type, Complexity, and City Cost. Gross area is simply the
total area of the project. Network area is the area devoted to mission critical systems and their
backups. City cost is an index that captures local construction costs. Building Type refers to the
type of constructionType A (high-rise), Type B (mid-rise), Type C (single-story), and Greenfield
(new construction). Complexity is an index that captures the number and complexity of redundant
(backup) systems required (“1” in the data indicates high complexity, “0” means normal
complexity). These variables along with the cost of 33 completed projects are in file
Construction.xls.
One of the complicating features of this problem is the presence of categorical independent
variables. These are independent variables that describe a particular qualitative factor instead of
values from a continuous scale. In our example, “Building Type” and “Complexity” are categorical
variables. The typical manner for dealing with such variables is through the use of 0/1 dummy (also
called indicator) variables that effectively “code” the various possibilities. For example, to model
Building Type, we introduce three dummy variables,

BA 6326, EMBA Statistics, Fall 2007                                                                  60
1     if the construction is Type A
IA  
0     otherwise
1     if the construction is Type B
IB  
0     otherwise
1     if the construction is Type C
IC  
0     otherwise

Observe that four categories require only three dummy variables in the regression model. In
general, m categories require only m-1 dummy variables in a regression model. To see why this
works, observe that in the construction example, Type A construction is indicated by
I A  1, I B  0, I C  0 ; Type B by I A  0, I B  1, I C  0 ; Type C by I A  0, I B  0, I C  1 ; and
Greenfield by I A  0, I B  0, I C  0 . In other words, the final category is obtained when the
remaining dummy variables are all set to 0.
Complexity involves the use of a single dummy variable (two categories, hence one dummy
variable),

1 if the construction is high complexity
IX  
0 otherwise

The initial full model, including both continuous and dummy variables, is

y   0   GA xGA   NA x NA   CI xCI   A I A   B I B   C I C   X I X   .

If we fit this model in Excel, we get the following output.

George's Construction Cost Model
Histogram
15       -1.5
-2.5             -0.5                  0.5 1.5
Regression Statistics                                   Frequency
Bin
Frequency

Multiple R          0.904887751                 10       -2.5           0
R Square            0.818821843                          -1.5           2
Adjusted R Square 0.768091958                    5       -0.5           9
Standard Error      852508.5055                           0.5          14
Observations                 33                  0        1.5           5
-2.5 2.5 -1.5      2
-0.5    0.5    1.5             2.5   More
ANOVA                                      More                         1
df           SS            MS               F      Significance F
Regression                   7 8.21147E+13     1.17307E+13 16.14082 7.7432E-08
Residual                    25 1.81693E+13     7.26771E+11
Total                       32 1.00284E+14

Coefficients Standard Error            t Stat           P-value    Lower 95%     Upper 95%
Intercept         698998.9541 1703695.315               0.410284015       0.685093   -2809824.77    4207822.7
GrossArea         146.8058511 72.39693827               2.027790879       0.053366    -2.2983299    295.91003
NetArea           213.6654041 109.5252025               1.950833226       0.062371   -11.9058149    439.23662
TypeA             609797.6799 484001.0976               1.259909704       0.219337   -387020.542    1606615.9
TypeB             -108064.859 612790.4108              -0.176348808       0.861441   -1370129.45    1153999.7
TypeC             101556.6321 420042.6362               0.241776961       0.810925   -763536.765    966650.03
Complexity        297122.8627 388433.6535               0.764925644       0.451477   -502870.661    1097116.4
CityCost          1140763.445 1581964.589                0.72110555       0.477534   -2117351.33    4398878.2

BA 6326, EMBA Statistics, Fall 2007                                                                                        61
RESIDUAL OUTPUT

Observation        Predicted Cost           Residuals Standard Residuals
1    3498742.768            -216866.768    -0.287805621                                                     GrossArea Residual Plot
4000000
2    3861405.313            -77258.3129    -0.102530124

Residuals
3     3071253.12            -1317232.12    -1.748109272                                2000000
4    4669309.415            248190.5852     0.329375709                                       0
5    3562627.455            1429819.545     1.897524944            X                   -2000000 0               5000      10000 15000 20000 25000 30000
6     5119092.33             -167226.33    -0.221927399                                                                          GrossArea
7    2489294.332            -597747.332    -0.793275261
8    3312755.847            -156173.847     -0.20725956                                                         NetArea Residual Plot
9    2365320.458            -390109.458    -0.517717379                                     4000000

Residual
10    3696099.121            1481463.879     1.966062552            X                        2000000

s
11    3818441.559            -636032.559    -0.844083891                                           0
12    4127309.261            9281.739399     0.012317871                                    -2000000 0                  5000          10000         15000     20000
13    3566836.742            234667.2576     0.311428793                                                                             NetArea
14    4582391.412            2045573.588     2.714697055            X
15    3702083.687            -434332.687     -0.57640638                                                            TypeA Residual Plot
16    4165524.011            -1218710.01    -1.617359794                                           4000000

Residual
17    4766138.171            -237307.171    -0.314932242                                           2000000
18    3379751.054            -162407.054    -0.215531699

s
0
19    3469140.537            110323.4631      0.14641115
-0.2                   0.3                 0.8             1.3
20    9742296.805             469871.195     0.623569818                                           -2000000
TypeA
21    3830230.221            733749.7795     0.973765196
22    3501202.714            -190770.714    -0.253173339
TypeB Residual Plot
23    5494452.442            -947814.442     -1.25785212                                    4000000

Residual
24    3809883.517            -789258.517    -1.047431285                                    2000000

s
25    2837058.579            426318.4207     0.565770583                                                 0
26    2583341.212            171058.7881     0.227013485                                      -0.2                      0.3                   0.8              1.3
-2000000
27    5134572.195            826165.8052     1.096411242                                                                          TypeB
28    7029148.702            -504636.702       -0.6697074
29    4463062.211            987856.7895     1.310992639                                                         TypeC Residual Plot
30    2514006.655             164348.345     0.218108002                                 4000000

Residuals
31    3090735.255            -205051.255    -0.272125159                                 2000000
32     5925052.51             -513052.51    -0.680876086
0
33    8115701.391            -576701.391    -0.765345025
-0.2                      0.3                      0.8               1.3
-2000000
TypeC
1
Complexity Residual Plot                                                             CityCost Residual Plot
4000000                                                                          4000000
Residual
Residual

2000000                                                                          2000000
s
s

0                                                                                   0
-0.2       0     0.2    0.4   0.6     0.8   1   1.2
-2000000                                                                       -20000000.75             0.85      0.95     1.05     1.15   1.25    1.35
Complexity                                                                                      CityCost

Questions

(1) What is the estimated cost of a Greenfield site of low complexity having a gross area of 5000
square feet, a network area of 2500 square feet, in a city whose cost index is 1.15?

(2) What is the cost difference, ceteris paribus, between high rise construction and Greenfield
construction? Is this difference statistically significant? (Use the default value   .05 ).

(3) What is the cost of increasing gross square footage (ceteris paribus)?

(4) What is the cost of increasing network square footage (ceteris paribus)?

Fit and Diagnostics
The R 2  .818 tells us that nearly 82% of the variation in cost is explained by the independent
variables. The standard error of \$852208, however, is still pretty large for predictive purposes. The
BA 6326, EMBA Statistics, Fall 2007                                                                                                                                                        62
overall model is statistically significant based on the F-statistic, but the t-statistics reveal that some
of the variables add little to the model. These are undoubtedly inflating the standard error (why?)
and thus should be removed in some systematic manner so that the standard error is improved. We
will discuss a procedure later when we talk about model building.
The residual plots do not reveal any violations of the homoscedasticity assumption or the
independence assumption. However, the histogram appears slightly positively skewed, mostly due
to three positive residuals (marked with an “X” above) that highlight projects that were much more
expensive than our model would predict. These are potential outliers and possibly influential. A
subsequent consultation with George revealed that these three projects were indeed highly unusual
compared to the remaining sample points. In project #5, the work involved sequentially building
and demolishing parts of the structure, often destroying a portion that was only recently constructed.
The second two projects (observations 10 and 14) were done in office buildings that required
opening up the construction site in the evening after the tenants had left and then closing up the
construction site (so it was concealed) in the early morning before the tenants returned. These three
projects were all highly specialized and therefore not representative of “typical” projects. Our
model cannot address construction costs for these types of idiosyncratic projects because they
involve factors we have not included or measured. If we remove these three projects 1 and build a
model for the homogeneous population of “typical” construction projects that remain, we get the
results below.

George's Construction Model                                                              Histogram
(Reduced data set, n=30)                              -2               12 -1                 0                1       2
10
Regression Statistics
Frequency

8
Multiple R 0.95856189
6
R Square 0.918840896
4
Square
2
583873.281
Standard Error
0
Observations         30
-2    -1        0          1       2   More
Bin
ANOVA
df           SS               MS         F     Significance F
Regression            7 8.49108E+13        1.21301E+13 35.5818   1.44393E-10
Residual             22 7.49998E+12        3.40908E+11
Total                29 9.24108E+13

Coefficients Standard Error       t Stat         P-value              Lower 95%          Upper 95%
Intercept  281625.988 1170746.224          0.240552549     0.81213               -2146355.68       2709607.66
GrossArea139.7814566 49.90307723           2.801058859     0.01041               36.28869761       243.274216
NetArea 228.1872883 75.6225792             3.017449163     0.00633               71.35548968       385.019087
TypeA     373203.8497 342476.6769          1.089720483     0.28763              -337050.0691       1083457.77
TypeB     42605.88107 420907.3498          0.101223894     0.92029              -830303.4726       915515.235
TypeC     -51206.5948 292254.6326         -0.175212261     0.86252              -657306.2568       554893.067
Complexity17588.80265 274676.667           0.064034571     0.94952              -552056.3508       587233.956
CityCost 1615820.164 1088843.147           1.483978816        0.152              -642304.737       3873945.07

These results appear to be much improved (discussion).

1
One should never arbitrarily remove data points without documentation or cause.
BA 6326, EMBA Statistics, Fall 2007                                                                                              63
Model Building: Improving the Model for Construction Costs Using Backward
Elimination
Since the construction model was intended to predict construction costs, we want to improve the
standard error. We can do so by building a better model, that is, deciding which variables to include
and which to exclude. One way to accomplish this is to start with a model that includes everything
(the “full model”) and sequentially remove variables (one at a time!) that do not add much to the
model. This procedure is known as backward elimination.
If any variable is worthy of elimination in the full model, it is the Complexity variable (p-
value = .949). After removing this variable and re-running the regression, we obtain a new model
with s  571092.5 and R 2  .9188 . The Type B variable is now the least significant variable with
a p-value of .9087. If we remove this variable, we obtain a new model with s  559231 and
R 2  .9188 . The Type C variable in this model has the biggest p-value at .8127, so we remove it
and obtain a model with s   548587 and R 2  .9185 . The biggest p-value occurs for City Cost
(p-value = .12), and if we remove it we obtain a model with s  564905 and R 2  .9102 . The
value for s  has gone up, and this suggests we have gone too far in eliminating variables. The best
model for predictive purposes appears to be one that includes Gross Area, Net Area, Type A, and
City Cost as independent variables (see below).

Final Model (Backward Elimination)                                                                                        Histogram
12 -2.5                  -0.5-1.5                                       0.5 1.5
Regression Statistics                                                            10             Frequency
Bin
Frequency

Multiple R                0.958427969                                                      8     -2.5                 0
R Square                  0.918584172                                                      6     -1.5                 3
Adjusted R Square          0.90555764                                                      4     -0.5                 7
Standard Error            548587.2856                                                      2      0.5                10
Observations                       30                                                      0      1.5                 8
-2.5     -1.5 2.5 -0.5      0.5    2 1.5                                 2.5        More
ANOVA                                                                                More                             0
df                  SS              MS                F          Significance F
Regression                                     4      8.48871E+13        2.12218E+13    70.5164001        3.02103E-13
Residual                                      25       7.5237E+12        3.00948E+11
Total                                         29      9.24108E+13

Coefficients Standard Error                 t Stat                         P-value            Lower 95%            Upper 95%
Intercept                         361239.4894   977119.3354                  0.369698435                   0.71472048          -1651174.042         2373653.021
GrossArea                         137.3786516   44.96420971                  3.055288917                  0.005284993           44.77319309         229.9841101
NetArea                           229.1804244   70.58166886                  3.247024732                  0.003310698           83.81485814         374.5459907
TypeA                             402580.9277   231605.6919                  1.738216899                  0.094476673          -74419.58973          879581.445
CityCost                          1536504.891   958479.9938                  1.603064123                  0.121481454          -437520.2252         3510530.007

CityCost Residual Plot                                                                    TypeA Residual Plot
1000000                                                                                 1000000
Residuals

Residuals

0                                                                                       0
-10000000.75     0.85   0.95     1.05     1.15   1.25   1.35                            -1000000 0         0.2         0.4      0.6     0.8          1          1.2

-2000000                                                                                -2000000
CityCost                                                                                       TypeA

GrossArea         Residual Plot                                                             NetArea Residual Plot
1000000                                                                                 1000000
Residuals

Residuals

0                                                                                      0
0     5000 10000 15000 20000 25000 30000                                                0            5000         10000       15000            20000
-1000000                                                                                -1000000
-2000000                                                                                -2000000
GrossArea                                                                                       NetArea

BA 6326, EMBA Statistics, Fall 2007                                                                                                                                                 64
In general, backward elimination proceeds by sequentially removing the variable that adds the least
to the model. Remember that the variable that adds the least to the model is the one with the largest
p-value (i.e., the least significant). Some procedures permanently remove the variable with the
highest p-value and this p-value exceeds some pre-established threshold (like   .10 or   .15 ).
Other procedures, like the one we used for the construction model, permanently remove the variable
with the highest p-value only if it improves the standard error. Like any model building procedure,
the model we have arrived at is not necessarily the “best,” but rather a plausible candidate capable
of doing the job.

Questions. In this final model,

(1) What is the cost of increasing gross square footage (ceteris paribus)?

(2) What is the cost of increasing network square footage (ceteris paribus)?

(3) Predict the cost of a Type C project with 6000 square feet gross area, 2500 square feet network
area, in a city whose cost index was 1.18.

(4) Compute an approximate 95% prediction interval for the same problem.
[Use              the                approximate                       prediction        interval         formula
b0  b1 x1  b2 x2      bk xk  t / 2, nk 1 df  s  1  1 / n . Here, k = # independent variables, n =
sample size].

(Ans. \$3571538  \$1148512)

(Note: The exact 95% prediction interval is \$3571538  \$1227694 using another software package)

BA 6326, EMBA Statistics, Fall 2007                                                                           65
More Applications of Multiple Linear Regression
Online Auctions: An Investigation of When to Buy (or Sell) Laptops
on eBay
Suppose we are interested in buying or selling products through online auctions. What situations
are good for buying? What situations are good for selling?

To investigate this problem more rigorously, a researcher collected data on winning bid prices for
used computers purchased through online auctions. Over an approximately three month interval
beginning in May 2002, 488 purchases of Dell’s Latitude CPXH 500GT 500MHz 128MB laptop on
eBay were recorded. Data included (1) the winning bid for a particular auction, (2) the day of the
week the auction closed, (3) the number of bids in the auction, (4) the number of auctions that
closed that day for the same laptop, and (5) the rank of the auction within a day (the order it closed
among auctions for the same item). This data is included in the file Auction.xls.

The day of the week was coded with dummy variables. SUN = 1 if it was a Sunday (0 otherwise),
MON = 1 if it was a Monday (0 otherwise), etc. The Excel output for the model is given below:

BA 6326, EMBA Statistics, Fall 2007                                                                66
Regression Statistics
Multiple R                0.471383199
R Square                  0.222202121
Standard Error            31.55046086
Observations                      488

ANOVA
df             SS           MS            F     Significance F
Regression                         9    135931.7025     15103.5225 15.17284   8.43639E-22
Residual                         478    475816.2955    995.4315806
Total                            487     611747.998

Coefficients Standard Error      t Stat        P-value     Lower 95%     Upper 95%
Intercept                558.693216   5.726602353       97.56102861           0    547.4407827 569.9456493
SUN                    -4.294706396   5.260381458      -0.816424898   0.414664    -14.63104331 6.041630523
MON                     9.906281109   5.670000132       1.747139485   0.081255    -1.234932159 21.04749438
TUES                    17.39920411   5.252984387       3.312251251   0.000996     7.077401992 27.72100622
WED                     15.38320751   5.471611839       2.811458115   0.005134     4.631815447 26.13459957
THUR                    16.90919123   5.397031356       3.133054103   0.001836     6.304345388 27.51403708
FRI                     10.42141417   5.090961399       2.047042465      0.0412    0.417977601 20.42485074
#Bids                    1.52278322   0.291589103       5.222359836   2.64E-07     0.949827963 2.095738477
#AUCTIONS              -0.839917408   0.356722205      -2.354541982   0.018949     -1.54085534 -0.138979476
Rank-in-Day            -1.761991465   0.411566571      -4.281182168   2.25E-05    -2.570695316 -0.953287614

CORRELATION MATRIX
SUN              MON         TUES             WED        THUR           FRI           #Bids #AUCTIONS       Rank-in-Day
SUN                               1     -0.148563898    -0.16885815   -0.160313 -0.166031678 -0.185440699      0.026809 0.005221       0.003082611
MON                    -0.148563898                1   -0.147337406   -0.139881 -0.144871165 -0.161806532      0.089396 -0.054888     -0.032406068
TUES                    -0.16885815     -0.147337406              1   -0.158989 -0.164660978 -0.183909765      0.048373 0.135914       0.080244533
WED                    -0.160312802     -0.139881152   -0.158989315           1 -0.156328035 -0.174602705      0.018363 -0.111259     -0.065688364
THUR                   -0.166031678     -0.144871165   -0.164660978   -0.156328            1 -0.180831348     -0.009552 -0.138577     -0.081817125
FRI                    -0.185440699     -0.161806532   -0.183909765   -0.174603 -0.180831348            1     -0.015572 -0.048709     -0.028758044
#Bids                   0.026809086       0.08939618    0.048372866    0.018363 -0.009551683 -0.015571909             1 -0.073642     -0.119201966
#AUCTIONS               0.005221148     -0.054887516    0.135913531   -0.111259 -0.138577096 -0.048708705     -0.073642         1      0.590408718
Rank-in-Day             0.003082611     -0.032406068    0.080244533   -0.065688 -0.081817125 -0.028758044     -0.119202 0.590409                 1

INVERSE OF CORRELATION MATRIX
1.686543874      0.768556569    0.811538059 0.812097   0.833052957 0.884575206        -0.135624    0.171282   -0.015765442
0.768556569      1.601056562    0.744962263 0.754937   0.774085283 0.817335382        -0.180339    0.212787   -0.020963297
0.811538059      0.744962263    1.662092941 0.779789   0.796794509 0.856283502         -0.16076     0.04883   -0.018687329
0.812097352      0.754937389    0.779789225 1.672419   0.826821526 0.868783747         -0.11775     0.27359   -0.013687649
0.833052957      0.774085283    0.796794509 0.826822   1.712524719 0.893256151        -0.094042    0.302189   -0.010931806
0.884575206      0.817335382    0.856283502 0.868784   0.893256151   1.77626704       -0.102422    0.230314   -0.011905958
-0.135624146     -0.180339334    -0.16076004 -0.11775 -0.094042202 -0.102422467         1.040647   -0.013248       0.1209686
0.171281717      0.212786718    0.048830138 0.27359    0.302189257 0.230313932        -0.013248     1.62273   -0.907884524
-0.015765442     -0.020963297   -0.018687329 -0.013688 -0.010931806 -0.011905958        0.120969   -0.907885    1.549175529

(a) Do the variables included in the model collectively explain a significant amount of the variation
in winning bids? Cite the appropriate test, your test statistic, and your conclusion at the   .05

(b) What is the price difference between a Dell laptop auctioned on Saturday and one auctioned on
Sunday (all other things held equal)? Is this difference statistically significant? Cite your null and
alternative hypothesis, the relevant test statistic, and your conclusion at the   .05 level. What is
the p-value?

(c) Suppose you are more generally interested in whether it is better to auction laptops on weekdays
or weekends (all other things being equal). What is your conclusion based on this model? Briefly
summarize the appropriate test results and state your general findings. What day would you auction

BA 6326, EMBA Statistics, Fall 2007                                                                                                                   67
(d) Is there a relationship between the winning bid price and the rank of the auction (in this model)?
Cite an appropriate test, your test statistic, and your conclusion (at the   .05 level). What is the
p-value for this test?

(e) Can the coefficient values in this model be interpreted safely? Cite appropriate evidence to

Are “Wins” Related to Ratings for Relief Pitchers?
Relief pitchers are baseball’s equivalent of place kickers in the NFL. You bring in some poor sap
with the game on the line and he’s either a forgotten hero or a memorable goat. Many great relief
pitchers do not have much on their record in the way of wins or losses since their role is to save
games, i.e., protect a lead in the late innings.

If you want to know what the “experts” think, CBS Sportsline.Com (September 24, 2002) posts
ratings for the majority of MLB relief pitchers. Along with the pitchers’ ratings they post assorted
“hard” data on performance. The site does not include any information on how they arrive at their
expert ratings. The data is contained in the file Relief.xls.

(a) Are “wins” related to ratings? Build a simple linear regression model and discuss your results.

(b) There appears to be a lot of residual “noise” in the data. Suppose we include other variables to
account for this noise. Are wins related to ratings in this model?

Assignment 8 (Do Not Hand This In)

1.      Multiple Regression Chapter, problem 37, (page 682-683).
In this problem, also check for multicollinearity using the correlation matrix and VIFs.

2.      Multiple Regression Chapter, problem 38, (page 683-684).

3.      Multiple Regression Chapter, problem 50, (page 694).                   Note: Use Excel’s
standardized residual plot to answer part e and part f.

4.      Multiple Regression Chapter, problem 51, (page 695).

5.      Multiple Regression Chapter, problem 52, (page 695-696).

BA 6326, EMBA Statistics, Fall 2007                                                                68

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 11/30/2011 language: English pages: 9
How are you planning on using Docstoc?