Chapter 13: Multiple Regression Analysis and Model Building
Models that include more than one independent variable are called
multiple regression models. The general form of these models is:
y = β0 + β1x1 + β2x2 + … + βkxk + ε
y is the dependent variable
x1, x2, …, xk are the independent variables
βi determines the contribution of the
independent variable xi
The value of the coefficient βi determines the contribution of the
independent variable xi and β0 is once again the y-intercept. The
coefficients β0, β1, …, βk are usually unknown because they
represent population parameters, and so we generally have to find
estimates of them from our sample.
In forming our model, we have to follow the same Five Step Plan
as we did with Simple Linear Regression.
Step 1: Decide which model we wish to use
Now that we have more than one predictor variable we will have
more complicated models. If we are trying to predict values of our
dependent variable at different levels of 3 different independent
variables, our model would be of the form:
y = β0 + β1x1 + β2x2 + β3x3 + ε
We will still try to find the Least Squares Line through our points,
that is, the line that minimizes SSE =Σ(y – y )2.
Step 2: Use sample data to estimate unknown parameters
In multiple regression, whilst it is still possible to estimate all our
different parameters by hand, in actuality the process is tedious and
time consuming and so we will always be given ANOVA tables
with the appropriate estimates.
Step 3: Specify the probability distribution of the random
error term and estimate the standard deviation of this
Our estimator of σ2 for a multiple regression model with k
independent variables is:
s = n − (k + 1) = MSE
Note that in the ANOVA table, this is the same as our mean
squared error. Also, the assumptions we made over the distribution
of ε in simple linear regression will again apply.
Step 4: Evaluate how useful the model is
If we want to test the usefulness of a particular term in our model,
we would perform a t-test and look at the p-value for that term.
However, if we wanted to test whether any of the terms in our
model are useful in predicting y we would use the F-test.
The F-test is a test of the hypothesis:
H0: β1 = β2 = … = βk = 0
H1: At least one of the coefficients is nonzero
Note that our H0 will always include all of our parameters except
our y-intercept β0.
Note that this test has a general set-up of:
H0: None of the explanatory variables are helping
H1: At least one of the explanatory variables are
which shares the general format seen throughout the last couple of
H0: Model not useful
H1: Model useful
Once we know the test statistic of our F-test, we will often want to
determine whether it is significant. As in all our tests, if our test
statistic is more extreme (ie. greater) than our critical value, we
By rejecting H0 we are saying that our model is significantly better
than just estimating y with y . However, we are not necessarily
saying that the model we have is the best model we could find. In
real life, once we have found our model to be useful, we often try
to “fine-tune” it by adding more independent variables or higher-
order terms. We also often look at each term currently in our
model to see which are individually significant.
Step 5: Using The Model For Estimation And Prediction
Calculating confidence or prediction intervals by hand in multiple
regression is incredibly tedious and time consuming. We can
however request them with our computer output and interpret their
meaning in the same way as before.
The department head of a University’s Accounting department
wanted to see if she could predict the GPA of students using the
number of credit hours and total SAT scores of each student. She
takes a sample of students and generates the following Excel
df SS MS F p-value
Regression 2 1.4468 0.7234 9.7286 0.0488
Residual 3 0.2231 0.0744
Total 5 1.6698
Coefficients Stan Error t-stat p-value
Intercept 5.6357 2.1045 2.6779 0.0752
Credit H -0.3155 0.0770 -4.0974 0.0263
SAT Tot 0.0014 0.0014 0.9999 0.3923
Q. The standard deviation of the errors is closest to?
A. s = n − (k + 1) = MSE = .0744
Therefore s = .0744 = .2728
Q. What is the predicted GPA of an accounting student with 12
credit hours and a 1200 SAT total score?
A. Let y represent students GPA
x1 represent number of credit hours
x2 represent SAT total score
Our model is of the form:
y = β0 + β1x1 + β2x2 + ε
So using our parameter estimates we get:
y = 5.6357 + (-0.3155)x1 + (0.0014)x2
So when x1=12 and x2=1200:
y = 5.6357 + (-0.3155)(12) + (0.0014)(1200)
Other questions that could be asked are:
• Is the overall regression model useful at a 5% level of
• Which of the explanatory variables are useful at a 5% level of
• By how much would we predict that a students GPA would
increase based on a 100-point increase in SAT Total score,
holding Credit Hours constant?
• What proportion of the variability in a students’ GPA is
explained by using SAT Total and Credit Hours in a
• When using a regression model that predicts student GPA
based on SAT Total and Credit Hours, we would expect
approximately 95% of the data to fall within what distance of
the regression line?
Q. Is the overall regression model useful at a 5% level of
A. The table that comments on the entire regression model as a
whole (rather than breaking it down variable-by-variable) is the top
table and the test we are interested in is the F-test which tests the
H0: None of the explanatory variables are helping
H1: At least one of the explanatory variables are
Since the p-value of this test is less than alpha (.0488 < .05) we
reject H0 and conclude that the regression model is useful.
Q. Which of the explanatory variables are useful at a 5% level of
A. Testing whether the coefficient of Credit Hours is equal to zero
gives us a p-value of .0263 (see table). So we can say that, as our
model stands, credit hours is significantly linearly related to (and is
a useful predictor of) GPA (since .0263 < .05).
Doing the same test with the coefficient of SAT Total Score gives
us a p-value of .3923, which is not significant. So, as our model
stands, SAT Total Score does not have a significant linear
relationship with GPA.
Q. By how much would we predict that a students GPA would
increase based on a 100-point increase in SAT Total score, holding
Credit Hours constant?
A. The definition of the slope is that “for every one-unit increase in
x, we predict that y will increase by the coefficient of x, holding all
other explanatory variables constant”. Therefore if SAT Total
increases by 100, we predict that y (Student GPA) will increase by
0.14 (100 times the coefficient of SAT Total), holding credit hours
Q. What proportion of the variability in a students’ GPA is
explained by using SAT Total and Credit Hours in a regression
A. This is the definition of R-Squared.
R2 = Amount of error that disappeared
Total amount started with
Q. When using a regression model that predicts student GPA based
on SAT Total and Credit Hours, we would expect approximately
95% of the data to fall within what distance of the regression line?
A. The empirical rule says that we expect approximately 95% of
the data to fall within two standard deviations of the mean. In the
context of regression, this tells us that we expect approximately
95% of the data to fall within two standard deviations of the
Since s = .2728, we expect 95% of our predictions for student GPA
to fall within .5456 of the actual value.
The Coefficient Of Determination
Recall that R2 measures the amount of variation in y that can be
explained by using x to predict y. It can be calculated as:
SSyy − SSE
Another way to calculate R2 if we are given an ANOVA table is by
dividing the sum of squares for the model by the total sum of
squares as follows:
R2 = Explained variability = SSMODEL
Total variability SSTOTAL
The calculation of R2 does not involve any adjustment for degrees
of freedom. As a result of this, we could add an irrelevant term to
our model and R2 would never decrease and in almost all cases it
would even increase. Because of this, there is a tendency for R2 to
be too large. This bias can be removed by calculating instead an
adjusted R2, using the formula below.
Adjusted R2 = SSE/(n - k - 1)
SStotal/(n – 1)
The gap between the R2 and the adjusted R2 tends to increase as
non-significant independent variables are added to the regression
model. As n increases, the difference between the R2 and the
adjusted R2 becomes less.
Indicator variables (also known as Dummy variables) are used
when we wish to incorporate a categorical explanatory variable
into our analysis. They are just variables that can take on the
values 0 or 1, where a 1 indicates that a subject possesses a
characteristic or is a member of an indicated group, while a 0
indicates the converse.
We will often use indicator variables in our analysis if we take into
account non-numeric variables such as gender (0=male, 1=female)
or, in the case of medical data, treatment (0=placebo, 1=drug).
Suppose that a toy manufacturer wishes to determine whether his
red toys sell better than his blue toys. He gathered data regarding
sales levels, color, price and average age levels for which the toys
are intended. He entered these into a computer and obtained the
multiple regression equation:
y = 70,663 – 713x1 – 59.6x2 + 66.4x3
Where y refers to sales levels (in units), X1 refers to color (0=blue,
1=red), X2 refers to retail price (in dollars) and X3 refers to average
age level (in years).
Q. What is the prediction for the sales level of a red toy costing
$20 and intended for children 5-years-old?
y = 70,663 – 713(1) –59.6(20) + 66.4(5)
Notice that the equation is basically telling us that a red toy will
sell 713 units less than a blue toy, holding all other factors
It would be tempting, if we had a situation like that above, but the
toys came in three colors, to code them as 0=blue, 1=red, 2=green.
We should not fall into this trap for several reasons:
1. The values 0, 1 and 2 indicate a hierarchy of colors with
green selling less than red, which in turn sells less than blue.
We really don’t know whether this is the case though. With a
0-1 variable, the hierarchy is not fixed, since changing the
sign of the coefficient will reverse the order.
2. Since the variable has only one coefficient, if we coded it as
mentioned we are committing ourselves to the fact that the
difference between the sales of blue and red toys is the same
as the difference between the sales of red and green toys.
What we should do in a situation like this is create two different
indicator variable. Our first variable will be blue (1=blue, 0=not)
and our second will be red (1=red, 0=not). A variable should not
be assigned to green since all toys for whom a 1 was not recorded
either for the blue or red variables must be green.
Similarly, if an indicator variable has c categories, we must create
c-1 indicator variables and put them all in our regression model.
In many situations, especially where there is one class of extreme
interest, everything else is put in the baseline (X=0) class, while
items having the characteristic of interest are put in the (X=1)
class. For example, if our categorical variable was religious
preference, we may make code our data so 1=catholic, 0=non-
catholic, rather than split our data into smaller denominations and
over complicate our model.
More Complex Regression Models
Consider the following regression models:
y = β0 + ε
This is our null model, where we predict y using no explanatory
variables. Note that in this case β0= y.
y = β0 + β1X1 + ε
This is a first order model (meaning the highest power of any
predictor variable in the model is 1) with one independent variable.
This is the model we looked at in simple linear regression.
y = β0 + β1X1 + β2X2 + β3X3 + ε
This is a first order model with three independent variables. We
started looking at this type of model with multiple regression.
y = β0 + β1X1 + β2X12 + ε
This is a second order model (meaning the highest power of any
predictor variable in the model is 2) with one independent variable.
y = β0 + β1X1 + β2X12 + β3X13 + ε
This is a third order model with one predictor variable.
y = β0 + β1X1 + β2X2 + β3X1X2 + ε
This type of model is considered to be a second order model with
two predictor variables. The X1X2 term is an interaction term. Even
though the model has 1 as the highest power of any one variable, it
is considered to be a second order equation because of the
All of the cases above can be thought of as a special case of a
General Linear Model. The first three cases are models we have
looked at already and the next three are cases we will move on to.
Polynomial regression models are regression models that are
second or higher order models. They contain squared, or higher
powers of the predictor variable
If the simple model:
y = β0 + β1X1 + ε
appears to be too high for moderate values of X1 and to low in the
extremes, or vice versa, then we should worry about a possible
curvilinear relationship between X1 and Y.
If we suspect a curvilinear relationship exists we could try a
quadratic model such as:
y = β0 + β1X1 + β2X12 + ε
A model like this allows our model to curve with the data.
Even better fits can occasionally be obtained by trying cubic
y = β0 + β1X1 + β2X12 + β3X13 + ε
In general, one could keep trying higher order polynomials, but
that is not advised. Even though adding additional terms will result
in a higher R2 (and therefore a better fitting model) there is always
a danger of over-fitting the model to our sample points.
It is theoretically possible to exactly fit any data set with n points
with an (n-1)th degree polynomial. However, if you really attempt
this, you will get a wildly oscillating function that does nothing but
fit the observed data. It would be utterly useless for predicting how
x actually affected y in the general population.
Another drawback of higher-order models is that they tend to
become difficult to interpret. They really don’t help you find trends
or general directions in your data. Due to all this, it is very rare to
use models with higher than second-order terms.
Regression Models With Interaction
Often when two different independent variables are used in a
regression analysis, there is an interaction between the two
variables. An interaction between two explanatory variables (X1
and X2) simply implies a change in the coefficient of X1 from one
value of X2 to another value of X2.
In cases where we suspect there may be an interaction, we can use
a model like this one:
y = β0 + β1X1 + β2X2 + β3X1X2 + ε
or maybe try one like this, if we suspect that the relationship
between our X variables and Y is not linear:
y = β0 + β1X1 + β2X2 + β3X12 + β4X22 + β5X1X2 + ε
In a two-predictor regression with interaction, the response surface
is not a plane but a twisted surface (like "a bent cookie tin"). The
change of slope is quantified by the value of β3 in the first of the
two models above. Including it is a way to account for the
correlation between the two explanatory variables.
Once we have included all these terms in our model, we will
almost inevitably find that more than one of the terms will not
appear to be making a significant contribution to our model. It is
not right to automatically discard all these apparently insignificant
terms from our model in one go, however, you may not want to
waste the time of eliminating one variable at a time many times
over. The test that follows shows you how to test whether it is
statistically permissible to proceed from the full model of (k+1)
parameters to a reduced model with (g+1) parameters.
1. Perform the regression using the full (k+1) parameter model.
Calculate the SSE and MSE for the full model, and label them
SSE1 and MSE1 = SSE1/[n - (k+1)].
2. Perform the reduced model regression using only the (g+1)
parameters being considered for inclusion in the reduced model.
Calculate the SSE and MSE for this model, and label them
SSE2 and MSE2 = SSE2/[n – (g+1)]. Note that MSE2 is not
really needed for this test procedure.
3. Calculate the increase in SSE caused by dropping the (k-g)
parameters from the full model. This sum of squares due to the
dropping can be calculated as SS(drop) = SSE2 – SSE1. The
MS(drop) = SS(drop)/(k-g).
4. Compare MS(drop) with MSE1 using an F-test with (k-g) and
n-(k+1) degrees of freedom. The F-test will be testing the
H0: βg+1 = βg+2 + … + βk = 0
H1: At least one of the dropped coefficients is non-zero
If the test is insignificant, then the (k-g) terms can be dropped
from the model, with no significant loss of predictive power. If
the test is significant, we reject H0, which means that we can’t
jump from the full model to the reduced model without a
significant loss of power.
We could summarize the above information in the ANOVA table
Df SS MS F
Dropped Terms k-g SS(drop) MS(drop) MS(drop)/MSE1
Full Model n-(k+1) SSE1 MSE1
Reduced Model n-(g+1) SSE2
In a case where our F-test is significant and we can’t collapse to
the reduced model, we might consider examining reduced models
which fall somewhere between the full model and the above
The drawback to this method is that one must tell the computer
which models to try at each stage. Next we will consider several
methods which will allow the program to give us the “best” model
in one run of the program.
Variable Selection Procedures
There are various ways to select the variables which are used in a
multiple regression model. There is no one "best" way, although
most people would agree that one wants the simplest possible
model which explains the response variable adequately. The
difficulty is in determining what is "adequate" and in deciding
what trade-off to make in terms of model complexity for model fit.
SAS offers nine different /SELECTION= options within PROC
REG. Some of these (such as STEPWISE) actually pick a best
model, while others (such as RSQUARE) list the models which
have the most optimal value of the statistic under consideration for
a given number of explanatory variables.
A brief review of some of these follow:
/SELECTION=NONE; This is the SAS default. If this is used,
the requested model is fit, but no attempt to compare it to other
models is made. This is used when one is sure that one wants to
use a certain model, or after one has used some of the other
procedures shown below to narrow the class of models being
examined to a small group.
/SELECTION=F SLE=(α); This is the FORWARD selection
option. SAS starts with the null model (y = β0 + ε), and then adds
the most significant variable. After that it adds the next most
significant variable (with the first already entered into the
equation). This process continues until none of the variables left
outside of the model meet the entry-level selection value, α,
specified by the SLE statement. If no SLE value is specified, SAS
uses a default of α=.50. This gives a reasonable idea of what
variables might be important, but tends to keep too many unneeded
/SELECTION=B SLS=(α); This is the BACKWARDS selection
option. SAS starts with the full p-variable model, and deletes the
least significant variable. After that, it deletes the next least
significant variable remaining, etc, until all variables remaining are
significant at the Stay Selection Level (SLS) specified. If no
selection level is specified, SAS uses SLS=.10. For a small number
of possible predictors, k, Forward and Backward regression (with
SLE=SLS) tend to give the same final models, but as k increases,
there is a good chance that they disagree. This is why the next
procedure was invented.
/SELECTION=STEPWISE SLE=(α1) SLS=(α2); This is the
STEPWISE selection option. It is a combination of Forward and
Backward Regression. For those options, once a variable was
entered (for FORWARD) or deleted (for BACKWARD), that
variable was never re-examined. In STEPWISE, a variable can be
added or deleted several times before the final model is attained,
dependent on the other variables in the model. This is quite
important when collinearity is present, because a variable which
might initially have appeared insignificant (in the presence of some
variables,) might become very significant in the presence of others
(and vice-versa). The final model is achieved when no variables
outside of the model meet the SLE criteria, and all in the model
pass the SLS criteria. The default values for both SLE and SLS are
0.15 under STEPWISE.
/SELECTION=RSQUARE BEST=b; Unlike the F, B and
STEPWISE options, this option doesn't yield a best model. It
yields the b models with the highest r2 values for p=1, 2, ..., k
predictors. If the BEST=b option is not specified this is done for all
2k models, which would take an enormous amount of time and
memory. A value of BEST=5 works well in most applications.
Once these models are printed out, one might want to look more
closely at some of the models which were "best in the class of p-
variable models" and use some other criteria to pick which one of
those are best.
/SELECTION=MAXR and /SELECTION=MINR are two
procedures similar to /SELECTION=RSQUARE but are much
more computationally intensive and not really worth using.
/SELECTION=ADJRSQ BEST=b; This is very similar to the
/SELECTION=RSQUARE procedure except it reports the
adjusted r2 values for the models. This is another model that tends
to over-parameterize considerably.
/SELECTION=CP BEST=b; This option will rank the models
exactly the same way as RSQUARE does, so most people run them
at the same time, if at all. The CP statistic (called Mallow's C(P)) is
more interesting than R-squared, since it can be used to determine
the best model, although whether the model so determined is really
the best is a matter of interpretation.
Once you have run these selection procedures, you will generally
have several models which could be used. Frequently, it doesn't
really matter which is used, and the decision over which will be
most helpful would be best made based on a general understanding
of the variables themselves.
Two popular ways to choose a final model are:
1. Pick the best model by STEPWISE with SLE=.20, SLS=.10
2. Pick the simplest model such that Mallow's C(P) is
approximately equal to p, where p in this context means the
number of variables (including the intercept) in the model
There are many problems that can undermine our attempts to fit a
model to our data. Some of the major problems we can encounter
are outlined below.
Often two or more of the independent variables used in our model
contribute redundant information. That is, the independent
variables are correlated with each other. Suppose we wanted to
construct a model to predict a student’s GPA based on their Total
SAT score (x1), their Verbal SAT score (x2) and their Math SAT
score (x3). Although all three of our independent variables
contribute information for the prediction of GPA, some of the
information is overlapping because Total SAT score is highly
correlated to Verbal SAT score and Math SAT score.
If we were to fit a model using all three of these independent
variables to predict GPA we might find that the t values for β1, β2
and β3 (the coefficients of x1, x2 and x3) are by themselves not
significant, yet our F-test still says the model is useful. This is
because all three of the variables are contributing to the model, but
the contribution of one overlaps with that of the other two.
Another way that multicollinearity can be recognized is by
inspection of a correlation matrix. If we are using three different
explanatory variables (X1, X2 and X3) to improve our prediction of
a dependent variable (Y), we may get a correlation matrix like the
Y X1 X 2 X 3
X1 .84 1
X2 .34 .72 1
X3 .12 .36 .08 1
Since multicollinearity is a problem if the explanatory (X)
variables are highly correlated with one another, the only possible
problem here is that maybe X1 and X2 are providing very similar
information (r = .72). Note that the high correlation between Y and
X1 is a good thing, since we want X and Y to be highly correlated.
Often when we have correlated explanatory variables we choose to
only include the bare minimum of the correlated variables in our
Prediction Outside The Experimental Region
Throughout the course we have emphasized that predictions from
our regression model are only valid over the ranges of our
explanatory variables. If we try to use our model to predict outside
of the range of our x variables(s), we can encounter many
Curvature to the Data
It is important to realize that most of the models we have
considered in the course have been straight-line models. Often
though, if we were to perform some transformation (maybe using
x2 rather than x to predict y) we would get a much better-fitting
Consider the scatterplot below:
0 2 4 6 8 10
In a situation like that above, using x to predict y would clearly be
a significant improvement on just predicting y with y . However, it
is apparent that we would get an even better prediction if we
transformed x and used x2, say.
It is therefore important to realize that if you only consider straight
line models, you are seriously limiting your chances of finding the
“best” model. The best linear model will not necessarily be the
best-fitting functional form for the data.
Violation Of Assumptions Concerning The Error Term, ε
All the assumptions made in Step 3 of constructing a regression
model are vitally important. If any of the assumptions do not hold,
the estimates of variability and all hypotheses tests based on them
will no longer be valid.
Chapter 11: Analysis Of Variance and Design Of Experiments
In this section we will explore research scenarios where
hypotheses are tested for more than two populations. For example,
we might wish to examine the average sales of salespeople trained
using five different training programs to see whether they are the
same. Our hypotheses become:
H0 : µ1 = µ2 = µ3 = µ4 = µ5
H1: not all µ’s are equal
We test such a hypothesis by first collecting five samples, one
from each of the training programs (populations). We will see that
to compare these five means one pair at a time is not the correct
approach, as this would result in ten different pairwise tests, and
what was intended to be a testing procedure with, say, a 5%
significance level results in a much higher significance level.
The correct procedure for this situation is to examine the variation
of the sales value, both (1) within both of the samples (examining
the variability of each sample alone) and (2) among the five
samples (for example, are the values in sample 1 larger, or smaller,
on average, than the values in the other samples?).
Another way to consider the reasoning behind this approach is by
relating it to a situation with just two populations. When testing for
a difference between µ1 and µ2, both s1 and s2 affect the width of
our confidence interval for (µ1 - µ2). Consequently, we infer
something about the means of several populations by utilizing the
variation of the resulting samples. Hence the term analysis of
Comparing Two Population Means: One Approach
A large sample confidence interval for (µ1 - µ2) is given by:
σ 12 σ 22
( x1 − x2 ) ± Z σ ( x1 − x2 ) = ( x1 − x2 ) ± Z n1
If we do not have large sample sizes (n1<30 or n2<30) we need to
use the t-distribution.
A small sample confidence interval for (µ1 - µ2) is given by:
2 1 1
( x1 − x2 ) ± t s p +
where sp2 = (n1 – 1)s12 + (n2 – 1)s22
n1 + n2 – 2
and t is based on (n1 + n2 – 2) degrees of freedom.
Note: sp is called the pooled standard deviation since it combines
the standard deviations of both samples.
When dealing with small samples we must make the following
• Both sampled populations have relative frequency
distributions that are approximately normal.
• The population variances are equal.
• The samples are randomly and independently selected from
Liverpool Drug Company claims its aspirin tablets will relieve
headaches faster than any other aspirin on the market. To
determine whether Liverpool’s claim is valid, a random sample of
size 15 is chosen from aspirins made by Liverpool and a further
random sample of size 15 is taken from aspirins made by the
Manchester Drug Company. An aspirin is given to each of the
randomly selected persons suffering from headaches and the
number of minutes required for each to recover from the headache
is recorded. The sample results are:
Liverpool (L) 8.4 2.2
Manchester (M) 8.9 2.6
Assume that the two populations are normally distributed with
equal, but unknown, variances.
Q. What is the pooled standard deviation?
(nL − 1)sL + (nM − 1)sM
A. sp =
nL + nM − 2
(14)(2.2) + (14)(2.6)
15 + 15 − 2
Q. Construct a 99% confidence interval for the true mean
difference in the time taken to relieve headaches (µM - µL).
2 1 1
A. ( xM − x L ) ± t sp
=( 8.9 − 8.4 ) ± 2.763 2.4 +
= 0.5 ± 2.763 2.4
= 0.5 ± 1.563
= (-1.063, 2.063)
Therefore we cannot conclude that a difference exists between the
two aspirins in terms of the time taken to relieve headaches.
A company offers an optional seminar for its managers on how to
interact with employees. A sample is taken of the job performance
ratings received by 30 managers who attended the seminar and 30
managers who did not attend the seminar. The mean and standard
deviation of ratings for the sample that did not attend was 6.1 and
5.9, respectively and the mean and standard deviation for the
sample that did attend was 9.7 and 4.5, respectively.
Q. What is the variance of the difference in the two sample means?
A. In this case our sample sizes are large, so we don’t need to pool
the standard deviations.
Attended Not Attended
n = 30 n = 30
x A = 9.7 x NA = 6.1
sA = 4.5 sNA = 5.9
2 s s
s( x − x ) = 1 + 2
(4.5) 2 (5.9) 2
Q. Find a 96% confidence interval for the difference in mean
ratings for those managers who did and did not attend the seminar.
σ 12 σ 22
A. Interval = ( x1 − x2 ) ± Z +
= (9.7 – 6.1) ± 2.05 1.84
= 3.6 ± 2.78
= (.82, 6.38)
Therefore there is a significant difference in the means of those
that did and did not attend the seminar. The mean for those that
attended is higher.
The Analysis Of Variance Approach
We need to introduce two terms: factor and level. The previous
example examined the effect of one factor (sex), consisting of two
levels (male and female).
The purpose of Analysis of Variance is to determine whether the
factor has a significant effect on the variable being measured
(salary, in our example). If for example the factor of sex is
significant, the mean salaries for the different sexes will not be
equal. Consequently, testing for equal means among the different
sexes is the same as attempting to answer the question, is there a
significant effect on salary due to this factor.
We will begin this part of the course by examining the effect of a
single factor on the variable being measured, one-factor ANOVA.
Extensions of this technique include ANOVA procedures that
determine the effect of two or more factors operating
Assumptions behind ANOVA
The following assumptions are basically the same requirements
that were necessary when testing two means using small,
independent samples and the pooled variance approach. These
1. The replicates (observations) are obtained independently and
randomly from each of the populations. The value of one
observation has no effect on any other replicates within the
same sample or within the other samples.
2. The replicates from each population follow (approximately) a
3. The normal populations all have a common variance, σ2. We
expect the values in each sample to vary about the same
amounts. The ANOVA procedure will be much less sensitive to
violations of this requirement when we obtain samples of equal
size from each population.
We mentioned earlier that our error when comparing means of
different populations could be split into two groups: within-sample
variation and between-sample variation. When using the ANOVA
approach, we measure these two sources of variation by calculating
sums of squares for each of them, and in a similar way to our
previous look at the ANOVA table, we also calculate a sum of
Deriving The Sum Of Squares
When examining k populations, for example, the data will be
configured something like this:
Level 1 Level 2 … Level k
n11 n12 n1k
n21 n22 n2k
M M … M
n1 replicates n2 replicates … nkreplicates
M M … M
Totals T1 T2 … Tk
In our example comparing male and female salaries, we had k=2
and n1 = n2 = 21 replicates. Notice also that Ti is the total of the
observations in sample i and we will also define T as the grand
total of all the observations so T = T1 + T2 + … + Tk.
SS(factor): also known as SS(between)
SS(factor) is the sum of squares that determines whether the values
in one sample are larger or smaller on the average than the values
in another sample. It can be calculated as:
SS(factor) = n1 ( X 1 − X ) 2 + n2 ( X 2 − X ) 2 + ... + nk ( X k − X ) 2
j (X j − X )2
A short cut calculation method is:
T1 2 T2 2 Tk T 2
SS(factor) = n + n + ... + n − n
where, again, k is the total number of populations we are
Sum Of Squares Total: SS(total)
SS(total) is a measure of the variation in all of the n = n1 + n2 + …
+ nk data values. You obtain this value as if you were finding the
variance of these n values, except that you do not divide by n-1. It
can be calculated as:
∑ [( X ]
SS(total) = 1j − X ) 2 + ( X 2 j − X ) 2 + ... + ( X n j j − X ) 2
= ∑∑ ( X
j =1 i =1
ij − X )2
A short cut calculation method is:
SS(total) = X −
SS(error): also known as SS(within)
SS(error) is the measure of the variation within each of the
samples. It can be calculated as:
∑ [( X ]
SS(error) = 1j − X j ) 2 + ( X 2 j − X j ) 2 + ... + ( X n j j − X j ) 2
= ∑∑ ( X
j −1 i =1
ij − X j )2
A short cut calculation method is:
T1 2 T2 2 Tk
SS(error) = ∑ X − + + ... +
n1 n2 nk
= SS(total) - SS(factor)
The ANOVA Table
The format of the ANOVA table will be the same, regardless of the
number of populations (levels), k. When we move on to examining
several factors in our analysis our degrees of freedom will change
however, in a similar way to when we moved from simple linear to
The ANOVA table will look as follows:
Source df SS MS F p-value
Factor k-1 SS(factor) MS(factor)
Error n-k SS(error) MS(error)
Total n-1 SS(total)
Values for MS, F-ratio and p-value can be calculated from the
other values in the table as before.
Example (Transportation Costs)
Family transportation costs are usually higher than most people
believe, because they include car payments, insurance, fuel costs,
repairs, parking and public transportation. Twenty randomly
sampled families in four major cities are asked to use their records
to estimate a monthly figure for transportation cost. Use the data
obtained and ANOVA to test whether there is a significant
difference in monthly transportation costs for families in these
cities at a 5% level of significance.
Atlanta New York Los Angeles Chicago
650 250 850 540
480 525 700 450
550 300 950 675
600 175 780 550
675 500 600 600
Total 2955 1750 3880 2815
So T1 = 2955, T2 = 1750, T3 = 3880, T4 = 2815 and T = 11400.
Also n1 = n2 = n3 = n4 = 5 and n = 20. Finally we can calculate Σx2
to be 7,175,400.
T1 2 T2 2 Tk T 2
SS(factor) = n + n + ... + n − n
(2955) 2 (1750) 2 (3880) 2 (2815) 2 (11400) 2
= 5 + + + −
5 5 5 20
= 6,954,630 – 6,498,000
SS(total) = X −
= 7,175,400 −
= 7,175,400 – 6,498,000
SS(error) = SS(total) - SS(factor)
= 677,400 – 456,630
df(factor) = k - 1 = 4 - 1 = 3
df(error) = n - k = 20 – 4 = 16
df(total) = n - 1 = 20 – 1 = 19
The ANOVA table for this analysis follows:
Source df SS MS F
Factor 3 456,630 152,210 11.0312
Error 16 220,770 13,798.125
Total 19 677,400
By looking in our F-table with α = .05 we see that F3,16(.05) =
3.24. Since our F-ratio of 11.03 is greater than this critical value
we can reject H0 and conclude that at least one of the cities has a
different mean at a 5% level of significance.
Interpretation of Mean Squares
Note that one of our assumptions in making this comparison of
means was that the variances of each of the populations were the
same. The ANOVA procedure is based on a comparison of two
separate estimates of this variance, σ2.
The first estimate is derived using the variation among the sample
means whereas the other estimate is determined using the variation
within each of the samples. The ANOVA procedure is based on a
comparison of these two estimates of σ2 because they should be
approximately equal provided H0 is true.
MS(factor) = estimate of σ2 based on the variation among the
MS(error) = estimate of σ2 based on the variation within each of
The closer these estimates are to each other, the closer our F-ratio
will be to 1. As the differences between them get larger, our F-ratio
increases and becomes increasingly significant.
If the one-factor ANOVA leads to a rejection of H0, and therefore
a conclusion that at least one of the means differs, a natural
question would be to ask which of the means differ? In other
words, rejecting the ANOVA null hypothesis informs us that the
means are not all the same but provide no clue as to which of the
population means are different.
As mentioned earlier, performing a series of t-tests to compare all
possible pairs of means is not a good idea, since the chances of
making a Type I error (concluding a difference exists between two
population means when in fact they are the same) using such a
procedure is much larger than the predetermined α used for each of
What is needed is a technique that compares all possible pairs of
means in such a way that the probability of making one or more
Type I errors is α. This is a multiple comparisons procedure. There
are several methods available for making these comparisons, but
the most well-known is Tukey’s test, which is presented here.
Tukey’s honestly significantly different (HSD) test is somewhat
limited by the fact that it requires equal sample sizes. It takes into
account the number of populations, the value of the mean square
error and the sample size. Using these values and a table value Q
(Table A.10 in textbook – page A-29 of 2nd edition), the HSD
determines the critical difference necessary between the means of
any two treatment levels for the means to be significantly different.
1. Find Qα,k,ν using Table A.10 where α is the significance level
required, k is the number of sample means (groups) and ν is the
degrees of freedom associated with MS(error).
MS (error )
2. Determine HSD = Qα ,k ,v nr
where nr is the number of replicates in each sample – remember
that sample sizes should be equal.
3. Place the sample means in order, from smallest to largest.
4. If two sample means differ by more than HSD, the conclusion
is that the corresponding population means are unequal. In
other words, if X i − X j > HSD, this implies that µi ≠ µj.
Example (Transportation Costs, continued)
Recall how in the earlier example we compared the average
transportation costs in four different cities and concluded that at
least one of the means was different. We will now use the
procedure outlined above to determine which of the means differ.
1. We had an α=5% level of significance, k=4 different
populations and ν=16 degrees of freedom for error. Therefore
Qα,k,ν = Q.05,4,16 = 4.05
2. We had MS(error) = 13,798.125 and nr = 5. Therefore the HSD
is equal to:
MS (error ) 13798.125
Qα ,k ,v = 4.05 = 212.755
3. Our sample means were:
New York: X 1 = 350
Chicago: X 2 = 563
Atlanta: X 3 = 591
Los Angeles: X 4 = 776
4. X 1 − X 2 = 213 > 212.755 so X1 and X2 differ
X 1 − X 3 = 241 > 212.755 so X1 and X3 differ
X 1 − X 4 = 426 > 212.755 so X1 and X4 differ
X 2 − X 3 = 28 < 212.755
X 2 − X 4 = 213 > 212.755 so X2 and X4 differ
X 3 − X 4 = 185 < 212.755
X 1 (the mean for New York) differs from the mean from the other
three cities, at a 5% level of significance. There was also a
significant difference between X 2 and X 4 (Chicago and LA).
If we asked SAS to perform the above test, our output would look
New York Chicago Atlanta Los Angeles
where means with the same letter are not significantly different.
As mentioned, the test above can only be performed when we have
equal sample sizes. The Tukey-Kramer procedure is a modification
of the regular Tukey’s test that will work when we have unequal
sample sizes. The necessary formula is:
MS (error ) 1 1
HSD = Qα ,k ,n−k 2 n n
where ni is the sample size for the ith sample and nj is the sample
size for the jth sample.
Note that a different HSD value must be computed for each
different pair, since our sample sizes will not be the same.
Designing An Experiment
So far in this part of the course we have introduced you to one-
factor (or one-way) ANOVA. In this type of analysis you
randomly obtain samples from each of the k populations (levels) of
a single factor – in our last example k=4 levels (cities) of a single
factor (location). Since replicates (repeat observations) are
obtained in a random manner from each population, this type of
sampling plan is called a completely randomized design.
Before we go further into Experimental Design we need to define a
few new terms:
We have already mentioned that a factor is a set of related levels
used as an explanatory variable. Factors are usually qualitative
(sex, marital status, etc.) but can be quantitative when a limited
number of levels of a quantitative variable are chosen for study. A
factor can be either a treatment variable or a classification variable.
A treatment variable is one the experimenter controls or modifies
in the experiment: for example, in a medical study, a treatment
variable may be medicine – a treatment that would consist of 2
levels, drug or placebo.
A classification variable is some characteristic of the
experimental subjects that was present prior to the experiment and
is not a result of the experimenter’s manipulation or control: for
example, in the transportation costs situation we looked at
previously, the classification variable was city – a classification
that consisted of 4 levels (Atlanta, New York, Chicago and LA).
A treatment or treatment combination is a particular
combination of the levels of one or more factors. Treatment
combinations will come into play when we start studying more
than one factor at a time.
The experimental units are materials or items on which a
measurement is made and to which treatments are applied.
Nuisance variables are other variables which influence the
response variable but which are not of interest . Systematic bias
occurs when treatments are not alike with respect to nuisance
variables. In this case, the nuisance variable becomes a
Confounding variables are variables that are not being controlled
by the researcher in the experiment but can have an effect on the
outcome of the treatment being studied. One way to control for
these variables is to include them in the experimental design. The
randomized blocking design, which is a type of experimental
design we will also be examining, has the capability of adding one
of these variables into the analysis as a blocking variable.
A blocking variable is a variable that the researcher wants to
control but is not the treatment variable of interest.
Structures Of An Experimental Design
A treatment structure is the set of treatments, treatment
combinations or populations under study – the selection and
arrangement of treatment factors.
A design structure is the way in which the experimental units are
grouped together into homogeneous units (blocks).
These structures are combined with a method of randomization to
create an experimental design.
Types Of Treatment Structures
2. n-way Factorial, where two or more factors are combined so
that every possible combination occurs.
3. n-way Fractional Factorial, where a specified fraction of the
total number of possible treatment combinations occur (eg.
4. Nested (Hierarchical) Treatment Structures
Types Of Design Structures
1. Completely Randomized Designs.
All experimental units are considered as a single homogeneous
group (no blocks). Treatments are assigned completely at random
(with equal probability) to all units.
2. Randomized Complete Block Designs.
Experimental units are grouped into homogeneous blocks within
which each treatment occurs c times (usually c=1).
3. Incomplete Block Designs.
Fewer than the total number of treatments occur in each block.
4. Latin Square Designs.
Considerations when Designing an Experiment
Experimental design should give unambiguous answers to
questions of interest.
Experimental design should be “optimal”. That is, it should
have more power (sensitivity) and estimate quantities of interest
more precisely than other designs.
Objectives of the experiment should be clearly defined.
- What questions are we trying to answer?
- What questions are more important?
- What populations are we interested in generalizing to?
Appropriate response and explanatory variables must be
determined and nuisance variables should be identified.
- What levels of the treatment factors will be examined?
- Should there be a control group?
- Which nuisance variables will be measured?
Statistical analysis of the experiment should be planned in detail
to meet the objectives of the experiment.
- What model will be used?
- How will nuisance variables be accounted for?
- What hypotheses will be tested?
- What “effects” will be estimated?
Experimental design should be economical