Demonstration
Prediction Error
How much can we trust a prediction made for an individual using regression
analysis? Compare the regression model with the prediction equation, and
you'll see that there are two potential sources of error in a prediction:
(1) The model coefficients might have been misestimated (due to sampling
error), or (2) the true residual of the individual for whom the prediction is
being made might be different from zero.
Y = a +b X +K+b X
1 1 k k + e
Ypred = a + b1X1 + K + b k X k + 0
The standard error of the regression measures your exposure to errors
of the second type. But the impact of errors of the first type depends on the
values of the independent variables for which the prediction is being made.
To see this, we'll run an experiment.
We'll take a population for which we know the true model coefficients.
(Actually, we'll create such a population via simulation.) Then we'll draw a
sample from that population, and compare the estimated regression equation
to the true one.
The relationship we'll study will involve only one independent variable. That
way, we'll be able to "see" what's happening on a chart. The true regression
equation is: Y = 100 + 4X . The true mean value of X is 20, and the
standard deviation of X, as well as the standard deviation of the residual
term, are given below:
2 standard deviation of X
6 standard deviation of residual
Now, imagine 10 investigators, each collecting a sample of size 25 from this
population and estimating the true relationship. The results of their ten
studies are listed in the table below, and the corresponding estimated
regression lines are plotted in the chart, overlayed by the true (red) regression
line.
alpha beta The True Regression Line,
100.00 4.00 and Ten Estimates
210
a b 200
82.71 4.93
105.51 3.68 190
97.75 4.10 180
99.95 4.04
Y
113.14 3.35 170
119.39 2.98 160
86.26 4.66
95.88 4.20 150
125.61 2.69
140
Page 1
Demonstration
140
117.73 3.08 15 17 19 21 23 25
X
Some of the estimated lines are too steep. Some are too level. But the
important thing to note is that the estimated lines are usually closest to the
true line near X = 20, and move further away as the values of X move further
from 20. That is, a + bX is typically a less-reliable estimate of the true
height of the regression line for larger values of |X - 20|. Press the
"Resample" button a few times to redraw the ten separate samples and see
how the results of additional studies compare to the true situation.
The prediction equation can be viewed in two somewhat-different ways. For
any specific values of the independent variables, it provides an estimate of
the mean value of the dependent variable across the subpopulation of
individuals for whom the independent variables take those values. And, of
course, it also provides a prediction for any one such individual.
The standard error of the estimated mean measures uncertainty due to
sampling error in the "mean value" estimate for the subpopulation. It is this
uncertainty we see in the chart - uncertainty about the true value of
a +b X +K+b X
1 1 k k
In a simple linear regression, the standard error of the estimated mean takes
the value
se ×
1
+
(X - X ) 2
n ( n - 1) × s 2
X
(the first factor is the standard error of the regression). The formula itself
isn't all that important, since it only applies to a simple linear regression (when
there are two or more independent variables, the corresponding formula is
quite ugly), and since any decent regression-analysis software will compute
it for us.
But the formula does serve to illustrate two important points, both of which
remain true even when there is more than one independent variable.:
1. For any given sample size, the standard error of the estimated mean
grows as the independent variables take values further from the most-
typically-observed (combination of) values.
When there is more than one independent variable, iIt's not just a matter of
the distance from the mean values of the independent variables. Hidden
extrapolation - when each independent variable takes a not-atypical value,
but the combination of values is atypical - can also increase the standard
error of the estimated mean.
2. For any given values of the independent variables, the standard
error of the estimated mean decreases as we increase our sample
size. This is as it should be, since it measures our exposure to
sampling error in making our estimate.
Page 2
Demonstration
In order to construct confidence intervals for the mean value of the
dependent variable, given values for the independent variables, we simply
use the prediction equation to make the estimate, and the standard error of
the estimated mean to compute the margin of error in the estimate.
When we use the prediction equation to make a prediction for an individual,
we must combine the standard error of the estimated mean with the standard
error of the regression. (The method of combination, since they are
independent sources of potential error, is to convert each to a variance by
squaring, add the variances, and then take a square root to get back to a
standard deviation again.) The result is the standard error of the prediction.
The standard error of the prediction therefore consists of two components.
One (the standard error of the estimated mean) can be reduced by
increasing the sample size. The other (the standard error of the regression)
can be reduced only by including new (and relevant) independent variables
in our model.
We're done! But if you'd like to test your intuition, go back up to the chart,
and ask yourself whether an increase in the standard deviation of X, or in
the standard error of the regression, wolud increase or decrease the spread
of the estimated regression lines around the true line. Then change either of
the two numbers in the yellow cells, and hit the resample button a few times
to see the effect. An explanation of what you see is given below (to keep
from giving away the answer, I've placed it down about 40 lines).
Page 3
Demonstration
Increasing the standard deviation of X will tighten up the estimated lines
around the true line. By spreading out the possible values of X, we make
values near the left and right sides of the chart less atypical. (This can also
be seen directly from the formula for the standard error of the estimated
mean: We're increasing the denominator of the second term inside the
square root.)
Increasing the standard error of the regression will widen the spread of the
estimated lines. With more "noise" in the model (and the same sample
size), our estimates of the true coefficients will become less reliable. (This
can also be seen from the formula, which has the standard error of the
regression as its leading factor outside the square root.)
Page 4