# Assessing the Fit of Regression Models

Document Sample

```					Assessing the Fit of Regression
Models
Engineering Experimental Design
Valerie L. Young
In today’s lecture . . .
•   Coefficient of Determination (R2)
•   Correlation Coefficient (R)
•   Residuals and Residual Plots
•   Confidence Limits on Adjustable Parameters
•   ANOVA
Regression: A set of statistical
tools that can. . .
• define a mathematical relationship between
factors and a response (a model).
– NOT proof of any physical relationship (though
ideally terms in the model have physical significance)
• quantify the significance of each factor’s
correlation with the response.
• estimate values for the constants in a model.
• indicate how well a particular model fits the data.
Models
• Every model consists of two parts
– Terms that describe the predictable way in
which the value of the response varies with
changes in the values of the factor(s)
– The random variation in the response due to
random variations in the measurement
technique or the system being measured
Assessing Model Fit
• Statistical techniques for assessing how well
a model fits your data are based on
– Quantifying the fraction of the total variation in
the response that is accounted for by the
“predictable” terms in the model
– Assessing whether the leftover, “non-
predictable” variation is random (i.e., residuals
normally distributed around zero)
• YOU must assess physical validity
Coefficient of Determination                   (R 2)

• R2 = fraction of total variation in response
that is explained by the “predictable” part of
the model
• 0  R2  1
• R2 is not sufficient to validate a model. You
must demonstrate that the leftover variation
is randomly distributed around zero.
• Can calculate R2 for any type of model
(linear, nonlinear)
– Refer to a statistics text for equations
Correlation Coefficient (R)
• Mathematically, R = sqrt(R2)
• Conceptually, R means nothing except for
simple linear models
– Sign on R is same as sign on slope
• R < 0: Negative correlation between y and x
• R > 0: Positive correlation between y and x
• The closer |R| is to 1, the closer the data are
to a straight line.
Residuals
• A residual is . . .the error for a given data
point. The difference between a measured
value of the response and the value the
model predicts.
• If the “predictable” part of the model is
well-chosen, the residuals will include only
random error.
– Residuals will be randomly distributed around
zero.
• One way to tell is with a residual plot.
Test Three Models

• PAI =  xAl + ,
where  and  are constants, xAl is the mass fraction of
aluminum, and PAI is the “phosphate adsorption index”.
• PAI =  xAl + xFe + ,
where ,  and  are constants, xAl is the mass fraction of
aluminum, xFe is the mass fraction of iron, and PAI is the
• PAI =  xAl + xAl2 + ,
where ,  and  are constants, xAl is the mass fraction of
aluminum, and PAI is the “phosphate adsorption index”.
PAI =  xAl + 
 = 0.23 ± 0.07 g soil / mg Al                    Uncertainty from 95%
confidence limits.
 = -11 ± 13
R2 = 0.825
Simple Linear Residual Plot
I edited the plot that is
automatically generated 20
by Excel to make the       10
Residuals

labels more
0
meaningful. Further
0        100       200      300       400
editing is required if    -10
you want to include one -20
of these plots in a                                   X (Al)
report.
PAI =  xAl + xFe + 
 = 0.11 ± 0.07 g soil / mg Al                                             I’m 95% sure that if I
collected an infinite number
 = 0.35 ± 0.16 g soil / mg Fe                                             of data points, the values of
 = -7 ± 8                                                                 the coefficients would be
R2 = 0.948                                                                 inside these ranges.

Multiple Linear Residual Plot                               Multiple Linear Residual Plot

10                                                          10

5                                                           5
Residuals
Residuals

0                                                           0
0         100       200      300    400                     0      20    40     60     80   100   120
-5                                                          -5

-10                                                         -10
X (Al)                                                     X (Fe)
PAI =  xAl + xAl +             2

To use this model in
 = 0.2 ± 0.3 g soil / mg Al                Excel, let xAl be one
 = (2 ± 80)  10-5 g2 soil / mg2 Al        independent variable
 = -10 ± 30                                and xAl2 be another,
then do multiple linear
R2 = 0.825                                  regression.

Although this model explains 82.5 % of the variation
in PAI, NONE of the adjustable parameters are
significantly different from zero. This is a common
result when you have included an unnecessary factor.
Remove the least-likely factor (xAl2 in this case) and
redo the regression.
ANOVA for PAI =  xAl + xFe + 
ANOVA
df        SS       MS        F     Significance F
Regression         2 3529.903 1764.952 92.02558   3.63428E-07
Residual          10 191.7892 19.17892
Total             12 3721.692

• ANOVA = Analysis of Variance
• More on ANOVA later in the course
ANOVA for PAI =  xAl + 
ANOVA
df        SS       MS        F     Significance F
Regression         1 3070.474 3070.474 51.86466   1.74873E-05
Residual          11 651.2183 59.20166
Total             12 3721.692
comparison?
Least-squares regression using 13 observations supports a simple linear dependence
of PAI (phosphate adsorption index) on xAl (extractable aluminum mass fraction)
and xFe (extractable iron mass fraction). The model
PAI =  xAl +                                (1)
gives R2 = 0.825 and a residual plot with points randomly distributed around zero.
Adding an xAl2 term to equation (1) to account for any nonlinear dependence does
not improve the model; R2 does not increase, and values for the adjustable
parameters are not significantly different from zero. The ability to predict PAI is
improved by adding a linear dependence on xFe to the model. The resulting
equation is
PAI =  xAl + xFe +                         (2)
where  = 0.11 ± 0.07 g soil / mg Al,  = 0.35 ± 0.16 g soil / mg Fe, and  = -7 ± 8.
Uncertainties span the 95 % confidence limits on the adjustable parameters.
Equation (2) gives R2 = 0.948. The residual plot shows points randomly distributed
around zero, indicating that the predictable behavior of PAI has been described and
only random error remains. ANOVA confirms the significance of the model
(significance level < 110-6). More testing of low-mineral-content soils is
recommended to try to narrow the confidence limits on .
Another version:

Least-squares regression using 13 observations supports a simple linear dependence
of PAI (phosphate adsorption index on xAl (extractable aluminum mass fraction)
and xFe (extractable iron mass fraction). The model proposed is
PAI =  xAl + xFe +                        (1)
where  = 0.11 ± 0.07 g soil / mg Al,  = 0.35 ± 0.16 g soil / mg Fe, and  = -7 ± 8.
Uncertainties span the 95 % confidence limits on the adjustable parameters.
Equation (1) gives R2 = 0.948. The residual plot shows points randomly distributed
around zero, indicating that the predictable behavior of PAI has been described and
only random error remains. ANOVA confirms the significance of the model
(significance level < 110-6). More testing of low-mineral-content soils is
recommended to try to narrow the confidence limits on . Note that measuring
either xAl or xFe alone may allow a reasonable prediction of PAI for some
applications. For example, the model
PAI =  xAl +                               (2)
gives R2 = 0.825 and a residual plot with points randomly distributed around zero.
No higher-order terms are required in the model. For example, adding an xAl2 term
to equation (2) to account for any nonlinear dependence results in no improvement
to R2, and values for the adjustable parameters not significantly different from zero.
Cells =  m + 
250

200
# Cells in FOV

150

100

50

0
0   1   2     3        4       5         6   7   8   9
Mass of Inhibitor, g
Cells =  m + 
 = -22 ± 9 cells / g inhibitor
 = 160 ± 50 cells
R2 = 0.809
Simple Linear Residual Plot

60
40
Residuals

20
0
-20 0       2       4        6        8   10
-40
mass of inhibitor
Cells =  m +  ?!?!
250

200
# Cells in FOV

150

100

50

0
0   1   2   3        4       5         6   7   8   9
Mass of Inhibitor, g

Of course, just looking at this plot, we should have
known not to use a linear function. It is always valuable
to look at a plot before you dive into regression, and to
check the plot with the regression line on it at the end.
([I] = t2 + t + ) or ([I] = e-t)?

350

300                                            Data
Polynomial Fit
250
Linearized Exponential Fit
[Inhibitor], g/L

200

150

100

50

0
0        2         4                6                8           10
-50
Time, days
Stuff to Remember:
• Plot the data before you start regression to
make sure you pick a reasonable model.
• Use more than just R2 to evaluate model
quality.
• Plot the data and the model together to
make sure the model satisfies over the
whole region of interest.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 95 posted: 5/19/2010 language: English pages: 21