# Introduction to Statistics - PowerPoint - PowerPoint

Document Sample

```					          Regression

Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction
Using R. Wiley.
• Gonick, L., and Woollcott Smith (1993) A Cartoon
Guide to Statistics. HarperResource (for fun).
Regression
• Used when both the response and the explanatory
variable are continuous
• Apply when a scatter plot is the appropriate graphic.
• Four main types:
–   Linear regression (straight line)
–   Polynomial regression (non-linear)
–   Non-linear regression (in general)
–   Non-parametric regression (no obvious functional form)
Linear Regression
• Worked example from book (128ff)
attach(reg.data)
names(reg.data)
plot(tannin,growth,pch=16)
• Uses the lm() function and a simple model
growth~tannin
abline(lm(growth~tannin))
fitted<-predict(lm(growth~tannin))
• model… (141ff)
Tannin Data Set
reg.data<-
attach(reg.data)
names(reg.data)
[1] "growth" "tannin”
plot(tannin,growth,pch=16) (dots)
Tannin Plot
Linear Regression
model<-lm(growth~tannin)
model

Call:
lm(formula = growth ~ tannin)

Coefficients:
(Intercept)       tannin
11.756       -1.217

abline(model)
Abline
Fitting
fitted<-predict(model)
fitted
1         2         3         4         5         6       7          8          9
11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 4.455556      3.238889   2.022222
for(i in 1:9)lines(c(tannin[i],tannin[i]),c(growth[i],fitted[i]))
Fitted
Summary
summary(model)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.7556      1.0408 11.295 9.54e-06 ***
tannin        -1.2167    0.2186 -5.565 0.000846 ***
---

Residual standard error: 1.693 on 7 degrees of freedom
0.7893
F-statistic: 30.97 on 1 and 7 DF, p-value: 0.000846
Summary.aov
summary.aov(model)
Df Sum Sq Mean Sq F value         Pr(>F)
tannin       1 88.817 88.817 30.974 0.000846 ***
Residuals    7 20.072   2.867 <- the error variance

Report summary(model) and resist the
temptation to include summary.aov(model).
Include the p-value from (last slide) and error
variance (here) in a figure caption.
Finally plot(model)
First Plot (don’t want structure here)
Second Plot (qqnorm)
Third Plot (also don’t want structure
here)
Fourth Plot (influence)
Key Definitions
• SSE—the sum of the squares of the residuals (or error sum
of squares)—this is to be minimised for the best fit
• SSX—∑x2-(∑x)2/n, the corrected sum of the squares of x
• SSY—∑y2-(∑y)2/n, the corrected sum of the squares of y.
• SSXY—∑xy-(∑x)(∑y)/n, the corrected sum of the products
• b—SSXY/SSX, the maximum likelihood estimate of the
slope of the linear regression.
• SSR—SSXY2/SSX, the explained variation or the regression
sum of squares. Note SSY = SSR + SSE.
• r—the correlation coefficient, SSXY/√(SSX  SSY)
Analysis of Variance
• SSY has df = n-1.
• SSE uses two estimated parameters (slope and
intercept), so df = n-2.
• SSR uses a single degree of freedom since fitting the
regression model to this simple data set estimated only
one extra parameter (beyond the mean value of y), the
slope, b.
• Remember SSY = SSR + SSE.
Continuing
• Regression variance = SSR/1.
• Error variance s2 = SSE/(n-2)
• F = Regression variance/s2
• The null hypothesis is that the slope (b) is zero, so
there is no dependence of the response on the
explanatory variable.
• s2 then allows us to work out the standard errors of the
slope and intercept.
• s.e.b = √(s2/SSX)
• s.e.a = √(s2∑x2/nSSX)
Doing it in R
• model<-lm(growth~tannin)
• summary(model)
– This produces all of the parameters and their standard errors
• If you want to see the analysis of variance, use
summary.aov(model)
• Report summary(model) and resist the temptation to
include summary.aov(model). Include the p-value and
error variance in a figure caption.
• The degree of fit or coefficient of determination (r2) is
SSR/SSY. r (or ) is the correlation coefficient.
Critical Appraisal
• Check constancy of variance and normality of errors
• plot(model)
– Plot 1 should show no pattern
– Plot 2 should show a straight line
– Plot 3 repeats Plot 1 on a different scale. You don’t want to
see a triangular shape.
– Plot 4 shows Cook’s distance, showing those points with the
most influence. You may want to investigate them to look
for error or systematic effects. Remodel, removing those
points and assess whether they dominate your results unduly.
• mcheck(model)
Be Aware!
• interv<-1:100/100
• theta<-2*pi*interv
• x<-cos(theta)
• y<-sin(theta)
• plot(y,x)
What's the correct functional form?
• regress<-lm(y~x)
• plot(regress)
Polynomial Regression
• A simple way to investigate non-linearity.
• Worked example. (146ff)
Non-Linear Regression
• Perhaps the science constrains the functional form of
the relationship between a response variable and an
explanatory variable, but the relationship cannot be
linearized by transformations. What to do?
• Use nls instead of lm, precisely specify the form of the
model, and define initial guesses for any parameters.
• summary(model) still reports the statistics, while
anova(model1, model2) is used to compare models.
summary.aov(model) reports the analysis of variance.
• If you see that the relationship is non-linear, but you
don’t have a theory, use a generalised additive model
(gam).
• library(mgcv)
– by the way, this is not gam() from core R.
• model<-gam(y~s(x))
– s(x) is the default smoother , a thin plate regression spline
basis.
• Worked example.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 77 posted: 4/5/2010 language: English pages: 25