VIEWS: 77 PAGES: 25 POSTED ON: 4/5/2010
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland Resources • Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. • Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun). Regression • Used when both the response and the explanatory variable are continuous • Apply when a scatter plot is the appropriate graphic. • Four main types: – Linear regression (straight line) – Polynomial regression (non-linear) – Non-linear regression (in general) – Non-parametric regression (no obvious functional form) Linear Regression • Worked example from book (128ff) reg.data<-read.table(―tannin.txt‖,header=T) attach(reg.data) names(reg.data) plot(tannin,growth,pch=16) • Uses the lm() function and a simple model growth~tannin abline(lm(growth~tannin)) fitted<-predict(lm(growth~tannin)) • model… (141ff) Tannin Data Set reg.data<- read.table("tannin.txt",header=T) attach(reg.data) names(reg.data) [1] "growth" "tannin” plot(tannin,growth,pch=16) (dots) Tannin Plot Linear Regression model<-lm(growth~tannin) model Call: lm(formula = growth ~ tannin) Coefficients: (Intercept) tannin 11.756 -1.217 abline(model) Abline Fitting fitted<-predict(model) fitted 1 2 3 4 5 6 7 8 9 11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 4.455556 3.238889 2.022222 for(i in 1:9)lines(c(tannin[i],tannin[i]),c(growth[i],fitted[i])) Fitted Summary summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 *** tannin -1.2167 0.2186 -5.565 0.000846 *** --- Residual standard error: 1.693 on 7 degrees of freedom Multiple R-squared: 0.8157, Adjusted R-squared: 0.7893 F-statistic: 30.97 on 1 and 7 DF, p-value: 0.000846 Summary.aov summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) tannin 1 88.817 88.817 30.974 0.000846 *** Residuals 7 20.072 2.867 <- the error variance Report summary(model) and resist the temptation to include summary.aov(model). Include the p-value from (last slide) and error variance (here) in a figure caption. Finally plot(model) First Plot (don’t want structure here) Second Plot (qqnorm) Third Plot (also don’t want structure here) Fourth Plot (influence) Key Definitions • SSE—the sum of the squares of the residuals (or error sum of squares)—this is to be minimised for the best fit • SSX—∑x2-(∑x)2/n, the corrected sum of the squares of x • SSY—∑y2-(∑y)2/n, the corrected sum of the squares of y. • SSXY—∑xy-(∑x)(∑y)/n, the corrected sum of the products • b—SSXY/SSX, the maximum likelihood estimate of the slope of the linear regression. • SSR—SSXY2/SSX, the explained variation or the regression sum of squares. Note SSY = SSR + SSE. • r—the correlation coefficient, SSXY/√(SSX SSY) Analysis of Variance • Start with SSR, SSE, and SSY. • SSY has df = n-1. • SSE uses two estimated parameters (slope and intercept), so df = n-2. • SSR uses a single degree of freedom since fitting the regression model to this simple data set estimated only one extra parameter (beyond the mean value of y), the slope, b. • Remember SSY = SSR + SSE. Continuing • Regression variance = SSR/1. • Error variance s2 = SSE/(n-2) • F = Regression variance/s2 • The null hypothesis is that the slope (b) is zero, so there is no dependence of the response on the explanatory variable. • s2 then allows us to work out the standard errors of the slope and intercept. • s.e.b = √(s2/SSX) • s.e.a = √(s2∑x2/nSSX) Doing it in R • model<-lm(growth~tannin) • summary(model) – This produces all of the parameters and their standard errors • If you want to see the analysis of variance, use summary.aov(model) • Report summary(model) and resist the temptation to include summary.aov(model). Include the p-value and error variance in a figure caption. • The degree of fit or coefficient of determination (r2) is SSR/SSY. r (or ) is the correlation coefficient. Critical Appraisal • Check constancy of variance and normality of errors • plot(model) – Plot 1 should show no pattern – Plot 2 should show a straight line – Plot 3 repeats Plot 1 on a different scale. You don’t want to see a triangular shape. – Plot 4 shows Cook’s distance, showing those points with the most influence. You may want to investigate them to look for error or systematic effects. Remodel, removing those points and assess whether they dominate your results unduly. • mcheck(model) Be Aware! • interv<-1:100/100 • theta<-2*pi*interv • x<-cos(theta) • y<-sin(theta) • plot(y,x) What's the correct functional form? • regress<-lm(y~x) • plot(regress) Polynomial Regression • A simple way to investigate non-linearity. • Worked example. (146ff) Non-Linear Regression • Perhaps the science constrains the functional form of the relationship between a response variable and an explanatory variable, but the relationship cannot be linearized by transformations. What to do? • Use nls instead of lm, precisely specify the form of the model, and define initial guesses for any parameters. • summary(model) still reports the statistics, while anova(model1, model2) is used to compare models. summary.aov(model) reports the analysis of variance. Generalised Additive Models • If you see that the relationship is non-linear, but you don’t have a theory, use a generalised additive model (gam). • library(mgcv) – by the way, this is not gam() from core R. • model<-gam(y~s(x)) – s(x) is the default smoother , a thin plate regression spline basis. • Worked example.