Docstoc

Introduction to Statistics - PowerPoint - PowerPoint

Document Sample
Introduction to Statistics - PowerPoint - PowerPoint Powered By Docstoc
					          Regression

        Harry R. Erwin, PhD
School of Computing and Technology
     University of Sunderland
                   Resources
• Crawley, MJ (2005) Statistics: An Introduction
  Using R. Wiley.
• Gonick, L., and Woollcott Smith (1993) A Cartoon
  Guide to Statistics. HarperResource (for fun).
                       Regression
• Used when both the response and the explanatory
  variable are continuous
• Apply when a scatter plot is the appropriate graphic.
• Four main types:
   –   Linear regression (straight line)
   –   Polynomial regression (non-linear)
   –   Non-linear regression (in general)
   –   Non-parametric regression (no obvious functional form)
                Linear Regression
• Worked example from book (128ff)
   reg.data<-read.table(―tannin.txt‖,header=T)
   attach(reg.data)
   names(reg.data)
   plot(tannin,growth,pch=16)
• Uses the lm() function and a simple model
  growth~tannin
   abline(lm(growth~tannin))
   fitted<-predict(lm(growth~tannin))
• model… (141ff)
          Tannin Data Set
reg.data<-
  read.table("tannin.txt",header=T)
attach(reg.data)
names(reg.data)
[1] "growth" "tannin”
plot(tannin,growth,pch=16) (dots)
Tannin Plot
            Linear Regression
model<-lm(growth~tannin)
model

Call:
lm(formula = growth ~ tannin)

Coefficients:
(Intercept)       tannin
     11.756       -1.217

abline(model)
Abline
                                   Fitting
fitted<-predict(model)
fitted
        1         2         3         4         5         6       7          8          9
11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 4.455556      3.238889   2.022222
for(i in 1:9)lines(c(tannin[i],tannin[i]),c(growth[i],fitted[i]))
Fitted
                   Summary
summary(model)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.7556      1.0408 11.295 9.54e-06 ***
tannin        -1.2167    0.2186 -5.565 0.000846 ***
---

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared: 0.8157,   Adjusted R-squared:
  0.7893
F-statistic: 30.97 on 1 and 7 DF, p-value: 0.000846
                 Summary.aov
summary.aov(model)
            Df Sum Sq Mean Sq F value         Pr(>F)
tannin       1 88.817 88.817 30.974 0.000846 ***
Residuals    7 20.072   2.867 <- the error variance


  Report summary(model) and resist the
  temptation to include summary.aov(model).
  Include the p-value from (last slide) and error
  variance (here) in a figure caption.
  Finally plot(model)
First Plot (don’t want structure here)
Second Plot (qqnorm)
Third Plot (also don’t want structure
                here)
Fourth Plot (influence)
                 Key Definitions
• SSE—the sum of the squares of the residuals (or error sum
  of squares)—this is to be minimised for the best fit
• SSX—∑x2-(∑x)2/n, the corrected sum of the squares of x
• SSY—∑y2-(∑y)2/n, the corrected sum of the squares of y.
• SSXY—∑xy-(∑x)(∑y)/n, the corrected sum of the products
• b—SSXY/SSX, the maximum likelihood estimate of the
  slope of the linear regression.
• SSR—SSXY2/SSX, the explained variation or the regression
  sum of squares. Note SSY = SSR + SSE.
• r—the correlation coefficient, SSXY/√(SSX  SSY)
            Analysis of Variance
• Start with SSR, SSE, and SSY.
• SSY has df = n-1.
• SSE uses two estimated parameters (slope and
  intercept), so df = n-2.
• SSR uses a single degree of freedom since fitting the
  regression model to this simple data set estimated only
  one extra parameter (beyond the mean value of y), the
  slope, b.
• Remember SSY = SSR + SSE.
                    Continuing
• Regression variance = SSR/1.
• Error variance s2 = SSE/(n-2)
• F = Regression variance/s2
• The null hypothesis is that the slope (b) is zero, so
  there is no dependence of the response on the
  explanatory variable.
• s2 then allows us to work out the standard errors of the
  slope and intercept.
• s.e.b = √(s2/SSX)
• s.e.a = √(s2∑x2/nSSX)
                     Doing it in R
• model<-lm(growth~tannin)
• summary(model)
   – This produces all of the parameters and their standard errors
• If you want to see the analysis of variance, use
  summary.aov(model)
• Report summary(model) and resist the temptation to
  include summary.aov(model). Include the p-value and
  error variance in a figure caption.
• The degree of fit or coefficient of determination (r2) is
  SSR/SSY. r (or ) is the correlation coefficient.
                Critical Appraisal
• Check constancy of variance and normality of errors
• plot(model)
   – Plot 1 should show no pattern
   – Plot 2 should show a straight line
   – Plot 3 repeats Plot 1 on a different scale. You don’t want to
     see a triangular shape.
   – Plot 4 shows Cook’s distance, showing those points with the
     most influence. You may want to investigate them to look
     for error or systematic effects. Remodel, removing those
     points and assess whether they dominate your results unduly.
• mcheck(model)
                    Be Aware!
• interv<-1:100/100
• theta<-2*pi*interv
• x<-cos(theta)
• y<-sin(theta)
• plot(y,x)
What's the correct functional form?
• regress<-lm(y~x)
• plot(regress)
         Polynomial Regression
• A simple way to investigate non-linearity.
• Worked example. (146ff)
          Non-Linear Regression
• Perhaps the science constrains the functional form of
  the relationship between a response variable and an
  explanatory variable, but the relationship cannot be
  linearized by transformations. What to do?
• Use nls instead of lm, precisely specify the form of the
  model, and define initial guesses for any parameters.
• summary(model) still reports the statistics, while
  anova(model1, model2) is used to compare models.
  summary.aov(model) reports the analysis of variance.
      Generalised Additive Models
• If you see that the relationship is non-linear, but you
  don’t have a theory, use a generalised additive model
  (gam).
• library(mgcv)
   – by the way, this is not gam() from core R.
• model<-gam(y~s(x))
   – s(x) is the default smoother , a thin plate regression spline
     basis.
• Worked example.