Linear regression models

Document Sample
Linear regression models Powered By Docstoc
					Linear regression models
Simple Linear Regression
• Developed by Sir Francis Galton (1822-
  1911) in his article “Regression towards
  mediocrity in hereditary structure”
• To describe the linear relationship between two
  continuous variables, the response variable (y-
  axis) and a single predictor variable (x-axis)
• To determine how much of the variation in Y can
  be explained by the linear relationship with X
  and how much of this relationship remains
• To predict new values of Y from new values of X
The linear regression model is:

 • Xi and Yi are paired observations (i = 1 to n)
 • β0 = population intercept (when Xi =0)
 • β1 = population slope (measures the change in Yi
   per unit change in Xi)
 • εi = the random or unexplained error associated
   with the i th observation. The εi are assumed to be
   independent and distributed as N(0, σ2).
     Linear relationship




Linear models approximate non-linear functions
               over a limited domain

      extrapolation   interpolation   extrapolation
• For a given value of X, the sampled Y
  values are independent with normally
  distributed errors:   Y = β + β *X + ε
                          i   o       1   i   i
                         ε ~ N(0,σ2) à E(εi) = 0
                         E(Yi ) = βo + β1*Xi




       X1          X2
Fitting data to a linear model:

                   Yi – Ŷi = εi (residual)

        The residual

The residual sum of squares
Estimating Regression Parameters
• The “best fit” estimates for the regression
  population parameters (β0 and β1) are the
  values that minimize the residual sum of
  squares (SSresidual) between each
  observed value and the predicted value of
  the model:
   Sum of squares

Sum of cross products
Least-squares parameter estimates

Sample variance of X:

 Sample covariance:
   Solving for the intercept:

Thus, our estimated regression
         equation is:
 Hypothesis Tests with Regression
• Null hypothesis is that there is no linear
  relationship between X and Y:

    H 0: β 1 = 0 à Y i = β 0 + ε i

    H A: β 1 ≠ 0 à Y i = β 0 + β 1 X i + ε i

• We can use an F-ratio (i.e., the ratio of
  variances) to test these hypotheses
Variance of the error of regression:

NOTE: this is also referred to as residual
variance, mean squared error (MSE) or
residual mean square (MSresidual)
     Mean square of regression:

The F-ratio is: (MSRegression)/(MSResidual)

This ratio follows the F-distribution with (1, n
-2) degrees of freedom
Variance components and
Coefficient of determination
Coefficient of determination
         ANOVA table for regression
Source       Degrees    Sum of squares   Mean     Expected      F
             of freedom                  square   mean square   ratio

Regression       1

Residual        n-2

Total           n-1
Product-moment correlation
Parametric Confidence Intervals
•   If we assume our parameter of interest has a particular sampling
    distribution and we have estimated its expected value and variance,
    we can construct a confidence interval for a given percentile.
•   Example: if we assume Y is a normal random variable with unknown
    mean μ and variance σ2, then               is distributed as a
    standard normal variable. But, since we don’t know σ, we must
    divide by the standard error instead:              , giving us a t-
    distribution with (n-1) degrees of freedom.
•   The 100(1-α)% confidence interval for μ is then given by:

•   IMPORTANT: this does not mean “There is a 100(1-α)% chance
    that the true population mean μ occurs inside this interval.” It
    means that if we were to repeatedly sample the population in
    the same way, 100(1-α)% of the confidence intervals would
    contain the true population mean μ.
   Publication form of ANOVA table
             for regression
             Sum of              Mean
Source        Squares   df       Square    F       Sig.
Regression     11.479        1   11.479   21.044   0.00035

                8.182   15         .545

Total          19.661   16
Variance of estimated intercept
Variance of the slope estimator
Variance of the fitted value
Variance of the predicted value
    Assumptions of regression
• The linear model correctly describes the
  functional relationship between X and Y
• The X variable is measured without error
• For a given value of X, the sampled Y
  values are independent with normally
  distributed errors
• Variances are constant along the
  regression line
Residual plot for species-area

Shared By: