STAT Introductory Statistics

Document Sample
STAT Introductory Statistics Powered By Docstoc
					STAT 111 Introductory Statistics

       Lecture 3: Regression
           May 20, 2004
                   Today’s Topics
• Regression line
  –   Fitting a line
  –   Prediction
  –   Least-squares
  –   Interpretation
• Correlation and regression
• Causation
• Transforming variables (briefly)
          Review: The Scatterplot

• The scatterplot shows the relationship between
  two quantitative variables.
• It plots the observations of different individuals
  in a two-dimensional graph.
• Each point in a scatterplot corresponds to an
  observation of two variables of the same
           The Regression Line
• A regression line is a straight line that
  summarizes the linear relationship between two
• It describes how a response variable y changes
  as an explanatory variable x changes.
• A regression line is often used as a model to
  predict the value of the response y for a given
  value of the explanatory variable x.
       The Regression Line (cont.)
• We fit a line to data by drawing the line that
  comes as close as possible to the points.
• Once we have a regression line, we can predict
  the y for a specific value of x. Accuracy depends
  on how scattered the data are about the line.
• Using the regression line for prediction for far
  outside the range of values of x used to obtain the
  line is called extrapolation. This is generally not
  advised, since predictions will be inaccurate.
    Example: Predicting SAT Math
    Scores using SAT Verbal Scores
• Making a regression line using JMP:
  Analyze → Fit Y by X → Put the response
  variable into Y, explanatory variable into X →
  Hit OK → Double-click the red triangle above
  the scatterplot → Fit line
• Mathematically, a straight line has an equation of
  the form y = a + bx, where b is the slope and a is
  the intercept. But how do we determine the value
  of these two numbers?
  The Least-Squares Regression Line

• The least-squares regression line of y on x is the
  line that makes the sum of the squares of the
  vertical distances of the data points from the line
  as small as possible.
• Mathematically, the line is determined by
  The Least-Squares Regression Line
• The equation of the least-squares regression line
  of y on x is

• The slope is determined using the formula

• The intercept is calculated using
    Interpreting the Regression Line
• The slope b tells us that along the regression
  line, a change of 1 unit in x corresponds to a
  change of b units in y.
• The least-squares regression line always passes
  through the point          .
• If both x and y are standardized variables, then
  the slope of the least-squares regression line will
  be r, and the line will pass through the origin
    Interpreting the Regression Line
• Since standard deviation can never be negative,
  the signs of r and b will always be the same.
• Hence, if our slope is positive, we have a
  positive association between our explanatory
  variable and our response.
• On the other hand, if our slope is negative, then
  we have a negative association between our
  explanatory variable and our response.
      Example: SAT Scores Again
• In our SAT data, the math score is the response,
  and the verbal score is the explanatory variable.
  The least-squares regression line as reported by
  JMP is
      math = 498.00765 + 0.3167866 verbal
• Hence, in the context of the SAT, if a student’s
  verbal score increases by 10 points, then his
  math score will increase by a little bit more than
  3 points.
      Example: SAT Scores (cont.)

• Suppose we want to predict using our regression
  line a student’s math score given that his verbal
  score was 550.
• The predicted math score then would be
        498.00765 + 0.3167866 (550) = 672
• Remember not to extrapolate when you make
  your predictions.
      Example: SAT Scores (cont.)

• Now, suppose we instead wanted to use a
  regression line to predict verbal scores using
  math scores, and suppose that one student had a
  math score of 670.
• Naively, we would predict the verbal score by
  taking the inverse of our existing regression line,
  in which case we would predict a verbal score
  between 540 and 550.
• It is not quite as simple as this.
      Example: SAT Scores (cont.)
• What we would need to do is re-fit the
  regression line using math scores as our
  explanatory variable and verbal scores as our
• The new regression line is (from JMP)
      verbal = 408.37653 + 0.3901289 math
• So, our predicted verbal score given a math
  score of 670 would be
       408.37653 + 0.3901289 (670) = 670
        Correlation and Regression
• The square of the correlation, r2, is the proportion
  of the variation in the data that is explained by
  our least-squares regression line.
• r2 is always between 0 and 1.
• If r = ± 0.7, then r2 = 0.49, or about ½ of the
• In our SAT data, r2 = 0.1236 (it is the same for
  both regressions), so our regression line only
  captures about 12% of the response’s variation.
              Understanding r2
• Let’s look at the SAT line (verbal as x, math as y)
  once again.
• The variance in our observed math values is
  (61.262875)2 = 3753.14
• If the only variability in observed math scores
  was because of the linear fit, then math scores
  would lie exactly on our line.
• In other words, the math scores would be
  identical to our predicted math scores.
         Understanding r2 (cont.)
• After computing the predicted math scores, we
  have that the variance in our predicted values is
  (21.53698)2 = 463.84
• If we divide the variance of our predicteds by the
  variance of our actuals, we have
  463.84 / 3753.14 = .1236
• It is always true for least-squares regression when
  we say that r2 gives us the variance of predicted
  responses as a fraction of the variance of actual
 Diagnosis (How Good is our Model?)

• Although we are most interested in the overall
  pattern as described by the regression line,
  deviations from this pattern are also important.
• In the regression setting, the deviations we
  consider are the vertical distances from the actual
  points to the least-squares regression line.
• These distances represent the variation left in the
  response after fitting the line and are called
• A residual is the difference between an observed
  value and the predicted value.
• Residual = observed y – predicted y

• The sum of the residuals of a regression line is
  always equal to 0.
• A residual plot is a scatterplot of regression
  residuals against the explanatory variable and is
  used to assess the fit of a regression line.
Simplified Patterns of Least-squares
              linear relationship                      nonlinear relationship

                    x                                   x
                                   Nonconstant prediction error

Outliers and Influential Observations
• An outlier is an observation that lies outside the
  overall pattern of the other observations.
• Points that are outliers in the y direction have
  large regression residuals, but that need not be
  the case for all outliers.
• An influential observation is one that would
  significantly change the regression line if
  removed. An outlier in the x direction is often
  influential for the least-squares regression line.
    Example: Age at First Word and
             Gesell Score
• Does the age at which a child begin to talk
  predict a later score on a test of mental ability?
• The age in months at which the first word was
  spoken and the score on an ability test taken
  much later were recorded for 21 children.
• Fitting a line to all data reveals a negative linear
  relationship: early talkers tend to have higher
  test scores than those who start talking later.
Example: First Word and Gesell
         Score (cont.)
    Example: First Word and Gesell
             Score (cont.)
• In the scatterplot, we see that observations 18
  and 19 are unusual.
• Observation 18 is far out in the x direction;
  observation 19 is far out in the y direction.
• The red line is the regression line we obtained by
  including 18; the green is obtained by excluding
• 18 is pulling the line towards itself; hence it is
Extreme Example: Random Data
         Causation vs Association

• Example of causation: Increased consumption
  of alcohol causes a decrease in coordination and
• Example of association: A high SAT score in
  senior year of high school is typically associated
  with a high GPA in freshman year of college.
• In general, an association between an
  explanatory variable x and a response y is not
  sufficient evidence to prove that x causes y.
    Causation vs Association (cont.)
• Examples:
  – High SAT math scores tend to be accompanied by
    high SAT verbal scores, but does this mean a high
    math score causes a high verbal score?
  – Nations in which people have easy access to the
    Internet tend to have higher life expectancies. Does
    better access to the Internet cause people to live
  – The divorce rate tends to be positively correlated with
    the quantity of bananas imported. Does importing
    more bananas cause more people to get divorced?
             Lurking Variables

• A lurking variable is one that is not among the
  explanatory or response variables in a study, but
  may influence the interpretation of relationships
  among those variables.
• In each of our three cases mentioned previously,
  there is likely a lurking variable at work.
• Give a an example of one for each of the
         Lurking Variables (cont.)

• Lurking variables can create “nonsense
  correlations” in the sense that they suggest that
  changing one variable causes changes in the
• In addition, lurking variables can hide a true
  relationship between explanatory and response
• In many cases, we wish to determine whether
  changes in an explanatory cause changes in the
  response variable.
• Even in the presence of strong association, it is
  difficult to decide whether this is due to a causal
• There are three main ways to explain an
  association between two variables.
           Explaining Association
• The association between an explanatory and a
  response variable may be due to
  – Causation when there is a direct cause-and-effect link
    between these two variables.
  – Common response when there is a lurking variable
    whose changes cause both the explanatory variable
    and the response variable to change.
  – Confounding when there are multiple influences at
    work that are getting mixed up.
Explaining Association (cont.)
• Officially, two variables are considered
  confounded when their effects on a response
  variable cannot be distinguished from each other.
• Confounded variables can be either explanatory
  or lurking.
• Even a very strong association between two
  variables is not sufficient evidence that there is a
  cause-and-effect link between the variables.
• The best way to establish that an association is
  due to causation is with a carefully designed
  experiment – more on this later.
    Transformations of Relationships
• In some situations, the values of quantitative
  variables are quite spread out, with some isolated
  points. The rest of the data becomes very
  compressed, making it somewhat difficult to look
• Situations like this suggest using a function of the
  original variable; for example, we might use a
  function that will shrink the distance between
  values. This is what we call transforming the
   Transformations of Relationships
• Transforming data changes the original scale of
  measurement. Our most common
  transformations are linear (˚F → ˚C, lb → kg).
• Linear transformations cannot straighten curved
  relationships, though; to do that, we need a
  nonlinear transformation (e.g., powers,
  exponentials, logarithms).
• The most common transformations of our
  explanatory variable x are power transformations
  of the form xp.
   Transformations of Relationships
• We call a function f(x) monotone if its values
  move in only one direction as x increases
• For positive values of x, power functions with
  positive p (and the logarithm function) are
  monotonic increasing and preserve the order of
• For negative p, the power functions are monotonic
  decreasing and reverse the order of observations.
• If we believe that there is some mathematical
  model that describes our data, then
  transformations will be quite effective.
• For example, the exponential growth model
  y = a * bx can be written as a linear model if we
  take the logarithm of y (log y = log a + x log b).
• On the other hand, a power law growth model
  y = a * xp can be written as a linear model if we
  take the logarithm of both x and y
  (log y = log a + p log x).
   Transformations of Relationships
• In practice, our decision to make a transformation
  is governed by what we know about the data.
• This also holds true in terms of what type of
  transformation we decide to make.
• For example, animal populations and values of
  investments are often well-described by
  exponential growth model, though we do not
  always know the values of the parameters.