Your Federal Quarterly Tax Payments are due April 15th

# STAT Introductory Statistics by mikesanye

VIEWS: 1 PAGES: 38

• pg 1
```									STAT 111 Introductory Statistics

Lecture 3: Regression
May 20, 2004
Today’s Topics
• Regression line
–   Fitting a line
–   Prediction
–   Least-squares
–   Interpretation
• Correlation and regression
• Causation
• Transforming variables (briefly)
Review: The Scatterplot

• The scatterplot shows the relationship between
two quantitative variables.
• It plots the observations of different individuals
in a two-dimensional graph.
• Each point in a scatterplot corresponds to an
observation of two variables of the same
individual.
The Regression Line
• A regression line is a straight line that
summarizes the linear relationship between two
variables.
• It describes how a response variable y changes
as an explanatory variable x changes.
• A regression line is often used as a model to
predict the value of the response y for a given
value of the explanatory variable x.
The Regression Line (cont.)
• We fit a line to data by drawing the line that
comes as close as possible to the points.
• Once we have a regression line, we can predict
the y for a specific value of x. Accuracy depends
on how scattered the data are about the line.
• Using the regression line for prediction for far
outside the range of values of x used to obtain the
line is called extrapolation. This is generally not
advised, since predictions will be inaccurate.
Example: Predicting SAT Math
Scores using SAT Verbal Scores
• Making a regression line using JMP:
Analyze → Fit Y by X → Put the response
variable into Y, explanatory variable into X →
Hit OK → Double-click the red triangle above
the scatterplot → Fit line
• Mathematically, a straight line has an equation of
the form y = a + bx, where b is the slope and a is
the intercept. But how do we determine the value
of these two numbers?
The Least-Squares Regression Line

• The least-squares regression line of y on x is the
line that makes the sum of the squares of the
vertical distances of the data points from the line
as small as possible.
• Mathematically, the line is determined by
minimizing
The Least-Squares Regression Line
(cont.)
• The equation of the least-squares regression line
of y on x is

• The slope is determined using the formula

• The intercept is calculated using
Interpreting the Regression Line
• The slope b tells us that along the regression
line, a change of 1 unit in x corresponds to a
change of b units in y.
• The least-squares regression line always passes
through the point          .
• If both x and y are standardized variables, then
the slope of the least-squares regression line will
be r, and the line will pass through the origin
(0,0).
Interpreting the Regression Line
(cont.)
• Since standard deviation can never be negative,
the signs of r and b will always be the same.
• Hence, if our slope is positive, we have a
positive association between our explanatory
variable and our response.
• On the other hand, if our slope is negative, then
we have a negative association between our
explanatory variable and our response.
Example: SAT Scores Again
• In our SAT data, the math score is the response,
and the verbal score is the explanatory variable.
The least-squares regression line as reported by
JMP is
math = 498.00765 + 0.3167866 verbal
• Hence, in the context of the SAT, if a student’s
verbal score increases by 10 points, then his
math score will increase by a little bit more than
3 points.
Example: SAT Scores (cont.)

• Suppose we want to predict using our regression
line a student’s math score given that his verbal
score was 550.
• The predicted math score then would be
498.00765 + 0.3167866 (550) = 672
• Remember not to extrapolate when you make
Example: SAT Scores (cont.)

• Now, suppose we instead wanted to use a
regression line to predict verbal scores using
math scores, and suppose that one student had a
math score of 670.
• Naively, we would predict the verbal score by
taking the inverse of our existing regression line,
in which case we would predict a verbal score
between 540 and 550.
• It is not quite as simple as this.
Example: SAT Scores (cont.)
• What we would need to do is re-fit the
regression line using math scores as our
explanatory variable and verbal scores as our
response.
• The new regression line is (from JMP)
verbal = 408.37653 + 0.3901289 math
• So, our predicted verbal score given a math
score of 670 would be
408.37653 + 0.3901289 (670) = 670
Correlation and Regression
• The square of the correlation, r2, is the proportion
of the variation in the data that is explained by
our least-squares regression line.
• r2 is always between 0 and 1.
• If r = ± 0.7, then r2 = 0.49, or about ½ of the
variation.
• In our SAT data, r2 = 0.1236 (it is the same for
both regressions), so our regression line only
captures about 12% of the response’s variation.
Understanding r2
• Let’s look at the SAT line (verbal as x, math as y)
once again.
• The variance in our observed math values is
(61.262875)2 = 3753.14
• If the only variability in observed math scores
was because of the linear fit, then math scores
would lie exactly on our line.
• In other words, the math scores would be
identical to our predicted math scores.
Understanding r2 (cont.)
• After computing the predicted math scores, we
have that the variance in our predicted values is
(21.53698)2 = 463.84
• If we divide the variance of our predicteds by the
variance of our actuals, we have
463.84 / 3753.14 = .1236
• It is always true for least-squares regression when
we say that r2 gives us the variance of predicted
responses as a fraction of the variance of actual
responses.
Diagnosis (How Good is our Model?)

• Although we are most interested in the overall
pattern as described by the regression line,
deviations from this pattern are also important.
• In the regression setting, the deviations we
consider are the vertical distances from the actual
points to the least-squares regression line.
• These distances represent the variation left in the
response after fitting the line and are called
residuals.
Residuals
• A residual is the difference between an observed
value and the predicted value.
• Residual = observed y – predicted y

• The sum of the residuals of a regression line is
always equal to 0.
• A residual plot is a scatterplot of regression
residuals against the explanatory variable and is
used to assess the fit of a regression line.
Simplified Patterns of Least-squares
Residuals
linear relationship                      nonlinear relationship
residual

residual
x                                   x
Nonconstant prediction error
residual

x
Outliers and Influential Observations
• An outlier is an observation that lies outside the
overall pattern of the other observations.
• Points that are outliers in the y direction have
large regression residuals, but that need not be
the case for all outliers.
• An influential observation is one that would
significantly change the regression line if
removed. An outlier in the x direction is often
influential for the least-squares regression line.
Example: Age at First Word and
Gesell Score
• Does the age at which a child begin to talk
predict a later score on a test of mental ability?
• The age in months at which the first word was
spoken and the score on an ability test taken
much later were recorded for 21 children.
• Fitting a line to all data reveals a negative linear
relationship: early talkers tend to have higher
test scores than those who start talking later.
Example: First Word and Gesell
Score (cont.)
Example: First Word and Gesell
Score (cont.)
• In the scatterplot, we see that observations 18
and 19 are unusual.
• Observation 18 is far out in the x direction;
observation 19 is far out in the y direction.
• The red line is the regression line we obtained by
including 18; the green is obtained by excluding
18.
• 18 is pulling the line towards itself; hence it is
influential.
Extreme Example: Random Data
Causation vs Association

• Example of causation: Increased consumption
of alcohol causes a decrease in coordination and
reflexes.
• Example of association: A high SAT score in
senior year of high school is typically associated
with a high GPA in freshman year of college.
• In general, an association between an
explanatory variable x and a response y is not
sufficient evidence to prove that x causes y.
Causation vs Association (cont.)
• Examples:
– High SAT math scores tend to be accompanied by
high SAT verbal scores, but does this mean a high
math score causes a high verbal score?
Internet tend to have higher life expectancies. Does
longer?
– The divorce rate tends to be positively correlated with
the quantity of bananas imported. Does importing
more bananas cause more people to get divorced?
Lurking Variables

• A lurking variable is one that is not among the
explanatory or response variables in a study, but
may influence the interpretation of relationships
among those variables.
• In each of our three cases mentioned previously,
there is likely a lurking variable at work.
• Give a an example of one for each of the
scenarios.
Lurking Variables (cont.)

• Lurking variables can create “nonsense
correlations” in the sense that they suggest that
changing one variable causes changes in the
other.
• In addition, lurking variables can hide a true
relationship between explanatory and response
variables.
Causation
• In many cases, we wish to determine whether
changes in an explanatory cause changes in the
response variable.
• Even in the presence of strong association, it is
difficult to decide whether this is due to a causal
• There are three main ways to explain an
association between two variables.
Explaining Association
• The association between an explanatory and a
response variable may be due to
– Causation when there is a direct cause-and-effect link
between these two variables.
– Common response when there is a lurking variable
whose changes cause both the explanatory variable
and the response variable to change.
– Confounding when there are multiple influences at
work that are getting mixed up.
Explaining Association (cont.)
• Officially, two variables are considered
confounded when their effects on a response
variable cannot be distinguished from each other.
• Confounded variables can be either explanatory
or lurking.
• Even a very strong association between two
variables is not sufficient evidence that there is a
• The best way to establish that an association is
due to causation is with a carefully designed
experiment – more on this later.
Transformations of Relationships
• In some situations, the values of quantitative
variables are quite spread out, with some isolated
points. The rest of the data becomes very
compressed, making it somewhat difficult to look
at.
• Situations like this suggest using a function of the
original variable; for example, we might use a
function that will shrink the distance between
values. This is what we call transforming the
data.
Transformations of Relationships
(cont.)
• Transforming data changes the original scale of
measurement. Our most common
transformations are linear (˚F → ˚C, lb → kg).
• Linear transformations cannot straighten curved
relationships, though; to do that, we need a
nonlinear transformation (e.g., powers,
exponentials, logarithms).
• The most common transformations of our
explanatory variable x are power transformations
of the form xp.
Transformations of Relationships
(cont.)
• We call a function f(x) monotone if its values
move in only one direction as x increases
• For positive values of x, power functions with
positive p (and the logarithm function) are
monotonic increasing and preserve the order of
observations.
• For negative p, the power functions are monotonic
decreasing and reverse the order of observations.
• If we believe that there is some mathematical
model that describes our data, then
transformations will be quite effective.
• For example, the exponential growth model
y = a * bx can be written as a linear model if we
take the logarithm of y (log y = log a + x log b).
• On the other hand, a power law growth model
y = a * xp can be written as a linear model if we
take the logarithm of both x and y
(log y = log a + p log x).
Transformations of Relationships
(cont.)
• In practice, our decision to make a transformation
is governed by what we know about the data.
• This also holds true in terms of what type of
transformation we decide to make.
• For example, animal populations and values of
investments are often well-described by
exponential growth model, though we do not
always know the values of the parameters.

```
To top