STAT 111 Introductory Statistics Lecture 3: Regression May 20, 2004 Today’s Topics • Regression line – Fitting a line – Prediction – Least-squares – Interpretation • Correlation and regression • Causation • Transforming variables (briefly) Review: The Scatterplot • The scatterplot shows the relationship between two quantitative variables. • It plots the observations of different individuals in a two-dimensional graph. • Each point in a scatterplot corresponds to an observation of two variables of the same individual. The Regression Line • A regression line is a straight line that summarizes the linear relationship between two variables. • It describes how a response variable y changes as an explanatory variable x changes. • A regression line is often used as a model to predict the value of the response y for a given value of the explanatory variable x. The Regression Line (cont.) • We fit a line to data by drawing the line that comes as close as possible to the points. • Once we have a regression line, we can predict the y for a specific value of x. Accuracy depends on how scattered the data are about the line. • Using the regression line for prediction for far outside the range of values of x used to obtain the line is called extrapolation. This is generally not advised, since predictions will be inaccurate. Example: Predicting SAT Math Scores using SAT Verbal Scores • Making a regression line using JMP: Analyze → Fit Y by X → Put the response variable into Y, explanatory variable into X → Hit OK → Double-click the red triangle above the scatterplot → Fit line • Mathematically, a straight line has an equation of the form y = a + bx, where b is the slope and a is the intercept. But how do we determine the value of these two numbers? The Least-Squares Regression Line • The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. • Mathematically, the line is determined by minimizing The Least-Squares Regression Line (cont.) • The equation of the least-squares regression line of y on x is • The slope is determined using the formula • The intercept is calculated using Interpreting the Regression Line • The slope b tells us that along the regression line, a change of 1 unit in x corresponds to a change of b units in y. • The least-squares regression line always passes through the point . • If both x and y are standardized variables, then the slope of the least-squares regression line will be r, and the line will pass through the origin (0,0). Interpreting the Regression Line (cont.) • Since standard deviation can never be negative, the signs of r and b will always be the same. • Hence, if our slope is positive, we have a positive association between our explanatory variable and our response. • On the other hand, if our slope is negative, then we have a negative association between our explanatory variable and our response. Example: SAT Scores Again • In our SAT data, the math score is the response, and the verbal score is the explanatory variable. The least-squares regression line as reported by JMP is math = 498.00765 + 0.3167866 verbal • Hence, in the context of the SAT, if a student’s verbal score increases by 10 points, then his math score will increase by a little bit more than 3 points. Example: SAT Scores (cont.) • Suppose we want to predict using our regression line a student’s math score given that his verbal score was 550. • The predicted math score then would be 498.00765 + 0.3167866 (550) = 672 • Remember not to extrapolate when you make your predictions. Example: SAT Scores (cont.) • Now, suppose we instead wanted to use a regression line to predict verbal scores using math scores, and suppose that one student had a math score of 670. • Naively, we would predict the verbal score by taking the inverse of our existing regression line, in which case we would predict a verbal score between 540 and 550. • It is not quite as simple as this. Example: SAT Scores (cont.) • What we would need to do is re-fit the regression line using math scores as our explanatory variable and verbal scores as our response. • The new regression line is (from JMP) verbal = 408.37653 + 0.3901289 math • So, our predicted verbal score given a math score of 670 would be 408.37653 + 0.3901289 (670) = 670 Correlation and Regression • The square of the correlation, r2, is the proportion of the variation in the data that is explained by our least-squares regression line. • r2 is always between 0 and 1. • If r = ± 0.7, then r2 = 0.49, or about ½ of the variation. • In our SAT data, r2 = 0.1236 (it is the same for both regressions), so our regression line only captures about 12% of the response’s variation. Understanding r2 • Let’s look at the SAT line (verbal as x, math as y) once again. • The variance in our observed math values is (61.262875)2 = 3753.14 • If the only variability in observed math scores was because of the linear fit, then math scores would lie exactly on our line. • In other words, the math scores would be identical to our predicted math scores. Understanding r2 (cont.) • After computing the predicted math scores, we have that the variance in our predicted values is (21.53698)2 = 463.84 • If we divide the variance of our predicteds by the variance of our actuals, we have 463.84 / 3753.14 = .1236 • It is always true for least-squares regression when we say that r2 gives us the variance of predicted responses as a fraction of the variance of actual responses. Diagnosis (How Good is our Model?) • Although we are most interested in the overall pattern as described by the regression line, deviations from this pattern are also important. • In the regression setting, the deviations we consider are the vertical distances from the actual points to the least-squares regression line. • These distances represent the variation left in the response after fitting the line and are called residuals. Residuals • A residual is the difference between an observed value and the predicted value. • Residual = observed y – predicted y • The sum of the residuals of a regression line is always equal to 0. • A residual plot is a scatterplot of regression residuals against the explanatory variable and is used to assess the fit of a regression line. Simplified Patterns of Least-squares Residuals linear relationship nonlinear relationship residual residual x x Nonconstant prediction error residual x Outliers and Influential Observations • An outlier is an observation that lies outside the overall pattern of the other observations. • Points that are outliers in the y direction have large regression residuals, but that need not be the case for all outliers. • An influential observation is one that would significantly change the regression line if removed. An outlier in the x direction is often influential for the least-squares regression line. Example: Age at First Word and Gesell Score • Does the age at which a child begin to talk predict a later score on a test of mental ability? • The age in months at which the first word was spoken and the score on an ability test taken much later were recorded for 21 children. • Fitting a line to all data reveals a negative linear relationship: early talkers tend to have higher test scores than those who start talking later. Example: First Word and Gesell Score (cont.) Example: First Word and Gesell Score (cont.) • In the scatterplot, we see that observations 18 and 19 are unusual. • Observation 18 is far out in the x direction; observation 19 is far out in the y direction. • The red line is the regression line we obtained by including 18; the green is obtained by excluding 18. • 18 is pulling the line towards itself; hence it is influential. Extreme Example: Random Data Causation vs Association • Example of causation: Increased consumption of alcohol causes a decrease in coordination and reflexes. • Example of association: A high SAT score in senior year of high school is typically associated with a high GPA in freshman year of college. • In general, an association between an explanatory variable x and a response y is not sufficient evidence to prove that x causes y. Causation vs Association (cont.) • Examples: – High SAT math scores tend to be accompanied by high SAT verbal scores, but does this mean a high math score causes a high verbal score? – Nations in which people have easy access to the Internet tend to have higher life expectancies. Does better access to the Internet cause people to live longer? – The divorce rate tends to be positively correlated with the quantity of bananas imported. Does importing more bananas cause more people to get divorced? Lurking Variables • A lurking variable is one that is not among the explanatory or response variables in a study, but may influence the interpretation of relationships among those variables. • In each of our three cases mentioned previously, there is likely a lurking variable at work. • Give a an example of one for each of the scenarios. Lurking Variables (cont.) • Lurking variables can create “nonsense correlations” in the sense that they suggest that changing one variable causes changes in the other. • In addition, lurking variables can hide a true relationship between explanatory and response variables. Causation • In many cases, we wish to determine whether changes in an explanatory cause changes in the response variable. • Even in the presence of strong association, it is difficult to decide whether this is due to a causal link. • There are three main ways to explain an association between two variables. Explaining Association • The association between an explanatory and a response variable may be due to – Causation when there is a direct cause-and-effect link between these two variables. – Common response when there is a lurking variable whose changes cause both the explanatory variable and the response variable to change. – Confounding when there are multiple influences at work that are getting mixed up. Explaining Association (cont.) • Officially, two variables are considered confounded when their effects on a response variable cannot be distinguished from each other. • Confounded variables can be either explanatory or lurking. • Even a very strong association between two variables is not sufficient evidence that there is a cause-and-effect link between the variables. • The best way to establish that an association is due to causation is with a carefully designed experiment – more on this later. Transformations of Relationships • In some situations, the values of quantitative variables are quite spread out, with some isolated points. The rest of the data becomes very compressed, making it somewhat difficult to look at. • Situations like this suggest using a function of the original variable; for example, we might use a function that will shrink the distance between values. This is what we call transforming the data. Transformations of Relationships (cont.) • Transforming data changes the original scale of measurement. Our most common transformations are linear (˚F → ˚C, lb → kg). • Linear transformations cannot straighten curved relationships, though; to do that, we need a nonlinear transformation (e.g., powers, exponentials, logarithms). • The most common transformations of our explanatory variable x are power transformations of the form xp. Transformations of Relationships (cont.) • We call a function f(x) monotone if its values move in only one direction as x increases • For positive values of x, power functions with positive p (and the logarithm function) are monotonic increasing and preserve the order of observations. • For negative p, the power functions are monotonic decreasing and reverse the order of observations. • If we believe that there is some mathematical model that describes our data, then transformations will be quite effective. • For example, the exponential growth model y = a * bx can be written as a linear model if we take the logarithm of y (log y = log a + x log b). • On the other hand, a power law growth model y = a * xp can be written as a linear model if we take the logarithm of both x and y (log y = log a + p log x). Transformations of Relationships (cont.) • In practice, our decision to make a transformation is governed by what we know about the data. • This also holds true in terms of what type of transformation we decide to make. • For example, animal populations and values of investments are often well-described by exponential growth model, though we do not always know the values of the parameters.