Bivariate Data --- Ch 2
(x1, y1), (x2, y2), . . . (xn, yn) ==> n subjects with 2 observations per subject, (xi, yi), both x,and yi are quantitative
Summaries for bivarite data:
scatterplot: xy graph
regression equation: y i = b0 + b1xi
r = Pearson’s correlation coefficient, a numerical measure representing the strength and direction of
the linear relationship between x and y .
= [ (xi - x )(yi - y )]/(n – 1)sxsy
Questions about the scatterplot:
What is the overall (average) pattern? It could be a straight line or a curved pattern.
What is the direction of the line? It could be positive (y increases as x increases) or negative.
How much do the points vary off of the line? It could be that it is almost perfectly linear or just barely linear.
Are there any unusual points? Outliers are points whose combination of values is unusual. They will affect
what we perceive as the pattern, so they need investigation.
If the data points fit a straight line well, we can use one of the variables to predict the other. Usually, one
variable is difficult (or impossible) to measure, and we would like to know what it is without having to always
Let xi is the explanatory or independent variable
yi is the response or dependent variable
The regression equation:
Regression analysis is the estimation of an equation that describes how, on average, the response variable
changes with respect to the explanatory variable.
b1 = [(xi - x )(yi - y )]/ [(xi - x )2, how much y for each 1 unit increase of x, the units of the slope are y/x.
b0 = y - b1 x , the value of y when x = 0, it is not always a ‘valid’ data point. NOTE: this means ( x , y ) will
always be on the line.
b1 = r (sy/sx ), again the slope is in units of y/x since the correlation, r, has no units.
FYI: y i = ( y - b1 x ) + b1xi = y + b1(xi - x ) so yi is regressed or ‘pulled back’ towards its mean compared to
where xi is relative its own mean
y i = y + r (sy/sx )(xi - x ) for xi = x + sx => y i = y + r sy, the predicted y is only ‘moves’ r percent of its
standard deviation (compared to x).
The regression line does not predict exactly what will happen for each individual, only what will happen on
average. The method of Least Squares chooses b0 and b1 so that the vertical distance between the points and the
line is minimized. This distance is called the residual or error because it is how far off our prediction is from
the data, so it makes sense that we want to minimize our errors.
ei = yi - y i = yi - (b0 - b1xi) and the Method of Least Squares min ei2
Properties of ei :
1. ei = 0 says the point falls exactly on the line, ei 0 says the points falls above the line and ei 0 says the point
falls below the line
2. ei `s = 0 since they’ve been minimized and ei = 0.
3. ei 2 = Sum of Squared Residuals = Sum of Squared Errors has been minimized by the method of Least
NOTE: Because the line is determined by the data, we cannot guarantee this relationship continues outside the
range of the data. This is often referred to as ‘the danger of extrapolation’.
Properties of r :
1. r has no units, therefore it is not affected by scale changes.
2. r(x,y) = r(y,x) , i.e., it makes no difference which is x and which is y
3. -1 r 1 , r < 0 implies a negative linear relationship between x and y
r > 0 implies a positive linear relationship between x and y
| r | 0.8 implies a strong linear relationship, | r | 0.5 implies a weak one
4. r = | 1 | implies that there is an exact linear relationship between x and y, the points
would fit on a straight line
5. r = 0 says there is no linear relationship between x and y. It is always possible some other relationship exists.
6. r2 = proportion of variation explained by x = SSTotal/SSE where SSTotal = (yi - y )2.
What affects the summary statistics:
For bivariate data, it depends on whether you change x or y:
In general, the slope, b1 = r*sy/sx and the intercept, b0 = y - b1 x = y - r*(sy/sx)*( x )
x c: b0 => y - b1( x c) = (b0 – b1)*c, b1 => r*sy/sx = b1
y c: b0 => b0 c, b1 => r*sy/sx = b1
x*c: b0 => y - ( x *c)*r*sy/(sx*c) = b0 (the c’s cancel), b1 => r*sy/(sx*c) = b1/c
y*c: b0 => b0*c, b1 => r*(sy*c)/sx = b1*c
x/c: b0 => y - ( x /c)*r*sy/(sx/c) = b0, b1 => r*sy/(sx/c) = b1*c
y/c: b0 => b0/c, b1 => r*(sy/c)/sx = b1/c
The slope is similar to a standard deviation in that it is only affected by changes in scale (y or x). The intercept is
similar to a mean, in that, it is affected by scale AND shift changes (but not scale changes of x since the change
cancels out). Notice: r, the correlation coefficient NEVER changes since x is ‘just as good’ as x*c (or x+c) at
predicting y (or y*c or y+c)!!!!
NOTE: y will only move r% of a standard deviation from its mean, y , each time x moves 1 standard deviation
from its mean, x . Let c be any constant, then if x* = x c*sx, then y* = y r*c*sy.
A point that is ‘off’ in the x direction is called an influential point because it can affect the slope, intercept and
A point that is ‘off’ in the y direction only affects the intercept (usually) and the correlation. Because it is a ‘bad’
point, it causes the correlation to decrease (get closer to 0).
Correlation does not prove causation:
It could be the explanatory variable is causing the change in the response. Only a properly designed
experiment using a large sample and a randomization scheme can statistically prove causation.
It could be there is causation, but confounding factors contribute as well, and since they weren’t controlled,
causation can’t be proved. Different groups combined into one dataset is often a confounding factor.
There is no causation. Both variables are affected by other variables.
The response variable is causing the explanatory variable to change.