# Correlation and Bivariate Regression

Document Sample

```					                    Overview
• Bivariate Regression
–   Purpose
–   Regression equation
–   Variance accounted for
–   Confidence intervals
–   Statistical inference
–   Practical exercise
Purpose
• Correlation treats X and Y as if they have equal
status
• Bivariate Regression is used when:
– One variable (X) is thought to cause the other (Y)
• Hypothesis: obesity (X) causes diabetes (Y)
– Prediction
•   Do A-level results predict University grades?
•   Does personality predict hormonal response to stress?
•   Does animal behaviour predict snowfall?
•   Even if no causal inference is implied
• Regression retains the original measurement
scales of X and Y
Bivariate Regression
• Predicting Y from X
– How much change in winning percentage (Y) is
associated with a given increase in payroll (X)?
• Regression line determined by least squares
criterion                        Visually:
it’s the straight line
that minimizes error, in
terms of squared
deviations.
Cubs have a big
squared deviation;
Reds’ is small;
but, the total squared
r = .54   deviation is minimized
Bivariate Regression
• Algebraically, the regression line can be obtained
from a linear conversion of X
• In standarised form, it’s just zX multiplied by the
correlation
zY  rXY z X
ˆ                    “Hat” indicates
predicted value

• But, we want to express the relation in raw
scores
ˆ
Y  B0  BYX X
• The predicted value of Y (winning) associated
with a given value of X (payroll)
Regression Equation
ˆ
Y  B0  BYX X
• BYX is the regression coefficient (“slope”)
– the amount of change in Ŷ associated with a unit
change in X
• B0 is a constant (“intercept”)
– the value of Ŷ when X = 0
– sometimes the intercept is of theoretical interest
• if an X value of 0 is meaningful
• not of much interest with our baseball example
Regression Equation
BYX  r            sdY
XY sd X
baseball:      BYX  .538 ( 32..22 )  .104
6
27

B0  M Y  BYX M X
baseball:
B0  (50 )  .104 (77 .56 )  41 .93

baseball:    ˆ
Y  41 .93  .104 X
Unlike correlations, regression coefficients will differ based on which variable
is X and which is Y
Residual is difference
between Y and Ŷ.
Residuals sum to zero.
Regression line
minimises (squared)
residuals.
ˆ
Y  41 .93  .104 X
•        Do males and females differ in Body Mass Index?
•        Bivariate regression can test differences between two independent means

SEX         N           Mean        Std. Deviation
BMIACT            M           29          28.2459     7.60284              Pearson Correlation   -.359
F           12          22.7601     3.13329              Sig. (2-tailed)       .021
N                     41
Regression      B           Std. Error   Beta    t           Sig.
(Constant)      33.732      3.130                10.778      .000
SEX             -5.486      2.284        -.359   -2.402      .021

60

50                                                             ˆ
Y  33 .73  5.49 X
40
Mean Difference
28.25-22.76
BMI

30

20

10
0.0            1.0                2.0           3.0

SEX
r2 as “Variance Accounted For”
Y  B0  BYX X  e
• To what degree do X and Y share variance,
and to what degree are they independent?
sd  sd  sd
2
Y
2
Yˆ
2
Y Yˆ    sd  sd
2
Yˆ
2
e

• So, Y variance is “partitioned” into that
accounted for by X (sdŶ2) and that which is
residual (sde2)
r2 as “Variance Accounted For”
• For standardized scores sd
2
zY    sd  sd
2
zYˆ
2
zY Yˆ

sd  12
zY

sd   2

z   2
Yˆ

 (r  z   X   )   2

r   2   z     X
2

 r2
n 1         n 1                          n 1
zYˆ

1  r  sd  2       2
e

• so, r2 is the proportion of Y variance accounted
for by X (and vice versa)
In standard score form…

1  r  sd2              2
ZY Y
ˆ
r2 = “shared variance”
2
sd   ZY Y
ˆ
= “residual variance”

Visual Representation = “Ballantine”
ZPayroll    ZWin
Baseball Example:
r = .538
r2 = .289
sd2ZY-Ŷ = 1- r2 = .711
Statistical Inference

• The regression coefficients (B0 and BYX) are
sample statistics (parameter estimates)
• Because these are estimates, they are subject to
sampling error
• A given estimate falls somewhere on a
hypothetical sampling distribution
Confidence Intervals
•    Place our sample statistic (estimate) within a
confidence interval to indicate its margin of
error
•    To compute CI for BYX
–  Need to know two things about the sampling
distribution of BYX…
1. Its standard deviation
•   SE of BYX
2. Its degrees of freedom
•   n-2
Baseball Example
ˆ
Y  41 .93  .104 X
Standard deviation of the sampling distribution of BYX
with 28 df is:
sd Y 1  r 2   6.22 .711
SEBYX                             .031
sd X n  2 32.27 28

Sampling distribution of BYX is a t distribution with n – 2 df.
For 95% CI, find appropriate t value with 28 df
Baseball Example
ˆ
Y  41 .93  .104 X
Standard deviation of the sampling distribution of BYX
with 28 df is:
sd Y 1  r 2   6.22 .711
SEBYX                             .031
sd X n  2 32.27 28

Sampling distribution of BYX is a t distribution with n – 2 df.
For 95% CI, find appropriate t value with 28 df

Margin of error = SE(t) = .031(2.048) = .063
95% CL = .104 ± .063
95% CI = .041 to .167
Confidence Intervals

•     To compute CI for Ŷi
–  Need to know two things about the sampling
distribution of Ŷi…
What happens when
1. Its standard deviation          Xi = MX?
Xi = 0? (B0 intercept)
1 ( Xi  M X )   2
SEYˆ  SEY Yi    
i
n (n  1) sd X
2

•    Note that SEY-Yi = standard error of the estimate
– the estimated population σ of the residuals
2. Its degrees of freedom
•    n-2
Standard Error of Estimate…

SEY Yˆ 
 ˆ
(Y  Y ) 2

(1  r 2 ) (Y  M Y ) 2
n2                                               n2

The estimated population standard deviation (σ) of
the residuals
Baseball Example:
Model Sum m ary

Model         R         R Square      R Square       the Estimate
1              .538 a       .289           .264          5.33958
a. Predictors: (Constant), paymil

a
Coe fficients

Unstandardiz ed          Standardized
Coef f icients          Coef f icients                                  95% Conf idence Interval f or B
Model                   B         Std. Error         Beta                t           Sig.        Low er Bound Upper Bound
1        (Cons tant)   41.957          2.575                            16.296         .000            36.683          47.231
paymil          .104           .031               .538          3.375         .002              .041             .167
a. Dependent Variable: percent

These SPSS values are identical to those we calculated (with some rounding)
CI for Ŷi – Baseball Example
54.62 = 41.93 + .104(122)
ˆ
Y  41 .93  .104 X
54.62 = estimated wins for \$122m
payroll (Red Sox)

1 ( X i  M X )2
SEYˆ  SEY Yi    
i
n (n  1) sd X2

1 (122  77.56) 2
SEYˆ  5.34             2
 .249
i
30   29(32.27 )

Margin of error = SE(t) = .249(2.048) = .51
95% CL = 54.62 ± .51
95% CI = 54.11 to 56.13
CI for Ŷi – Baseball Example
Note that CI for Ŷi increases as Xi deviates from MX

Whatever sampling error
sample BYX rather than
the (unavailable)
population coefficient, it
will have more serious
consequences for X
values that are more
distant from the mean
Null Hypothesis Significance Testing

•      Is the statistic significant?
–    How likely is it that you would get a sample
statistic of that magnitude if there really was no
association between X and Y in the population?
•      NHST for BYX
BYX  H 0
t
sample value  H 0 value   t
standard error             SEBYX
Baseball Example:

.104  0
t           3.35, p  .01          How was p determined?
.031
Null Hypothesis Significance Testing

•     Same basic strategy for B0
(constant/intercept) if of interest
B0  H 0
t
sample value  H 0 value    t
standard error              SEB0
Baseball Example (Meaningless):

41.93  0
t            16.3, p  .001
2.57

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 5 posted: 8/8/2012 language: pages: 22