Correlation and Bivariate Regression

Document Sample
Correlation and Bivariate Regression Powered By Docstoc
					                    Overview
• Bivariate Regression
  –   Purpose
  –   Regression equation
  –   Variance accounted for
  –   Confidence intervals
  –   Statistical inference
  –   Practical exercise
                        Purpose
• Correlation treats X and Y as if they have equal
  status
• Bivariate Regression is used when:
  – One variable (X) is thought to cause the other (Y)
     • Hypothesis: obesity (X) causes diabetes (Y)
  – Prediction
     •   Do A-level results predict University grades?
     •   Does personality predict hormonal response to stress?
     •   Does animal behaviour predict snowfall?
     •   Even if no causal inference is implied
• Regression retains the original measurement
  scales of X and Y
           Bivariate Regression
• Predicting Y from X
  – How much change in winning percentage (Y) is
    associated with a given increase in payroll (X)?
• Regression line determined by least squares
  criterion                        Visually:
                                          it’s the straight line
                                          that minimizes error, in
                                          terms of squared
                                          deviations.
                                          Cubs have a big
                                          squared deviation;
                                          Reds’ is small;
                                          but, the total squared
                                r = .54   deviation is minimized
            Bivariate Regression
• Algebraically, the regression line can be obtained
  from a linear conversion of X
• In standarised form, it’s just zX multiplied by the
  correlation
                 zY  rXY z X
                 ˆ                    “Hat” indicates
                                      predicted value

• But, we want to express the relation in raw
  scores
              ˆ
              Y  B0  BYX X
 • The predicted value of Y (winning) associated
   with a given value of X (payroll)
           Regression Equation
              ˆ
              Y  B0  BYX X
• BYX is the regression coefficient (“slope”)
  – the amount of change in Ŷ associated with a unit
    change in X
• B0 is a constant (“intercept”)
  – the value of Ŷ when X = 0
  – sometimes the intercept is of theoretical interest
     • if an X value of 0 is meaningful
     • not of much interest with our baseball example
                  Regression Equation
   BYX  r            sdY
                   XY sd X
      baseball:      BYX  .538 ( 32..22 )  .104
                                   6
                                      27


 B0  M Y  BYX M X
     baseball:
                  B0  (50 )  .104 (77 .56 )  41 .93

              baseball:    ˆ
                           Y  41 .93  .104 X
Unlike correlations, regression coefficients will differ based on which variable
is X and which is Y
   Residual is difference
   between Y and Ŷ.
   Residuals sum to zero.
   Regression line
   minimises (squared)
   residuals.
ˆ
Y  41 .93  .104 X
       •        Do males and females differ in Body Mass Index?
       •        Bivariate regression can test differences between two independent means


                   SEX         N           Mean        Std. Deviation
 BMIACT            M           29          28.2459     7.60284              Pearson Correlation   -.359
                   F           12          22.7601     3.13329              Sig. (2-tailed)       .021
                                                                            N                     41
      Regression      B           Std. Error   Beta    t           Sig.
      (Constant)      33.732      3.130                10.778      .000
      SEX             -5.486      2.284        -.359   -2.402      .021

           60




           50                                                             ˆ
                                                                          Y  33 .73  5.49 X
           40
                                                                                            Mean Difference
                                                                                            28.25-22.76
BMI




           30




           20



           10
             0.0            1.0                2.0           3.0


                                     SEX
  r2 as “Variance Accounted For”
        Y  B0  BYX X  e
• To what degree do X and Y share variance,
  and to what degree are they independent?
• Variances are additive
       sd  sd  sd
         2
         Y
                2
                Yˆ
                      2
                      Y Yˆ    sd  sd
                                 2
                                 Yˆ
                                          2
                                          e

• So, Y variance is “partitioned” into that
  accounted for by X (sdŶ2) and that which is
  residual (sde2)
  r2 as “Variance Accounted For”
• For standardized scores sd
                                                      2
                                                      zY    sd  sd
                                                                 2
                                                                 zYˆ
                                                                                 2
                                                                                 zY Yˆ


     sd  12
           zY


      sd   2
                 
                   z   2
                        Yˆ
                             
                                (r  z   X   )   2

                                                       r   2   z     X
                                                                           2

                                                                                r2
                     n 1         n 1                          n 1
           zYˆ




     1  r  sd  2       2
                         e


• so, r2 is the proportion of Y variance accounted
  for by X (and vice versa)
   In standard score form…


1  r  sd2              2
                         ZY Y
                             ˆ
                                        r2 = “shared variance”
                                            2
                                       sd   ZY Y
                                                ˆ
                                                    = “residual variance”

Visual Representation = “Ballantine”
              ZPayroll    ZWin
                                       Baseball Example:
                                       r = .538
                                       r2 = .289
                                       sd2ZY-Ŷ = 1- r2 = .711
           Statistical Inference

• The regression coefficients (B0 and BYX) are
  sample statistics (parameter estimates)
• Because these are estimates, they are subject to
  sampling error
• A given estimate falls somewhere on a
  hypothetical sampling distribution
              Confidence Intervals
•    Place our sample statistic (estimate) within a
     confidence interval to indicate its margin of
     error
•    To compute CI for BYX
    –  Need to know two things about the sampling
       distribution of BYX…
    1. Its standard deviation
        •   SE of BYX
    2. Its degrees of freedom
        •   n-2
                  Baseball Example
                  ˆ
                  Y  41 .93  .104 X
Standard deviation of the sampling distribution of BYX
with 28 df is:
                 sd Y 1  r 2   6.22 .711
         SEBYX                             .031
                 sd X n  2 32.27 28

 Sampling distribution of BYX is a t distribution with n – 2 df.
 For 95% CI, find appropriate t value with 28 df
                  Baseball Example
                  ˆ
                  Y  41 .93  .104 X
Standard deviation of the sampling distribution of BYX
with 28 df is:
                 sd Y 1  r 2   6.22 .711
         SEBYX                             .031
                 sd X n  2 32.27 28

 Sampling distribution of BYX is a t distribution with n – 2 df.
 For 95% CI, find appropriate t value with 28 df

 Margin of error = SE(t) = .031(2.048) = .063
 95% CL = .104 ± .063
 95% CI = .041 to .167
                    Confidence Intervals

•     To compute CI for Ŷi
     –  Need to know two things about the sampling
        distribution of Ŷi…
                                        What happens when
     1. Its standard deviation          Xi = MX?
                                                         Xi = 0? (B0 intercept)
                              1 ( Xi  M X )   2
             SEYˆ  SEY Yi    
                i
                              n (n  1) sd X
                                           2



         •    Note that SEY-Yi = standard error of the estimate
              – the estimated population σ of the residuals
    2. Its degrees of freedom
         •    n-2
    Standard Error of Estimate…

           SEY Yˆ 
                              ˆ
                              (Y  Y ) 2
                                                            
                                                                     (1  r 2 ) (Y  M Y ) 2
                                      n2                                               n2

 The estimated population standard deviation (σ) of
 the residuals
  Baseball Example:
                                                                          Model Sum m ary

                                                                                        Adjusted       Std. Error of
                                                  Model         R         R Square      R Square       the Estimate
                                                  1              .538 a       .289           .264          5.33958
                                                    a. Predictors: (Constant), paymil

                                                                      a
                                                          Coe fficients

                           Unstandardiz ed          Standardized
                            Coef f icients          Coef f icients                                  95% Conf idence Interval f or B
   Model                   B         Std. Error         Beta                t           Sig.        Low er Bound Upper Bound
   1        (Cons tant)   41.957          2.575                            16.296         .000            36.683          47.231
            paymil          .104           .031               .538          3.375         .002              .041             .167
     a. Dependent Variable: percent


These SPSS values are identical to those we calculated (with some rounding)
          CI for Ŷi – Baseball Example
                                    54.62 = 41.93 + .104(122)
  ˆ
  Y  41 .93  .104 X
                                        54.62 = estimated wins for $122m
                                        payroll (Red Sox)

                 1 ( X i  M X )2
SEYˆ  SEY Yi    
   i
                 n (n  1) sd X2




            1 (122  77.56) 2
SEYˆ  5.34             2
                               .249
    i
            30   29(32.27 )

       Margin of error = SE(t) = .249(2.048) = .51
       95% CL = 54.62 ± .51
       95% CI = 54.11 to 56.13
    CI for Ŷi – Baseball Example
Note that CI for Ŷi increases as Xi deviates from MX


                                              Whatever sampling error
                                              was made using the
                                              sample BYX rather than
                                              the (unavailable)
                                              population coefficient, it
                                              will have more serious
                                              consequences for X
                                              values that are more
                                              distant from the mean
    Null Hypothesis Significance Testing

•      Is the statistic significant?
     –    How likely is it that you would get a sample
          statistic of that magnitude if there really was no
          association between X and Y in the population?
•      NHST for BYX
                                        BYX  H 0
       t
          sample value  H 0 value   t
              standard error             SEBYX
    Baseball Example:

             .104  0
          t           3.35, p  .01          How was p determined?
               .031
    Null Hypothesis Significance Testing

•     Same basic strategy for B0
      (constant/intercept) if of interest
                                         B0  H 0
       t
          sample value  H 0 value    t
              standard error              SEB0
    Baseball Example (Meaningless):

            41.93  0
         t            16.3, p  .001
              2.57

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:8/8/2012
language:
pages:22