Biostatistics

Document Sample
Biostatistics Powered By Docstoc
					Simple Linear Regression

Statistical Reasoning 2, Lecture 4
Section A

Review: The Equation of a Line
The Equation of a Line

   Recall, from algebra, there are two values which uniquely define
    any line
      y-intercept—where the line crosses the y-axis (when x = 0)
      Slope—the ―rise over the run‖—how much y changes for each
         one unit change in x




                                                                       3
The Equation of a Line

   Recall, from algebra, there are two values which uniquely define
    any line


   y = mx + b
        b = y-intercept
        m = slope




                                                                       4
The Equation of a Line

   Of course statisticians must have their own notation!


   y = bo + b1x
         bo= y-intercept
         b1 = slope


   y = βo + β1x
         β o= y-intercept
         β1 = slope




                                                            5
The Intercept, βo

   The intercept βo is the value of y when x is 0
      It is the point on the graph where the line crosses the y
        (vertical) axis, at the coordinate (0, βo )




                                           y = βo + β1x


                βo




                                                                   6
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x




                                          y = βo + β1x




                                                                            7
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x




                                      β1
                                           y = βo + β1x




                                                                            8
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x


   Another interpretation: β1 is difference in y-values for x+1
    compared to x


   This change/difference is the same across the entire line




                                                                            9
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x




                                      β1
                                           y = βo + β1x


                              β1



                       β1


                                                                            10
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in
    x: β1 is difference in y-values for x+1 compared to x


   All information about the difference in the y-value for two differing
    values of x is contained in the slope!


   For example: two values of x three units apart will have a
    difference in y values of 3* β1




                                                                            11
The Slope, β1

   For example: two values of x three units apart will have a
    difference in y values of 3* β1




                                             β1
                                        β1
                                   β1




                                                                 12
The Slope, β1

   For example: two values of x three units apart will have a
    difference in y values of 3×β1 (3β1 )




                                             β1
                                        β1        3β1
                                   β1




                                                                 13
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in
    x: β1 is difference in y-values for x+1 compared to x



        If slope β1 = 0, indicates that there is no association:
         (i.e., the values of y are the same regardless of the values of x)
        If slope β1 > 0, indicates that there is a positive association:
         (i.e., the values of y increase with increasing values of x)
        If slope β1 < 0, indicates that there is a negative association:
         (i.e., the values of y decrease with increasing values of x)




                                                                              14
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in
    x: β1 is difference in y-values for x+1 compared to x




                                                                          15
The Equation of a Line

   In linear regression situations, points don‘t fit exactly to a line


   We estimate a line that relates the mean of an outcome y to a
    predictor x
                             ˆ     ˆ
                     E[ y]   0  1 x
        E[y] =estimated ―expected‖ (mean) value of y
         ˆ
          0 = estimated y-intercept
         ˆ
          1 = estimated slope




                                                                          16
The Equation of a Line

   ˆ        ˆ
     o and  1 are called estimated regression coefficients

   These two quantities are estimated using the data
      Line estimated is line that ―fits the data best‖

   Many times the equation just written as:
                              ˆ     ˆ
                          y   0  1 x

                                 or

                           ˆ ˆ       ˆ
                           y   0  1 x




                                                               17
The Equation of a Line

   ˆ        ˆ
     o and  1 are called estimated regression coefficients


                                                ˆ
    We will see that in a regression context ,  1 is nothing more than
    estimated mean difference in y between two groups who differ by
    one unit in x
      ie: how much the mean of y changes for a one-unit increase in x




                                                                          18
Section B

Linear Regression: Motivating Example
Example: Arm Circumference and Height

   Data on anthropomorphic measures from a random sample of 150
    Nepali children [0, 12) months old


   Question: what is the relationship between average arm
    circumference and height


   Data:
      Arm circumference: mean 12.4 cm, SD 1.5 cm, range 7.3 cm –
        15.6 cm
      Height: mean 61.6 cm, SD 6.3 cm, range 40.9 cm – 73.3 cm




                                                                    20
Approach 1: Arm Circumference and Height

   Dichotomize height at median, compare mean arm circumference
    with t-test and 95% CI




                                                                   21
Approach 1: Arm Circumference and Height

   Potential Advantages:
      We know how to do it!
      Gives a single summary measure (sample mean difference) for
        quantifying the arm circumference/height association


   Potential Disadvantages:
      Throws away a lot of information in the height data that was
        originally measured as continuous
      Only allows for a single comparison between two crudely
        defined height categories




                                                                      22
Approach 2 Arm Circumference and Height

   Categorize height into 4 categories by quartile, compare mean arm
    circumference with ANOVA, 95% CIs




                                                                        23
Approach 2: Arm Circumference and Height

   Potential Advantages:
      We know how to do it!
      Uses a less crude categorization of height than the previous
        approach of dichotomizing


   Potential Disadvantages:
      Still throws away a lot of information in the height data that
        was originally measured as continuous
      Requires multiple summary measures (6 sample mean
        differences between each unique combination of height
        categories) to quantify arm circumference/height relationship
      Does not exploit the structure we see in the previous boxplot:
        as height increases so does arm circumference




                                                                        24
Approach 3: Arm Circumference and Height

   What about treating height as continuous when estimating the arm
    circumference/height relationship


   Linear regression is a potential option: allows us to associate a
    continuous outcome with a continuous predictor via a line
      The line estimates the mean value of the outcome for each
        continuous value of height in the sample used
      Makes a lot of sense: but only if a line reasonably describes the
        outcome/predictor relationship


   Linear regression can also use binary or categorical predictors (will
    show later in this set of lectures)




                                                                            25
Visualizing Arm Circumference and Height Relationship

   A useful visual display for assessing nature of association between
    two continuous variables: a scatterplot




                                                                          26
Visualizing Arm Circumference and Height Relationship

   Question : does a line reasonably describe the general shape of the
    relationship between arm circumference and height?


   We can estimate a line, using the computer (details to come in
    subsequent lecture section)


    The line we estimate will be of the form:

                 y   o   1x
                 ˆ


             ˆ
    Here: y is the average arm circumference for a group of children
    all of the same height, x




                                                                          27
Example: Arm Circumference and Height

   Equation of regression line relating estimated mean arm
    circumference (cm) to height (cm) : from Stata
     y  2.7  0.16x
      ˆ

       Here , y  estimated average arm circumference (like what we
               ˆ
                                               ˆ
        previously would call y ), x = height,  o  2.7 and
        ˆ
        1  0.16


       This is the estimated line from the sample of 150 Nepali
        children




                                                                       28
Example: Arm Circumference and Height

   Scatterplot with regression line superimposed




                                                    y  2.7  0.16x
                                                    ˆ




                                                                      29
Example: Arm Circumference and Height

   Estimated mean arm circumference for children 60 cm in height




                                                     y  2.7  0.16x
                                                     ˆ




                                               for x  60 cm
                                               y  2.7  0.16  60  12 .3 cm
                                               ˆ




                                                                           30
Example: Arm Circumference and Height

   Notice, most points don‘t fall directly on the line: we are estimating
    the mean arm circumference of children 60 cm tall: observed
    points vary about the estimated mean




                                                        y  2.7  0.16x
                                                        ˆ




                                                  for x  60 cm
                                                  y  2.7  0.16  60  12 .3 cm
                                                  ˆ




                                                                              31
Example: Arm Circumference and Height

   How to interpret estimated slope?
     y  2.7  0.16x
      ˆ

              ˆ
        Here ,  1  0.16
       Two ways to say the same thing:

                 ˆ
                 1 is the average change in arm circumference for a
             one-unit (1 cm) increase in height
                 ˆ
                 1 is the mean difference in arm circumference for two
             groups of children who differ by one-unit (1 cm) in height,
             taller to shorter

          This results estimates that the mean difference in arm
          circumferences for a one cm difference in height is 0.16 cm,
          with taller children having greater average arm
          circumference.
                                                                           32
Example: Arm Circumference and Height

   This mean difference estimate is constant across the entire height
    range in the sample: definition of a slope of a line




                                                      y  2.7  0.16x
                                                      ˆ




                                                                         33
Example: Arm Circumference and Height

   What is estimated mean difference in arm circumference for:
      Children 60 cm tall versus children 59 cm tall?
      Children 25 cm tall versus children 24 cm tall?
      Children 72 cm tall versus children 71 cm tall?
      Etc….?

    Answer is the same for all of the above: 0.16 cm




                                                                  34
Example: Arm Circumference and Height

   What is estimated mean difference in arm circumference for:
      Children 60 cm tall versus children 50 cm tall?




                                                                ˆ
                                         y x60  y x50  10  1  10  0.16 cm  1.6 cm
                                         ˆ        ˆ




                                                                                     35
Example: Arm Circumference and Height

   What is estimated mean difference in arm circumference for:
      Children 90 cm tall versus children 89 cm tall?
      Children 34 cm tall versus children 33 cm tall?
      Children 110 cm tall versus children 109 cm tall?
      Etc….?

    This is a trick question!!!!




                                                                  36
Example: Arm Circumference and Height

   The range of observed heights in the sample is 40.9 cm – 73.3 cm:
    our regression results only apply to the relationship between arm
    circumference and height for this height range




                                                      y  2.7  0.16x
                                                      ˆ




                                                                        37
Example: Arm Circumference and Height

   How to interpret estimated intercept?
     y  2.7  0.16x
      ˆ

              ˆ
        Here ,  o  2.7cm

       This is the estimated y when x =0: the estimated mean arm
        circumference for children 0 cm tall

                Does this make sense given our sample?

                As we noted before, estimate of mean arm
                circumferences only apply to observed height range.

    Frequently, the scientific interpretation of the intercept is
    scientifically meaningless: but this intercept is necessary for fully
    specify equation of line and make estimates of mean arm
    circumference for groups of children with heights in sample range.
                                                                            38
Example: Arm Circumference and Height

   Notice the x=0 is not even on this graph (the vertical axis is at x=39)




                                                         y  2.7  0.16x
                                                         ˆ




                                                                              39
Example: Arm Circumference and Height

   Notice the x=0 is not even on this graph (the vertical axis is at x=39)




                                                         y  2.7  0.16x
                                                         ˆ




                                                                              40
Section C

Simple Linear Regression : More Examples
Example: Hb and PCV

   Linear regressions performed with a single predictor (one x) are
    called simple linear regressions


   Linear regressions performed with a more than one predictor (x‘s)
    are called multiple linear regressions


   In this set of lectures we are dealing with simple linear regression:
    in this section we will give three more examples




                                                                            42
Example: Hb and PCV

   Data on laboratory measurements on a random sample of 21 clinic
    patients 20-67 years old


   Question: what is the relationship between hemoglobin levels (g/dL)
    and packed cell volume (percent of packed cells)


   Data:
      Hemoglobin (Hb): mean 14.1 g/dl, SD 2.3 g/dL, range 9.6 g/dL –
        17.1 g/dL
      Packed Cell Volume (PCV): mean 41.1 %, SD 8.1 %, range 25% to
        55%




                                                                          43
Visualizing Hb and PCV Relationship

   Scatterplot display




                                      44
Example: Hb and PCV

   Equation of regression line relating estimated mean Hemoglobin
    (g/dL) to packed cell volume : from Stata
     y  5.77  0.20x
      ˆ

       Here , y  estimated average Hemoglobin (like what we
               ˆ
                                               ˆ
        previously would call y ), x = height,  o  5.77 and
        ˆ
        1  0.20


       This is the estimated line from the sample of 21 subjects




                                                                     45
Example: Hb and PCV

   Equation of regression line relating estimated mean Hemoglobin
    (g/dL) to packed cell volume : from Stata
     y  5.77  0.20x
      ˆ

        ˆ
        1  0.20 : what are the units?
                      ˆ                               ˆ
               Well , y is in g/dL, x in percent; so  1 is in units if g/dL
                 per percent

          This results estimates that the mean difference in
          Hemoglobin levels for two groups of subjects who differ by
          1% in PCV is 0.20 g/dL: subjects with greater PCV have
          greater Hb levels in average.




                                                                               46
Visualizing Hb and PCV Relationship

   Scatterplot display with regression line




                                               y  5.77  0.20 x
                                               ˆ




                                                              47
Example: Hb and PCV

   What is average difference in Hb levels for subjects with PCV of 40%
    compared to subjects with 32%?

        ˆ
        1  0.20 : compares groups of subjects who differ in PCV by 1%
    (it is positive, so those with the greater PCV have hemoglobin levels
    of .20 g/dL greater on average)


        To compare subjects with PCV of 40% versus subjects with 32%,
    which is an 8 unit difference in x, take
                   ˆ
               8   1  8  0.20  1.6 g / dL




                                                                            48
Example: Hb and PCV

   What is estimated Hb level for subjects with PCV of 41% ?


                 y  5.77  0.20 x
                 ˆ

Plugging 41% into the equation,

          y  5.77  0.20  41  13.97 g / dL
          ˆ




                                                                49
Example: Wages and Education Level

   Data on hourly wages from a random sample of 534 U.S. workers in
    1985


   Question: what is the relationship between hourly wage (US$) and
    years of formal education


   Data:
      Hourly wages : mean $9.04/hr, SD $5.13/hr, range $1.00/hr–
        $44.50/hr
      Year of formal education: mean 13.0 years, SD 2.6 years, range
        2 years – 18 years




                                                                        50
Visualizing Wages and Education Level Relationship

   Scatterplot display




                                                     51
Example: Wages and Education Level

   Equation of regression line relating estimated mean hourly wages
    (US $) to years of education : from Stata
        y  0.75  0.75x
        ˆ
       Here , y  estimated average hourly wage (like what we
               ˆ
        previously would call y ), x = years of formal education ,
           ˆ                 ˆ
           o  0.75 and 1  0.75

       This is the estimated line from the sample of 534 subjects




                                                                       52
Visualizing Wages and Education Level Relationship

   Scatterplot display with regression line




                                                     53
Wages and Education Level

   What is interpretation of the slope estimate?




                                                    54
Example: Arm Circumference and Sex

   Data on anthropomorphic measures from a random sample of 150
    Nepali children [0, 12) months old


   Question: what is the relationship between average arm
    circumference and sex of a child


   Data:
      Arm circumference: mean 12.4 cm, SD 1.5 cm, range 7.3 cm –
        15.6 cm
      Sex: 51% female




                                                                    55
Visualizing Arm Circumference and Sex Relationship

   Scatterplot display




                                                     56
Visualizing Arm Circumference and Sex Relationship

   Boxplot display




                                                     57
Example: Arm Circumference and Sex

   Here, y is arm circumference, a continuous measure: x is not
    continuous, but binary – male or female


   How to handle sex as a ―x‖ in regression?
      One possibility: x = 0 for male children, x =1 for female
       children


   The equation we will estimate

                          ˆ ˆ       ˆ
                          y   0  1 x

    How to interpret regression coefficients?




                                                                   58
Example: Arm Circumference and Sex

    Notice: this equation is only estimating two values: mean arm
     circumference for male children, and the mean for female children


    For female children:
                ˆ ˆ       ˆ       ˆ     ˆ
                y   0  1 1   0  1
    For male children

                ˆ ˆ       ˆ        ˆ
                y   0  1  0   0

So    ˆ
      1   is still a slope estimating mean difference in y for one-unit
     difference in x: but only possible one-unit difference is 1 (females)
     to 0 (males)

     ˆ
      o actually has substantive meaning in this example: it is the
     average arm circumference for male children
                                                                             59
Example: Arm Circumference and Sex

    The resulting equation

                      y  12.5  0.13x
                      ˆ

ˆ
 1  0.13 : the estimated mean difference in arm circumference
     for female children compared to male children is -0.13 cm; female
     children have lower arm circumference by 0.13 cm on average

    ˆ
     o  12.5   : the mean arm circumference for male children is 12.5
     cm




                                                                          60
Visualizing Arm Circumference and Sex Relationship

   Scatterplot display with regression line




                                                     61
Section D

Simple Linear Regression Model: Estimating the
Regression Equation—Accounting for Uncertainly in the
Estimates
Example: Hemoglobin and Packed Cell Volume

   So in the last section, we showed the results from several simple
    linear regression models


   For example, when relating arm circumference to height using a
    random sample of 150 Nepali children < 12 months old, I told you
    that the resulting regression equation was:
                      y  2.7  0.16x
                      ˆ


I told you this came from Stata, and will show you how to do
   regression with Stata shortly: but how does Stata estimate this
   equation?




                                                                        63
Example: Arm Circumference and Height

   There must be some algorithm that will always yield the same
    results for the same data set




                                                                   64
Example: Arm Circumference and Height

   The algorithm to estimate the equation of the line is called the
    ―least squares‖ estimation


   The idea is to find the line that gets ―closest‖ to all of the points in
    the sample


    How to define closeness to multiple points?


    In regression, closeness is defined as the cumulative squared
    distance between each point‘s y-value and the corresponding value
        ˆ
    of y for that point‘s x : in other words the squared distance
    between an observed y-value and the estimated y-value for all
    points with the same value of x.



                                                                               65
Example: Arm Circumference and Height

                                 ˆ     ˆ
    Each distance is y  y  y  ( o  B1 x) : this is computed for each
                         ˆ
    data point in the sample




                                                                            66
Example: Arm Circumference and Height

   The algorithm to estimate the equation of the line is called the
    ―least squares‖ estimation


                          ˆ       ˆ
    The values chosen for  o and 1 are the values that minimize the
    cumulative distances squared: i.e.

                       n
                                          
                                ˆ   x ) 2
                   min  yi  ( o ˆ1 i 
                        i 1              




                                                                        67
Example: Arm Circumference and Height

                            ˆ       ˆ
    The values chosen for  o and 1 are just estimates based on a
    single sample. If were to have a different random sample of 150
    Nepali children from the same population of <12 month olds, the
    resulting estimate would likely be different: i.e. the values that
    minimized the cumulative squared distance from this second sample
    of points would likely be different


   As such, all regression coefficients have an associated standard
    error that can be used to make statements about the true
    relationship between mean y and x (for example, the true slope  1 )
    based on a single sample




                                                                           68
Example: Arm Circumference and Height

   The estimated regression equation relating arm circumference to
    height using a random samples of 150 Nepali children < 12 months
    old, I told you that the resulting regression equation was:

                        y  2.7  0.16x
                        ˆ

        ˆ              ˆ ˆ
        1  0.16 and SE(  1 )  0.014
        ˆ               ˆ ˆ
         o  2.70 and SE(  o )  0.88




                                                                       69
Example: Arm Circumference and Height

   Random sampling behavior of estimated regression coefficients is
    normal for large samples (n>60), and centered at true values




   As such, we can use same ideas to create 95% CIs and get p-values




                                                                        70
Example: Arm Circumference and Height

   The estimated regression equation relating arm circumference to
    height using a random samples of 150 Nepali children < 12 months
    old, I told you that the resulting regression equation was:

                          y  2.7  0.16x
                          ˆ

         ˆ              ˆ ˆ
         1  0.16 and SE(  1 )  0.014

   95% CI for β1

    ˆ         ˆ ˆ
    1  2  SE ( 1 )  0.16  2  0.014  (0.13,0.19 )




                                                                       71
Example: Arm Circumference and Height

   p-value for testing:

                 Ho: β1 =0
                 Ho: β1 =0




                                                                 ˆ
    Assume null true, and calculate standardized ―distance ― of  1 from
    0
             ˆ
            1  0       ˆ
                        1       0.16
         t                          11.4
              ˆ        ˆ
            SE ( 1 ) SE ( 1 ) .014

    p-value is probability of being 11.4 or more standard errors away
    from mean of 0 on a normal curve: very low in this example, p <
    .001


                                                                           72
Summarizing findings: Arm Circumference and Height

   This research used simple linear regression to estimate the
    magnitude of the association between arm circumference and
    height in Nepali children less than 12 months old, using data on a
    random sample of 150. A statistically significant positive
    association was found (p<.001). The results estimate that two
    groups of such children who differ by 1 cm in height will differ on
    average by 0.16 cm in arm circumference. (95% CI 0.13 cm to 0.19
    cm)




                                                                          73
Summarizing findings: Arm Circumference and Height

   Finally: Stata!


   If you have your ―y‖ and ―x‖ values entered in Stata, then to do
    linear regression use the regress command:
       regress y x

   Data snippet from Stata




                                                                       74
Using Stat: Arm Circumference and Height

   regress armcirc height




                   y  2.7  0.16x
                   ˆ

                                           75
Using Stat: Arm Circumference and Height

   regress armcirc height




                   y  2.7  0.16x
                   ˆ

                                           76
Using Stat: Arm Circumference and Height

   regress armcirc height




      ˆ
      o




                   y  2.7  0.16x
                   ˆ

                                           77
Using Stat: Arm Circumference and Height

   regress armcirc height




      ˆ
      1




                   y  2.7  0.16x
                   ˆ

                                           78
Using Stat: Arm Circumference and Height

   regress armcirc height




                   y  2.7  0.16x
                   ˆ

                                           79
Example 2: Arm Circumference and Height
   Give an estimate and 95% CI for the mean difference in arm
    circumference for children 60 cm tall compared to children 50 cm
    tall
      From previous set we know this estimated mean difference is
                      ˆ       ˆ
         (60  50 )  1  10 1  10  0.16  1.6 cm
       How to get standard error? Well as it turns out:

                  ˆ     ˆ             ˆ ˆ
                 SE (10  1 )  10  SE (  1 )
               ˆ     ˆ
              SE (10 1 )  10  0.014  0.14
       95% CI for the mean difference

                         ˆ      ˆ     ˆ
                      10 1  2SE (10 1 )
                          1.6  2  0.14

                                                                       80
Example 2: Hemoglobin and ―Packed Cell Volume‖

   Equation of regression line relating estimated mean Hemoglobin
    (g/dL) to packed cell volume : from Stata
       y  5.77  0.20 x
       ˆ


   Snippet of data in Stata




                                                                     81
Example 2: Hemoglobin and ―Packed Cell Volume‖

   regress Hb PCV




                                                 82
Example 2: Hemoglobin and ―Packed Cell Volume‖

   Same idea with computation of 95% CI and p-value as we saw before


   However, with small (n<60) samples, a slight change analaguous to
    what we did with means and differences in means before


   Sampling distribution of regression coefficients not quite normal,
    but follow a t-distribution with n-2 degrees of freedom


   95% for  1
                   ˆ                ˆ ˆ
                   1  t.95,n2  SE(1 )
        In this example
         ˆ               ˆ ˆ
         1  t.95,19  SE(1 )  0.20  2.09  .046  (0.10,0.30)

                                                                         83
Example: Hemoglobin and ―Packed Cell Volume‖

   p-value for testing:

                 Ho: β1 =0
                 Ho: β1 =0




                                                                 ˆ
    Assume null true, and calculate standardized ―distance ― of  1 from
    0
                    ˆ
                   1  0       ˆ
                               1      0.20
                t                         4.35
                              ˆ
                     ˆ (  ) SE (  ) .046
                   SE 1            1



    p-value is probability of being 4.35 or more standard errors away
    from mean of 0 on a t curve with 19 degrees of freedom: very low
    in this example, p < .001


                                                                           84
Interpreting Result of 95% CI

   So, the estimated slope is 0.2 with 95% CI 0.10 to 0.30


   How to interpret results?
      Based on a sample of 21 subjects, we estimated that PCV(%) is
       positively associated with hemoglobin levels
      We estimated that a one-percent increase in PCV is associated
       with a 0.2 g/dL increase in hemoglobin on average
      Accounting for sampling variability, this mean increase could be
       as small as 0.10 g/dL, or as large as 0.3 g/dL in the population
       of all such subjects




                                                                          85
Interpreting Result of 95% CI

   In other words:
      We estimated that the average difference in hemoglobin levels
         for two groups of subjects who differ by one-percent in PCV to
         be 0.2 g/dL on average (higher PCV group relative to lower)
      Accounting for sampling variability, mean difference could be
         as small as 0.10 g/dL, or as large as 0.3 g/dL in the population
         of all subjects




                                                                            86
What about Intercepts?

   In this section, all examples have confidence intervals for the slope,
    and multiples of the slope


   We can also create confidence intervals/p-values for the intercept
    in the same manner (and Stata presents it in the output). However
    as we noted in the previous section, many times the intercept is just
    a placeholder and does not describe a useful quantity: as such, 95%
    CIs and p-values are not always relevant




                                                                             87
Section E

Measuring the Strength of A Linear Association
Strength of Association

   The slope of a regression line estimates the magnitude and direction
    of the relationship between y and x: it encapsulates how much y
    differs on average with differences in x


   The slope estimate and standard error can be used to address the
    uncertainty in the this estimate with regards to the true magnitude
    and direction of the association in the population from which the
    sample was taken from


   Slopes do not impart any information about how well the regression
    line fits the data in the sample; the slope gives no indication of how
    close the points get to the estimated regression line




                                                                             89
Strength of Association

   Another quantity that can be estimated via linear regression is the
    coefficient of determination , R2: this is a number that ranges from
    0 to 1, with larger values indicate ―closer fits‖ of the data points
    and regression line


   R2 measures strength of association by comparing variability of
    points around the regression line to variability in y-values ignoring x




                                                                              90
Example: Arm Circumference and Height

   How close do the points get to the line – can we quantify?




                                                                 91
Example: Arm Circumference and Height

   (SR1 Flashback) The sample standard deviation of the y-values
    ignoring the corresponding potential information in x is

                           n

                          (y    i    yi ) 2
                     s   i 1

                                 n 1

        this measures how far on average each of the sample y values
        falls from the overall mean all y-values
       In this example s=1.48 cm




                                                                        92
Example: Arm Circumference and Height

   ―Visualization‖ on the scatterplot




                                         93
Example: Arm Circumference and Height

   Standard deviation of regression, referred to as root mean square
    error is ―average‖ distance of points from the line: how far on
    average each y falls from its mean predicted by the its
    corresponding x-value
                                        n

                                       (y    i    yi ) 2
                                                    ˆ
                            s y| x    i 1

                                              n2




                                                                        94
Example: Arm Circumference and Height

                                 ˆ     ˆ
    Each distance is y  y  y  ( o  B1 x) : this is computed for each
                         ˆ
    data point in the sample




                                                                            95
Using Stata: Arm Circumference and Height

   regress command in Stata gives sy|x




                                            96
Example: Arm Circumference and Height

   If s = sy|x, then knowing x does not yield a better guess for the mean
    of y than using the overall mean y (flat regression line)


   The smaller sy|x is relative to s, the closer the points are to the
    regression line


   R2 functionally measures how much smaller sy|x is than s: as such it
    is an estimate of the amount of variability in y explained by taking x
    into account




                                                                             97
Using Stata: Arm Circumference and Height

   regress command in Stata gives R2: childs‘ height explains (an
    estimated) 46% of the variation in arm circumferences




                                                                     98
Example: Arm Circumference and Height

   R2 and r


   r = the properly signed square root of R2; the proper sign is the same
    sign as the slope in the regression


   r is called the correlation coefficient (not to be confused with the
    ―regression coefficients‖ – great names, huh)


   Allowable values
      0 ≤ R2 ≤ 1
      If relationship between y and x is positive 0 ≤ r ≤ 1
      If relationship between y and x is negative -1 ≤ r ≤ 0

   In this example, r   R 2   0.46  0.68

                                                                             99
Example: Arm Circumference and Height

   So from the example: child height explains (an estimated) 46% of
    the variation in arm circumferences
      This is just an estimate based on the sample; a 95% CI can be
         computed but its not easy to do, and not given readily by the
         computer; also the procedure for estimating the 95% CI is not so
         good


   So this means an estimated 54% of the variability in arm
    circumference is not explained by childs height
      Some if this unexplained variability may be explained by factors
         other then height
      Multiple linear regression will allow us to estimate the
         relationship between arm circumference, height and other child
         characteristics in one analysis



                                                                        100
Example 2: Hemoglobin and ―Packed Cell Volume‖

   regress command in Stata gives R2: PCV explains (an estimated) 51%
    of the variation in hemoglobin levels




                                                                     101
Example: Hemoglobin and PCV

   regress command in Stata gives R2 of 0.51; the slope is positive, so
    r   R 2   0.51  0.71




                                                                           102
Example 3: Wages and Years of Education

   regress command in Stata gives R2: years of education explains (an
    estimated) 15% of the variation in hourly wages




Here    r   R 2   0.15  0.39

                                                                         103
Example 4: Arm Circumference and Child Sex

   regress command in Stata gives R2: sex(female=1) explains (an
    estimated) 0.2% of the variation in arm circumference




Here r   R   0.002  0.045 . In this sample of data female sex is
            2


  negatively correlated with arm circumference.
                                                                         104
What‘s a ―Good‖ R2

   There are a couple of important things to keep in mind about R2 and
    r


    - These quantities are both estimates based on the sample of data;
    frequently reported without some recognition of sampling
    variability, for example a 95% confidence interval


    - Low R2 and r not necessarily ―bad‖


    - many outcomes can not/ will not be fully or close to fully
    explained, in terms of variability, by any one single predictor




                                                                         105
What‘s a ―Good‖ R2

   The higher the R2 values, the better the x predicts y for individuals
    in a sample/population , as individual y-values vary less about their
    estimated means based on x


   However, there may be important overall associations between
    mean of y and x even though still a lot of individual variability in y-
    values about their means estimated by x
      In the wages example, years of education explained an
        estimated 15% of the variability in hourly wages
      The association was statistically significant showing that
        average wages were greater for persons with more years of
        education
      However, for any single education level (year), still a lot of
        variation in wages for individual workers



                                                                              106
Slope versus R2

   Slope estimates the magnitude and direction of the relationship
    between y and x
      Estimates a mean difference in y for two groups who differ by
        one-unit in x
      The slope will change if the units change for y and/or for x
      Larger slopes not indicative of stronger linear association:
        smaller slopes not indicative of weaker linear association


   R2 measures strength of linear association; r measures strength and
    direction
      Neither R2 or r measures magnitude
      Neither R2 or r changes with changes in units



                                                                          107
Using Stat: Arm Circumference and Height

   Regression of arm circumference (cm) on height in centimeters




             y  2.7  0.16x
             ˆ
                  ˆ
R2 = 0.46 or 46%;  1  0.16

                                                                    108
Using Stat: Arm Circumference and Height

   Regression of arm circumference on height in inches


    . regress   armcirc height_inch

          Source |       SS       df       MS             Number of obs    =      150
    -------------+------------------------------          F( 1,     148)   =   124.30
           Model | 148.874589      1 148.874589           Prob > F         =   0.0000
        Residual | 177.263343    148 1.19772529           R-squared        =   0.4565
    -------------+------------------------------          Adj R-squared    =   0.4528
           Total | 326.137932    149 2.18884518           Root MSE         =   1.0944

    ------------------------------------------------------------------------------
         armcirc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
     height_inch |   .4008806    .035957    11.15   0.000     .3298251     .471936
           _cons |   2.695906   .8774225     3.07   0.003     .9620119    4.429801
    ------------------------------------------------------------------------------



                y  2.7  0.40x
                ˆ
                  ˆ
R2 = 0.46 or 46%;  1  0.40

                                                                                        109
Section F

Optional: Some FYIs about SLR
Standard Error of Slopes

   Just FYI: standard error of estimated slope a combination of
    variation in y-values around regression line, and spread of x values
      Definition: standard deviation of regression, called ‗root mean
         squared error‘ is functionally average distance of any single
         point from estimated mean of all y-values with same x, ie:
         corresponding value on regression line
      For simple linear regression, d.f.=n-2 and
                               n

                              (y      i     yi )
                                              ˆ
                   s y| x    i 1

                                     n2


        Estimated standard error of slope estimate
                                             s y| x
                        ˆ
                   SE ( 1 )         n

                                      (x
                                     i 1
                                              i    x)2



                                                                           111
Standard Error of Slopes

   Estimated standard error of slope estimate
                                        s y| x
                    ˆ ˆ
                   SE ( 1 )     n

                                  (x
                                 i 1
                                         i    x)2




   Notice this will be larger
      The more variable the y-values are around their corresponding
        mean estimates on the regression line (ie: the greater sy|x is)
      The less variable the x-values are around the mean of x: hmm…




                                                                          112
Actually Computation of R2

   How do we actually compute R2?
      Recall interpretation: percent of variability in y explained by x

   Total Variability in y?
      Actually, for the R2 computation
                                                       n
               total variability in y  (n - 1)  s   (yi - y) 2
                                                  2

                                                      i 1




                                                                           113
Example: Arm Circumference and Height

   ―Visualization‖ on the scatterplot: distance of each point from the
    flat line at y squared and added together




                                                                          114
R2:Arm Circumference and Height

   Regression of arm circumference on height in centimeters: total
    variability in y




                                                                      115
Actually Computation of R2

   Total Variability in y not explained by x?
      For the R2 computation

                                                                        n
total variability in y not explained by x  (n  2)  s   2
                                                              y| x     (yi - y i ) 2
                                                                               ˆ
                                                                       i 1




                                                                                         116
Example: Arm Circumference and Height

                                 ˆ     ˆ
    Each distance is y  y  y  ( o  B1 x) : this is computed for each
                         ˆ
    data point in the sample , squared and summed




                                                                            117
R2:Arm Circumference and Height

   Regression of arm circumference on height in centimeters: total
    variability in y not explained by x




                                                                      118
Actually Computation of R2

   Percentage of variability in y NOT explained by x
                                      n

                                      (y
                                     i 1
                                             i
                                                   ˆ
                                                 - yi )2
                                        n

                                      ( y  y) 2
                                      i 1




   R2 is percentage of variability in y explained by x

                      n

                      (y   i
                                  ˆ
                                - yi )2
                                                   177 .26
                1   i 1
                        n
                                             1            1  0.54  0.46
                      ( y  y)      2             326 .14
                     i 1




                                                                               119

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:26
posted:7/30/2011
language:English
pages:119