# Biostatistics

Document Sample

Simple Linear Regression

Statistical Reasoning 2, Lecture 4
Section A

Review: The Equation of a Line
The Equation of a Line

   Recall, from algebra, there are two values which uniquely define
any line
 y-intercept—where the line crosses the y-axis (when x = 0)
 Slope—the ―rise over the run‖—how much y changes for each
one unit change in x

3
The Equation of a Line

   Recall, from algebra, there are two values which uniquely define
any line

   y = mx + b
b = y-intercept
m = slope

4
The Equation of a Line

   Of course statisticians must have their own notation!

   y = bo + b1x
bo= y-intercept
b1 = slope

   y = βo + β1x
β o= y-intercept
β1 = slope

5
The Intercept, βo

   The intercept βo is the value of y when x is 0
 It is the point on the graph where the line crosses the y
(vertical) axis, at the coordinate (0, βo )

y = βo + β1x

βo

6
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x

y = βo + β1x

7
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x

β1
y = βo + β1x

8
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x

   Another interpretation: β1 is difference in y-values for x+1
compared to x

   This change/difference is the same across the entire line

9
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in x

β1
y = βo + β1x

β1

β1

10
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in
x: β1 is difference in y-values for x+1 compared to x

   All information about the difference in the y-value for two differing
values of x is contained in the slope!

   For example: two values of x three units apart will have a
difference in y values of 3* β1

11
The Slope, β1

   For example: two values of x three units apart will have a
difference in y values of 3* β1

β1
β1
β1

12
The Slope, β1

   For example: two values of x three units apart will have a
difference in y values of 3×β1 (3β1 )

β1
β1        3β1
β1

13
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in
x: β1 is difference in y-values for x+1 compared to x

   If slope β1 = 0, indicates that there is no association:
(i.e., the values of y are the same regardless of the values of x)
   If slope β1 > 0, indicates that there is a positive association:
(i.e., the values of y increase with increasing values of x)
   If slope β1 < 0, indicates that there is a negative association:
(i.e., the values of y decrease with increasing values of x)

14
The Slope, β1

   The slope β1 is the change in y corresponding to a unit increase in
x: β1 is difference in y-values for x+1 compared to x

15
The Equation of a Line

   In linear regression situations, points don‘t fit exactly to a line

   We estimate a line that relates the mean of an outcome y to a
predictor x
ˆ     ˆ
E[ y]   0  1 x
   E[y] =estimated ―expected‖ (mean) value of y
    ˆ
 0 = estimated y-intercept
    ˆ
 1 = estimated slope

16
The Equation of a Line

   ˆ        ˆ
 o and  1 are called estimated regression coefficients

   These two quantities are estimated using the data
 Line estimated is line that ―fits the data best‖

   Many times the equation just written as:
ˆ     ˆ
y   0  1 x

or

ˆ ˆ       ˆ
y   0  1 x

17
The Equation of a Line

   ˆ        ˆ
 o and  1 are called estimated regression coefficients


ˆ
We will see that in a regression context ,  1 is nothing more than
estimated mean difference in y between two groups who differ by
one unit in x
 ie: how much the mean of y changes for a one-unit increase in x

18
Section B

Linear Regression: Motivating Example
Example: Arm Circumference and Height

   Data on anthropomorphic measures from a random sample of 150
Nepali children [0, 12) months old

   Question: what is the relationship between average arm
circumference and height

   Data:
 Arm circumference: mean 12.4 cm, SD 1.5 cm, range 7.3 cm –
15.6 cm
 Height: mean 61.6 cm, SD 6.3 cm, range 40.9 cm – 73.3 cm

20
Approach 1: Arm Circumference and Height

   Dichotomize height at median, compare mean arm circumference
with t-test and 95% CI

21
Approach 1: Arm Circumference and Height

 We know how to do it!
 Gives a single summary measure (sample mean difference) for
quantifying the arm circumference/height association

 Throws away a lot of information in the height data that was
originally measured as continuous
 Only allows for a single comparison between two crudely
defined height categories

22
Approach 2 Arm Circumference and Height

   Categorize height into 4 categories by quartile, compare mean arm
circumference with ANOVA, 95% CIs

23
Approach 2: Arm Circumference and Height

 We know how to do it!
 Uses a less crude categorization of height than the previous
approach of dichotomizing

 Still throws away a lot of information in the height data that
was originally measured as continuous
 Requires multiple summary measures (6 sample mean
differences between each unique combination of height
categories) to quantify arm circumference/height relationship
 Does not exploit the structure we see in the previous boxplot:
as height increases so does arm circumference

24
Approach 3: Arm Circumference and Height

   What about treating height as continuous when estimating the arm
circumference/height relationship

   Linear regression is a potential option: allows us to associate a
continuous outcome with a continuous predictor via a line
 The line estimates the mean value of the outcome for each
continuous value of height in the sample used
 Makes a lot of sense: but only if a line reasonably describes the
outcome/predictor relationship

   Linear regression can also use binary or categorical predictors (will
show later in this set of lectures)

25
Visualizing Arm Circumference and Height Relationship

   A useful visual display for assessing nature of association between
two continuous variables: a scatterplot

26
Visualizing Arm Circumference and Height Relationship

   Question : does a line reasonably describe the general shape of the
relationship between arm circumference and height?

   We can estimate a line, using the computer (details to come in
subsequent lecture section)

The line we estimate will be of the form:

y   o   1x
ˆ

ˆ
Here: y is the average arm circumference for a group of children
all of the same height, x

27
Example: Arm Circumference and Height

   Equation of regression line relating estimated mean arm
circumference (cm) to height (cm) : from Stata
 y  2.7  0.16x
ˆ

   Here , y  estimated average arm circumference (like what we
ˆ
ˆ
previously would call y ), x = height,  o  2.7 and
ˆ
1  0.16

   This is the estimated line from the sample of 150 Nepali
children

28
Example: Arm Circumference and Height

   Scatterplot with regression line superimposed

y  2.7  0.16x
ˆ

29
Example: Arm Circumference and Height

   Estimated mean arm circumference for children 60 cm in height

y  2.7  0.16x
ˆ

for x  60 cm
y  2.7  0.16  60  12 .3 cm
ˆ

30
Example: Arm Circumference and Height

   Notice, most points don‘t fall directly on the line: we are estimating
the mean arm circumference of children 60 cm tall: observed
points vary about the estimated mean

y  2.7  0.16x
ˆ

for x  60 cm
y  2.7  0.16  60  12 .3 cm
ˆ

31
Example: Arm Circumference and Height

   How to interpret estimated slope?
 y  2.7  0.16x
ˆ

          ˆ
Here ,  1  0.16
   Two ways to say the same thing:

ˆ
 1 is the average change in arm circumference for a
one-unit (1 cm) increase in height
ˆ
 1 is the mean difference in arm circumference for two
groups of children who differ by one-unit (1 cm) in height,
taller to shorter

This results estimates that the mean difference in arm
circumferences for a one cm difference in height is 0.16 cm,
with taller children having greater average arm
circumference.
32
Example: Arm Circumference and Height

   This mean difference estimate is constant across the entire height
range in the sample: definition of a slope of a line

y  2.7  0.16x
ˆ

33
Example: Arm Circumference and Height

   What is estimated mean difference in arm circumference for:
 Children 60 cm tall versus children 59 cm tall?
 Children 25 cm tall versus children 24 cm tall?
 Children 72 cm tall versus children 71 cm tall?
 Etc….?

Answer is the same for all of the above: 0.16 cm

34
Example: Arm Circumference and Height

   What is estimated mean difference in arm circumference for:
 Children 60 cm tall versus children 50 cm tall?

ˆ
y x60  y x50  10  1  10  0.16 cm  1.6 cm
ˆ        ˆ

35
Example: Arm Circumference and Height

   What is estimated mean difference in arm circumference for:
 Children 90 cm tall versus children 89 cm tall?
 Children 34 cm tall versus children 33 cm tall?
 Children 110 cm tall versus children 109 cm tall?
 Etc….?

This is a trick question!!!!

36
Example: Arm Circumference and Height

   The range of observed heights in the sample is 40.9 cm – 73.3 cm:
our regression results only apply to the relationship between arm
circumference and height for this height range

y  2.7  0.16x
ˆ

37
Example: Arm Circumference and Height

   How to interpret estimated intercept?
 y  2.7  0.16x
ˆ

          ˆ
Here ,  o  2.7cm

   This is the estimated y when x =0: the estimated mean arm
circumference for children 0 cm tall

Does this make sense given our sample?

As we noted before, estimate of mean arm
circumferences only apply to observed height range.

Frequently, the scientific interpretation of the intercept is
scientifically meaningless: but this intercept is necessary for fully
specify equation of line and make estimates of mean arm
circumference for groups of children with heights in sample range.
38
Example: Arm Circumference and Height

   Notice the x=0 is not even on this graph (the vertical axis is at x=39)

y  2.7  0.16x
ˆ

39
Example: Arm Circumference and Height

   Notice the x=0 is not even on this graph (the vertical axis is at x=39)

y  2.7  0.16x
ˆ

40
Section C

Simple Linear Regression : More Examples
Example: Hb and PCV

   Linear regressions performed with a single predictor (one x) are
called simple linear regressions

   Linear regressions performed with a more than one predictor (x‘s)
are called multiple linear regressions

   In this set of lectures we are dealing with simple linear regression:
in this section we will give three more examples

42
Example: Hb and PCV

   Data on laboratory measurements on a random sample of 21 clinic
patients 20-67 years old

   Question: what is the relationship between hemoglobin levels (g/dL)
and packed cell volume (percent of packed cells)

   Data:
 Hemoglobin (Hb): mean 14.1 g/dl, SD 2.3 g/dL, range 9.6 g/dL –
17.1 g/dL
 Packed Cell Volume (PCV): mean 41.1 %, SD 8.1 %, range 25% to
55%

43
Visualizing Hb and PCV Relationship

   Scatterplot display

44
Example: Hb and PCV

   Equation of regression line relating estimated mean Hemoglobin
(g/dL) to packed cell volume : from Stata
 y  5.77  0.20x
ˆ

   Here , y  estimated average Hemoglobin (like what we
ˆ
ˆ
previously would call y ), x = height,  o  5.77 and
ˆ
1  0.20

   This is the estimated line from the sample of 21 subjects

45
Example: Hb and PCV

   Equation of regression line relating estimated mean Hemoglobin
(g/dL) to packed cell volume : from Stata
 y  5.77  0.20x
ˆ

ˆ
1  0.20 : what are the units?
ˆ                               ˆ
Well , y is in g/dL, x in percent; so  1 is in units if g/dL
per percent

This results estimates that the mean difference in
Hemoglobin levels for two groups of subjects who differ by
1% in PCV is 0.20 g/dL: subjects with greater PCV have
greater Hb levels in average.

46
Visualizing Hb and PCV Relationship

   Scatterplot display with regression line

y  5.77  0.20 x
ˆ

47
Example: Hb and PCV

   What is average difference in Hb levels for subjects with PCV of 40%
compared to subjects with 32%?

ˆ
1  0.20 : compares groups of subjects who differ in PCV by 1%
(it is positive, so those with the greater PCV have hemoglobin levels
of .20 g/dL greater on average)

To compare subjects with PCV of 40% versus subjects with 32%,
which is an 8 unit difference in x, take
ˆ
8   1  8  0.20  1.6 g / dL

48
Example: Hb and PCV

   What is estimated Hb level for subjects with PCV of 41% ?

y  5.77  0.20 x
ˆ

Plugging 41% into the equation,

y  5.77  0.20  41  13.97 g / dL
ˆ

49
Example: Wages and Education Level

   Data on hourly wages from a random sample of 534 U.S. workers in
1985

   Question: what is the relationship between hourly wage (US\$) and
years of formal education

   Data:
 Hourly wages : mean \$9.04/hr, SD \$5.13/hr, range \$1.00/hr–
\$44.50/hr
 Year of formal education: mean 13.0 years, SD 2.6 years, range
2 years – 18 years

50
Visualizing Wages and Education Level Relationship

   Scatterplot display

51
Example: Wages and Education Level

   Equation of regression line relating estimated mean hourly wages
(US \$) to years of education : from Stata
y  0.75  0.75x
ˆ
   Here , y  estimated average hourly wage (like what we
ˆ
previously would call y ), x = years of formal education ,
ˆ                 ˆ
 o  0.75 and 1  0.75

   This is the estimated line from the sample of 534 subjects

52
Visualizing Wages and Education Level Relationship

   Scatterplot display with regression line

53
Wages and Education Level

   What is interpretation of the slope estimate?

54
Example: Arm Circumference and Sex

   Data on anthropomorphic measures from a random sample of 150
Nepali children [0, 12) months old

   Question: what is the relationship between average arm
circumference and sex of a child

   Data:
 Arm circumference: mean 12.4 cm, SD 1.5 cm, range 7.3 cm –
15.6 cm
 Sex: 51% female

55
Visualizing Arm Circumference and Sex Relationship

   Scatterplot display

56
Visualizing Arm Circumference and Sex Relationship

   Boxplot display

57
Example: Arm Circumference and Sex

   Here, y is arm circumference, a continuous measure: x is not
continuous, but binary – male or female

   How to handle sex as a ―x‖ in regression?
 One possibility: x = 0 for male children, x =1 for female
children

   The equation we will estimate

ˆ ˆ       ˆ
y   0  1 x

How to interpret regression coefficients?

58
Example: Arm Circumference and Sex

    Notice: this equation is only estimating two values: mean arm
circumference for male children, and the mean for female children

    For female children:
ˆ ˆ       ˆ       ˆ     ˆ
y   0  1 1   0  1
    For male children

ˆ ˆ       ˆ        ˆ
y   0  1  0   0

So    ˆ
1   is still a slope estimating mean difference in y for one-unit
difference in x: but only possible one-unit difference is 1 (females)
to 0 (males)

ˆ
 o actually has substantive meaning in this example: it is the
average arm circumference for male children
59
Example: Arm Circumference and Sex

    The resulting equation

y  12.5  0.13x
ˆ

ˆ
 1  0.13 : the estimated mean difference in arm circumference
for female children compared to male children is -0.13 cm; female
children have lower arm circumference by 0.13 cm on average

ˆ
 o  12.5   : the mean arm circumference for male children is 12.5
cm

60
Visualizing Arm Circumference and Sex Relationship

   Scatterplot display with regression line

61
Section D

Simple Linear Regression Model: Estimating the
Regression Equation—Accounting for Uncertainly in the
Estimates
Example: Hemoglobin and Packed Cell Volume

   So in the last section, we showed the results from several simple
linear regression models

   For example, when relating arm circumference to height using a
random sample of 150 Nepali children < 12 months old, I told you
that the resulting regression equation was:
y  2.7  0.16x
ˆ

I told you this came from Stata, and will show you how to do
regression with Stata shortly: but how does Stata estimate this
equation?

63
Example: Arm Circumference and Height

   There must be some algorithm that will always yield the same
results for the same data set

64
Example: Arm Circumference and Height

   The algorithm to estimate the equation of the line is called the
―least squares‖ estimation

   The idea is to find the line that gets ―closest‖ to all of the points in
the sample

How to define closeness to multiple points?

In regression, closeness is defined as the cumulative squared
distance between each point‘s y-value and the corresponding value
ˆ
of y for that point‘s x : in other words the squared distance
between an observed y-value and the estimated y-value for all
points with the same value of x.

65
Example: Arm Circumference and Height

                                 ˆ     ˆ
Each distance is y  y  y  ( o  B1 x) : this is computed for each
ˆ
data point in the sample

66
Example: Arm Circumference and Height

   The algorithm to estimate the equation of the line is called the
―least squares‖ estimation

                          ˆ       ˆ
The values chosen for  o and 1 are the values that minimize the
cumulative distances squared: i.e.

n
               
ˆ   x ) 2
min  yi  ( o ˆ1 i 
 i 1              

67
Example: Arm Circumference and Height

                            ˆ       ˆ
The values chosen for  o and 1 are just estimates based on a
single sample. If were to have a different random sample of 150
Nepali children from the same population of <12 month olds, the
resulting estimate would likely be different: i.e. the values that
minimized the cumulative squared distance from this second sample
of points would likely be different

   As such, all regression coefficients have an associated standard
error that can be used to make statements about the true
relationship between mean y and x (for example, the true slope  1 )
based on a single sample

68
Example: Arm Circumference and Height

   The estimated regression equation relating arm circumference to
height using a random samples of 150 Nepali children < 12 months
old, I told you that the resulting regression equation was:

y  2.7  0.16x
ˆ

ˆ              ˆ ˆ
1  0.16 and SE(  1 )  0.014
ˆ               ˆ ˆ
 o  2.70 and SE(  o )  0.88

69
Example: Arm Circumference and Height

   Random sampling behavior of estimated regression coefficients is
normal for large samples (n>60), and centered at true values

   As such, we can use same ideas to create 95% CIs and get p-values

70
Example: Arm Circumference and Height

   The estimated regression equation relating arm circumference to
height using a random samples of 150 Nepali children < 12 months
old, I told you that the resulting regression equation was:

y  2.7  0.16x
ˆ

ˆ              ˆ ˆ
1  0.16 and SE(  1 )  0.014

   95% CI for β1

ˆ         ˆ ˆ
1  2  SE ( 1 )  0.16  2  0.014  (0.13,0.19 )

71
Example: Arm Circumference and Height

   p-value for testing:

Ho: β1 =0
Ho: β1 =0


ˆ
Assume null true, and calculate standardized ―distance ― of  1 from
0
ˆ
1  0       ˆ
1       0.16
t                          11.4
ˆ        ˆ
SE ( 1 ) SE ( 1 ) .014

p-value is probability of being 11.4 or more standard errors away
from mean of 0 on a normal curve: very low in this example, p <
.001

72
Summarizing findings: Arm Circumference and Height

   This research used simple linear regression to estimate the
magnitude of the association between arm circumference and
height in Nepali children less than 12 months old, using data on a
random sample of 150. A statistically significant positive
association was found (p<.001). The results estimate that two
groups of such children who differ by 1 cm in height will differ on
average by 0.16 cm in arm circumference. (95% CI 0.13 cm to 0.19
cm)

73
Summarizing findings: Arm Circumference and Height

   Finally: Stata!

   If you have your ―y‖ and ―x‖ values entered in Stata, then to do
linear regression use the regress command:
 regress y x

   Data snippet from Stata

74
Using Stat: Arm Circumference and Height

   regress armcirc height

y  2.7  0.16x
ˆ

75
Using Stat: Arm Circumference and Height

   regress armcirc height

y  2.7  0.16x
ˆ

76
Using Stat: Arm Circumference and Height

   regress armcirc height

ˆ
o

y  2.7  0.16x
ˆ

77
Using Stat: Arm Circumference and Height

   regress armcirc height

ˆ
1

y  2.7  0.16x
ˆ

78
Using Stat: Arm Circumference and Height

   regress armcirc height

y  2.7  0.16x
ˆ

79
Example 2: Arm Circumference and Height
   Give an estimate and 95% CI for the mean difference in arm
circumference for children 60 cm tall compared to children 50 cm
tall
 From previous set we know this estimated mean difference is
ˆ       ˆ
(60  50 )  1  10 1  10  0.16  1.6 cm
   How to get standard error? Well as it turns out:

ˆ     ˆ             ˆ ˆ
SE (10  1 )  10  SE (  1 )
ˆ     ˆ
SE (10 1 )  10  0.014  0.14
   95% CI for the mean difference

ˆ      ˆ     ˆ
10 1  2SE (10 1 )
1.6  2  0.14

80
Example 2: Hemoglobin and ―Packed Cell Volume‖

   Equation of regression line relating estimated mean Hemoglobin
(g/dL) to packed cell volume : from Stata
y  5.77  0.20 x
ˆ

   Snippet of data in Stata

81
Example 2: Hemoglobin and ―Packed Cell Volume‖

   regress Hb PCV

82
Example 2: Hemoglobin and ―Packed Cell Volume‖

   Same idea with computation of 95% CI and p-value as we saw before

   However, with small (n<60) samples, a slight change analaguous to
what we did with means and differences in means before

   Sampling distribution of regression coefficients not quite normal,
but follow a t-distribution with n-2 degrees of freedom

   95% for  1
ˆ                ˆ ˆ
1  t.95,n2  SE(1 )
   In this example
ˆ               ˆ ˆ
1  t.95,19  SE(1 )  0.20  2.09  .046  (0.10,0.30)

83
Example: Hemoglobin and ―Packed Cell Volume‖

   p-value for testing:

Ho: β1 =0
Ho: β1 =0


ˆ
Assume null true, and calculate standardized ―distance ― of  1 from
0
ˆ
1  0       ˆ
1      0.20
t                         4.35
ˆ
ˆ (  ) SE (  ) .046
SE 1            1

p-value is probability of being 4.35 or more standard errors away
from mean of 0 on a t curve with 19 degrees of freedom: very low
in this example, p < .001

84
Interpreting Result of 95% CI

   So, the estimated slope is 0.2 with 95% CI 0.10 to 0.30

   How to interpret results?
 Based on a sample of 21 subjects, we estimated that PCV(%) is
positively associated with hemoglobin levels
 We estimated that a one-percent increase in PCV is associated
with a 0.2 g/dL increase in hemoglobin on average
 Accounting for sampling variability, this mean increase could be
as small as 0.10 g/dL, or as large as 0.3 g/dL in the population
of all such subjects

85
Interpreting Result of 95% CI

   In other words:
 We estimated that the average difference in hemoglobin levels
for two groups of subjects who differ by one-percent in PCV to
be 0.2 g/dL on average (higher PCV group relative to lower)
 Accounting for sampling variability, mean difference could be
as small as 0.10 g/dL, or as large as 0.3 g/dL in the population
of all subjects

86

   In this section, all examples have confidence intervals for the slope,
and multiples of the slope

   We can also create confidence intervals/p-values for the intercept
in the same manner (and Stata presents it in the output). However
as we noted in the previous section, many times the intercept is just
a placeholder and does not describe a useful quantity: as such, 95%
CIs and p-values are not always relevant

87
Section E

Measuring the Strength of A Linear Association
Strength of Association

   The slope of a regression line estimates the magnitude and direction
of the relationship between y and x: it encapsulates how much y
differs on average with differences in x

   The slope estimate and standard error can be used to address the
uncertainty in the this estimate with regards to the true magnitude
and direction of the association in the population from which the
sample was taken from

   Slopes do not impart any information about how well the regression
line fits the data in the sample; the slope gives no indication of how
close the points get to the estimated regression line

89
Strength of Association

   Another quantity that can be estimated via linear regression is the
coefficient of determination , R2: this is a number that ranges from
0 to 1, with larger values indicate ―closer fits‖ of the data points
and regression line

   R2 measures strength of association by comparing variability of
points around the regression line to variability in y-values ignoring x

90
Example: Arm Circumference and Height

   How close do the points get to the line – can we quantify?

91
Example: Arm Circumference and Height

   (SR1 Flashback) The sample standard deviation of the y-values
ignoring the corresponding potential information in x is

n

(y    i    yi ) 2
s   i 1

n 1

    this measures how far on average each of the sample y values
falls from the overall mean all y-values
   In this example s=1.48 cm

92
Example: Arm Circumference and Height

   ―Visualization‖ on the scatterplot

93
Example: Arm Circumference and Height

   Standard deviation of regression, referred to as root mean square
error is ―average‖ distance of points from the line: how far on
average each y falls from its mean predicted by the its
corresponding x-value
n

(y    i    yi ) 2
ˆ
s y| x    i 1

n2

94
Example: Arm Circumference and Height

                                 ˆ     ˆ
Each distance is y  y  y  ( o  B1 x) : this is computed for each
ˆ
data point in the sample

95
Using Stata: Arm Circumference and Height

   regress command in Stata gives sy|x

96
Example: Arm Circumference and Height

   If s = sy|x, then knowing x does not yield a better guess for the mean
of y than using the overall mean y (flat regression line)

   The smaller sy|x is relative to s, the closer the points are to the
regression line

   R2 functionally measures how much smaller sy|x is than s: as such it
is an estimate of the amount of variability in y explained by taking x
into account

97
Using Stata: Arm Circumference and Height

   regress command in Stata gives R2: childs‘ height explains (an
estimated) 46% of the variation in arm circumferences

98
Example: Arm Circumference and Height

   R2 and r

   r = the properly signed square root of R2; the proper sign is the same
sign as the slope in the regression

   r is called the correlation coefficient (not to be confused with the
―regression coefficients‖ – great names, huh)

   Allowable values
 0 ≤ R2 ≤ 1
 If relationship between y and x is positive 0 ≤ r ≤ 1
 If relationship between y and x is negative -1 ≤ r ≤ 0

   In this example, r   R 2   0.46  0.68

99
Example: Arm Circumference and Height

   So from the example: child height explains (an estimated) 46% of
the variation in arm circumferences
 This is just an estimate based on the sample; a 95% CI can be
computed but its not easy to do, and not given readily by the
computer; also the procedure for estimating the 95% CI is not so
good

   So this means an estimated 54% of the variability in arm
circumference is not explained by childs height
 Some if this unexplained variability may be explained by factors
other then height
 Multiple linear regression will allow us to estimate the
relationship between arm circumference, height and other child
characteristics in one analysis

100
Example 2: Hemoglobin and ―Packed Cell Volume‖

   regress command in Stata gives R2: PCV explains (an estimated) 51%
of the variation in hemoglobin levels

101
Example: Hemoglobin and PCV

   regress command in Stata gives R2 of 0.51; the slope is positive, so
r   R 2   0.51  0.71

102
Example 3: Wages and Years of Education

   regress command in Stata gives R2: years of education explains (an
estimated) 15% of the variation in hourly wages

Here    r   R 2   0.15  0.39

103
Example 4: Arm Circumference and Child Sex

   regress command in Stata gives R2: sex(female=1) explains (an
estimated) 0.2% of the variation in arm circumference

Here r   R   0.002  0.045 . In this sample of data female sex is
2

negatively correlated with arm circumference.
104
What‘s a ―Good‖ R2

   There are a couple of important things to keep in mind about R2 and
r

- These quantities are both estimates based on the sample of data;
frequently reported without some recognition of sampling
variability, for example a 95% confidence interval

- Low R2 and r not necessarily ―bad‖

- many outcomes can not/ will not be fully or close to fully
explained, in terms of variability, by any one single predictor

105
What‘s a ―Good‖ R2

   The higher the R2 values, the better the x predicts y for individuals
in a sample/population , as individual y-values vary less about their
estimated means based on x

   However, there may be important overall associations between
mean of y and x even though still a lot of individual variability in y-
values about their means estimated by x
 In the wages example, years of education explained an
estimated 15% of the variability in hourly wages
 The association was statistically significant showing that
average wages were greater for persons with more years of
education
 However, for any single education level (year), still a lot of
variation in wages for individual workers

106
Slope versus R2

   Slope estimates the magnitude and direction of the relationship
between y and x
 Estimates a mean difference in y for two groups who differ by
one-unit in x
 The slope will change if the units change for y and/or for x
 Larger slopes not indicative of stronger linear association:
smaller slopes not indicative of weaker linear association

   R2 measures strength of linear association; r measures strength and
direction
 Neither R2 or r measures magnitude
 Neither R2 or r changes with changes in units

107
Using Stat: Arm Circumference and Height

   Regression of arm circumference (cm) on height in centimeters

y  2.7  0.16x
ˆ
ˆ
R2 = 0.46 or 46%;  1  0.16

108
Using Stat: Arm Circumference and Height

   Regression of arm circumference on height in inches

. regress   armcirc height_inch

Source |       SS       df       MS             Number of obs    =      150
-------------+------------------------------          F( 1,     148)   =   124.30
Model | 148.874589      1 148.874589           Prob > F         =   0.0000
Residual | 177.263343    148 1.19772529           R-squared        =   0.4565
Total | 326.137932    149 2.18884518           Root MSE         =   1.0944

------------------------------------------------------------------------------
armcirc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
height_inch |   .4008806    .035957    11.15   0.000     .3298251     .471936
_cons |   2.695906   .8774225     3.07   0.003     .9620119    4.429801
------------------------------------------------------------------------------

y  2.7  0.40x
ˆ
ˆ
R2 = 0.46 or 46%;  1  0.40

109
Section F

Standard Error of Slopes

   Just FYI: standard error of estimated slope a combination of
variation in y-values around regression line, and spread of x values
 Definition: standard deviation of regression, called ‗root mean
squared error‘ is functionally average distance of any single
point from estimated mean of all y-values with same x, ie:
corresponding value on regression line
 For simple linear regression, d.f.=n-2 and
n

(y      i     yi )
ˆ
s y| x    i 1

n2

   Estimated standard error of slope estimate
s y| x
ˆ
SE ( 1 )         n

 (x
i 1
i    x)2

111
Standard Error of Slopes

   Estimated standard error of slope estimate
s y| x
ˆ ˆ
SE ( 1 )     n

 (x
i 1
i    x)2

   Notice this will be larger
 The more variable the y-values are around their corresponding
mean estimates on the regression line (ie: the greater sy|x is)
 The less variable the x-values are around the mean of x: hmm…

112
Actually Computation of R2

   How do we actually compute R2?
 Recall interpretation: percent of variability in y explained by x

   Total Variability in y?
 Actually, for the R2 computation
n
total variability in y  (n - 1)  s   (yi - y) 2
2

i 1

113
Example: Arm Circumference and Height

   ―Visualization‖ on the scatterplot: distance of each point from the
flat line at y squared and added together

114
R2:Arm Circumference and Height

   Regression of arm circumference on height in centimeters: total
variability in y

115
Actually Computation of R2

   Total Variability in y not explained by x?
 For the R2 computation

n
total variability in y not explained by x  (n  2)  s   2
y| x     (yi - y i ) 2
ˆ
i 1

116
Example: Arm Circumference and Height

                                 ˆ     ˆ
Each distance is y  y  y  ( o  B1 x) : this is computed for each
ˆ
data point in the sample , squared and summed

117
R2:Arm Circumference and Height

   Regression of arm circumference on height in centimeters: total
variability in y not explained by x

118
Actually Computation of R2

   Percentage of variability in y NOT explained by x
n

 (y
i 1
i
ˆ
- yi )2
n

 ( y  y) 2
i 1

   R2 is percentage of variability in y explained by x

n

 (y   i
ˆ
- yi )2
177 .26
1   i 1
n
 1            1  0.54  0.46
 ( y  y)      2             326 .14
i 1

119

DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 26 posted: 7/30/2011 language: English pages: 119