Epi 203: Epidemiologic Methods
Problem Set 7: Understanding Measurement: Aspects of Reproducibility and Validity
ANSWER KEY
Due: 11/10/09 at 1:30PM section
ANSWER KEY
Possible points: 25
The calculations are most easily performed using the "functions"
and graphing capacity of Microsoft Excel.
This Excel file already has the data entered.
1. You are asked to be a consultant for the development of a new
at-home non-invasive needleless blood sugar (glucose) monitoring device.
The company has presented you with some reproducibility data, obtained on the same machine.
They have sampled 30 patients and performed 3 replicate measurements
on each patient, one replicate right after the other, which are shown below:
(Clinical note: The normal range for blood glucose is about 70 to 110 mg/dl)
subject rep#1 rep#2 rep#3 mean within-sub sd within-sub variance
1 95 96 91 94 2.64575131 7
2 142 145 139 142 3 9
3 168 171 167 168.67 2.081666 4.3333
4 123 126 123 124 1.73205081 3
5 110 113 107 110 3 9
6 115 116 113 114.67 1.52752523 2.3333
7 122 125 118 121.67 3.51188458 12.333
8 148 148 147 147.67 0.57735027 0.3333
9 105 108 102 105 3 9
10 155 159 154 156 2.64575131 7
11 168 169 163 166.67 3.21455025 10.333
12 188 188 187 187.67 0.57735027 0.3333
13 118 120 117 118.33 1.52752523 2.3333
14 114 118 113 115 2.64575131 7
15 134 137 130 133.67 3.51188458 12.333
16 139 143 137 139.67 3.05505046 9.3333
17 149 150 147 148.67 1.52752523 2.3333
18 130 133 125 129.33 4.04145188 16.333
19 128 129 125 127.33 2.081666 4.3333
20 124 129 120 124.33 4.50924975 20.333
21 130 131 130 130.33 0.57735027 0.3333
22 140 141 136 139 2.64575131 7
23 142 144 141 142.33 1.52752523 2.3333
24 137 141 135 137.67 3.05505046 9.3333
25 141 145 140 142 2.64575131 7
26 105 107 104 105.33 1.52752523 2.3333
27 180 182 180 180.67 1.15470054 1.3333
28 178 182 177 179 2.64575131 7
29 108 108 107 107.67 0.57735027 0.3333
30 110 113 107 110 3 9
common sd = 2.5451
a. Determine the within-subject standard deviation for each subject. (2 pts) x 1.96 = 4.9885
x 1.41 = 7.0338
see column above marked "within-sub sd"
b. Plot the within-subject standard deviation versus the mean for each person. Comment
on any relationship between the within-subject standard deviation and the mean. (2 pts)
see graph below
1
within-subject mean vs standard deviation
5
4.5
4
3.5
3
within-subject sd
2.5
2
1.5
1
0.5
0
90 110 130 150 170 190
within-subject mean
There appears to be no relationship between the within-subject sd and the mean.
(Visual inspection is fine, but formally the Pearson correlation coefficient = -0.18, p=0.33)
c. Determine the "common" within-subject standard deviation. (1 pt)
add the within-subject variances (column marked within-sub variance),
divide by 30 to get the mean, and take the square root to get the number back to its original units.
This equals 2.54 units. The formula for this is shown in cell I50.
Note: A frequent mistake in calculating the "common" within-subject standard deviation is to average the within-subject SD's,
rather than the variances. The result for this dataset would be 2.326. In this case, close but not correct.
d. The difference between any single measurement of glucose on a subject and the value that
would be obtained if you could take the mean of numerous replicates on a subject (i.e.,
the subject's true value as measured by the device) would be expected to be less than what
value for 95% of all individual measurements? (2 pts)
Assuming that the differences between a subject's measured and true glucose are normally
distributed with a mean of zero (this is the assumption of random error) and a standard deviation
estimated by the common within-subject standard deviation, then 95% of replicates
would be expected to be within 1.96 standard deviations of the true value.
2
1.96 x 2.54 = 4.99 units
95% of observed values would be expected to be within 4.99 units of the true value.
e. The difference between any two measurements on the same individual would be expected to be less
than what value for 95% of all pairs of measurements? (2 pts)
Following the derivation performed in class (taken from Bland and Altman),
1.96 x square root of 2 x 2.54 = 7.03 units
The difference between measurements for 95% of all measurement pairs would be expected to be
less than 7.03 units. Another way to consider this is that the difference between
any two pairs could be as much as 7.03 units.
f. Is the coefficient of variation a useful measure of reproducibility here? (2 pts)
Not really. The within-subject standard deviation is fairly constant throughout the range of values such that
there is not one constant ratio between the standard deviation and the underlying true value. No one number
adequately characterizes the coefficient of variation.
g. Using this new device, examine the following output from Stata. State the ICC.
State in words what the ICC means. (1 pt)
. loneway Glucose Subject
One-way Analysis of Variance for Glucose:
Number of obs = 90
Source SS df MS F Prob > F
-------------------------------------------------------------------------
Between Subject 50978.056 29 1757.864 271.37 0.0000
Within Subject 388.66667 60 6.4777778
-------------------------------------------------------------------------
Total 51366.722 89 577.15418
Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.98903 0.00348 0.98220 0.99585
The intraclass correlation coefficient is defined as the fraction of the total variance that is due to the variability betwee n
individuals. In this case, the ICC = 0.989. This means that about 99% of the variability in glucose measurements with the
new device is due to the intrinsic variability of underlying glucose values among the participants. Only 1% of the variabili ty
is due to random measurement error.
2. These same 30 subjects have blood drawn during the session when the measurements with the new
device are taken. The blood glucose level is measured at the UCSF Clinical Laboratory - with one assay - as shown below:
subject UCSF glucose new device mean (new device - ucsf)
1 96.0 94 -2.000000
2 139.0 142 3.000000
3 168.0 168.6666667 0.666667
4 126.0 124 -2.000000
5 112.0 110 -2.000000
3
6 116.0 114.6666667 -1.333333
7 121.0 121.6666667 0.666667
8 149.0 147.6666667 -1.333333
9 106.0 105 -1.000000
10 157.0 156 -1.000000
11 166.0 166.6666667 0.666667
12 188.0 187.6666667 -0.333333
13 117.0 118.3333333 1.333333
14 116.0 115 -1.000000
15 134.0 133.6666667 -0.333333
16 140.0 139.6666667 -0.333333
17 149.0 148.6666667 -0.333333
18 129.0 129.3333333 0.333333
19 128.0 127.3333333 -0.666667
20 124.0 124.3333333 0.333333
21 133.0 130.3333333 -2.666667
22 139.0 139 0.000000
23 142.0 142.3333333 0.333333
24 137.0 137.6666667 0.666667
25 143.0 142 -1.000000
26 104.0 105.3333333 1.333333
27 181.0 180.6666667 -0.333333
28 180.0 179 -1.000000
29 108.0 107.6666667 -0.333333
30 110.0 110 0.000000
-0.322 mean difference = -0.32
sd of the differences = 1.18
a. Examine the difference between the UCSF Clin Lab value (the gold standard) and the new device value with
a Bland-Altman plot. (1 pt). Does the difference between methods appear constant over the range of values? (1 pt)
(Note: Use the mean of the three replicates of the new device as the "final" result for the new device.)
The mean of the three new device trials is shown in the table above (column marked "new device mean")
The difference is taken by subtracting the UCSF glucose value from the new device mean glucose value, in column marked
"(new device - UCSF)". Note that you could also take the difference by subtracting the device from the UCSF value; you
simply have to keep track of what you did when expressing your final result.
A Bland-Altman plot can mean one of 3 things: a) plotting within-subject standard deviation vs within-subject mean if the task
is assessment of reproducibility; b) plotting the within-subject difference vs the mean of 2 measurement methods if the task
is assessment of agreement of two methods and neither is considered a gold standard; or c) plotting the within-subject
difference vs the value of a gold standard if the comparator technique is considered a gold standard, as defined by both
its accuracy and its having very exquisite reproducibility. For this last task, if we assume the UCSF technique is
the gold standard, recent methodologic work indicates that indeed plotting the value of the gold standard (and not the mean
of the two methods) on the x axis induces less correlation and is preferred. For this particular problem, credit is also
given for plotting the mean of the two values as this nuance was not covered in class or the readings.
4
4.000000
3.000000
2.000000
New device minus UCSF
1.000000
0.000000
-1.000000
-2.000000
-3.000000
80.0 100.0 120.0 140.0 160.0 180.0 200.0
UCSF Value
The difference between the two methods appears approximately constant across the range of values. Of course, there is not
much data in the high range to evaluate. In Stata, one can look at a "moving average" of the difference between the devices
with a lowess smoother. When doing so in the plot below, the average difference is about constant over the range of values.
Lowess smoother
4
2
0
-2
-4
100 120 140 160 180
mean_ucsf_device
bandw idth = .8
5
Lowess smoother
4
2
0
-2
-4
100 120 140 160 180
mean_ucsf_device
bandw idth = .8
b. Determine the bias associated with the new device. (1 pt)
Answer: -0.32. This is calculated as the mean difference between the UCSF Clinical measure and the new device.
c. In comparison with the UCSF blood glucose measurement,
determine the 95% limits of agreement for the new at-home device. (Hint: Use the mean of
the three replicates of the new device as the "final" result for the new device.) (1 pt)
The 95% limits of agreement approach is first discussed by Bland
and Altman in 1986 (Lancet 1986:307-310) and reviewed again in 2003
(Ultrasound Obstet Gynecol 2003; 22:85-93). These articles have been distributed and are on the website.
The mean difference between the device and the UCSF value is -0.32 units.
(On average, the new device is 0.32 units lower than the true value).
The standard deviation of the differences is 1.18 units.
Assuming the differences are normally distributed,
95% limits of agreement = mean +/- (1.96)(sd)
95% limits of agreement: -0.32 +/- (1.96)(1.18) = -2.63 to 1.99 units
Solving this problem using just one measurement from the device, e.g. the first measurement:
subject UCSF glucose rep#1 (device - ucsf)
1 96.0 95 -1.0
2 139.0 142 3.0
3 168.0 168 0.0
4 126.0 123 -3.0
5 112.0 110 -2.0
6 116.0 115 -1.0
7 121.0 122 1.0
8 149.0 148 -1.0
9 106.0 105 -1.0
10 157.0 155 -2.0
6
11 166.0 168 2.0
12 188.0 188 0.0
13 117.0 118 1.0
14 116.0 114 -2.0
15 134.0 134 0.0
16 140.0 139 -1.0
17 149.0 149 0.0
18 129.0 130 1.0
19 128.0 128 0.0
20 124.0 124 0.0
21 133.0 130 -3.0
22 139.0 140 1.0
23 142.0 142 0.0
24 137.0 137 0.0
25 143.0 141 -2.0
26 104.0 105 1.0
27 181.0 180 -1.0
28 180.0 178 -2.0
29 108.0 108 0.0
30 110.0 110 0.0
mean difference = -0.40
sd of the differences = 1.40
In this case,
95% limits of agreement = mean +/- (1.96)(sd)
95% limits of agreement = -0.4 +/- (1.96)(1.40) = -3.14 to 2.34
Regardless if the mean of three attempts or just one attempt is used, the mean difference between the device and
the UCSF result is very small, indicating excellent validity of the device. As expected, the more measurements
that are taken per subject result in closer agreement (narrower limits of agreement) with the gold standard.
d. Does this new device appear promising for clinical and/or research purposes?
If you could collect additional data, what would you want to know? (1 pt)
Thus far, the device does look promising for both research and clinical practice. For research, the ICC is 0.99, and
this is when used in a group with a relatively narrow range of values compared to what would typically be used in a
research study where more diseased patients would likely be represented. Hence, thus far, use of the new device will
not unduly increase the number of subjects needed in a research study. If this holds up with further testing (especially
of patients with larger values of glucose), it is not likely that improving the reproducibility of the new device will be
necessary for research. For clinical management, the device's "repeatability" is 7 mg/dl. This is smaller than the
magnitude of blood glucose values typically felt to be clinically significant. Hence, this is quite satisfactory. The
device also demonstrates validity in comparison to a gold standard blood-based assay. There is very little evidence
of systematic error compared to the gold standard (the mean difference between the device and the gold standard is
essentially zero). The 95% limits of agreement, even using one replicate, are -3.14 to 2.34. These values are also
smaller than those typically felt to be clinically relevant. Before the new device can be recommended for either
research or clinical use, testing of more individuals is needed. The values estimated for repeatability and for the 95%
limits of agreement are just estimates, which themselves have 95% confidence intervals. Testing more participants
would allow for narrowing of these confidence intervals and a very precise knowledge of the performance of the
device. In particular, the main additional data needed is testing over a wider range of glucose values, such as
persons with low values (30 to 90) and high values (over 200). Another acceptable answer is that you would want to
know the reproducibility of the "gold standard" (UCSF glucose).
e. What if the bias was -40 units? Would the new device be useless? (1 pt)
Answer: it could simply be calibrated by subtracting 40 from each measurement. This illustrates
how bias in an interval scale is often not a problem in that the inventors can just
add or subtract some factor. The 95% limit of agreement, however, cannot so easily be altered.
3. Examine the following data from Costantini et al. J. Immunological Methods. 278:145-55, 2003.
7
3. Examine the following data from Costantini et al. J. Immunological Methods. 278:145-55, 2003.
Cryopreservation (i.e. freezing) is the most commonly used procedure to store lymphocytes for
prolonged periods of time. To determine the effect of prolonged cryopreservation on the
immunophenotype of peripheral blood lymphocytes, we performed a comparative analysis of fresh
blood versus cryopreserved blood in a group of 19 normal individuals. Each individual had fresh
cells tested prior to freezing and then had their cells thawed after 2 months of being frozen at -80
C. Cells were tested for the proportion of lymphocytes which were CD4+ and CD8+.
As shown in the table, the authors conclude that no significant differences were observed following
cryopreservation in the mean proportions of the two major lymphoid subpopulations, CD4+ and
CD8+ cells (p=0.08 and 0.27, respectively).
Fresh cells Thawed cells p value (Student's T test)
after cryopreservation
CD4+ cells 44% +/- 6.4* 40% +/- 7.5 0.08
CD8+ cells 27% +/- 7.8 30% +/- 8.9 0.27
*values are expressed as mean +/- standard deviation in 19 individuals
The authors concluded that cryopreservation of cells was equivalent to the testing of fresh cells
for the enumeration of CD4+ and CD8+ cells. Do you agree with their conclusion? (1 pt)
What other analysis would you perform? (1 pt)
This is essentially a study looking at the utility of using frozen cells for the enumeration of CD4+ and CD8+ cells.
It is also sometimes called a methods agreement study. However, the analytic methods appear to be flawed.
It appears that the authors simply compared the mean obtained from one testing technique (upon fresh cells) to the
mean obtained from the other testing technique (upon previously frozen cells). This does not take into account the paired
nature of the data. We have no idea of whether those subjects with high values using the fresh cells also had high
values using frozen cells.
A better analysis would have plotted the within-subject difference using the two techniques on the y axis
versus the mean of the two techniques on the x axis. This is the Bland-Altman approach.
One would then find the mean difference between techniques and the 95% limits of agreement.
4. Consider the following abstract:
Objective
To determine the test–retest reproducibility of a self-report questionnaire (the Adolescent Sedentary Activities
Questionnaire; ASAQ) which assesses the time spent in a comprehensive range of sedentary activities,
among school-aged young people.
Method
Two-hundred and fifty school students aged 11–15 years from four primary and four high schools in metropolitan
Sydney (New South Wales, Australia) completed the questionnaire under the same conditions on two
occasions, 2 weeks apart during Autumn, 2002.
Results
Test–retest correlations for time total spent in sedentary behavior were ≥ 0.70, except for Grade 6 boys
(Intraclass correlation coefficient (ICC) = 0.57, 95%CI: 0.25, 0.76). Reproducibility was generally higher on week days
compared with week end days. There was little difference in the reproducibility across age groups.
Conclusions
ASAQ has good to excellent reproducibility in the measurement of a broad range of sedentary behaviors
among young people. ASAQ has good face validity, but further validity testing is required to provide
a complete assessment of the instrument.
a. Describe how you would interpret the intraclass correlation coefficient for Grade 6 boys.
What concerns would you have, if any, about the inferences derived from the ASAQ for grade 6 boys? (1 pt)
The ICC for Grade 6 boys is 0.57. This means that 57% of the total variability among Grade 6 boys is explained
by their true between-subject variability and 43% is explained by their within-subject variability.
This low ICC should give one concern about measurement error in a single days' measure of physical activity.
This low ICC means will result in a potentially biased estimate of effect in the study in which it is used.
8
b. What do you think contributes to the ICC being what it is for grade 6 boys? (1 pt)
There are two components of the within-subject variability. The first is the lack of reproducibility in the measurement
process itself. In other words, the questions on the ASAQ may be imperfect in getting at the same answer
every time it is administered in the same person. The second is the fact that the measure was done on two
separate days. Behaviorally, it may simply be the case that Grade 6 boys have different levels of activity on
different days. This is intrinsic variability not the fault of the measurement process itself.
c. You would like to use this same measure in a study you are doing among primary and high school students
in San Francisco. Is there any other information you would like to know? (1 pt)
Answer: Intraclass correlations are relevant only to the population they are derived from. Subject matter knowledge
about how between-subject variability compares in San Francisco versus Australia would
be helpful to evaluate whether the ICC derived in Australia pertains to San Francisco.
Ideally, the ICC would be calculated for the San Francisco sample as well.
5. Assume that you work in a general primary care clinic, one of 5 such clinics in an HMO.
Patients in the HMO are free to choose among the 5 different primary care clinics.
Your clinic chief has become concerned with "patient satisfaction" and has asked you to come up
with a short questionnaire to administer to clinic patients to assess this.
You come up with the following questions: "Are you satisfied with the clinic?"; "Are you
satisfied with your primary care provider?"; "Are you satisfied with the nursing staff?";
and "Are you satisfied with the reception desk staff?".
(Response options include: very satisfied, somewhat satisfied, somewhat dissatisfied,
and very dissatisfied.)
Comment on the ways you would determine the validity of these measurements. (2 pt)
Make sure to discuss aspects of content, construct, and criterion validity as described in the McDowell/Newell article.
There is no gold standard to which these measurements can be compared. Hence, in this situation, the entire
spectrum of aspects of validity need to be looked at to see if this question is indeed valid:
a. Content validity. Two types of content validity exist: a) face validity and b) sampling validity. Both are very
subjective. Face validity asks if the question "makes sense" as a measurement, to most observers. Our questions
appears to pass the face validity check, but this is not saying much. Sampling validity refers to whether the entire
spectrum, or a representative sample, of concepts concerning the entity in question is being asked. In our situation,
there are many different aspects of a clinic experience that need to be evaluated to assess satisfaction. Although
there are more things that could be evaluated (e.g., satisfied with the medical advice you got or satisfied with the
ease of getting an appointment), what is asked appears to be reasonably representative of what could be asked.
b. Construct validity. This refers to whether the performance of the measurement fits in with one’s theoretical
understanding of the system under study. For example, you could determine how known patients responded to these
questions. If you are aware of a group of chronically disgruntled patients and a group who always brings appreciative
gifts to clinic each time they visit, you might compare the responses to the question by these two groups. If the
responses differ by group, you would say the question had construct validity.
Construct validity is important to understand because it is referred to often in the literature. As another example, the
Course Director helped develop an assay for antibodies to human herpesvirus 8, the causative agent of Kaposi’s
sarcoma. Gold standards for the definitive presence or absence of human herpesvirus 8 infection are not available
(short of persons with Kaposi’s sarcoma). However, it is known clinically that among HIV-infected persons,
homosexual men develop Kaposi’s sarcoma frequently while other persons (e.g., heterosexual men and women)
develop it only rarely. Thus, it is theoretically suggested that homosexual men are commonly infected with the virus
but other persons are not. When the new assay was used to test homosexual men vs heterosexual men/women, it
was found that homosexual men had a high prevalence of antibody-reactivity and heterosexual men/women had a
very low prevalence of reactivity. Thus, it was determined that the new antibody assay had construct validity. Its
performance fits in with the existing theoretical understanding of the system.
c. Criterion validity. This refers to whether the measurement is correlated with an external measurement; it is also
known as empirical validity. One type of empirical validity is concurrent empirical validity, which requires a concurrent
gold standard. This is not present here. Another type of empirical validity is predictive validity, which refers to
whether the measurement is correlated with an external criterion that occurs after the measurement. To assess
predictive validity, you might assume that patient satisfaction will ultimately help determine whether a given patient
9
predictive validity, you might assume that patient satisfaction will ultimately help determine whether a given patient
decides to continue to receive care at a particular clinic or move to another clinic. In this situation, it would be
expected that most people would continue to need primary care services in the future. Therefore, you could assess
the predictive validity of this question by seeing how responses to it correlated with patients' subsequent use of the
clinic.
The distinction between empirical validity and construct validity is often murky. Classifying validity checks as criterion
versus construct is not as important as realizing that using other information about persons (either concurrent or
future information) can be used to assess the validity of a measurement.
10
within-subject mean vs standard deviation
5
4.5
4
3.5
3
within-subject sd
2.5
2
1.5
1
0.5
0
90 110 130 150 170 190
within-subject mean
4.000000
3.000000
2.000000
New device minus UCSF
1.000000
0.000000
-1.000000
-2.000000
-3.000000
80.0 100.0 120.0 140.0 160.0 180.0 200.0
UCSF Value