Embed
Email

Answers

Document Sample
Answers
Shared by: HC111116222241
Categories
Tags
Stats
views:
0
posted:
11/16/2011
language:
English
pages:
12
Epi 203: Epidemiologic Methods

Problem Set 7: Understanding Measurement: Aspects of Reproducibility and Validity

ANSWER KEY



Due: 11/10/09 at 1:30PM section

ANSWER KEY

Possible points: 25



The calculations are most easily performed using the "functions"

and graphing capacity of Microsoft Excel.

This Excel file already has the data entered.



1. You are asked to be a consultant for the development of a new

at-home non-invasive needleless blood sugar (glucose) monitoring device.

The company has presented you with some reproducibility data, obtained on the same machine.

They have sampled 30 patients and performed 3 replicate measurements

on each patient, one replicate right after the other, which are shown below:

(Clinical note: The normal range for blood glucose is about 70 to 110 mg/dl)



subject rep#1 rep#2 rep#3 mean within-sub sd within-sub variance

1 95 96 91 94 2.64575131 7

2 142 145 139 142 3 9

3 168 171 167 168.67 2.081666 4.3333

4 123 126 123 124 1.73205081 3

5 110 113 107 110 3 9

6 115 116 113 114.67 1.52752523 2.3333

7 122 125 118 121.67 3.51188458 12.333

8 148 148 147 147.67 0.57735027 0.3333

9 105 108 102 105 3 9

10 155 159 154 156 2.64575131 7

11 168 169 163 166.67 3.21455025 10.333

12 188 188 187 187.67 0.57735027 0.3333

13 118 120 117 118.33 1.52752523 2.3333

14 114 118 113 115 2.64575131 7

15 134 137 130 133.67 3.51188458 12.333

16 139 143 137 139.67 3.05505046 9.3333

17 149 150 147 148.67 1.52752523 2.3333

18 130 133 125 129.33 4.04145188 16.333

19 128 129 125 127.33 2.081666 4.3333

20 124 129 120 124.33 4.50924975 20.333

21 130 131 130 130.33 0.57735027 0.3333

22 140 141 136 139 2.64575131 7

23 142 144 141 142.33 1.52752523 2.3333

24 137 141 135 137.67 3.05505046 9.3333

25 141 145 140 142 2.64575131 7

26 105 107 104 105.33 1.52752523 2.3333

27 180 182 180 180.67 1.15470054 1.3333

28 178 182 177 179 2.64575131 7

29 108 108 107 107.67 0.57735027 0.3333

30 110 113 107 110 3 9

common sd = 2.5451

a. Determine the within-subject standard deviation for each subject. (2 pts) x 1.96 = 4.9885

x 1.41 = 7.0338

see column above marked "within-sub sd"



b. Plot the within-subject standard deviation versus the mean for each person. Comment

on any relationship between the within-subject standard deviation and the mean. (2 pts)



see graph below





1

within-subject mean vs standard deviation





5





4.5





4





3.5





3

within-subject sd









2.5





2





1.5





1





0.5





0

90 110 130 150 170 190

within-subject mean





There appears to be no relationship between the within-subject sd and the mean.

(Visual inspection is fine, but formally the Pearson correlation coefficient = -0.18, p=0.33)



c. Determine the "common" within-subject standard deviation. (1 pt)



add the within-subject variances (column marked within-sub variance),

divide by 30 to get the mean, and take the square root to get the number back to its original units.

This equals 2.54 units. The formula for this is shown in cell I50.



Note: A frequent mistake in calculating the "common" within-subject standard deviation is to average the within-subject SD's,

rather than the variances. The result for this dataset would be 2.326. In this case, close but not correct.



d. The difference between any single measurement of glucose on a subject and the value that

would be obtained if you could take the mean of numerous replicates on a subject (i.e.,

the subject's true value as measured by the device) would be expected to be less than what

value for 95% of all individual measurements? (2 pts)



Assuming that the differences between a subject's measured and true glucose are normally

distributed with a mean of zero (this is the assumption of random error) and a standard deviation

estimated by the common within-subject standard deviation, then 95% of replicates

would be expected to be within 1.96 standard deviations of the true value.









2

1.96 x 2.54 = 4.99 units

95% of observed values would be expected to be within 4.99 units of the true value.





e. The difference between any two measurements on the same individual would be expected to be less

than what value for 95% of all pairs of measurements? (2 pts)







Following the derivation performed in class (taken from Bland and Altman),

1.96 x square root of 2 x 2.54 = 7.03 units

The difference between measurements for 95% of all measurement pairs would be expected to be

less than 7.03 units. Another way to consider this is that the difference between

any two pairs could be as much as 7.03 units.





f. Is the coefficient of variation a useful measure of reproducibility here? (2 pts)



Not really. The within-subject standard deviation is fairly constant throughout the range of values such that

there is not one constant ratio between the standard deviation and the underlying true value. No one number

adequately characterizes the coefficient of variation.





g. Using this new device, examine the following output from Stata. State the ICC.

State in words what the ICC means. (1 pt)

. loneway Glucose Subject



One-way Analysis of Variance for Glucose:



Number of obs = 90



Source SS df MS F Prob > F

-------------------------------------------------------------------------

Between Subject 50978.056 29 1757.864 271.37 0.0000

Within Subject 388.66667 60 6.4777778

-------------------------------------------------------------------------

Total 51366.722 89 577.15418



Intraclass Asy.

correlation S.E. [95% Conf. Interval]

------------------------------------------------

0.98903 0.00348 0.98220 0.99585









The intraclass correlation coefficient is defined as the fraction of the total variance that is due to the variability betwee n

individuals. In this case, the ICC = 0.989. This means that about 99% of the variability in glucose measurements with the

new device is due to the intrinsic variability of underlying glucose values among the participants. Only 1% of the variabili ty

is due to random measurement error.







2. These same 30 subjects have blood drawn during the session when the measurements with the new

device are taken. The blood glucose level is measured at the UCSF Clinical Laboratory - with one assay - as shown below:





subject UCSF glucose new device mean (new device - ucsf)

1 96.0 94 -2.000000

2 139.0 142 3.000000

3 168.0 168.6666667 0.666667

4 126.0 124 -2.000000

5 112.0 110 -2.000000







3

6 116.0 114.6666667 -1.333333

7 121.0 121.6666667 0.666667

8 149.0 147.6666667 -1.333333

9 106.0 105 -1.000000

10 157.0 156 -1.000000

11 166.0 166.6666667 0.666667

12 188.0 187.6666667 -0.333333

13 117.0 118.3333333 1.333333

14 116.0 115 -1.000000

15 134.0 133.6666667 -0.333333

16 140.0 139.6666667 -0.333333

17 149.0 148.6666667 -0.333333

18 129.0 129.3333333 0.333333

19 128.0 127.3333333 -0.666667

20 124.0 124.3333333 0.333333

21 133.0 130.3333333 -2.666667

22 139.0 139 0.000000

23 142.0 142.3333333 0.333333

24 137.0 137.6666667 0.666667

25 143.0 142 -1.000000

26 104.0 105.3333333 1.333333

27 181.0 180.6666667 -0.333333

28 180.0 179 -1.000000

29 108.0 107.6666667 -0.333333

30 110.0 110 0.000000

-0.322 mean difference = -0.32

sd of the differences = 1.18



a. Examine the difference between the UCSF Clin Lab value (the gold standard) and the new device value with

a Bland-Altman plot. (1 pt). Does the difference between methods appear constant over the range of values? (1 pt)

(Note: Use the mean of the three replicates of the new device as the "final" result for the new device.)



The mean of the three new device trials is shown in the table above (column marked "new device mean")

The difference is taken by subtracting the UCSF glucose value from the new device mean glucose value, in column marked

"(new device - UCSF)". Note that you could also take the difference by subtracting the device from the UCSF value; you

simply have to keep track of what you did when expressing your final result.



A Bland-Altman plot can mean one of 3 things: a) plotting within-subject standard deviation vs within-subject mean if the task

is assessment of reproducibility; b) plotting the within-subject difference vs the mean of 2 measurement methods if the task

is assessment of agreement of two methods and neither is considered a gold standard; or c) plotting the within-subject

difference vs the value of a gold standard if the comparator technique is considered a gold standard, as defined by both

its accuracy and its having very exquisite reproducibility. For this last task, if we assume the UCSF technique is

the gold standard, recent methodologic work indicates that indeed plotting the value of the gold standard (and not the mean

of the two methods) on the x axis induces less correlation and is preferred. For this particular problem, credit is also

given for plotting the mean of the two values as this nuance was not covered in class or the readings.









4

4.000000







3.000000







2.000000

New device minus UCSF









1.000000







0.000000







-1.000000







-2.000000







-3.000000

80.0 100.0 120.0 140.0 160.0 180.0 200.0

UCSF Value









The difference between the two methods appears approximately constant across the range of values. Of course, there is not

much data in the high range to evaluate. In Stata, one can look at a "moving average" of the difference between the devices

with a lowess smoother. When doing so in the plot below, the average difference is about constant over the range of values.



Lowess smoother

4

2

0

-2

-4









100 120 140 160 180

mean_ucsf_device

bandw idth = .8









5

Lowess smoother

4

2

0

-2

-4









100 120 140 160 180

mean_ucsf_device

bandw idth = .8





b. Determine the bias associated with the new device. (1 pt)

Answer: -0.32. This is calculated as the mean difference between the UCSF Clinical measure and the new device.



c. In comparison with the UCSF blood glucose measurement,

determine the 95% limits of agreement for the new at-home device. (Hint: Use the mean of

the three replicates of the new device as the "final" result for the new device.) (1 pt)

The 95% limits of agreement approach is first discussed by Bland

and Altman in 1986 (Lancet 1986:307-310) and reviewed again in 2003

(Ultrasound Obstet Gynecol 2003; 22:85-93). These articles have been distributed and are on the website.



The mean difference between the device and the UCSF value is -0.32 units.

(On average, the new device is 0.32 units lower than the true value).

The standard deviation of the differences is 1.18 units.



Assuming the differences are normally distributed,



95% limits of agreement = mean +/- (1.96)(sd)

95% limits of agreement: -0.32 +/- (1.96)(1.18) = -2.63 to 1.99 units



Solving this problem using just one measurement from the device, e.g. the first measurement:



subject UCSF glucose rep#1 (device - ucsf)

1 96.0 95 -1.0

2 139.0 142 3.0

3 168.0 168 0.0

4 126.0 123 -3.0

5 112.0 110 -2.0

6 116.0 115 -1.0

7 121.0 122 1.0

8 149.0 148 -1.0

9 106.0 105 -1.0

10 157.0 155 -2.0







6

11 166.0 168 2.0

12 188.0 188 0.0

13 117.0 118 1.0

14 116.0 114 -2.0

15 134.0 134 0.0

16 140.0 139 -1.0

17 149.0 149 0.0

18 129.0 130 1.0

19 128.0 128 0.0

20 124.0 124 0.0

21 133.0 130 -3.0

22 139.0 140 1.0

23 142.0 142 0.0

24 137.0 137 0.0

25 143.0 141 -2.0

26 104.0 105 1.0

27 181.0 180 -1.0

28 180.0 178 -2.0

29 108.0 108 0.0

30 110.0 110 0.0

mean difference = -0.40

sd of the differences = 1.40

In this case,

95% limits of agreement = mean +/- (1.96)(sd)

95% limits of agreement = -0.4 +/- (1.96)(1.40) = -3.14 to 2.34



Regardless if the mean of three attempts or just one attempt is used, the mean difference between the device and

the UCSF result is very small, indicating excellent validity of the device. As expected, the more measurements

that are taken per subject result in closer agreement (narrower limits of agreement) with the gold standard.





d. Does this new device appear promising for clinical and/or research purposes?

If you could collect additional data, what would you want to know? (1 pt)

Thus far, the device does look promising for both research and clinical practice. For research, the ICC is 0.99, and

this is when used in a group with a relatively narrow range of values compared to what would typically be used in a

research study where more diseased patients would likely be represented. Hence, thus far, use of the new device will

not unduly increase the number of subjects needed in a research study. If this holds up with further testing (especially

of patients with larger values of glucose), it is not likely that improving the reproducibility of the new device will be

necessary for research. For clinical management, the device's "repeatability" is 7 mg/dl. This is smaller than the

magnitude of blood glucose values typically felt to be clinically significant. Hence, this is quite satisfactory. The

device also demonstrates validity in comparison to a gold standard blood-based assay. There is very little evidence

of systematic error compared to the gold standard (the mean difference between the device and the gold standard is

essentially zero). The 95% limits of agreement, even using one replicate, are -3.14 to 2.34. These values are also

smaller than those typically felt to be clinically relevant. Before the new device can be recommended for either

research or clinical use, testing of more individuals is needed. The values estimated for repeatability and for the 95%

limits of agreement are just estimates, which themselves have 95% confidence intervals. Testing more participants

would allow for narrowing of these confidence intervals and a very precise knowledge of the performance of the

device. In particular, the main additional data needed is testing over a wider range of glucose values, such as

persons with low values (30 to 90) and high values (over 200). Another acceptable answer is that you would want to

know the reproducibility of the "gold standard" (UCSF glucose).









e. What if the bias was -40 units? Would the new device be useless? (1 pt)

Answer: it could simply be calibrated by subtracting 40 from each measurement. This illustrates

how bias in an interval scale is often not a problem in that the inventors can just

add or subtract some factor. The 95% limit of agreement, however, cannot so easily be altered.





3. Examine the following data from Costantini et al. J. Immunological Methods. 278:145-55, 2003.







7

3. Examine the following data from Costantini et al. J. Immunological Methods. 278:145-55, 2003.



Cryopreservation (i.e. freezing) is the most commonly used procedure to store lymphocytes for

prolonged periods of time. To determine the effect of prolonged cryopreservation on the

immunophenotype of peripheral blood lymphocytes, we performed a comparative analysis of fresh

blood versus cryopreserved blood in a group of 19 normal individuals. Each individual had fresh

cells tested prior to freezing and then had their cells thawed after 2 months of being frozen at -80

C. Cells were tested for the proportion of lymphocytes which were CD4+ and CD8+.

As shown in the table, the authors conclude that no significant differences were observed following

cryopreservation in the mean proportions of the two major lymphoid subpopulations, CD4+ and

CD8+ cells (p=0.08 and 0.27, respectively).









Fresh cells Thawed cells p value (Student's T test)

after cryopreservation

CD4+ cells 44% +/- 6.4* 40% +/- 7.5 0.08

CD8+ cells 27% +/- 7.8 30% +/- 8.9 0.27

*values are expressed as mean +/- standard deviation in 19 individuals



The authors concluded that cryopreservation of cells was equivalent to the testing of fresh cells

for the enumeration of CD4+ and CD8+ cells. Do you agree with their conclusion? (1 pt)

What other analysis would you perform? (1 pt)





This is essentially a study looking at the utility of using frozen cells for the enumeration of CD4+ and CD8+ cells.

It is also sometimes called a methods agreement study. However, the analytic methods appear to be flawed.

It appears that the authors simply compared the mean obtained from one testing technique (upon fresh cells) to the

mean obtained from the other testing technique (upon previously frozen cells). This does not take into account the paired

nature of the data. We have no idea of whether those subjects with high values using the fresh cells also had high

values using frozen cells.



A better analysis would have plotted the within-subject difference using the two techniques on the y axis

versus the mean of the two techniques on the x axis. This is the Bland-Altman approach.

One would then find the mean difference between techniques and the 95% limits of agreement.



4. Consider the following abstract:

Objective

To determine the test–retest reproducibility of a self-report questionnaire (the Adolescent Sedentary Activities

Questionnaire; ASAQ) which assesses the time spent in a comprehensive range of sedentary activities,

among school-aged young people.

Method

Two-hundred and fifty school students aged 11–15 years from four primary and four high schools in metropolitan

Sydney (New South Wales, Australia) completed the questionnaire under the same conditions on two

occasions, 2 weeks apart during Autumn, 2002.

Results

Test–retest correlations for time total spent in sedentary behavior were ≥ 0.70, except for Grade 6 boys

(Intraclass correlation coefficient (ICC) = 0.57, 95%CI: 0.25, 0.76). Reproducibility was generally higher on week days

compared with week end days. There was little difference in the reproducibility across age groups.

Conclusions

ASAQ has good to excellent reproducibility in the measurement of a broad range of sedentary behaviors

among young people. ASAQ has good face validity, but further validity testing is required to provide

a complete assessment of the instrument.



a. Describe how you would interpret the intraclass correlation coefficient for Grade 6 boys.

What concerns would you have, if any, about the inferences derived from the ASAQ for grade 6 boys? (1 pt)

The ICC for Grade 6 boys is 0.57. This means that 57% of the total variability among Grade 6 boys is explained

by their true between-subject variability and 43% is explained by their within-subject variability.

This low ICC should give one concern about measurement error in a single days' measure of physical activity.

This low ICC means will result in a potentially biased estimate of effect in the study in which it is used.







8

b. What do you think contributes to the ICC being what it is for grade 6 boys? (1 pt)

There are two components of the within-subject variability. The first is the lack of reproducibility in the measurement

process itself. In other words, the questions on the ASAQ may be imperfect in getting at the same answer

every time it is administered in the same person. The second is the fact that the measure was done on two

separate days. Behaviorally, it may simply be the case that Grade 6 boys have different levels of activity on

different days. This is intrinsic variability not the fault of the measurement process itself.



c. You would like to use this same measure in a study you are doing among primary and high school students

in San Francisco. Is there any other information you would like to know? (1 pt)

Answer: Intraclass correlations are relevant only to the population they are derived from. Subject matter knowledge

about how between-subject variability compares in San Francisco versus Australia would

be helpful to evaluate whether the ICC derived in Australia pertains to San Francisco.

Ideally, the ICC would be calculated for the San Francisco sample as well.



5. Assume that you work in a general primary care clinic, one of 5 such clinics in an HMO.

Patients in the HMO are free to choose among the 5 different primary care clinics.

Your clinic chief has become concerned with "patient satisfaction" and has asked you to come up

with a short questionnaire to administer to clinic patients to assess this.

You come up with the following questions: "Are you satisfied with the clinic?"; "Are you

satisfied with your primary care provider?"; "Are you satisfied with the nursing staff?";

and "Are you satisfied with the reception desk staff?".

(Response options include: very satisfied, somewhat satisfied, somewhat dissatisfied,

and very dissatisfied.)

Comment on the ways you would determine the validity of these measurements. (2 pt)

Make sure to discuss aspects of content, construct, and criterion validity as described in the McDowell/Newell article.





There is no gold standard to which these measurements can be compared. Hence, in this situation, the entire

spectrum of aspects of validity need to be looked at to see if this question is indeed valid:



a. Content validity. Two types of content validity exist: a) face validity and b) sampling validity. Both are very

subjective. Face validity asks if the question "makes sense" as a measurement, to most observers. Our questions

appears to pass the face validity check, but this is not saying much. Sampling validity refers to whether the entire

spectrum, or a representative sample, of concepts concerning the entity in question is being asked. In our situation,

there are many different aspects of a clinic experience that need to be evaluated to assess satisfaction. Although

there are more things that could be evaluated (e.g., satisfied with the medical advice you got or satisfied with the

ease of getting an appointment), what is asked appears to be reasonably representative of what could be asked.



b. Construct validity. This refers to whether the performance of the measurement fits in with one’s theoretical

understanding of the system under study. For example, you could determine how known patients responded to these

questions. If you are aware of a group of chronically disgruntled patients and a group who always brings appreciative

gifts to clinic each time they visit, you might compare the responses to the question by these two groups. If the

responses differ by group, you would say the question had construct validity.



Construct validity is important to understand because it is referred to often in the literature. As another example, the

Course Director helped develop an assay for antibodies to human herpesvirus 8, the causative agent of Kaposi’s

sarcoma. Gold standards for the definitive presence or absence of human herpesvirus 8 infection are not available

(short of persons with Kaposi’s sarcoma). However, it is known clinically that among HIV-infected persons,

homosexual men develop Kaposi’s sarcoma frequently while other persons (e.g., heterosexual men and women)

develop it only rarely. Thus, it is theoretically suggested that homosexual men are commonly infected with the virus

but other persons are not. When the new assay was used to test homosexual men vs heterosexual men/women, it

was found that homosexual men had a high prevalence of antibody-reactivity and heterosexual men/women had a

very low prevalence of reactivity. Thus, it was determined that the new antibody assay had construct validity. Its

performance fits in with the existing theoretical understanding of the system.



c. Criterion validity. This refers to whether the measurement is correlated with an external measurement; it is also

known as empirical validity. One type of empirical validity is concurrent empirical validity, which requires a concurrent

gold standard. This is not present here. Another type of empirical validity is predictive validity, which refers to

whether the measurement is correlated with an external criterion that occurs after the measurement. To assess

predictive validity, you might assume that patient satisfaction will ultimately help determine whether a given patient





9

predictive validity, you might assume that patient satisfaction will ultimately help determine whether a given patient

decides to continue to receive care at a particular clinic or move to another clinic. In this situation, it would be

expected that most people would continue to need primary care services in the future. Therefore, you could assess

the predictive validity of this question by seeing how responses to it correlated with patients' subsequent use of the

clinic.



The distinction between empirical validity and construct validity is often murky. Classifying validity checks as criterion

versus construct is not as important as realizing that using other information about persons (either concurrent or

future information) can be used to assess the validity of a measurement.









10

within-subject mean vs standard deviation





5





4.5





4





3.5





3

within-subject sd









2.5





2





1.5





1





0.5





0

90 110 130 150 170 190

within-subject mean

4.000000









3.000000









2.000000

New device minus UCSF









1.000000









0.000000









-1.000000









-2.000000









-3.000000

80.0 100.0 120.0 140.0 160.0 180.0 200.0

UCSF Value


Related docs
Other docs by HC111116222241
You have two choices
Views: 0  |  Downloads: 0
07 08 consolidated application
Views: 3  |  Downloads: 0
OCR Document
Views: 0  |  Downloads: 0
Adam Johnston 091224
Views: 1  |  Downloads: 0
Johnson
Views: 1  |  Downloads: 0
File Code:
Views: 0  |  Downloads: 0
Preforeclosures_June 13th thru
Views: 4  |  Downloads: 0
Answers
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!