University of Illinois at Chicago
School of Public Health
Epidemiology-Biostatistics Division
PRELIMINARY EXAMINATION
Ph.D. in Biostatistics
Part II
September 20, 2004
• Three questions are given.
• A good and complete answer to one part of a question is better than incomplete answers
to two parts.
• Start each question on a separate page.
• Number your pages consecutively in the upper right-hand corner.
• Put your code number (not your name) next to the page number.
Code Number: 001
1). Although tea is the world’s most widely consumed beverage after water, little is
known about its nutritional value. Folacin is the only B vitamin present in any significant
amount in tea, and recent advances in assay methods have made accurate determination of
folacin content feasible. Consider the accompanying data on folacin content for randomly
selected specimens of the four leading brands of green tea.
Brand1: 7.9, 6.2, 6.6, 8.6, 8.9, 10.1, 9.6
Brand2: 5.7, 7.5, 9.8, 6.1, 8.4
Brand3: 6.8, 7.5, 5.0, 7.4, 5.3, 6.1
Brand4: 6.4, 7.1, 7.9, 4.5, 5.0, 4.0
(a). What is the response variable here?
(b). Identify the factor studied and the factor levels.
(c). Fill in the blanks in the following one-way ANOVA table:
Analysis of Variance
Source DF SS MS F P
Factor ? 23.50 ? ? 0.028
Error ? ? ?
Total ? 65.27
Level N Mean StDev
B1 7 8.271 1.463
B2 5 7.500 1.681
B3 6 6.350 1.060
B4 6 5.817 1.551
(d). What are the fitted values?
(e). Obtain the residuals.
(f). Regarding the folacin content problem, suppose you are to fit a regression model without
an intercept. Set up the design matrix and find the estimated regression coefficients.
(g). Suppose you are to fit a regression model with an intercept. Set up the design matrix
and find the estimated regression coefficients.
(h). Why can’t you fit an intercept term and four indicator variables for treatments at the
same time?
(i). You probably have chosen the last brand as a reference in part g. Would the regression
coefficients change if you have chosen Brand2 as a reference? If yes, write down the
new coefficients.
(j). Estimate the mean folacin level for Brand3 with 95% confidence.
1
(k). Estimate the standard deviation of the mean folacin content difference between Brand1
and Brand4.
(l). Construct a 99% confidence interval for the difference between folacin content of Brand1
and the average folacin content of other brands. Is there a contrast here?
(m). Suppose you are interested in conducting a simultaneous testing procedure on a few
linear combinations rather than the linear combination in part l. If you use any of
the multiple comparison procedures (Scheffe, Bonferroni, Tukey etc.), will you get nar-
rower or wider confidence intervals for the parameter of interest in part l? Explain.
(Assume that comparisonwise error rate is kept constant.)
2). 5190 subjects from the Australian Health Survey 1977-1978 were queried about the
number of doctor visits in the past 2 weeks, as well as several other variables. The dataset
DOCVISIT.DAT contains these variables in the order presented in the table below.
variable definition mean std dev minimum maximum
sexf = 1 if female 0.521 0.500 0 1.000
age age in years divided by 100 0.406 0.205 0.190 0.720
age2 age squared 0.207 0.186 0.036 0.518
income annual income in tens of thousands 0.583 0.369 0 1.500
of dollars
phealth = 1 if private health insurance 0.443 0.497 0 1.000
ghealth = 1 if free govenment health 0.043 0.202 0 1.000
insurance due to low income
ghealtho = 1 if free govenment health 0.210 0.407 0 1.000
insurance due to old age,
disability or veteran status
illness number of illnesses in past 2 weeks 1.432 1.384 0 5.000
actdays number of days of reduced activity 0.862 2.888 0 14.000
in past 2 weeks due to illness or
injury
hscore general health questionnaire score 1.218 2.124 0 12.000
using Goldberg’s method
chcond1 = 1 if chronic condition not 0.403 0.491 0 1.000
limiting activity
chcond2 = 1 if chronic conditioon limiting 0.117 0.321 0 1.000
activity
dvisits number of doctor visits in past 2 0.302 0.798 0 9.000
weeks
For these data, fit a reasonable model treating dvisits as the dependent variable, and all
others as potential independent variables. Defend the choice of model that you decide upon.
Write up and interpret the results in a way that a physician could understand.
2
3) The data set (BoneFracture.xls) were obtained from a matched case-control study
where a subject with bone fracture (fracture=1) is matched with a subject without bone
fracture (fracture=2) in age (in years) and race(1=black, 2=white). Fifteen other potential
risk factors are measured in this data set. All of them are subject to some missing values.
Answer the following questions with regard to the analysis of this data set.
(a). In a typical matched case-control study, subjects are identified by their matched set.
If no missing values for the covariates had occurred, the conditional likelihood method
could be applied to obtain the relative risk estimates. However, the matched set identity
was lost in this study, design an analysis plan for such a broken matched case-control
study pretending that no missing values in the covariates were involved. Compare
your analysis plan with the conventional conditional likelihood approach in terms of
assumptions required and the results obtained.
(b). sign an analysis plan for this dataset with consideration of missing values in covariates.
State the primary assumptions that you require to make you analysis plan sound.
(c). Discuss any limitations of your analysis plan (in 2) for this data set. How can the
limitations be addressed ?
3