Using Stata for Subpopulation Analysis of Complex Sample Survey
Document Sample


Using Stata for Subpopulation
Analysis of Complex Sample
Survey Data
Brady T. West
PhD Student
Michigan Program in Survey Methodology
July 30, 2009 2009 Stata Conference
Presentation Outline
1. Introduction: Subclass Analysis Issues
2. Kish‟s Taxonomy of Subclasses
3. Two Alternative Approaches to Inference
4. Variance Estimation and Methods for
„Singletons‟
5. Examples using NHANES and NHAMCS Data
6. Suggestions for Practice
7. Directions for Future Research
2009 Stata Conference: Subpop 2
Analysis of Survey Data
Subclass Analysis Issues
• Analysts of large, complex sample survey
data sets are often interested in making
inferences about subpopulations of the
original population that the sample was
selected from (e.g., Caucasian Females)
• These subpopulations are referred to
interchangeably in various literatures as
subgroups, subclasses, subpopulations,
domains, and subdomains, leading to
confusion among analysts of survey data
2009 Stata Conference: Subpop 3
Analysis of Survey Data
Subclass Analysis Issues, cont‟d
• Software procedures for analysis of
complex sample survey data are becoming
more powerful, flexible, and widely
available, offering analysts several options
• Analysts need to be careful when analyzing
subclasses, and be aware of the alternative
approaches to subclass analysis that are
possible and their implications for inference
2009 Stata Conference: Subpop 4
Analysis of Survey Data
Kish‟s Taxonomy of Subclasses
• Design Domains: Restricted to specific strata
according to the complex sample design (usually
geographically, e.g., Texas)
• Cross-Classes: Broadly distributed (in theory)
across the strata and primary sampling units
defining a complex sample (e.g., African-
Americans over age 50)
• Mixed Classes: Disproportionately distributed
across the complex sample design (e.g., Hispanics
in a sample including Los Angeles as a stratum)
• See Kish (1987), Statistical Design for Research
2009 Stata Conference: Subpop 5
Analysis of Survey Data
Design Domains
X = Sample Element in Subclass
Stratum PSU 1 PSU 2
1 XXXXXXX XXXXXXX
XXXX XX
2 XXXXXXX XXXXXXX
XXX XXXXX
3
4
5
2009 Stata Conference: Subpop 6
Analysis of Survey Data
Cross-Classes
Stratum PSU 1 PSU 2
1 XXXXXXX XXXXX
XXXXX
2 XXXX XXXXXXX
3 XXXXXXX XXXXXXX
XXXX XX
4 XXXXXX XXXXX
5 XXXXXXX XXXXXXX
XXX XXXXX
2009 Stata Conference: Subpop 7
Analysis of Survey Data
Mixed Classes
Stratum PSU 1 PSU 2
1 XXXXXXX XXXXXXX
XXXXXXX XXXXXX
2 X
3 XXXXXXX XXXXXXX
XXXXXX XXX
4 XX
5 XXXXXXX XXXXXXX
XXXXXXX XXXXX
2009 Stata Conference: Subpop 8
Analysis of Survey Data
Applying Kish‟s Taxonomy
• The type of subclass is critical for
determining an appropriate analysis
approach
• Two possible approaches to inference
motivated by the taxonomy:
1. Unconditional approach (cross-classes,
mixed classes)
2. Conditional approach (design domains)
2009 Stata Conference: Subpop 9
Analysis of Survey Data
The Unconditional Approach
• Appropriate for Cross-Classes, and in some
cases Mixed Classes; the subclass of interest
theoretically can appear in all design strata
and primary sampling units (PSUs)
• KEY POINT: Allow the software to process
the entire survey data set, and recognize all
possible design strata and PSUs; DO NOT
delete sample cases not in the subclass!
2009 Stata Conference: Subpop 10
Analysis of Survey Data
The Unconditional Approach
• Rationale: estimated variances for sample
estimates of subclass parameters (based on
within-stratum variance between PSUs)
need to reflect sample-to-sample variability
based on the full complex design
• In other words, if a particular subclass does
not appear in a PSU in any given sample
(although in theory it could have), that PSU
should contribute 0 to variance estimates,
rather than be ignored completely!
2009 Stata Conference: Subpop 11
Analysis of Survey Data
The Unconditional Approach
• Further, the subclass sample size in each
stratum is going to be a random variable,
and theoretical sample-to-sample variance
in realizations of this random variable
should be incorporated into any variance
estimation procedures
2009 Stata Conference: Subpop 12
Analysis of Survey Data
The Unconditional Approach
• If cross-classes (or in some cases mixed classes)
are being analyzed, and PSUs where the subclass
does not appear (by random chance) are deleted,
problems arise
• Some strata may appear to have only one PSU by
design (preventing variance estimation unless an
ad hoc approach is used)
• Entire design strata may be dropped, impacting
variance estimates and calculations of degrees of
freedom
2009 Stata Conference: Subpop 13
Analysis of Survey Data
The Unconditional Approach:
General Stata Code
• svy, subpop(indicator): command varlist, options
• indicator = an indicator variable for the subpop or
an if condition, e.g., if male == 1
• svy: mean, over(groupvar)
• svy: prop, over(groupvar)
• Stata drops strata* with no subpopulation
observations from degrees of freedom calculations
* Exercise: repeat 10 times really fast
2009 Stata Conference: Subpop 14
Analysis of Survey Data
The Conditional Approach
• Appropriate for Design Domains, where a
subclass cannot appear outside of specific
design strata
• The rationale behind the unconditional
approach no longer applies
• Certain design strata should not contribute
to variance estimation or calculation of
degrees of freedom
2009 Stata Conference: Subpop 15
Analysis of Survey Data
The Conditional Approach
• Restrict the analysis to only those design
strata where the subclass of interest exists
• Variance estimates reflecting sample-to-
sample variability should only be based on
those design strata where the subclass can
appear (unlike the unconditional approach)
• Subclass sample sizes in design domains are
assumed to be fixed, by design
2009 Stata Conference: Subpop 16
Analysis of Survey Data
The Conditional Approach:
General Stata Code
• svy: command varlist if (condition), options
• (condition) might be male == 1, or a more
complex combination of conditions (e.g.,
male == 1 & age >= 50 & age <= 90)
2009 Stata Conference: Subpop 17
Analysis of Survey Data
Variance Estimation Methods
• All of these issues are only relevant when
using Taylor Series Linearization, which is
a default for variance estimation in Stata
• Conditional analyses are OK to perform
when using replication methods, such as
Balanced Repeated Replication or Jackknife
Repeated Replication (Rust and Rao, 1996)
2009 Stata Conference: Subpop 18
Analysis of Survey Data
Ad-hoc Fixes for „Singleton‟
Clusters in Stata 10.1
• Stata 10.1 provides users with four ad-hoc fixes
for the problem where strata are identified with
only a single ultimate cluster for variance
estimation in a subpopulation analysis:
1. Report Missing Standard Errors (not really a fix)
2. Treat Units as Certainty Units, which contribute
nothing to the standard error
3. Scale Variance using Certainty Units, which uses the
average variance from each stratum with multiple
PSUs for each stratum with only a single PSU
4. Center at the Grand Mean, where the variance
contribution comes from a deviation from the grand
mean instead of the stratum mean
2009 Stata Conference: Subpop 19
Analysis of Survey Data
Example: The NHANES Data
• We first consider examples based on the
NHANES II data set, collected from a
nationally representative multistage
probability sample of the U.S. population
from 1976-1980 (oldie but a goodie)
• Briefly, a sample of the U.S. population was
given medical examinations in an effort to
assess the health of the U.S. population
2009 Stata Conference: Subpop 20
Analysis of Survey Data
Example NHANES Analysis
• Analysis Subclass: African-Americans ages
50 and above (this is a cross-class of the
U.S. population, which can theoretically
appear in all design strata and PSUs)
• Analysis Objective: Estimate the mean
systolic blood pressure of this subclass and
an appropriate standard error
• See West et al. (2007) for more details
2009 Stata Conference: Subpop 21
Analysis of Survey Data
Conditional Approach:
Stata Code for NHANES Analysis
• svyset ppsu [pweight = fwgtexam],
strata(stratum) singleunit(missing)
• svyset ppsu [pweight = fwgtexam],
strata(stratum) singleunit(centered)
• Also singleunit(certainty),
singleunit(scaled)
• gen b50subp = (race == 2 & ager >= 50)
• svy: mean bpsyst if b50subp == 1
2009 Stata Conference: Subpop 22
Analysis of Survey Data
Conditional Approach: Results
Method Est. Mean TSL SE Design DF
Missing SE 144.09 . 50-29 = 21
Centered 144.09 1.66 50-29 = 21
Certainty 144.09 1.62 50-29 = 21
Scaled 144.09 1.90 50-29 = 21
2009 Stata Conference: Subpop 23
Analysis of Survey Data
Conditional Approach?
• This approach would not be appropriate for
this particular subclass
• Computed standard errors would generally
be biased downward, because additional
sources of sample-to-sample variability are
ignored when following this approach
• Same issues apply for analytic models
• Evidence that the “scaled” ad-hoc fix may
be overly conservative!
2009 Stata Conference: Subpop 24
Analysis of Survey Data
Unconditional Approach:
Stata Code for NHANES Analysis
• svyset ppsu [pweight = fwgtexam],
strata(stratum) singleunit(missing)
• Note: choice of single unit option does not
matter when following this approach!
• gen b50subp = (race == 2 & ager >= 50)
• svy, subpop(b50subp): mean bpsyst
2009 Stata Conference: Subpop 25
Analysis of Survey Data
Unconditional Approach: Results
Method Est. Mean TSL SE Des. DF*
Missing SE 144.09 1.66 58-29 = 29
Centered 144.09 1.66 58-29 = 29
Certainty 144.09 1.66 58-29 = 29
Scaled 144.09 1.66 58-29 = 29
* Note: Stata dropped three strata with no sample units in the subpopulation.
2009 Stata Conference: Subpop 26
Analysis of Survey Data
Unconditional Approach?
• This approach would be the appropriate
choice for a cross-class such as African-
Americans over the age of 50
• Inferences are theoretically appropriate
• Same idea for analytic models
• Results suggest that the “centered” and
“certainty” ad-hoc fixes for conditional
analyses are reasonable
2009 Stata Conference: Subpop 27
Analysis of Survey Data
Example: The NHAMCS Data
• Analysis Subclass: Visits to Emergency
Departments (ED) by African-American men ages
60 and above (this is another cross-class of the
U.S. population, which can theoretically appear in
all NHAMCS design strata and PSUs)
• Analysis Objective: Estimate the percentage of all
ED visits by members of this subclass for
dizziness and/or vertigo in 2004
• See West et al. (2008) for more details
2009 Stata Conference: Subpop 28
Analysis of Survey Data
Stata Code for NHAMCS Analyses
• svyset cpsum [pweight = patwt],
strata(cstratm) singleunit(…)
• generate subc = (settype == 3 & sex == 2 &
agecat == 5 & race == 2)
• svy: tabulate dizzyrfv if subc == 1, se ci
percent * conditional
• svy, subpop(subc): tabulate dizzyrfv, se ci
percent * unconditional
2009 Stata Conference: Subpop 29
Analysis of Survey Data
NHAMCS Analysis Results
Method Est. % TSL SE Design DF
Missing SE 4.82 1.576 106
Centered 4.82 1.576 106
Certainty 4.82 1.576 106
Scaled 4.82 1.576 106
Unconditional 4.82 1.590 286
2009 Stata Conference: Subpop 30
Analysis of Survey Data
NHAMCS Analysis Implications
• No problems with strata having only a single ultimate
cluster: ad-hoc fixes all give the same results
• Weighted point estimates are identical
• Substantially fewer design-based degrees of freedom when
following the conditional approach; the full complex
design will not be reflected in estimation of sample-to-
sample variance (many ultimate clusters are lost)
• Conditional analysis assumes that each sample will be of
fixed size n = 397 for variance estimation purposes; no
random variance!
• Conditional analysis results in overly liberal inferences
2009 Stata Conference: Subpop 31
Analysis of Survey Data
Suggestions for Practice
• Consider Kish‟s Taxonomy when determining an
appropriate subclass analysis approach
• Utilize the appropriate software options for
unconditional analyses when analyzing cross-
classes
• Be careful with missing values when creating the
subpopulation indicator
• The unconditional analysis approach generally
works fine for both cases (when in doubt, use this
approach)
2009 Stata Conference: Subpop 32
Analysis of Survey Data
Directions for Future Research
• More appropriate calculation / estimation of
design-based and effective degrees of
freedom for sparse subclasses or mixed
classes
• Development of analytic theory for interval
estimation when working with small
subclasses, which does not rely on
asymptotic results
2009 Stata Conference: Subpop 33
Analysis of Survey Data
References
• Kish, L. 1987. Statistical Design for Research. New York:
Wiley.
• Rust, K. F., and J. N. K. Rao. 1996. Variance estimation
for complex surveys using replication. Statistical Methods
in Medical Research 5: 283–310.
• West, B.T., Berglund, P., and Heeringa, S.G. 2008. A
Closer Examination of Subpopulation Analysis of
Complex Sample Survey Data. The Stata Journal, 8(3), 1-
12.
• West, B.T., Berglund, P., and Heeringa, S.G. 2007.
Alternative Approaches to Subclass Analysis of Complex
Sample Survey Data. Proceedings of the 2007 Joint
Statistical Meetings.
2009 Stata Conference: Subpop 34
Analysis of Survey Data
Questions / Thank You!
• For additional questions, comments, or
electronic copies of these slides or the
papers, please send an email to
bwest@umich.edu
2009 Stata Conference: Subpop 35
Analysis of Survey Data
Related docs
Get documents about "