Using Stata for Subpopulation Analysis of Complex Sample Survey

W
Document Sample
scope of work template
							Using Stata for Subpopulation
Analysis of Complex Sample
        Survey Data
              Brady T. West
               PhD Student
 Michigan Program in Survey Methodology




  July 30, 2009    2009 Stata Conference
        Presentation Outline
1. Introduction: Subclass Analysis Issues
2. Kish‟s Taxonomy of Subclasses
3. Two Alternative Approaches to Inference
4. Variance Estimation and Methods for
   „Singletons‟
5. Examples using NHANES and NHAMCS Data
6. Suggestions for Practice
7. Directions for Future Research

              2009 Stata Conference: Subpop   2
                 Analysis of Survey Data
      Subclass Analysis Issues
• Analysts of large, complex sample survey
  data sets are often interested in making
  inferences about subpopulations of the
  original population that the sample was
  selected from (e.g., Caucasian Females)
• These subpopulations are referred to
  interchangeably in various literatures as
  subgroups, subclasses, subpopulations,
  domains, and subdomains, leading to
  confusion among analysts of survey data
                2009 Stata Conference: Subpop   3
                   Analysis of Survey Data
 Subclass Analysis Issues, cont‟d
• Software procedures for analysis of
  complex sample survey data are becoming
  more powerful, flexible, and widely
  available, offering analysts several options
• Analysts need to be careful when analyzing
  subclasses, and be aware of the alternative
  approaches to subclass analysis that are
  possible and their implications for inference
                2009 Stata Conference: Subpop   4
                   Analysis of Survey Data
 Kish‟s Taxonomy of Subclasses
• Design Domains: Restricted to specific strata
  according to the complex sample design (usually
  geographically, e.g., Texas)
• Cross-Classes: Broadly distributed (in theory)
  across the strata and primary sampling units
  defining a complex sample (e.g., African-
  Americans over age 50)
• Mixed Classes: Disproportionately distributed
  across the complex sample design (e.g., Hispanics
  in a sample including Los Angeles as a stratum)
• See Kish (1987), Statistical Design for Research
                  2009 Stata Conference: Subpop       5
                     Analysis of Survey Data
       Design Domains
     X = Sample Element in Subclass
Stratum        PSU 1          PSU 2
   1       XXXXXXX                   XXXXXXX
           XXXX                      XX
   2       XXXXXXX                   XXXXXXX
           XXX                       XXXXX
   3

   4

   5
          2009 Stata Conference: Subpop        6
             Analysis of Survey Data
       Cross-Classes
Stratum          PSU 1                    PSU 2
   1       XXXXXXX XXXXX
           XXXXX
   2       XXXX    XXXXXXX

   3       XXXXXXX XXXXXXX
           XXXX    XX
   4       XXXXXX XXXXX

   5       XXXXXXX XXXXXXX
           XXX     XXXXX
          2009 Stata Conference: Subpop           7
             Analysis of Survey Data
       Mixed Classes
Stratum          PSU 1                    PSU 2
   1       XXXXXXX XXXXXXX
           XXXXXXX XXXXXX
   2               X

   3       XXXXXXX XXXXXXX
           XXXXXX XXX
   4       XX

   5       XXXXXXX XXXXXXX
           XXXXXXX XXXXX
          2009 Stata Conference: Subpop           8
             Analysis of Survey Data
    Applying Kish‟s Taxonomy
•   The type of subclass is critical for
    determining an appropriate analysis
    approach
•   Two possible approaches to inference
    motivated by the taxonomy:
    1. Unconditional approach (cross-classes,
    mixed classes)
    2. Conditional approach (design domains)

                2009 Stata Conference: Subpop   9
                   Analysis of Survey Data
   The Unconditional Approach
• Appropriate for Cross-Classes, and in some
  cases Mixed Classes; the subclass of interest
  theoretically can appear in all design strata
  and primary sampling units (PSUs)
• KEY POINT: Allow the software to process
  the entire survey data set, and recognize all
  possible design strata and PSUs; DO NOT
  delete sample cases not in the subclass!
                2009 Stata Conference: Subpop   10
                   Analysis of Survey Data
   The Unconditional Approach
• Rationale: estimated variances for sample
  estimates of subclass parameters (based on
  within-stratum variance between PSUs)
  need to reflect sample-to-sample variability
  based on the full complex design
• In other words, if a particular subclass does
  not appear in a PSU in any given sample
  (although in theory it could have), that PSU
  should contribute 0 to variance estimates,
  rather than be ignored completely!
                2009 Stata Conference: Subpop   11
                   Analysis of Survey Data
   The Unconditional Approach
• Further, the subclass sample size in each
  stratum is going to be a random variable,
  and theoretical sample-to-sample variance
  in realizations of this random variable
  should be incorporated into any variance
  estimation procedures


               2009 Stata Conference: Subpop   12
                  Analysis of Survey Data
   The Unconditional Approach
• If cross-classes (or in some cases mixed classes)
  are being analyzed, and PSUs where the subclass
  does not appear (by random chance) are deleted,
  problems arise
• Some strata may appear to have only one PSU by
  design (preventing variance estimation unless an
  ad hoc approach is used)
• Entire design strata may be dropped, impacting
  variance estimates and calculations of degrees of
  freedom

                  2009 Stata Conference: Subpop       13
                     Analysis of Survey Data
      The Unconditional Approach:
          General Stata Code
• svy, subpop(indicator): command varlist, options
• indicator = an indicator variable for the subpop or
  an if condition, e.g., if male == 1
• svy: mean, over(groupvar)
• svy: prop, over(groupvar)
• Stata drops strata* with no subpopulation
  observations from degrees of freedom calculations
  * Exercise: repeat 10 times really fast

                         2009 Stata Conference: Subpop   14
                            Analysis of Survey Data
    The Conditional Approach
• Appropriate for Design Domains, where a
  subclass cannot appear outside of specific
  design strata
• The rationale behind the unconditional
  approach no longer applies
• Certain design strata should not contribute
  to variance estimation or calculation of
  degrees of freedom

                2009 Stata Conference: Subpop   15
                   Analysis of Survey Data
    The Conditional Approach
• Restrict the analysis to only those design
  strata where the subclass of interest exists
• Variance estimates reflecting sample-to-
  sample variability should only be based on
  those design strata where the subclass can
  appear (unlike the unconditional approach)
• Subclass sample sizes in design domains are
  assumed to be fixed, by design

                2009 Stata Conference: Subpop   16
                   Analysis of Survey Data
      The Conditional Approach:
         General Stata Code
• svy: command varlist if (condition), options
• (condition) might be male == 1, or a more
  complex combination of conditions (e.g.,
  male == 1 & age >= 50 & age <= 90)




                2009 Stata Conference: Subpop   17
                   Analysis of Survey Data
  Variance Estimation Methods
• All of these issues are only relevant when
  using Taylor Series Linearization, which is
  a default for variance estimation in Stata
• Conditional analyses are OK to perform
  when using replication methods, such as
  Balanced Repeated Replication or Jackknife
  Repeated Replication (Rust and Rao, 1996)

               2009 Stata Conference: Subpop   18
                  Analysis of Survey Data
       Ad-hoc Fixes for „Singleton‟
          Clusters in Stata 10.1
•   Stata 10.1 provides users with four ad-hoc fixes
    for the problem where strata are identified with
    only a single ultimate cluster for variance
    estimation in a subpopulation analysis:
    1. Report Missing Standard Errors (not really a fix)
    2. Treat Units as Certainty Units, which contribute
       nothing to the standard error
    3. Scale Variance using Certainty Units, which uses the
       average variance from each stratum with multiple
       PSUs for each stratum with only a single PSU
    4. Center at the Grand Mean, where the variance
       contribution comes from a deviation from the grand
       mean instead of the stratum mean
                     2009 Stata Conference: Subpop            19
                        Analysis of Survey Data
  Example: The NHANES Data
• We first consider examples based on the
  NHANES II data set, collected from a
  nationally representative multistage
  probability sample of the U.S. population
  from 1976-1980 (oldie but a goodie)
• Briefly, a sample of the U.S. population was
  given medical examinations in an effort to
  assess the health of the U.S. population
                2009 Stata Conference: Subpop   20
                   Analysis of Survey Data
   Example NHANES Analysis
• Analysis Subclass: African-Americans ages
  50 and above (this is a cross-class of the
  U.S. population, which can theoretically
  appear in all design strata and PSUs)
• Analysis Objective: Estimate the mean
  systolic blood pressure of this subclass and
  an appropriate standard error
• See West et al. (2007) for more details

                2009 Stata Conference: Subpop   21
                   Analysis of Survey Data
       Conditional Approach:
 Stata Code for NHANES Analysis
• svyset ppsu [pweight = fwgtexam],
  strata(stratum) singleunit(missing)
• svyset ppsu [pweight = fwgtexam],
  strata(stratum) singleunit(centered)
• Also singleunit(certainty),
  singleunit(scaled)
• gen b50subp = (race == 2 & ager >= 50)
• svy: mean bpsyst if b50subp == 1
               2009 Stata Conference: Subpop   22
                  Analysis of Survey Data
 Conditional Approach: Results
 Method      Est. Mean             TSL SE       Design DF

Missing SE    144.09                      .     50-29 = 21

 Centered     144.09                   1.66     50-29 = 21

Certainty     144.09                   1.62     50-29 = 21

  Scaled      144.09                   1.90     50-29 = 21

                2009 Stata Conference: Subpop            23
                   Analysis of Survey Data
       Conditional Approach?
• This approach would not be appropriate for
  this particular subclass
• Computed standard errors would generally
  be biased downward, because additional
  sources of sample-to-sample variability are
  ignored when following this approach
• Same issues apply for analytic models
• Evidence that the “scaled” ad-hoc fix may
  be overly conservative!
                2009 Stata Conference: Subpop   24
                   Analysis of Survey Data
      Unconditional Approach:
 Stata Code for NHANES Analysis
• svyset ppsu [pweight = fwgtexam],
  strata(stratum) singleunit(missing)
• Note: choice of single unit option does not
  matter when following this approach!
• gen b50subp = (race == 2 & ager >= 50)
• svy, subpop(b50subp): mean bpsyst


                2009 Stata Conference: Subpop   25
                   Analysis of Survey Data
Unconditional Approach: Results
 Method             Est. Mean                 TSL SE            Des. DF*

Missing SE             144.09                     1.66         58-29 = 29

 Centered              144.09                     1.66         58-29 = 29

Certainty              144.09                     1.66         58-29 = 29

  Scaled               144.09                     1.66         58-29 = 29

* Note: Stata dropped three strata with no sample units in the subpopulation.

                           2009 Stata Conference: Subpop                        26
                              Analysis of Survey Data
     Unconditional Approach?
• This approach would be the appropriate
  choice for a cross-class such as African-
  Americans over the age of 50
• Inferences are theoretically appropriate
• Same idea for analytic models
• Results suggest that the “centered” and
  “certainty” ad-hoc fixes for conditional
  analyses are reasonable
                2009 Stata Conference: Subpop   27
                   Analysis of Survey Data
  Example: The NHAMCS Data
• Analysis Subclass: Visits to Emergency
  Departments (ED) by African-American men ages
  60 and above (this is another cross-class of the
  U.S. population, which can theoretically appear in
  all NHAMCS design strata and PSUs)
• Analysis Objective: Estimate the percentage of all
  ED visits by members of this subclass for
  dizziness and/or vertigo in 2004
• See West et al. (2008) for more details
                  2009 Stata Conference: Subpop    28
                     Analysis of Survey Data
 Stata Code for NHAMCS Analyses
• svyset cpsum [pweight = patwt],
  strata(cstratm) singleunit(…)
• generate subc = (settype == 3 & sex == 2 &
  agecat == 5 & race == 2)
• svy: tabulate dizzyrfv if subc == 1, se ci
  percent * conditional
• svy, subpop(subc): tabulate dizzyrfv, se ci
  percent * unconditional

               2009 Stata Conference: Subpop   29
                  Analysis of Survey Data
    NHAMCS Analysis Results
  Method        Est. %               TSL SE       Design DF
 Missing SE      4.82                  1.576        106
  Centered       4.82                  1.576        106
  Certainty      4.82                  1.576        106
   Scaled        4.82                  1.576        106
Unconditional    4.82                  1.590        286

                  2009 Stata Conference: Subpop           30
                     Analysis of Survey Data
NHAMCS Analysis Implications
• No problems with strata having only a single ultimate
  cluster: ad-hoc fixes all give the same results
• Weighted point estimates are identical
• Substantially fewer design-based degrees of freedom when
  following the conditional approach; the full complex
  design will not be reflected in estimation of sample-to-
  sample variance (many ultimate clusters are lost)
• Conditional analysis assumes that each sample will be of
  fixed size n = 397 for variance estimation purposes; no
  random variance!
• Conditional analysis results in overly liberal inferences


                    2009 Stata Conference: Subpop         31
                       Analysis of Survey Data
       Suggestions for Practice
• Consider Kish‟s Taxonomy when determining an
  appropriate subclass analysis approach
• Utilize the appropriate software options for
  unconditional analyses when analyzing cross-
  classes
• Be careful with missing values when creating the
  subpopulation indicator
• The unconditional analysis approach generally
  works fine for both cases (when in doubt, use this
  approach)


                  2009 Stata Conference: Subpop        32
                     Analysis of Survey Data
  Directions for Future Research
• More appropriate calculation / estimation of
  design-based and effective degrees of
  freedom for sparse subclasses or mixed
  classes
• Development of analytic theory for interval
  estimation when working with small
  subclasses, which does not rely on
  asymptotic results
                2009 Stata Conference: Subpop   33
                   Analysis of Survey Data
                    References
• Kish, L. 1987. Statistical Design for Research. New York:
  Wiley.
• Rust, K. F., and J. N. K. Rao. 1996. Variance estimation
  for complex surveys using replication. Statistical Methods
  in Medical Research 5: 283–310.
• West, B.T., Berglund, P., and Heeringa, S.G. 2008. A
  Closer Examination of Subpopulation Analysis of
  Complex Sample Survey Data. The Stata Journal, 8(3), 1-
  12.
• West, B.T., Berglund, P., and Heeringa, S.G. 2007.
  Alternative Approaches to Subclass Analysis of Complex
  Sample Survey Data. Proceedings of the 2007 Joint
  Statistical Meetings.
                     2009 Stata Conference: Subpop         34
                        Analysis of Survey Data
      Questions / Thank You!
• For additional questions, comments, or
  electronic copies of these slides or the
  papers, please send an email to
  bwest@umich.edu




                2009 Stata Conference: Subpop   35
                   Analysis of Survey Data

						
Related docs