Analysis of Complex Survey Data - Katherine Keyes_3_ by hcj


									Analysis of Complex Survey Data

        Katherine M. Keyes
           Purpose of this class
• Teach you how to analyze complex survey data
  using SUDAAN
• Provide you with the tools to:
  – 1) find datasets that fit your research interests;
  – 2) download and manage those datasets;
  – 3) do your own analyses
           Structure of the class
•   1:00-2:00   Lecture
•   2:00-3:30   Guided exercise
•   3:30-3:45   Break
•   3:45-5:00   Independent research project
                 Today’s schedule
• Introduction to each other
• Key concepts in complex surveys
• Introduction to the NHANES
   – Focus on describing the complexities in sample and design
   –   Locate variables
   –   Download data files
   –   Append and merge datasets
   –   Clean and recode data
   –   Format and label variables
   –   Save datasets
Who am I?
Who are you?
    What is ‘complex survey data’
• Complex survey data usually refers to sample
  designs in which respondents have been
  sampled in a way that is multi-stage, stratified,
  unequally weighted, and/or clustered.
• Because of these design elements, the sample
  is no longer “randomly selected”, which
  violates the assumptions of basic large-sample
   What is ‘complex survey data’
• Because of this, we need to take into account
  the design elements when estimating
  standard errors.
 Two types of weights commonly used
• SAMPLE WEIGHTS: adjust for oversampling of certain typically hard
  to reach groups (e.g., young people) and informative nonresponse
• DESIGN WEIGHTS: adjust the standard errors for the nonrandom
  probability of selection into the sample

• Sample weights affect the ESTIMATES and not the STANDARD
• Design weights affect the STANDARD ERRORS and not the

• We need SUDAAN to incorporate the design weights.
   Design weights: what are they
• Strata: larger geographic unit
• Primary Sampling Units (PSUs): generally
  single counties or groups of small counties
• Households
  Introduction to the data we will be
          using in this class
• National Health and Nutrition Examination
• “A program of studies designed to assess the
  health and nutritional status of adults and
  children in the United States. The survey is
  unique in that it combines interviews with
  physical examinations.”
           Introduction to the data we will be
                   using in this class

          1959-    1963-   1966-   1971-   1976-     1989-     1991-    1999-  2001-  2003-  2005-  2007-  2009-
Years     1962     1965    1970    1975    1980       1991      1994    2000   2002   2004   2006   2008   2010
                                                   NHANES NHANES
name      NHES I   NHES II NHES III    I      I         I         II    99-00  01-02  03-04  05-06  07-08  09-10
Age range 18-79    12-17    12-17    1-74   1-74      1-74      1-74     0-75  0-75   0-75   0-75   0-75    0-75
    Domains of inquiry in the NHANES
•   Demographic background    •   Dermatology
•   Housing characteristics   •   Diabetes
•   Smoking                   •   Dietary screener
•   Consumer behavior         •   Dietary behavior
•   Income                    •   Early childhood
•   Food security             •   Health insurance
•   Tracking and tracing      •   Hospital utilization and access
•   Acculturation                 to care
•   Arthritis                 •   Immunization
•   Audiometry                •   Kidney conditions
•   Blood pressure            •   Occupation
•   Cardiovascular disease    •   Oral health
                              •   Osteoporosis
 Domains of inquiry in the NHANES
• Physical activity and physical
• Physical functioning
• Respiratory Health and
• Sleep disorders
• Weight history
• Reproductive health
• Illegal drug use
• Depression
• Alcohol use
• Pesticide use
• Bowel health
    Physical exam includes measures of:
• Arthritis
• Audiometry
• Bone density (DXA)
• Anthropometry
• Oral Glucose Tolerance
• Oral Health
• Physician’s Exam
• Respiratory Health
    Laboratory components include
             measures of:
• Venipuncture          • Thyroid profile
• Urine collection      • Standard biochemical
• Bone mineral status   • Kidney disease profile
  markers               • Pregnancy test
                        • Prostate Specific Antigen
• Diabetes profile
                        • Nutritional biochemistries
• Infectious disease      and hematologies
  profile               • STD profile
                        • Blood lipids
• Oral HPV              • Environmental health
• C-reative protein       profile
• Blood samples for DNA purification were collected
  from participants age 20 or more years in survey
  years 1999-2002 and 2007-2008.
• These are restricted access data
  Landmark findings and public health
• High blood lead levels
   – Lead out of gasoline
• Low folate levels
   – Mandatory food fortification
• Rising levels of obesity
   – Public health action plan
• Racial/ethnic disparities in Hepatitis B
   – Universal vaccination of all infants and children
           NHANES not for you?
• The concepts we will discuss apply to many other
  publicly available datasets, and you are encouraged
  to use these data for your in-class project if your
  research questions are not covered in the NHANES

• Where can I find other publicly available datasets?

   – ICPSR:
  Design weights: variable names
• Strata: SDMVSTRA

    Sample weights in the NHANES
• If only data from the interviewed sample is used, then the
  appropriate SAS variable is:
    – WTINT2YR

• If data from the medical examination is used, then the appropriate
  SAS variable is:
    – WTMEC2YR

• Some data are only collected on sub-samples of NHANES
  participants. These data are generally not publicly available or are
  only released a few years after the main interview data. If you are
  using data on a subsample of NHANES participants, appropriate
  subsample weights must be used and they are included on any
  data file where relevant.
    Combining NHANES samples
• For NHANES 1999-2000, SDMVSTRA is
  numbered 1 to 13; for NHANES 2001-2002
  SDMVSTRA is numbered 14-28; for NHANES
  2003-2004 SDMVSTRA is numbered 29-43;
• Therefore, two year NHANES cycles can be
  combined without any recoding of this
      Combining NHANES samples:
• For the 1999-2002 and 2003-2006 survey
  periods, Mexican Americans were
  oversampled but non-Mexican American
  Hispanics were not oversampled.
• Therefore, estimates for Hispanics that are not
  Mexican Americans are generally unreliable
  and should not be analyzed
• Further, estimates for ‘all Hispanics’ should
  not be calculated
      Combining NHANES samples:
        2007-2008, 2009-2010
• The sample design of NHANES 2007-2010 is
  different than the sample designs for earlier
• Adolescents were no longer oversampled
• Non-Mexican American Hispanics were
  oversampled, allowing for estimates of “all
  Hispanics” (but smaller subgroups remain
       Summary: combining samples
• The NHANES sample designs for the periods 1999-2002 and
  2003-2006 were similar, such that combining data cycles
  within these periods does not present any analytic issues.
• When combining with the 2007-2008 data, however, data
  users should not create estimates for total Hispanics for the
  2005-2008 data period.
• For non-Hispanic white, non-Hispanic black, and Mexican
  American sample domains, rescaling the sample weights to
  create four-year weights should be sufficient
• But users should check estimates carefully to see if the four
  year estimates and sampling errors are consistent with
  each set of 2 year estimates.
Reweighting the data when combining
• When combining two or more 2-year cycles of the continuous
  NHANES, the user must calculate new sample weights before
  beginning any analysis of the data.
• A set of four year weights has already been created for the 1999-
  2002 data (e.g., for the MEC sample it’s WTMEC4YR).
• For four year estimates for 2001-2004, one can create a new
  variable for a four year weight by assigning ½ of the 2 year weight
  for 2001-2002 if the person was sampled in 2001-2002 or assigning
  ½ of the 2 year weight for 2003-2004 if the person was sampled in
• For an estimate for the 6-years of 1999-2004, a 6-year weight
  variable can be created by assigning 2/3 of the 4 year weight for
  1999-2002 if the person was sampled between 1999-2002 or
  assigning 1/3 or the 2 year weight for 2003-2004 if the person was
  sampled in 2003-2004.
           LAB #1:

Open the Word document “Lab 1:
 Preparing an analytic dataset”

To top