STATA Tutorial - Center for Knowledge Management

Document Sample
STATA Tutorial - Center for Knowledge Management Powered By Docstoc
					Using the 2008 OFHS Public Use File
       A Self Guided Tutorial
          *Stata Version*

• This tutorial is intended for persons who wish to use the
  2008 OFHS Public Use File (PUF).
• The PUFs exclude any information that could either
  intentionally, or unintentionally identify a respondent.
  Geographic information below the county level has been
• The dataset is a record of the responses to the survey
  questions at the respondent level.
• The dataset is in a format that requires the use of SAS, a
  statistical analysis software from SAS Institute.
• The dataset is also available for SAS and SPSS. There
  is a separate tutorial for SAS users.
              STATA Users
• Prerequisites
  – User has STATA Release 9 or Higher.
  – User has experience writing STATA
  – User has an understanding of basic statistics,
    including analysis of univariate data using
    nominal and ordinal level variables.
  – User is comfortable with statistical terms such
    as proportions, standard error, confidence
    level, and confidence interval.
        OFHS Background
• The 2008 OFHS is the largest State
  sponsored health survey in the U.S.
• Previous surveys were completed in 1998
  and 2004.
• The survey had a sample size of 50,993.
• The survey was stratified to have enough
  respondents to do some analysis for each
  county in the state.
Documents that you may download
     before you get started.
• OFHS Questionnaire
• OFHS Codebook
  These documents are available on the
  OFHS web site.

  Look on the Downloads page.
      What you need to know about the survey.

•   Survey Design
•   Survey Questions
•   Imputation of Missing Values
•   Weighting of Responses
•   Constructed Variables
                Survey Design
• The survey is a stratified random sample of
  Ohio‟s non-institutional population.
  – Conducted through telephone interviews.
     • Land Lines (49,000 respondents)
     • Cell Phone (2,000 respondents)
  – Random Digit Dialing (land lines) within exchange
    numbers associated with each county.
     • Exchanges are the first 3 digits of a seven digit phone
     • The last four digits within each exchange are randomly
                Survey Design
– Cell Phones
   • Exchanges are at state level.
– Over Samples
   • African Americans - Some Exchanges in 6 largest urban counties
     have higher proportion of African Americans in the population. The
     higher proportion exchanges were sampled at a higher rate.
   • Asian and Hispanics - Supplementation of survey with lists of
     persons with hispanic or asian surnames.
– Household clusters
   • Each household/family forms a cluster within the sample.
       – One adult and one child are randomly selected within the family.
       – Each response includes information on the adult, and the child (if there
         are any children).
       – The adult who is most knowledgeable about the child‟s health responds
         for the child.
              Survey Design
• The population of persons within each of the
  strata (State, County, telephone exchange,
  household, etc.) is already known or is collected
  as a part of the survey.
• A weight is established for each child and adult
  which reflects the inverse of the probability of
  being selected for the survey.
• Indicators of the strata and the weights are used
  in the STATA programs. We will come back to
  this later on.
          Survey Questions
• In the survey questionnaire there are
  different kinds of questions. They include:
  – Qs that help to establish the weights for the
     • How many children are in the family?
     • How many phone numbers are in the home?
        Survey Questions
– Qs that identify the demographic and
  socioeconomic characteristics of the
  individuals and the family.
  • Age, gender, race, ethnicity.
  • Family income, employment, occupation.
  • Education
         Survey Questions
– Qs that identify the insurance status of the
  adult and child respondents.
   • Source of Coverage (Job based, Medicare,
     Medicaid, etc.)
   • If no insurance, the length of time without
   • Difficulty in getting insurance.
   • Types of Coverage (dental, prescriptions, vision
     mental health)
          Survey Questions
– Health Status of Adult and Child
  •   General health status
  •   Chronic health conditions
  •   Special Health Care needs
  •   Functional disability
  •   Height and weight
          Survey Questions
• Health Care Access, Utilization,
  Satisfaction and Unmet needs.
  – Usual source of care
  – Care coordination
  – Specialists
  – Emergency room use
  – Hospitalizations
  – Types of unmet needs.
          Survey Questions
• Questions are at multiple levels.
  – Anchor Questions are questions that are
    asked of everyone.
  – Qualifying Questions are questions that help
    to narrow down who should be responding to
    an in-depth question.
  – In-depth questions probe the dimensions of
    the respondent‟s experience with a particular
Example of Question levels
           D43. //Have you/Has person in S1// ever been told by a doctor or any other
Anchor          health professional that //you/he// had diabetes or sugar diabetes?
           01                                        YES
Question   02 (Skip to D45)                          NO
           03                                        [VOLUNTEERED:] BORDERLINE
           98                                        DK
           99                                        REFUSED

           D43a //Have you/Has person in S1// ever been told by a doctor or any other
               health professional that //you/he/she// had TYPE 1 CHILD ONSET

                 „BORDERLINE‟ CODE AS „03‟]
           //Display response option 97, only if S15 = 02, 99.
           // 97 (Skip to D45)        [VOLUNTEERED:] YES, “GESTATIONAL” OR

           01                              YES - TYPE I (JUVENILE)
           02                              YES - TYPE II (ADULT ONSET)
           04   (Skip to D45)    NO, NEVER DIAGNOSED WITH DIABETES
           98   (Skip to D45)              DK
           99   (Skip to D45)              REFUSED
Example of Question levels
             D43b.      //If (s15 = 02) then ask://
Qualifying   //Was your/Was person in S1‟s// DIABETES only during a time
Question         associated with a pregnancy? [INTERVIEWER: PROBE
                 FOR PROPER CODE]
             01 (Skip to D45)                   YES ONLY WHEN
             02                                 NO
             98 (Skip to D45)                   DK
             99 (Skip to D45)                   REFUSED

             D44.       //Is your/Is person on S1‟s// blood sugar or glucose
In Depth        level, which affects diabetes, USUALLY under control or
Question        where a physician wants it, even if medication is required
                Always, Usually, Sometimes, Rarely, or Never?
             01                                 ALWAYS
             02                                 USUALLY
             03                                 SOMETIMES
             04                                 RARELY
             05                                 NEVER
             98                                 DK
             99                                 REFUSED
             Question levels
• Notice in the example that there are instructions
  to skip to another question if the answer is no.
• These are anchor questions and qualifying
  questions which are eliminating persons from
  answering the in-depth questions.
• As a result, when a question is not asked of a
  respondent it creates a missing value for the
  respondent which is MISSING BY DESIGN.
           Missing Values
• Some data is missing in the survey
  because the respondent refused to answer
  the question, or did not know the answer.
• These kinds of missing values need to be
  treated differently then those that are
  „missing by design‟.
            Missing Values
• There are some types of questions which
  are very important to the survey design or
  for public policy issues, for which it is not
  acceptable to have values missing.
• These include questions like:
  – Number of children in the family (design)
  – Family Income (public policy)
   Imputation of Missing Values
• Where it is important for the survey to not have any
  missing values, the survey statisticians have replaced
  the missing value, by imputing it from all of the other
  survey respondents that answered other questions in the
  survey like the respondent did.
• Survey statisticians use very sophisticated models and
  processes to do imputation, and the practice is well
• When using this survey to do analysis, it is expected that
  the user will choose the form of the variable which
  includes the imputed values.
• These variables are labeled and typically have a suffix of
• Weights for each adult and child response
  which reflect the inverse of the probability
  of being selected for the survey, are
  constructed and should be used in all
• When the weights are used, the results
  reflect an accurate reflection of the entire
• If the weights for children in the OFHS
  were summed up across all responses,
  the total would be equal to the child
  population of Ohio. The same is true of
  the adult weights.
• The variable name for the adult weight is
• The variable name for the child weight is
       Constructed Variables
• There are many variables in the OFHS file
  that are constructed from the responses to
  the survey questions that make it easier to
  use the OFHS. These variables include:
  – BMI – Body mass index. BMI is an indicator
    of adult and child obesity constructed from
    height and weight. The formula is
    complicated, especially for children. We
    make it easier for the user to do analysis of
    obesity by pre-calculating it.
     Constructed Variables
– Insurance Type – In many instances,
  respondents to the survey had more than one
  source of insurance. For example, many
  seniors have insurance from their private
  pension plans and Medicare. For the purpose
  of creating an unduplicated count of the
  population by their insurance status, we have
  created a variable which imposes a hierarchy
  of insurance sources to classify the
      Using Stata with the OFHS
•   Step 1. Download and Un-zip the Stata dataset.
•   Step 2. Open dataset in Stata.
•   Step 3. Set survey design parameters in Stata.
•   Step 4. Build and run your first OFHS Stata Program
Download and Unzip the Stata dataset.

• You will find the OFHS Public Use Dataset at:
• Right click on the file name and select „save
  target as‟.
• Save the ZIP file to the directory where you will
  store the data (c:\statadata\ofhs2008).
• After the file has been saved, run winzip, saving
  the unzipped file to the same directory.
 Setting survey design parameters
• After you open the data in Stata, you will have to set the
  survey design parameters prior to running any analyses.
  To do this, type the following command in the command
  window in Stata. (Note: You will have to do this EVERY
  time you open the data.)

• If conducting analyses on adults:
svyset masterid [pweight=wt_a], strata(stratum)
  singleunit(certainty) vce(linearized)

• If conducting analyses on child population:
svyset masterid [pweight=wt_c], strata(stratum)
  singleunit(certainty) vce(linearized)
   Build and run your first OFHS Stata Program

• You should only use procedures in Stata that
  support the use of complex survey designs.
  – svy: mean (estimates means)
  – svy: prop (estimates proportions)
  – svy: tabulate (provides tables)

  – A detailed list of commands that support the use of
    complex survey designs can be found by going to the
    Help menu in Stata (found in toolbar), choosing Stata
    command, and typing “svy estimation”
                   Proc Surveymeans
Here is a simple program which calculates the percent of children by Insurance Type.
It includes a 95% confidence interval around the mean.
Note that you have already entered all of the sampling design parameters (at the
beginning of your session).
Remember that to calculate any adult variables, you will have to re-enter your design
parameters, using the code provided on slide 28.

svy: tab i_type_c, ci
                                            Svy: tab results
                           (with a little cutting and pasting and formatting of values)

                                                                                            95% C.I. Lower    95% C.I. Upper
Child Insurance Type                           Proportions             Std. Error           Bound             Bound

1: Medicaid & Medicare                                        1.94%                 0.17%             1.64%             2.30%

2: Medicaid, No Medicare                                     30.92%                 0.55%            29.84%            32.01%

3: Medicare, No Medicaid                                      0.64%                 0.09%             0.50%             0.83%

4: Job-based Coverage                                        53.29%                 0.57%            52.16%            54.42%

5: Directly Purchased                                         2.55%                 0.18%             2.22%             2.93%

6: Other                                                      0.63%                 0.09%             0.47%             0.84%

7: Insured Type Unknown                                       5.99%                 0.29%             5.45%             6.57%

8: Uninsured                                                  4.04%                 0.21%             3.65%             4.48%

Total                                                        100.00%
                               svy: tabulate
              Now you might add some domain analysis to this,
              breaking out insurance status for children by poverty level.

generate poverty200=.
replace poverty200=0 if h87_imp>4
replace poverty200=1 if h87_imp<=4
replace poverty200=. If h87_imp==.
svy: tab i_type_c if poverty200==0, se ci
svy: tab i_type_c if poverty200==1, se ci
                           Svy: tabulate with an if statement
                                                                                95% C.I. Lower    95% C.I. Upper
Child Insurance Type if FPL>=201%      Proportions             Std. Error       Bound             Bound

1: Medicaid & Medicare                                0.43%             0.09%             0.29%            0.65%

2: Medicaid, No Medicare                              7.64%             0.45%             6.80%            8.57%

3: Medicare, No Medicaid                              0.57%             0.11%             0.39%            0.84%
4: Job-based Coverage                                80.40%             0.63%            79.14%           81.60%
5: Directly Purchased                                 3.48%             0.29%             2.95%            4.10%
6: Other                                              0.64%             0.13%             0.43%            0.94%

7: Insured Type Unknown                               4.56%             0.34%             3.95%            5.27%
8: Uninsured                                          2.28%             0.20%             1.91%            2.72%

Total                                                100.00%
                                                                                95% C.I. Lower    95% C.I. Upper
Child Insurance Type if FPL<201%       Proportions             Std. Error       Bound             Bound

1: Medicaid & Medicare                                3.77%             0.35%             3.14%            4.51%

2: Medicaid, No Medicare                             58.93%             0.86%            57.23%           60.61%

3: Medicare, No Medicaid                              0.74%             0.13%             0.52%            1.04%
4: Job-based Coverage                                20.67%             0.68%            19.37%           22.03%
5: Directly Purchased                                 1.42%             0.19%             1.10%            1.84%
6: Other                                              0.62%             0.14%             0.40%            0.96%

7: Insured Type Unknown                               7.69%             0.48%                0%            8.70%
8: Uninsured                                          6.16%             0.40%             5.43%            6.98%

Total                                                100.00%

Shared By: