Analysis of Complex Survey Data - Katherine Keyes_2_

Document Sample
Analysis of Complex Survey Data - Katherine Keyes_2_ Powered By Docstoc
					Analysis of Complex Survey Data

          Day 5, Special topics:
  Developing weights and imputing data
Part 1: Imputation using
        HOT DECK
           What is HOT DECK?
• This procedure is designed to perform the Cox-
  Iannacchione Weighted Sequential Hot Deck
  (WSHD) imputation that is described in both Cox
  (1980) and Iannacchione (1982), a methodology
  based on a weighted sequential sample selection
  algorithm developed by Chromy (1979).
• Provisions are included in this procedure for
  multivariate imputation (several variables
  imputed at the same time) and multiple
  imputations (several imputed versions of the
  same variable).
• Donor – An item respondent selected to provide a
  value for missing item nonrespondent data
• Imputation class – user-defined group used in the
  imputation process. Three categories: classes that
  consist of only item respondent records, class that
  consist of only item nonrespondent records, and
  classes that contain both item respondents and item
  nonrespondents. For classes with both item
  respondents and item nonrespondents, imputation is
  performed and donors are selected for missing values.
  For classes with only item respondents or only item
  nonrespondents, imputation is not performed.
• Imputation variable – User-defined variable that contains
  some missing values on the input data file. A missing value
  for this variable will be populated with a donor value.
• Item Nonrespondent — A record for which imputation is
  performed on missing data. All records on the input data
  file are defined as either an item respondent or an item
  nonrespondent. Users define the set of item respondents.
  Item nonrespondents are the set of remaining records from
  the input file not defined as an item respondent.
• Item Respondent — A record for which values can be
  selected for imputation of missing item nonrespondent
  data. All records on the input data file are defined as either
  an item respondent or an itemnonrespondent. Users define
  the set of all item respondents.
               Getting started
• Prepare a dataset that includes:
  – The variable(s) you want to impute
  – The variable(s) that will inform the imputation
  – ID and weighting variables
 How do you decide what variables to
  include to inform the imputation?
• IMPORTANT ASSUMPTION: imputation assumes that, for a given
  variable with missing data, the missing-data mechanism within each
  imputation class is ignorable (also known as missing at random).
• The validity of the imputed values is dependent on how good the
  measures you use to inform the imputation. This decision is
  theoretical rather than statistical (but we can use statistics to
  inform our decision).
• Choose variables that are strongly related to the outcome of
  interest. Want response to be as homogeneous as possible within
  groups and as hetergeneous as possible across groups
   – E.g., if you are imputing depression score, you definitely want sex and
     age. The others depend on what you think impacts depression score:
     BMI? Smoking? Race? Education? Income?
  How much missing is too much?
• There is no rule that says when there is just
  too much missing data to use a variable
• Some people use a 10% rule
• It likely depends on how important this
  variable is to your analysis
• Just remember that the more missing data,
  the less valid the imputed variable will be.
         Getting into the details
• Sorting Matters. The way in which the input file is
  sorted WITHIN each imputation class (defined by
  the IMPBY statement) will have an effect on the
  imputation results. The assignment of a selection
  probability to a potential donor, or item
  respondent, depends both on the donor’s weight
  and on the weights of nearby item
  nonrespondents. In other words, both the
  weights and the sort order of observations play a
  role in the selection of donors for imputation in
  the WSHD algorithm.
Getting into the details
Lab: Imputation using HOT DECK
       Why would you need to calculate
              new weights?
• Nonresponse adjustment
   – One of the key variables in your analysis has high levels of missingness,
     and you don’t want to impute
   – In this case, you can reestimate the sample weights taking into
     account factors associated with being missing on the key variable
• Post-stratification weight adjustment
   – You don’t like the referent population used for calculating the sample
     weights in your data
   – E.G., Most complex surveys weight the sample to be representative of
     the U.S. based on the 2000 Census – they may not be desirable if your
     data was collected in 2009.
   – Post-stratification adjustment may also be useful to users who seek to
     create standardized weights or non-probability based sample weights
             PROC WTADJUST
• Designed to be used to compute nonresponse
  and post-stratification weight adjustments
• Created using a model-based, calibration
  approach that is somewhat similar to what is
  done with PROC LOGISTIC – a generalization of
  the classical weighting class approach for
  producing weight adjustments
               PROC WTADJUST
• In a model based approach:
• The weight estimate allows the user to include
  more main effect and lower-order interactions of
  variables in the weight adjustment process. This
  can reduce bias in estimates computed using the
  adjusted weights
• Allows you to estimate the statistical significance
  of the variables used in the adjustment process.
• Unlike traditional methods, can incorporate
  continuous variables.
              PROC WTADJUST
• In fact, if all interaction terms are included in
  the weight adjustment model for a given set
  of categorical variables, the model-based
  approach is equivalent to the weighting class
       The weight adjustment model

is an index corresponding to each record in the domain of interest

is the domain of interest (SUBPOPN statement)

is the final weight adjustment for each record k in   . This is the key output
variable from this procedure

is a weight trimming factor that will be computed before the B-parameters of
the exponential model (i.e., parameters of      ) are estimated

is the nonresponse or post-stratification adjustment computed after the
weight trimming step
       The weight adjustment model

lower bound imposed on the adjustment

upper bound imposed on the adjustment

centering constant

a constant used to control the behavior of   as the upper and lower bound
get closer to the centering constant

is a vector of explanatory variables

are the model parameters that will be estimated within the procedure
         The weight adjustment model

        is the input weight for record k (whatever is on the WEIGHT statement)
        dependent variable in the modeling procedure. For nonresponse
adjustments, this variable should be set to one for records corresponding eligible
respondents and to zero for records corresponding to ineligible cases. For post-
stratification adjustments, this variable should be set to one for all records that
should receive a post-stratification adjustment (if that’s everyone, just use the
option “_ONE_”.
             Weight Trimming
• Reducing the variance in your weights will
  reduce the variance in your estimates (which
  is good!). So, you might want to ‘trim’ the
  weights to be within certain bounds.
• For example, the 99 year old daily cocaine
  user might have a really extreme weight. We
  might want to reign that person in to have a
  weight that’s similar to a 60 year old daily
  cocaine user.
   How do you decide the bounds on the
        weight trimming factor?
• There are many ways to do this.
• One relatively simple approach is to parition the sample into
  small subpopulations (e.g., by strata or by levels of some
  covariate of interest)
• Within each of the subpopulations, compute the interquartile
  range (IQR) of the input sample, and set:
           A simple example
RECNOS   Unique record identifier
SAMPWT   Base sampling weight for each person
ELIG     Yes/No variable indicating whether or not the record is eligible
RESP     Yes/No variable indicating whether the record on file
         corresponds to a respondent
                     A simple example
To compute a nonresponse adjustment that will correct the sample weights of
respondents for those people that did not respond to the survey, we use the
following code:
The DESIGN=WR coupled with the NEST and WEIGHT statements provides the design
information for WTADJUST so that the procedure can compute appropriate design-based
variances of the model parameters B.

The variable SAMPWT is wk

A SUBPOPN elig=1 statement is used to tell SUDAAN to only consider eligible records. In
this example, we seek weight adjustments that will correct the sample weights of
respondents for eligible nonrespondents.

The IDVAR statement is included so that the OUTPUT file, ADJUST, contains a variable that
can be used to merge the adjustments back to the original file. In this example, the merge-
by variable is RECNOS.
The WTMAX and WTMIN statements are included. These are optional statements. A fixed value can be
used in these statements – in this case, the fixed value applies to all records k. Optionally, a variable
can be used in these statements. One could use a variable in cases where a different WTMAX and/or
WTMIN is desired for different sets of respondents. In this particular example, the user would like to
truncate any weight that is less than 10 or greater than 15000 prior to computing the actual
nonresponse weight adjustment.

Similarly, the UPPERBD and LOWRBD statements are included. These are also optional statements. A
fixed value can be used in these statements – in this case, the fixed value applies to all records k.
Optionally, a variable can be used in these statements. In this particular example, the user would like to
truncate or bound the resulting weight adjustments, k α , so that no weight adjustment falls
below 1.0 or above 3.0.
The CENTER statement is included. This is also an optional statement.
A fixed value can be used in this statement – in this case, the fixed
value applies to all records k. Optionally, a variable can
be used in this statement. In this particular example, the value of k c is
set equal to 2.0 for each record.
The MODEL statement tells WTADJUST that RESP is the 0/1 indicator
for response status and that the user would like to use the main
effects of categorical variables GENDER and RACE in the model. If the
user also wants the interaction of GENDER and RACE, then similar to
all other SUDAAN procedures, they would add the term
GENDER*RACE to the right hand side of the MODEL statement. The
user is also specifying that AGE be included in the model as a
continuous variable.
               The output file
  – In our example, this variable is assigned a value
    that will force     to equal 10 for those records
    where is <10 and 15000 for those records with
       >15000. For records with between 10 and
    15000, the value of     will be equal to 1.0
• ADJFACTOR. This will hold the values of the
  weight adjustment factors
Suppose in this example that the weighted sums of explanatory variables are as
displayed above

Then, WTADJUST is designed to yield model-based weight adjustments (          ) that will
force the adjusted weighted sum of the model explanatory variables to equal those
totals displayed in the third column above. In other words, if you were to compute the
weighted sum of each explanatory variable using only those records that satisfy RESP=1
and using the adjusted sample weight WTFINAL, then the totals you would obtain would
be equivalent to what is in column 2.
    Now a post-stratification example
Suppose instead that we were interested in obtaining a post-stratification adjustment
that would force the nonresponse-adjusted respondent weights to equal the following

 Let’s say we merged our nonresponse-adjusted respondent weights back into the
 dataset and named them WTNONADJ. Then, getting the post-stratification totals is
    Now a post-stratification example

We no longer need weight trimming or upper and lower bounds.

The POSTWGT statement contains the control totals for the post-stratification
adjustment. This numbers should correspond, in order, to the B model parameters.
Unless the NOINT option is specified, SUDAAN first always includes an intercept in the
mode. Consequently, the first POSTWGT value corresponds to the overall control total –
in this case, that would be 116900+39100=156000. The next eight numbers in the
POSTWGT statement are control totals corresponding to the GENDER*AGEGRP*RACE
interaction. Note that control totals should be supplied for reference levels associated
with any explanatory variable or interaction term.
Lab 5: Calculating sample

Shared By: