The Covariance of Earnings and Hours Revisited by USCensus


									Research on Improvements to Current SIPP Imputation Methods
ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson

Census Imputation Research Plan
• Few changes made to actual production imputation methods in many years • With redesign of the SIPP, this is an opportunity to consider what changes might be made • New committee formed with members from content, data processing, sampling, and statistical methodology divisions • Incremental approach: test new methods and consider short list of variables that might be substantially improved

Proposed Improvements
1. Model-based approach 2. Use administrative data to mitigate problems caused when survey data are not “missing at random” 3. Multiple imputation


Model-based Approach
• Hot-deck depends on a donor matrix with reasonable cell sizes • Small cells must sometimes be collapsed • Collapsing cells creates a more heterogeneous group of donors • Hot-deck can’t take account of variables that are dropped in order to combine cells

Model-based Approach: Research
• Consider an imputation method that uses a linear regression to impute missing values • Stratify sample by set of characteristics, run regressions for each sub-group that is large enough • Sub-groups that are too small are combined • Variables that are dropped from stratification list are added as explanatory variables in the regression

• Earnings imputation
– Stratify by age, gender, race, education, industry, and disability – Including disability may cause some small cells – Perhaps combine sub-groups of disabled and notdisabled white women in their fifties – For this sub-group, include disability status as explanatory variable in regression of earnings on SIPP characteristics

Data Not “Missing At Random”
• All imputation methods that use survey data exclusively are built on the assumption that the relationships between survey variables are the same for everyone, regardless of missing data • Assume relationship between X1, X2, X3 and Y can be estimated • Assume if Y is missing, X1, X2, and X3 are good predictors • However if the relationship between Y and X1, X2, X3 is different when Y is missing, the imputation will be flawed

Data Not “Missing At Random”: Research

• We can evaluate the magnitude of this problem and mitigate the impact on imputation using administrative data • Information from an outside source can help account for unobservable (in the survey) differences between people


Example: 2004 SIPP panel
• 2004 Annual earnings at two main jobs
– Earnings at each job are imputed on a monthly basis – Sum across jobs and then across months to get annual earnings – Create count of number of imputed months in the year (range from 0-12) – If either job has imputed earnings, count the full month as imputed

Example: 2004 SIPP panel (cont.)
• Split SIPP respondents into groups
1. No months of imputed or missing data 2. 1-4 months of imputed data (no missing) 3. 5-8 months of imputed data (no missing) 4. 9-12 months of imputed data (no missing)

• Match earnings report from W-2 records summed for all employers

Example: 2004 SIPP panel (cont.)
• If earnings are missing at random, relationship between admin. earnings and other SIPP variables should be the same for all four groups • Test
– regress admin. earnings on SIPP demographic variables separately for each group – predict earnings for each group using each set of coefficients (four predicted values per group) – compare each prediction to actual admin. earnings – if coefficients are good predictors, difference should be zero on average

Example: Results
Coeff1 Actual1 – pred1 Actual 2– pred1 Actual 3– pred1 Actual 4– pred1 Coeff2 Coeff3 Coeff4


Actual 1– Actual 1– pred2 pred3 Actual 2– Actual 2– pred2 pred3 Actual 3– Actual3 – pred2 pred3 Actual 4– Actual 4– pred2 pred3

Actual 1– pred4 Actual 2– pred4 Actual 3– pred4 Actual 4– pred4





Example: Results
Coeff1 Obs




No imputes 1-4 months 5-8 months 9-12 months Might impute too low Might impute too high











Multiple Imputation
• Since the 1970s, Donald Rubin has argued that imputation adds variability to usercalculated statistics • Traditional methods impute only once • User has no way to account for variability • Multiple imputation allows the user to calculate variance that includes a piece due to imputation

Multiple Imputation: Example
• How might variance estimates change when switch from single to multiple imputation? • Consider random variable X with mean of .5 • Generate 1000 random samples by taking draws for 80 people • 20 people have missing data for X

Multiple Imputation: Example (cont.)
• Impute missing data using 2 methods:
– single implicate/hot deck – every observed value has equal prob. of being donor – multiple imputation/Bayesian Bootstrap – prob. of being donor changes across implicates but centered around 1/n; create 32 implicates

• Calculate mean and 95% confidence interval for all 1000 random samples

Multiple Imputation: Example (cont.)
– Case of 1 implicate
• 95% confidence interval contains the true value 88% of the time

– Case of multiple implicates
• Calculate variance of mean using Rubin formula • 95% confidence interval contains the true value 96.5% of the time

– What does this mean?
• Statistical hypotheses will be rejected too often using single imputation methods because variance estimates are too small

Examples of Census Research on Imputation Methods • • • • Generalized Additive Model (GAM) Predictive Mean Matching Bayesian Bootstrap Sequential Regression Multiple Imputation (SRMI)


Questions for Panel Discussion
1. General thoughts and suggestions on model-based imputation? 2. Suggest specific models? 3. Which variables should we prioritize? 4. Would SIPP user community be willing/able to handle multiple implicates?

To top