VIEWS: 1 PAGES: 24 POSTED ON: 2/5/2012
Example of Working with Missing Values Alan C. Acock February, 2007 Presented at the Department of Family and Child Studies Florida State University Supporting material is available at www.oregonstate.edu/~acock/missing Based on the Power Point Presentation, the following is an example of working with missing values. These are notes to facilitate the presentation and are not intended to be in a format appropriate for publication. There are many packages and commands for working with missing values. I will illustrate the process using a command written by Royston (2004) for Stata. It is useful to see this process even if you do not have access to Stata because it is one of the best approaches currently available and it allows us to illustrates several important issues that are problematic with various packages. Model: We will estimate the hours a person works. We think this depends on their gender, race (white, black, other), age, education, number of children and an interaction of the number of children and gender. hrs1 a B1female B2other B3black B4age B5educ B6childs B7(female•childs) We have data for 1,262 adults with no missing data using the 2004 General Social Survey. Working with Missing Data—Presented at Florida State University, February, 2007 1 . sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 1262 1351.342 801.18 4 2812 hrs1 | 1262 42.31616 14.89532 1 89 childs | 1262 1.602219 1.436003 0 8 age | 1262 41.94295 12.85587 18 86 educ | 1262 14.36846 2.753653 0 20 -------------+-------------------------------------------------------- paeduc | 1262 12.25832 3.742694 0 20 maeduc | 1262 12.22583 3.284868 0 20 income98 | 1262 18.49128 4.418744 1 24 attend | 1262 3.861331 2.666016 0 8 -------------+-------------------------------------------------------- other | 1262 .0847861 .2786735 0 1 black | 1262 .1030111 .3040939 0 1 female | 1262 .4896989 .500092 0 1 femkid | 1262 .8312203 1.297545 0 8 -------------+-------------------------------------------------------- If we do a regression (listwise deletion) we obtain: . regress hrs1 female other black age educ childs femkid, beta Source | SS df MS Number of obs = 1262 -------------+------------------------------ F( 7, 1254) = 12.16 Model | 17784.362 7 2540.62314 Prob > F = 0.0000 Residual | 261994.488 1254 208.927024 R-squared = 0.0636 -------------+------------------------------ Adj R-squared = 0.0583 Total | 279778.85 1261 221.870619 Root MSE = 14.454 ------------------------------------------------------------------------------ hrs1 | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------- female | -4.09762 1.233346 -3.32 0.001 -.1375725 other | -2.9565 1.478854 -2.00 0.046 -.0553126 black | .2853287 1.360049 0.21 0.834 .0058251 age | -.0603985 .0348098 -1.74 0.083 -.0521288 educ | .3840182 .1500599 2.56 0.011 .0709923 childs | 1.010823 .4098343 2.47 0.014 .0974497 femkid | -1.562961 .5720398 -2.73 0.006 -.136151 _cons | 41.23919 2.59819 15.87 0.000 . ------------------------------------------------------------------------------ I created a new dataset that has missing values that violate the MAR assumption. Working with Missing Data—Presented at Florida State University, February, 2007 2 I deleted values deliberately so that the resulting dataset using listwise deletion has only 680 observations. We are missing between 2% and 19% of the values for each variable, but with listwise deletion almost half the observations are dropped because they have a missing value on at least one variable. If I had deleted these randomly then the multiple imputation would approximate the results for the full sample. I’ve deliberately deleted the observations to violate assumptions of data missing randomly. The results for our new dataset, using listwise deletion, are quite different: sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 1262 1351.342 801.18 4 2812 hrs1 | 1125 42.272 14.87459 1 89 childs | 1115 1.583857 1.43404 0 8 age | 1017 42.09636 12.93885 18 86 educ | 1236 14.36246 2.755579 0 20 -------------+-------------------------------------------------------- paeduc | 1232 12.25812 3.70872 0 20 maeduc | 1223 12.21259 3.302125 0 20 race | 1209 1.267163 .6012586 1 3 income98 | 1208 18.40977 4.432703 1 24 attend | 1212 3.861386 2.669322 0 8 -------------+-------------------------------------------------------- other | 1213 .0824402 .2751477 0 1 black | 1219 .1033634 .3045579 0 1 female | 1125 .4897778 .5001178 0 1 femkid | 1024 .8476563 1.305764 0 8 -------------+-------------------------------------------------------- . regress hrs1 female other black age educ childs femkid, beta Source | SS df MS Number of obs = 680 -------------+------------------------------ F( 7, 672) = 8.47 Model | 11131.7958 7 1590.25655 Prob > F = 0.0000 Residual | 126184.756 672 187.774934 R-squared = 0.0811 Working with Missing Data—Presented at Florida State University, February, 2007 3 -------------+------------------------------ Adj R-squared = 0.0715 Total | 137316.551 679 202.233507 Root MSE = 13.703 ------------------------------------------------------------------------------ hrs1 | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------- female | -4.7175 1.630608 -2.89 0.004 -.1659522 other | -3.564597 2.055072 -1.73 0.083 -.0648645 black | .9089816 1.7347 0.52 0.600 .0198014 age | -.0786796 .0443309 -1.77 0.076 -.0719363 educ | .6051706 .1961412 3.09 0.002 .1161483 childs | .9381454 .5071599 1.85 0.065 .0975426 femkid | -1.194012 .7266066 -1.64 0.101 -.1132623 _cons | 38.65546 3.375819 11.45 0.000 . ------------------------------------------------------------------------------ The results using mean substitution are also quite different with the explanatory power of the model attenuated: . regress hrs1m femalem otherm blackm agem educm childsm femkidm, beta Source | SS df MS Number of obs = 1262 -------------+------------------------------ F( 7, 1254) = 9.22 Model | 12168.8102 7 1738.40145 Prob > F = 0.0000 Residual | 236519.958 1254 188.612407 R-squared = 0.0489 -------------+------------------------------ Adj R-squared = 0.0436 Total | 248688.768 1261 197.215518 Root MSE = 13.734 ------------------------------------------------------------------------------ hrs1m | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------- femalem | -4.990874 1.15559 -4.32 0.000 -.1678047 otherm | -2.767918 1.448562 -1.91 0.056 -.053167 blackm | .5248508 1.312423 0.40 0.689 .0111867 agem | -.0360429 .0357436 -1.01 0.313 -.0298081 educm | .3560231 .1431787 2.49 0.013 .0691348 childsm | .2233433 .3823879 0.58 0.559 .0214362 femkidm | -.6406052 .5417609 -1.18 0.237 -.0536494 _cons | 41.48353 2.552413 16.25 0.000 . ------------------------------------------------------------------------------ Checking for missing values . misschk hrs1 female other black age educ childs femkid, gen(miss) replace dummy Working with Missing Data—Presented at Florida State University, February, 2007 4 Variables examined for missing values # Variable # Missing % Missing -------------------------------------------- 1 hrs1 137 10.9 2 female 137 10.9 3 other 49 3.9 4 black 43 3.4 5 age 245 19.4 6 educ 26 2.1 7 childs 147 11.6 8 femkid 238 18.9 The columns in the table below correspond to the # in the table above. If a column is _, there were no missing cases for that variable. Missing for | which | variables? | Freq. Percent Cum. ------------+----------------------------------- 1___5 _7_ | 11 0.87 0.87 1___5 ___ | 102 8.08 8.95 1____ ___ | 24 1.90 10.86 _2__5 __8 | 1 0.08 10.94 _2___ 6_8 | 25 1.98 12.92 _2___ __8 | 111 8.80 21.71 __34_ ___ | 43 3.41 25.12 __3__ ___ | 6 0.48 25.59 ____5 _7_ | 9 0.71 26.31 ____5 ___ | 122 9.67 35.97 _____ 6__ | 1 0.08 36.05 _____ _78 | 101 8.00 44.06 _____ _7_ | 26 2.06 46.12 _____ ___ | 680 53.88 100.00 ------------+----------------------------------- Total | 1,262 100.00 The first pattern has missing values on three of the variables, number 1, 5, and 7 (hrs, age, childs). Working with Missing Data—Presented at Florida State University, February, 2007 5 This table can tell us if there is a variable or, more usefully, a combination of variables that have a lot of missing values. The next table tells us how many people have missing values on 0, 1, 2, … of the variables. Notice that all but 37 of the observations are missing values for 2 or fewer variables and no observation is missing a value for more than 3 variables. Missing for | how many | variables? | Freq. Percent Cum. ------------+----------------------------------- 0 | 680 53.88 53.88 1 | 179 14.18 68.07 2 | 366 29.00 97.07 3 | 37 2.93 100.00 ------------+----------------------------------- Total | 1,262 100.00 Variables created: miss<varnm> is a binary variable indicating missing data for <varnm>. This command creates a dummy variable (miss<varnm> for each variable to represent the missingness. These are coded 0 if not missing and 1 if missing. We can use these new variables to see if there are variables in the dataset that predict them. A variable that is correlated with one of these variables is known as an auxiliary variable. It is a mechanism that explains the missingness. Here is one example: . tab misshrs1 -> tabulation of misshrs1 Missing | Working with Missing Data—Presented at Florida State University, February, 2007 6 value for | hrs1? | Freq. Percent Cum. ------------+----------------------------------- NotMissing | 1,125 89.14 89.14 Missing | 137 10.86 100.00 ------------+----------------------------------- Total | 1,262 100.00 What are our auxiliary variables? paeduc maeduc income98 attend black other Normally, we would pick more. These are variables that predict whether there is a missing value or not—they may or may not predict the score for the missing value. Think of other auxiliary variables that are mechanisms explaining why a value is missing—race? Depression? What are our covariates? paeduc maeduc income98 attend black other Normally, we would pick more candidates to be used as either auxiliary variables or covariates. These happen to be the same variables but are selected because we think they might be related to the score on our primary variables (hrs1 childs age educ interact). Working with Missing Data—Presented at Florida State University, February, 2007 7 Think of other covariates. Often these are the same as the auxilary variables. For example, minorities may work fewer hours because of discrimination and knowing minority status would help us predict the value when it is missing. Finding auxiliary variables and covariates To evaluate auxiliary variables we do the following: We explore for variables that are correlated with missingness. Normally, we would include far more variables: (edited output follows) pwcorr misshrs1-missfemkid hrs1-female | misshrs1 missfe~e missot~r missbl~k missage misseduc missch~s -------------+--------------------------------------------------------------- misshrs1 | 1.0000 missfemale | -0.1218 1.0000 missother | -0.0701 -0.0701 1.0000 missblack | -0.0655 -0.0655 0.9345 1.0000 missage | 0.5564 -0.1648 -0.0986 -0.0922 1.0000 misseduc | -0.0506 0.3977 -0.0292 -0.0272 -0.0712 1.0000 misschilds | -0.0394 -0.1267 -0.0730 -0.0682 -0.0533 -0.0527 1.0000 missfemkid | -0.1682 0.7238 -0.0969 -0.0905 -0.2315 0.2866 0.4627 hrs1 | . 0.0007 -0.0423 -0.0442 -0.0202 -0.0040 0.0767 childs | 0.0424 -0.0857 -0.0507 -0.0426 -0.0099 -0.0505 . age | -0.0202 -0.0691 0.0342 0.0362 . -0.0354 0.0439 educ | 0.0172 0.0444 -0.0132 -0.0106 0.0215 . -0.0765 paeduc | 0.0102 0.0547 0.0116 0.0273 0.0054 0.0250 -0.0864 maeduc | -0.0378 0.0761 0.0027 0.0189 -0.0248 0.0317 -0.0985 income98 | 0.0407 -0.0189 0.1013 0.0861 -0.0222 0.0120 -0.0221 attend | 0.0003 0.0049 -0.0506 -0.0552 0.0614 -0.0371 0.0472 other | -0.0123 0.0540 . . -0.0313 0.1833 0.0265 black | 0.0328 -0.0526 -0.0239 . 0.0449 -0.0315 -0.0595 female | -0.0114 . 0.0000 -0.0006 0.0151 0.0304 -0.0105 | missfe~d hrs1 childs age educ paeduc maeduc -------------+--------------------------------------------------------------- missfemkid | 1.0000 hrs1 | 0.0624 1.0000 childs | -0.0857 -0.0392 1.0000 age | -0.0223 -0.0322 0.4021 1.0000 educ | -0.0748 0.0739 -0.0840 0.0151 1.0000 paeduc | -0.0434 0.0208 -0.1637 -0.2146 0.3701 1.0000 maeduc | -0.0196 0.0632 -0.1602 -0.1877 0.3754 0.6151 1.0000 income98 | -0.0364 0.1914 0.1103 0.2140 0.2925 0.1274 0.1283 Working with Missing Data—Presented at Florida State University, February, 2007 8 attend | 0.0498 -0.0533 0.2204 0.0606 0.0669 -0.0269 -0.0104 other | -0.0047 -0.0484 -0.0335 -0.1016 0.0094 -0.0156 -0.1387 black | -0.0993 -0.0055 0.0200 -0.0747 -0.0699 -0.0732 -0.0583 female | 0.0095 -0.2277 0.0882 0.0296 0.0063 -0.0029 0.0035 | income98 attend other black female -------------+------------------------------------------------------ income98 | 1.0000 attend | 0.0619 1.0000 other | -0.0529 0.0004 1.0000 black | -0.1349 0.1443 -0.1021 1.0000 female | -0.0823 0.1253 -0.0324 0.1020 1.0000 The correlation between income98 and age is .21. income98 is a covariate that predicts the value of age where age is missing. The correlation between other (neither Black nor white) and misseduc is .18. People who are neither black nor white are more likely to have a missing value on education (not necessarily a higher or lower value, just a missing value. Thus other is an important auxiliary variable as a mechanism for missingness on education. MAR assumes we have included relevant auxiliary variables. We need to include any auxiliary variables and any covariates we identify. Often users of the full information maximum likelihood solutions to missing values include no additional variables even when they are available. I show how to do that in my JMF article (Acock, 2005). Multiple Imputation So much for preliminary analysis. We are now ready to do multiple imputation. The command we will use is one of the best that is currently available and if you use another command you should be aware of the issues we handle. We will do this with a command called ice that was written by Patrick Royston and is an implementation for Stata of S. van Buren and C. G. M. Oudshoorn’s program for MICE that is available in R and S-Plus (www.multiple-imputation.com). Working with Missing Data—Presented at Florida State University, February, 2007 9 First, do a dry run. This does nothing but tell us how Stata thinks we should do it. We will need to modify this as explained below: . ice hrs1-femkid using imputed.dta, dryrun m(20) #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 649 51.43 51.43 1 | 121 9.59 61.01 2 | 381 30.19 91.20 3 | 80 6.34 97.54 4 | 30 2.38 99.92 5 | 1 0.08 100.00 ------------+----------------------------------- Total | 1,262 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- hrs1 | regress | childs age educ paeduc maeduc race income98 attend | | other black female femkid childs | regress | hrs1 age educ paeduc maeduc race income98 attend other | | black female femkid age | regress | hrs1 childs educ paeduc maeduc race income98 attend | | other black female femkid educ | regress | hrs1 childs age paeduc maeduc race income98 attend | | other black female femkid paeduc | regress | hrs1 childs age educ maeduc race income98 attend other | | black female femkid maeduc | regress | hrs1 childs age educ paeduc race income98 attend other | | black female femkid race | mlogit | hrs1 childs age educ paeduc maeduc income98 attend | | other black female femkid income98 | regress | hrs1 childs age educ paeduc maeduc race attend other | | black female femkid attend | regress | hrs1 childs age educ paeduc maeduc race income98 other | | black female femkid other | logit | hrs1 childs age educ paeduc maeduc race income98 | | attend black female femkid black | logit | hrs1 childs age educ paeduc maeduc race income98 | | attend other female femkid female | logit | hrs1 childs age educ paeduc maeduc race income98 | | attend other black femkid femkid | regress | hrs1 childs age educ paeduc maeduc race income98 | | attend other black female End of dry run. No imputations were done, no files were created. Working with Missing Data—Presented at Florida State University, February, 2007 10 This shows you the defaults Stata would use if we made no further specifications. If there were no missing values for a variable this would show that it will not do anything with that variable. Notice we are using OLS regression (regress) for every variable except of female, other, and black where it wants to do logistic regression and race for which Stata wants to use multinomial logistic regression (mlogit). Because female, other, and black have just two values, Stata figured out we should do a logistic regression. Because race has three values (white, black, and other), Stata figured out we should do a multinomial logistic regression. Sometimes a variable with three values should be treated using ordinal logistic regression (ologit) and sometimes it should be treated using OLS regression (regress), but with three categories, Stata always guesses that we want multinomial logistic regression. Problems found with the dryrun There is a problem with femkid. It does not make sense to impute childs and female, and also their interaction since their interaction term because femkid, by definition, is the product childs times female. Therefore we need to impute childs and female but let the imputed interaction, femkid = childs female. This means we impute the interaction passively. The option is o passive(femkid:childs*female). This option also will make sure that femkid is not used as a predictor when we are imputing either childs or female. We have race which is coded as white, back, or other. With 3 categories we need to create 2 dummy variables, black and other. We let white be our reference group. We should not impute black or other using logistic regression like we do with female above because black and other are Working with Missing Data—Presented at Florida State University, February, 2007 11 interdependent (a person should not have an imputed value of 1 on both variables). Therefore, we need to impute race using multinomial regression but not impute black or other using logistic regression. This will mean that each missing value will be assigned to one and only one race. This approach will give an imputed value for each missing value on race. Then, it will go back to translated these to the dummy variables black and other. This needs to be passively imputed following the active imputation of race. We use the option o passive(black:race==2\other:race==3). (Double equal signs, ==, are pronounced “is” in Stata.) To guarantee that multinomial logit was used to actively impute race we could use the option o cmd(race:mlogit). We don’t have to include this option because it is the default for race, but do so to illustrate how to specify an estimator. Currently the only available estimators are: regress, logit, mlogit, and ologit. This gets more complicated because we cannot use race with 3 nominal levels as a predictor and must use black and other as predictors when imputing other variables. So we need to add an option to make this substitution happen: o subtitute(race:other black). Here is what we do for our situation. I realize this is a complex command. Stata commands are rarely even remotely this long except for some complex graphs. The three slashes are used at the end of each line to indicate that the following line is still part of the same command. The m(20) will impute 20 datasets. Other programs are much more difficult to implement and often just impute 5 datasets to make them manageable. . ice hrs1-femkid using impute.dta, m(20) /// passive(femkid:childs*female\black:race==2\other:race==3) /// substitute(race:other black) cmd(race:mlogit) Working with Missing Data—Presented at Florida State University, February, 2007 12 #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 649 51.43 51.43 1 | 121 9.59 61.01 2 | 381 30.19 91.20 3 | 80 6.34 97.54 4 | 30 2.38 99.92 5 | 1 0.08 100.00 ------------+----------------------------------- Total | 1,262 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- hrs1 | regress | childs age educ paeduc maeduc income98 attend other | | black female femkid childs | regress | hrs1 age educ paeduc maeduc income98 attend other | | black female age | regress | hrs1 childs educ paeduc maeduc income98 attend other | | black female femkid educ | regress | hrs1 childs age paeduc maeduc income98 attend other | | black female femkid paeduc | regress | hrs1 childs age educ maeduc income98 attend other | | black female femkid maeduc | regress | hrs1 childs age educ paeduc income98 attend other | | black female femkid race | mlogit | hrs1 childs age educ paeduc maeduc income98 attend | | female femkid income98 | regress | hrs1 childs age educ paeduc maeduc attend other black | | female femkid attend | regress | hrs1 childs age educ paeduc maeduc income98 other | | black female femkid other | | [Passively imputed from race==3] black | | [Passively imputed from race==2] female | logit | hrs1 childs age educ paeduc maeduc income98 attend | | other black femkid | | [Passively imputed from childs*female] Imputing 1..2..3..4..5..6..7..8..9..10..11..12..13..14..15..16..17..18..19..20..file impute.dta saved Working with Missing Data—Presented at Florida State University, February, 2007 13 Stata is extremely fast. On some programs this would take a long time, but it is all done in a few seconds. With a large number of covariates and auxiliary variables, however, even Stata can take a very long time. What did this command accomplish? We created 20 datasets, m(20), and put them into a single file in our default directory with the name impute.dta. The 20 datasets are stacked in one big dataset where the first 1,262 observations are the first complete dataset; the second 1,262 are the second dataset, and so on. This file has all of our variables; there are no missing values. It has 25,248 observations altogether (1262 20). These 20 stacked datasets are ready to go with all of our variable names, variable labels, and value labels. Some other programs produce the multiple datasets as separate text files that need to be transformed into datasets by adding variable names, value labels, etc. Here is a summary of the stacked file, impute.dta: . sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 25240 1351.342 800.8783 4 2812 hrs1 | 25240 42.29011 14.87211 -10.76075 94.99245 childs | 25240 1.599714 1.437436 -4.524939 8 age | 25240 42.05228 12.99656 -3.952193 91.75469 educ | 25240 14.37851 2.75387 0 22.91912 -------------+-------------------------------------------------------- paeduc | 25240 12.27622 3.703488 0 23.45145 maeduc | 25240 12.21553 3.295468 0 22.31159 race | 25240 1.265491 .5999308 1 3 income98 | 25240 18.39968 4.431575 1 32.87609 attend | 25240 3.867593 2.666426 -5.25985 13.53112 -------------+-------------------------------------------------------- other | 25240 .0824485 .2750522 0 1 Working with Missing Data—Presented at Florida State University, February, 2007 14 black | 25240 .1005943 .3007967 0 1 female | 25240 .4880745 .4998677 0 1 femkid | 25240 .8467585 1.312306 -4.524939 8 -------------+-------------------------------------------------------- missfemale | 25240 .1085578 .3110898 0 1 This summary of the 25,240 observations shows some minimum values that might seem problematic such as -10.76 hours a week or -4.52 children. We could change all “impossible” values to a possible value replace hrs1 = 0 if hrs1 < 0, replace childs = 0 if childs < 0, replace age = 18 if age < 18, replace attend = 0 if attend < 0, and replace femkid = 0 if childs ==0 & female == 0. We would run the last command last. Notice that the categorical variables (race, other, black, female) only take on discrete values. Alternatively, we can leave all of the imputed values alone. There are very few out of bounds values in any of our 20 samples. For example, there are no observations below zero on hrs1 in the first dataset of 1,262 observations. Estimating our model 20 times and combining the 20 separate results. You will recall that our model is hrs1 a B1female B2other B3black B4age B5educ B6childs B7(female•childs) To run this on a single dataset (single imputation like done with SPSS) our command would be regress hrs1 male age educ childs in 1/1262, beta The problem with single imputation is that it does not give us as much information as multiple imputation and will tend to under estimate standard errors. To get unbiased standard errors and better estimates of Working with Missing Data—Presented at Florida State University, February, 2007 15 parameters we need to run the regression on multiple datasets and then pool the results into a combined solution. To do the regression twenty times on twenty datasets and pool the twenty sets of estimates we run this command (this command will be superseded by mim within two months): micombine regress hrs1 female other black age educ childs femkid Multiple imputation parameter estimates (20 imputations) ------------------------------------------------------------------------------ hrs1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | -4.54755 1.433964 -3.17 0.002 -7.360783 -1.734317 other | -3.07658 1.62503 -1.89 0.059 -6.264658 .1114975 black | .7583257 1.536302 0.49 0.622 -2.25568 3.772331 age | -.0431244 .0397376 -1.09 0.278 -.1210839 .034835 educ | .3490039 .1591362 2.19 0.028 .0368013 .6612065 childs | .7887713 .4992205 1.58 0.114 -.1906281 1.768171 femkid | -1.477353 .6848294 -2.16 0.031 -2.820891 -.1338153 _cons | 41.47106 2.795495 14.83 0.000 35.98669 46.95542 ------------------------------------------------------------------------------ 1262 observations. Because ice and micombine were written by a biostatistician who has little interest in standardized measures like R2 and β’s, we only get an unstandardized solution. The parameter estimates are better than with single imputation because each is the average of all 20 values for the given parameter estimate. Any one imputation might get a coefficient, i. e., an unstandardized B, that is too big or two small, but averaged over 20 repetitions we mitigate this likely error. The standard errors for multiple imputation tend to be larger than for the single imputation and the t-ratios tend to be smaller. Although we normally want to minimize standard errors and hence maximize t-ratios and power, those from the single imputation are incorrect because they ignore the variability across our 20 datasets. This variability is inherent in data imputation and should not be ignored. If the first dataset gets very different results than the second Working with Missing Data—Presented at Florida State University, February, 2007 16 dataset and the second dataset gets very different results than the third, etc., the pooled standard errors for the multiple imputations will be much larger reflecting the uncertainty of the imputation process (See Appendix). If the results are virtually the same for each of the 20 datasets, then the pooled standard errors will not be much larger than those obtained from a single imputation. We want R2 and the β’s! To get them we need to do something that is a bit tedious. Currently, the micombine command does not provide the pooled values for the R2 or the β weights. I’ve advocated for the replacement command, mim, adding this capability and the authors are considering my request. We need to estimate the regression for each of the 20 datasets, write down each of the 20 R2’s and β’s, and then average them by computing the mean R2 and the mean for each of the β weights. Although a bit tedious and admittedly something that could be incorporated into the program to be automatic, here is what we would do. 1. First, we estimate the regression equation in each dataset. Because the data is stacked we need to know where to break the dataset into the 20 pieces. Using the calculator in Stata, the first dataset is for observations 1 to 1,262, the second is for observations 1, 263 to 2,562 . . . and the twentieth data set is for observations 23,978 to 25,240. di 1262 1262 . di 1262+1262 2524 . di 2524+1262 3786 . di 3786+1262 Working with Missing Data—Presented at Florida State University, February, 2007 17 5048 . . . . di 22716+1262 23978 . di 23978+1262 25240 Here are the results for the first five of these solutions First Five R2 Betas Imputation female other black age educ childs femkid 1 .0678 -.1707 -.0494 .0105 -.0586 .0729 .0668 -.1069 2 .0513 -.1237 -.0448 .0108 -.0170 .0533 .0783 -.1327 3 .0714 -.1287 -.0632 .0050 -.0511 .0606 .1121 -.1711 4 .0614 -.1519 -.0600 .0143 -.0250 .0680 .0799 -.1186 5 .0586 -.1267 -.0558 .0364 -.0562 .0589 .0871 -.1385 Mean .06 -.14 -.05 .02 -.04 .06 .08 -.13 Here we only show the first five solutions, but for publication we would include all 20 of them. The R2 we report is simply the mean of the R2’s and the β’s are simply the mean of the β’s for that variable. Thus, R2 = .06 (It was also .06 when there as no missing values) and the β = -.04 for age (it was -.05 when there was no missing values). The significance of the β is the same as the significance of the B. Strategies that rely on single imputation miss the underlying uncertainty that we capture with multiple imputations. For the full set of 20 solutions, the β’s for female vary from .1216 to .1933. The mean of the 20 β’s is a far better value than arbitrarily using a single solution and using that value. This is a good reason to do multiple imputation rather than single imputation. Working with Missing Data—Presented at Florida State University, February, 2007 18 How did we do? This is a single application of multiple imputation and we should avoid making too much of how well we did or didn’t do. When we meet the assumption of missing at random (MAR), the multiple imputation based solution is an unbiased estimate of the solution we would obtain from complete data. The following table provides a comparison of our three solutions. The first is our “gold standard.” It is the solution we obtain for a complete dataset, one with no missing values. The solution in the middle, labeled listwise, is the solution we obtain when we lose almost half of our observations to missing values. The solution on the right, labeled multiple imputation, is what we obtain from doing multiple imputation and pooling the results. In this particular example, the listwise solution does not do a terrible job. It slightly overestimates the R2. Some of the t-ratios are too big and some are too small. The B values are too big, substantially so for most of the variables and too small for number of children and the interaction term. We lose significance for childs, and the interaction term compared to the gold standard. The multiple imputation solution produces The identical R2 even though we are missing almost half of our observations and the missingness is far from random. The B’s are closer to those for the complete dataset than the corresponding B’s for the listwise solution. They are closer for female, other, black, educ, and the interaction term. The listwise solution is closer for childs although it is not significant. With the multiple imputation we lose significance for one of the variables, childs, but the interaction is still significant. Param Complete Data Listwise Mean Substitution Multiple Imputation eter B t β B t β B t β B t β female -4.10 -3.32 -.14 -4.78 -2.89 -.17 -4.99 -4.32 -.17 -4.55 -3.17 -.14 Working with Missing Data—Presented at Florida State University, February, 2007 19 other -2.96 -2.00 -.06 -3.56 -1.73 -.06 - -1.91 -.05 -3.08 -1.89 -.05 32.77 Black .29 .21 .01 .91 .52 .02 .52 .40 .01 .76 .49 .02 age -.06 -1.74 -.05 -.08 -1.77 -.07 -.04 -1.01 -.03 -.04 -1.09 -.04 educ .38 2.56 .07 .61 3.09 .12 .36 2.49 .07 .35 2.19 .06 childs 1.01 2.47 .10 .94 1.85 .10 .22 0.58 .02 .79 1.58 .08 femkid -1.56 -2.73 -.14 -1.19 -1.64 -.11 =/te -1.18 -.05 -1.48 -2.16 -.13 inter- 41.24 15.87 38.66 11.45 41.47 16.25 41.47 14.83 cept 2 R .06 .08 .05 .06 Is this cheating? A lot of people, upon their first exposure to multiple imputation, say it is cheating. There are some myths about multiple imputation. Myth 1—We are making up data. Actually, we are not making up any data. We are simply using all of the data available in the dataset and making the reasonable assumption (if MAR is appropriate) that the missing values would be similarly distributed. There is nothing added that is not there when we use all of the available data. Myth 2—We are getting significant results by having more observations than we really have. This may be true of single imputation, but multiple imputation incorporates the uncertainty of the implementation process in how it estimates the pooled standard errors and hence the t-ratios (see the Appendix). Because of this, multiple imputation usually has smaller t- ratios than single implementation that ignores this variance between solutions. If there is a lot of missing values, the multiple imputation will likely have larger t-values than listwise deletion Myth 3—MAR is a ridiculous assumption and we can never justify this. The fact is that we do not have a test of significance for this assumption. However, it is not as unreasonable as it first seems. MCAR is probably unreasonable unless there is planned missingness as part of the research design—say each participant answers a random sample of 50% of the items to keep an instrument from being too long. MAR is reasonable if you do enough to find appropriate auxiliary variables that explain the Working with Missing Data—Presented at Florida State University, February, 2007 20 pattern of missingness. We only had a few of these in this example and you should have many more. Fortunately, we know a great deal about who is more likely to skip or refuse to answer items. As long as we include a reasonable set of auxiliary variables the MAR assumption is reasonable. Precautionary note, many applications using full information maximum likelihood ignore this and only include variables that are used explicitly in the model. So you don’t have access or Stata or don’t want to learn it The freeware program, Amelia, is reasonably easy to use. Norm is reasonably easy to use. Both of these have some limitations in what they can do, e.g., Norm assumes all variables are continuous. An additional limitation is that you need to create the m datasets, run the program on each of them, and then bring these results back to Norm. This is not hard, but it is quite tedious and most users limited themselves to imputing 5 datasets. The value of additional datasets has rapidly decreasing marginal utilities. Some experts say 5 is enough and some say you should have more. SAS has a command for multiple imputation (MI) that is not as flexible as Stata’s but is fairly easy to use. SPSS has nothing on multiple imputation. It has an expectation maximization (EM) procedure for single imputation that an article in the American Statistician reported was not properly implemented. Single implementation under the best of circumstances (Norm, ice, etc.) can produce a single implementation quite easily), still will provide biased results. Some researchers say they believe the bias is not so great that they want to learn Stata, SAS, NORM, or AMELIA. Although there is no statistical basis for such a belief, my experience is that with small amounts of missing values, only continuous variables being used, no interactions, and the MAR assumption being reasonable that the SPSS solution is okay. The SPSS solution is an-add on module to SPSS. Working with Missing Data—Presented at Florida State University, February, 2007 21 Appendix A: Rubin’s Rules Rubin (1987) developed the following rules for pooling the solutions from the multiple datasets: Estimates of individual parameters (B, R2, β) 1 m M M j m j 1 where M j is the value of the parameter estimate in the jth dataset M is the pooled estimate m is the number of imputed datasets There are two components of the pooled standard error. First we compute the mean error variance (square of the standard error) 1 m Var Varj m j 1 where Varj is the square of the standard error for the jth dataset Var is the mean of the squares of the standard errors and the variance of the parameter estimates: 1 m B ( M j M )2 m 1 j 1 where B is the between imputation variance of the estimated parameter The pooled standard error is: 1 Pooled(SE) Var 1 B m Working with Missing Data—Presented at Florida State University, February, 2007 22 Some researchers use the ordinary degrees of freedom, N – k, where k is the number of predictors. A more conservative estimate of the degrees of freedom is provided by Schafer (1997). df (m 1) 1 (m+1)B 2 mVar where M j is the value of the parameter estimate in the jth dataset M is the pooled estimate m is the number of imputed datasets Appendix B: New capabilities for ice and mim (replacing micombine) may or may not include the incorporation of standardized values. The help menu’s for the beta versions of the new ice and mim are now on the web page for this workshop (www.oregonstate.edu/~acock/missing ) The mim command greatly extends the capability of micombine. It is possible to use far more Stata commands this way including the multilevel commands and the complex survey commands. Working with Missing Data—Presented at Florida State University, February, 2007 23 References Acock, A. C. (2005). Working with missing values. Journal of Marriage and the Family, 67, 1012-1028. Honaker, J., King, G., & Blackwell, M. (2007). Amelia II: A Program for Missing Data. Cao, H. (2001). IMPUTE: A SAS application system for missing value imputation. Ann Arbor: Survey Research Center, Institute for Social Research. King, G., Honaker, J. Joseph, A., Scheve, K. (2001). Analyzing incomplete political science data: An alternaqtive algorithm for multiple imputation. American Political Science Review 95:1: 49-69 McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing Data: A Gentle Introduction. New York: Guilford. Royston P. (2004). Multiple imputation of missing values. Stata Journal 4(3): 227-241. Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. J. Wiley & Sons, New York. Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall, London. Yuan, Yang C. Multiple imputation for missing data: Concepts and new developments. Rockville, MD: SAS Institute. Working with Missing Data—Presented at Florida State University, February, 2007 24