Docstoc

example

Document Sample
example Powered By Docstoc
					                Example of Working with Missing Values
                                  Alan C. Acock
                                  February, 2007
            Presented at the Department of Family and Child Studies
                             Florida State University
                       Supporting material is available at
                    www.oregonstate.edu/~acock/missing

Based on the Power Point Presentation, the following is an example of working
with missing values. These are notes to facilitate the presentation and are not
intended to be in a format appropriate for publication.

There are many packages and commands for working with missing
values. I will illustrate the process using a command written by Royston
(2004) for Stata. It is useful to see this process even if you do not have
access to Stata because it is one of the best approaches currently
available and it allows us to illustrates several important issues that are
problematic with various packages.

Model:

We will estimate the hours a person works. We think this depends on their
gender, race (white, black, other), age, education, number of children
and an interaction of the number of children and gender.

hrs1  a  B1female  B2other  B3black  B4age  B5educ  B6childs  B7(female•childs)

We have data for 1,262 adults with no missing data using the 2004
General Social Survey.




Working with Missing Data—Presented at Florida State University, February, 2007           1
. sum
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |      1262    1351.342      801.18          4       2812
        hrs1 |      1262    42.31616    14.89532          1         89
      childs |      1262    1.602219    1.436003          0          8
         age |      1262    41.94295    12.85587         18         86
        educ |      1262    14.36846    2.753653          0         20
-------------+--------------------------------------------------------
      paeduc |      1262    12.25832    3.742694          0         20
      maeduc |      1262    12.22583    3.284868          0         20
    income98 |      1262    18.49128    4.418744          1         24
      attend |      1262    3.861331    2.666016          0          8
-------------+--------------------------------------------------------
       other |      1262    .0847861    .2786735          0          1
       black |      1262    .1030111    .3040939          0          1
      female |      1262    .4896989     .500092          0          1
      femkid |      1262    .8312203    1.297545          0          8
-------------+--------------------------------------------------------

If we do a regression (listwise deletion) we obtain:
. regress hrs1 female other black age educ childs femkid, beta

      Source |       SS       df       MS                        Number of obs    =     1262
-------------+------------------------------                     F( 7, 1254)      =    12.16
       Model |   17784.362     7 2540.62314                      Prob > F         =   0.0000
    Residual | 261994.488 1254 208.927024                        R-squared        =   0.0636
-------------+------------------------------                     Adj R-squared    =   0.0583
       Total |   279778.85 1261 221.870619                       Root MSE         =   14.454

------------------------------------------------------------------------------
        hrs1 |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
      female |   -4.09762   1.233346    -3.32   0.001                -.1375725
       other |    -2.9565   1.478854    -2.00   0.046                -.0553126
       black |   .2853287   1.360049     0.21   0.834                 .0058251
         age | -.0603985    .0348098    -1.74   0.083                -.0521288
        educ |   .3840182   .1500599     2.56   0.011                 .0709923
      childs |   1.010823   .4098343     2.47   0.014                 .0974497
      femkid | -1.562961    .5720398    -2.73   0.006                 -.136151
       _cons |   41.23919    2.59819    15.87   0.000                        .
------------------------------------------------------------------------------


I created a new dataset that has missing values that violate the MAR
assumption.

Working with Missing Data—Presented at Florida State University, February, 2007                2
    I deleted values deliberately so that the resulting dataset using
     listwise deletion has only 680 observations.
    We are missing between 2% and 19% of the values for each
     variable, but with listwise deletion almost half the observations are
     dropped because they have a missing value on at least one
     variable.
    If I had deleted these randomly then the multiple imputation would
     approximate the results for the full sample. I’ve deliberately deleted
     the observations to violate assumptions of data missing randomly.

The results for our new dataset, using listwise deletion, are quite different:

sum
    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |      1262    1351.342      801.18          4       2812
        hrs1 |      1125      42.272    14.87459          1         89
      childs |      1115    1.583857     1.43404          0          8
         age |      1017    42.09636    12.93885         18         86
        educ |      1236    14.36246    2.755579          0         20
-------------+--------------------------------------------------------
      paeduc |      1232    12.25812     3.70872          0         20
      maeduc |      1223    12.21259    3.302125          0         20
        race |      1209    1.267163    .6012586          1          3
    income98 |      1208    18.40977    4.432703          1         24
      attend |      1212    3.861386    2.669322          0          8
-------------+--------------------------------------------------------
       other |      1213    .0824402    .2751477          0          1
       black |      1219    .1033634    .3045579          0          1
      female |      1125    .4897778    .5001178          0          1
      femkid |      1024    .8476563    1.305764          0          8
-------------+--------------------------------------------------------

. regress hrs1 female other black age educ childs femkid, beta

      Source |       SS       df       MS                        Number of obs    =      680
-------------+------------------------------                     F( 7,     672)   =     8.47
       Model | 11131.7958      7 1590.25655                      Prob > F         =   0.0000
    Residual | 126184.756    672 187.774934                      R-squared        =   0.0811
Working with Missing Data—Presented at Florida State University, February, 2007                3
-------------+------------------------------                     Adj R-squared =      0.0715
       Total | 137316.551    679 202.233507                      Root MSE      =      13.703

------------------------------------------------------------------------------
        hrs1 |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
      female |    -4.7175   1.630608    -2.89   0.004                -.1659522
       other | -3.564597    2.055072    -1.73   0.083                -.0648645
       black |   .9089816     1.7347     0.52   0.600                 .0198014
         age | -.0786796    .0443309    -1.77   0.076                -.0719363
        educ |   .6051706   .1961412     3.09   0.002                 .1161483
      childs |   .9381454   .5071599     1.85   0.065                 .0975426
      femkid | -1.194012    .7266066    -1.64   0.101                -.1132623
       _cons |   38.65546   3.375819    11.45   0.000                        .
------------------------------------------------------------------------------


The results using mean substitution are also quite different with the explanatory
power of the model attenuated:


. regress hrs1m femalem otherm blackm agem educm childsm femkidm, beta

      Source |       SS       df       MS                        Number of obs    =     1262
-------------+------------------------------                     F( 7, 1254)      =     9.22
       Model | 12168.8102      7 1738.40145                      Prob > F         =   0.0000
    Residual | 236519.958 1254 188.612407                        R-squared        =   0.0489
-------------+------------------------------                     Adj R-squared    =   0.0436
       Total | 248688.768 1261 197.215518                        Root MSE         =   13.734

------------------------------------------------------------------------------
       hrs1m |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
     femalem | -4.990874     1.15559    -4.32   0.000                -.1678047
      otherm | -2.767918    1.448562    -1.91   0.056                 -.053167
      blackm |   .5248508   1.312423     0.40   0.689                 .0111867
        agem | -.0360429    .0357436    -1.01   0.313                -.0298081
       educm |   .3560231   .1431787     2.49   0.013                 .0691348
     childsm |   .2233433   .3823879     0.58   0.559                 .0214362
     femkidm | -.6406052    .5417609    -1.18   0.237                -.0536494
       _cons |   41.48353   2.552413    16.25   0.000                        .
------------------------------------------------------------------------------


Checking for missing values

. misschk hrs1 female other black age educ childs femkid, gen(miss)
replace dummy
Working with Missing Data—Presented at Florida State University, February, 2007                4
Variables examined for missing values

   # Variable         # Missing   % Missing
--------------------------------------------
   1 hrs1                 137        10.9
   2 female               137        10.9
   3 other                 49         3.9
   4 black                 43         3.4
   5 age                  245        19.4
   6 educ                  26         2.1
   7 childs               147        11.6
   8 femkid               238        18.9

The columns in the table below correspond to the # in the table
above.
If a column is _, there were no missing cases for that variable.

Missing for |
      which |
 variables? |      Freq.     Percent        Cum.
------------+-----------------------------------
  1___5 _7_ |         11        0.87        0.87
  1___5 ___ |        102        8.08        8.95
  1____ ___ |         24        1.90       10.86
  _2__5 __8 |          1        0.08       10.94
  _2___ 6_8 |         25        1.98       12.92
  _2___ __8 |        111        8.80       21.71
  __34_ ___ |         43        3.41       25.12
  __3__ ___ |          6        0.48       25.59
  ____5 _7_ |          9        0.71       26.31
  ____5 ___ |        122        9.67       35.97
  _____ 6__ |          1        0.08       36.05
  _____ _78 |        101        8.00       44.06
  _____ _7_ |         26        2.06       46.12
  _____ ___ |        680       53.88      100.00
------------+-----------------------------------
      Total |      1,262      100.00

    The first pattern has missing values on three of the variables, number
     1, 5, and 7 (hrs, age, childs).
Working with Missing Data—Presented at Florida State University, February, 2007   5
    This table can tell us if there is a variable or, more usefully, a
     combination of variables that have a lot of missing values.

The next table tells us how many people have missing values on 0, 1, 2, …
of the variables. Notice that all but 37 of the observations are missing
values for 2 or fewer variables and no observation is missing a value for
more than 3 variables.

Missing for |
   how many |
 variables? |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        680       53.88       53.88
          1 |        179       14.18       68.07
          2 |        366       29.00       97.07
          3 |         37        2.93      100.00
------------+-----------------------------------
      Total |      1,262      100.00

Variables created:

    miss<varnm> is a binary variable indicating missing data for <varnm>.

    This command creates a dummy variable (miss<varnm> for each
     variable to represent the missingness.
    These are coded 0 if not missing and 1 if missing.
    We can use these new variables to see if there are variables in the
     dataset that predict them.
    A variable that is correlated with one of these variables is known as
     an auxiliary variable. It is a mechanism that explains the missingness.

Here is one example:

. tab misshrs1

-> tabulation of misshrs1

     Missing |
Working with Missing Data—Presented at Florida State University, February, 2007   6
  value for |
      hrs1? |      Freq.     Percent        Cum.
------------+-----------------------------------
 NotMissing |      1,125       89.14       89.14
    Missing |        137       10.86      100.00
------------+-----------------------------------
      Total |      1,262      100.00

What are our auxiliary variables?

paeduc
maeduc
income98
attend
black
other

    Normally, we would pick more.
    These are variables that predict whether there is a missing value or
     not—they may or may not predict the score for the missing value.
    Think of other auxiliary variables that are mechanisms explaining why
     a value is missing—race? Depression?

What are our covariates?

paeduc
maeduc
income98
attend
black
other

    Normally, we would pick more candidates to be used as either
     auxiliary variables or covariates.
    These happen to be the same variables but are selected because
     we think they might be related to the score on our primary variables
     (hrs1 childs age educ interact).
Working with Missing Data—Presented at Florida State University, February, 2007   7
    Think of other covariates. Often these are the same as the auxilary
     variables. For example, minorities may work fewer hours because of
     discrimination and knowing minority status would help us predict the
     value when it is missing.

Finding auxiliary variables and covariates

To evaluate auxiliary variables we do the following: We explore for
variables that are correlated with missingness. Normally, we would
include far more variables: (edited output follows)

pwcorr misshrs1-missfemkid hrs1-female

             | misshrs1 missfe~e missot~r missbl~k missage misseduc missch~s
-------------+---------------------------------------------------------------
    misshrs1 |   1.0000
  missfemale | -0.1218    1.0000
   missother | -0.0701 -0.0701     1.0000
   missblack | -0.0655 -0.0655     0.9345   1.0000
     missage |   0.5564 -0.1648 -0.0986 -0.0922      1.0000
    misseduc | -0.0506    0.3977 -0.0292 -0.0272 -0.0712      1.0000
  misschilds | -0.0394 -0.1267 -0.0730 -0.0682 -0.0533 -0.0527         1.0000
  missfemkid | -0.1682    0.7238 -0.0969 -0.0905 -0.2315      0.2866   0.4627
        hrs1 |        .   0.0007 -0.0423 -0.0442 -0.0202 -0.0040       0.0767
      childs |   0.0424 -0.0857 -0.0507 -0.0426 -0.0099 -0.0505             .
         age | -0.0202 -0.0691     0.0342   0.0362        . -0.0354    0.0439
        educ |   0.0172   0.0444 -0.0132 -0.0106     0.0215        . -0.0765
      paeduc |   0.0102   0.0547   0.0116   0.0273   0.0054   0.0250 -0.0864
      maeduc | -0.0378    0.0761   0.0027   0.0189 -0.0248    0.0317 -0.0985
    income98 |   0.0407 -0.0189    0.1013   0.0861 -0.0222    0.0120 -0.0221
      attend |   0.0003   0.0049 -0.0506 -0.0552     0.0614 -0.0371    0.0472
       other | -0.0123    0.0540        .        . -0.0313    0.1833   0.0265
       black |   0.0328 -0.0526 -0.0239          .   0.0449 -0.0315 -0.0595
      female | -0.0114         .   0.0000 -0.0006    0.0151   0.0304 -0.0105

               | missfe~d      hrs1     childs         age        educ     paeduc maeduc
-------------+---------------------------------------------------------------
  missfemkid |      1.0000
         hrs1 |     0.0624   1.0000
      childs | -0.0857 -0.0392          1.0000
          age | -0.0223 -0.0322         0.4021      1.0000
         educ | -0.0748      0.0739 -0.0840         0.0151     1.0000
      paeduc | -0.0434       0.0208 -0.1637 -0.2146            0.3701      1.0000
      maeduc | -0.0196       0.0632 -0.1602 -0.1877            0.3754      0.6151 1.0000
    income98 | -0.0364       0.1914     0.1103      0.2140     0.2925      0.1274 0.1283
Working with Missing Data—Presented at Florida State University, February, 2007            8
       attend   |    0.0498   -0.0533    0.2204     0.0606     0.0669   -0.0269   -0.0104
        other   |   -0.0047   -0.0484   -0.0335    -0.1016     0.0094   -0.0156   -0.1387
        black   |   -0.0993   -0.0055    0.0200    -0.0747    -0.0699   -0.0732   -0.0583
       female   |    0.0095   -0.2277    0.0882     0.0296     0.0063   -0.0029    0.0035

             |   income98   attend    other    black   female
-------------+------------------------------------------------------
    income98 |     1.0000
      attend |     0.0619   1.0000
       other |    -0.0529   0.0004   1.0000
       black |    -0.1349   0.1443 -0.1021    1.0000
      female |    -0.0823   0.1253 -0.0324    0.1020   1.0000

The correlation between income98 and age is .21. income98 is a covariate
that predicts the value of age where age is missing. The correlation
between other (neither Black nor white) and misseduc is .18. People who
are neither black nor white are more likely to have a missing value on
education (not necessarily a higher or lower value, just a missing value.
Thus other is an important auxiliary variable as a mechanism for
missingness on education. MAR assumes we have included relevant
auxiliary variables.

    We need to include any auxiliary variables and any covariates we
     identify.
    Often users of the full information maximum likelihood solutions to
     missing values include no additional variables even when they are
     available. I show how to do that in my JMF article (Acock, 2005).

Multiple Imputation

So much for preliminary analysis. We are now ready to do multiple
imputation. The command we will use is one of the best that is currently
available and if you use another command you should be aware of the
issues we handle. We will do this with a command called ice that was
written by Patrick Royston and is an implementation for Stata of S. van
Buren and C. G. M. Oudshoorn’s program for MICE that is available in R
and S-Plus (www.multiple-imputation.com).


Working with Missing Data—Presented at Florida State University, February, 2007             9
First, do a dry run. This does nothing but tell us how Stata thinks we should
do it. We will need to modify this as explained below:

. ice hrs1-femkid using imputed.dta, dryrun m(20)

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        649       51.43       51.43
          1 |        121        9.59       61.01
          2 |        381       30.19       91.20
          3 |         80        6.34       97.54
          4 |         30        2.38       99.92
          5 |          1        0.08      100.00
------------+-----------------------------------
      Total |      1,262      100.00

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
       hrs1 | regress | childs age educ paeduc maeduc race income98 attend
            |         | other black female femkid
     childs | regress | hrs1 age educ paeduc maeduc race income98 attend other
            |         | black female femkid
        age | regress | hrs1 childs educ paeduc maeduc race income98 attend
            |         | other black female femkid
       educ | regress | hrs1 childs age paeduc maeduc race income98 attend
            |         | other black female femkid
     paeduc | regress | hrs1 childs age educ maeduc race income98 attend other
            |         | black female femkid
     maeduc | regress | hrs1 childs age educ paeduc race income98 attend other
            |         | black female femkid
       race | mlogit | hrs1 childs age educ paeduc maeduc income98 attend
            |         | other black female femkid
   income98 | regress | hrs1 childs age educ paeduc maeduc race attend other
            |         | black female femkid
     attend | regress | hrs1 childs age educ paeduc maeduc race income98 other
            |         | black female femkid
      other | logit   | hrs1 childs age educ paeduc maeduc race income98
            |         | attend black female femkid
      black | logit   | hrs1 childs age educ paeduc maeduc race income98
            |         | attend other female femkid
     female | logit   | hrs1 childs age educ paeduc maeduc race income98
            |         | attend other black femkid
     femkid | regress | hrs1 childs age educ paeduc maeduc race income98
            |         | attend other black female

End of dry run. No imputations were done, no files were created.

Working with Missing Data—Presented at Florida State University, February, 2007   10
    This shows you the defaults Stata would use if we made no further
     specifications. If there were no missing values for a variable this
     would show that it will not do anything with that variable.
    Notice we are using OLS regression (regress) for every variable
     except of female, other, and black where it wants to do logistic
     regression and race for which Stata wants to use multinomial logistic
     regression (mlogit).
    Because female, other, and black have just two values, Stata
     figured out we should do a logistic regression.
    Because race has three values (white, black, and other), Stata
     figured out we should do a multinomial logistic regression.
    Sometimes a variable with three values should be treated using
     ordinal logistic regression (ologit) and sometimes it should be
     treated using OLS regression (regress), but with three categories,
     Stata always guesses that we want multinomial logistic regression.

Problems found with the dryrun

    There is a problem with femkid. It does not make sense to impute
     childs and female, and also their interaction since their interaction
     term because femkid, by definition, is the product childs times
     female.
    Therefore we need to impute childs and female but let the imputed
     interaction, femkid = childs  female. This means we impute the
     interaction passively. The option is
        o passive(femkid:childs*female).
    This option also will make sure that femkid is not used as a predictor
     when we are imputing either childs or female.

    We have race which is coded as white, back, or other. With 3
     categories we need to create 2 dummy variables, black and other.
    We let white be our reference group.
    We should not impute black or other using logistic regression like we
     do with female above because black and other are
Working with Missing Data—Presented at Florida State University, February, 2007   11
       interdependent (a person should not have an imputed value of 1 on
       both variables).
      Therefore, we need to impute race using multinomial regression but
       not impute black or other using logistic regression.
      This will mean that each missing value will be assigned to one and
       only one race. This approach will give an imputed value for each
       missing value on race.
      Then, it will go back to translated these to the dummy variables
       black and other. This needs to be passively imputed following the
       active imputation of race. We use the option
          o passive(black:race==2\other:race==3). (Double equal signs,
            ==, are pronounced “is” in Stata.)
      To guarantee that multinomial logit was used to actively impute
       race we could use the option
          o cmd(race:mlogit).
      We don’t have to include this option because it is the default for
       race, but do so to illustrate how to specify an estimator. Currently the
       only available estimators are: regress, logit, mlogit, and ologit.

This gets more complicated because we cannot use race with 3 nominal
levels as a predictor and must use black and other as predictors when
imputing other variables. So we need to add an option to make this
substitution happen:
        o subtitute(race:other black).

Here is what we do for our situation. I realize this is a complex command.
Stata commands are rarely even remotely this long except for some
complex graphs. The three slashes are used at the end of each line to
indicate that the following line is still part of the same command. The
m(20) will impute 20 datasets. Other programs are much more difficult to
implement and often just impute 5 datasets to make them manageable.

. ice hrs1-femkid using impute.dta, m(20) ///
     passive(femkid:childs*female\black:race==2\other:race==3) ///
     substitute(race:other black) cmd(race:mlogit)
Working with Missing Data—Presented at Florida State University, February, 2007   12
   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        649       51.43       51.43
          1 |        121        9.59       61.01
          2 |        381       30.19       91.20
          3 |         80        6.34       97.54
          4 |         30        2.38       99.92
          5 |          1        0.08      100.00
------------+-----------------------------------
      Total |      1,262      100.00

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
       hrs1 | regress | childs age educ paeduc maeduc income98 attend other
            |         | black female femkid
     childs | regress | hrs1 age educ paeduc maeduc income98 attend other
            |         | black female
        age | regress | hrs1 childs educ paeduc maeduc income98 attend other
            |         | black female femkid
       educ | regress | hrs1 childs age paeduc maeduc income98 attend other
            |         | black female femkid
     paeduc | regress | hrs1 childs age educ maeduc income98 attend other
            |         | black female femkid
     maeduc | regress | hrs1 childs age educ paeduc income98 attend other
            |         | black female femkid
       race | mlogit | hrs1 childs age educ paeduc maeduc income98 attend
            |         | female femkid
   income98 | regress | hrs1 childs age educ paeduc maeduc attend other black
            |         | female femkid
     attend | regress | hrs1 childs age educ paeduc maeduc income98 other
            |         | black female femkid
      other |         | [Passively imputed from race==3]
      black |         | [Passively imputed from race==2]
     female | logit   | hrs1 childs age educ paeduc maeduc income98 attend
            |         | other black
     femkid |         | [Passively imputed from childs*female]

Imputing 1..2..3..4..5..6..7..8..9..10..11..12..13..14..15..16..17..18..19..20..file
impute.dta saved




Working with Missing Data—Presented at Florida State University, February, 2007        13
Stata is extremely fast. On some programs this would take a long time,
but it is all done in a few seconds. With a large number of covariates and
auxiliary variables, however, even Stata can take a very long time.

What did this command accomplish?

    We created 20 datasets, m(20), and put them into a single file in our
     default directory with the name impute.dta.
    The 20 datasets are stacked in one big dataset where the first 1,262
     observations are the first complete dataset; the second 1,262 are
     the second dataset, and so on. This file has all of our variables; there
     are no missing values.
    It has 25,248 observations altogether (1262  20). These 20 stacked
     datasets are ready to go with all of our variable names, variable
     labels, and value labels.
    Some other programs produce the multiple datasets as separate
     text files that need to be transformed into datasets by adding
     variable names, value labels, etc.

Here is a summary of the stacked file, impute.dta:

. sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |     25240    1351.342    800.8783          4       2812
        hrs1 |     25240    42.29011    14.87211 -10.76075    94.99245
      childs |     25240    1.599714    1.437436 -4.524939           8
         age |     25240    42.05228    12.99656 -3.952193    91.75469
        educ |     25240    14.37851     2.75387          0   22.91912
-------------+--------------------------------------------------------
      paeduc |     25240    12.27622    3.703488          0   23.45145
      maeduc |     25240    12.21553    3.295468          0   22.31159
        race |     25240    1.265491    .5999308          1          3
    income98 |     25240    18.39968    4.431575          1   32.87609
      attend |     25240    3.867593    2.666426   -5.25985   13.53112
-------------+--------------------------------------------------------
       other |     25240    .0824485    .2750522          0          1
Working with Missing Data—Presented at Florida State University, February, 2007   14
       black |     25240    .1005943    .3007967          0          1
      female |     25240    .4880745    .4998677          0          1
      femkid |     25240    .8467585    1.312306 -4.524939           8
-------------+--------------------------------------------------------
  missfemale |     25240    .1085578    .3110898          0          1

    This summary of the 25,240 observations shows some minimum values
     that might seem problematic such as -10.76 hours a week or -4.52
     children.
    We could change all “impossible” values to a possible value replace
     hrs1 = 0 if hrs1 < 0, replace childs = 0 if childs < 0,
     replace age = 18 if age < 18, replace attend = 0 if attend <
     0, and replace femkid = 0 if childs ==0 & female == 0. We
     would run the last command last.
    Notice that the categorical variables (race, other, black,
     female) only take on discrete values.

Alternatively, we can leave all of the imputed values alone. There are
very few out of bounds values in any of our 20 samples. For example,
there are no observations below zero on hrs1 in the first dataset of 1,262
observations.

Estimating our model 20 times and combining the 20 separate results.

You will recall that our model is

hrs1  a  B1female  B2other  B3black  B4age  B5educ  B6childs  B7(female•childs)

To run this on a single dataset (single imputation like done with SPSS) our
command would be

regress hrs1 male age educ childs in 1/1262, beta

The problem with single imputation is that it does not give us as much
information as multiple imputation and will tend to under estimate
standard errors. To get unbiased standard errors and better estimates of
Working with Missing Data—Presented at Florida State University, February, 2007           15
parameters we need to run the regression on multiple datasets and then
pool the results into a combined solution. To do the regression twenty
times on twenty datasets and pool the twenty sets of estimates we run
this command (this command will be superseded by mim within two
months):

micombine regress hrs1 female other black age educ childs femkid

Multiple imputation parameter estimates (20 imputations)
------------------------------------------------------------------------------
        hrs1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   -4.54755   1.433964    -3.17   0.002    -7.360783   -1.734317
       other |   -3.07658    1.62503    -1.89   0.059    -6.264658    .1114975
       black |   .7583257   1.536302     0.49   0.622     -2.25568    3.772331
         age | -.0431244    .0397376    -1.09   0.278    -.1210839     .034835
        educ |   .3490039   .1591362     2.19   0.028     .0368013    .6612065
      childs |   .7887713   .4992205     1.58   0.114    -.1906281    1.768171
      femkid | -1.477353    .6848294    -2.16   0.031    -2.820891   -.1338153
       _cons |   41.47106   2.795495    14.83   0.000     35.98669    46.95542
------------------------------------------------------------------------------
1262 observations.


Because ice and micombine were written by a biostatistician who has
little interest in standardized measures like R2 and β’s, we only get an
unstandardized solution.
     The parameter estimates are better than with single imputation
       because each is the average of all 20 values for the given
       parameter estimate.
     Any one imputation might get a coefficient, i. e., an unstandardized
       B, that is too big or two small, but averaged over 20 repetitions we
       mitigate this likely error.
     The standard errors for multiple imputation tend to be larger than for
       the single imputation and the t-ratios tend to be smaller.
     Although we normally want to minimize standard errors and hence
       maximize t-ratios and power, those from the single imputation are
       incorrect because they ignore the variability across our 20 datasets.
     This variability is inherent in data imputation and should not be
       ignored. If the first dataset gets very different results than the second
Working with Missing Data—Presented at Florida State University, February, 2007   16
     dataset and the second dataset gets very different results than the
     third, etc., the pooled standard errors for the multiple imputations will
     be much larger reflecting the uncertainty of the imputation process
     (See Appendix).
    If the results are virtually the same for each of the 20 datasets, then
     the pooled standard errors will not be much larger than those
     obtained from a single imputation.

We want R2 and the β’s!

To get them we need to do something that is a bit tedious. Currently, the
micombine command does not provide the pooled values for the R2 or
the β weights. I’ve advocated for the replacement command, mim,
adding this capability and the authors are considering my request.

    We need to estimate the regression for each of the 20 datasets,
     write down each of the 20 R2’s and β’s, and then average them by
     computing the mean R2 and the mean for each of the β weights.

Although a bit tedious and admittedly something that could be
incorporated into the program to be automatic, here is what we would
do.
1. First, we estimate the regression equation in each dataset. Because
the data is stacked we need to know where to break the dataset into
the 20 pieces. Using the calculator in Stata, the first dataset is for
observations 1 to 1,262, the second is for observations 1, 263 to 2,562 . . .
and the twentieth data set is for observations 23,978 to 25,240.

di 1262
1262
. di 1262+1262
2524
. di 2524+1262
3786
. di 3786+1262
Working with Missing Data—Presented at Florida State University, February, 2007   17
5048
. . .
. di 22716+1262
23978
. di 23978+1262
25240

Here are the results for the first five of these solutions

 First Five       R2                                        Betas
Imputation                female       other     black       age       educ       childs   femkid
     1           .0678     -.1707     -.0494     .0105     -.0586       .0729      .0668    -.1069
     2           .0513     -.1237     -.0448     .0108     -.0170       .0533      .0783    -.1327
     3           .0714     -.1287     -.0632     .0050     -.0511       .0606      .1121    -.1711
     4           .0614     -.1519     -.0600     .0143     -.0250       .0680      .0799    -.1186
     5           .0586     -.1267     -.0558     .0364     -.0562       .0589      .0871    -.1385
   Mean          .06        -.14       -.05      .02         -.04       .06        .08       -.13


Here we only show the first five solutions, but for publication we would
include all 20 of them. The R2 we report is simply the mean of the R2’s and
the β’s are simply the mean of the β’s for that variable. Thus, R2 = .06 (It
was also .06 when there as no missing values) and the β = -.04 for age
(it was -.05 when there was no missing values). The significance of
the β is the same as the significance of the B.

Strategies that rely on single imputation miss the underlying uncertainty
that we capture with multiple imputations. For the full set of 20 solutions,
the β’s for female vary from .1216 to .1933. The mean of the 20 β’s is a far
better value than arbitrarily using a single solution and using that value.
This is a good reason to do multiple imputation rather than single
imputation.




Working with Missing Data—Presented at Florida State University, February, 2007                 18
                                            How did we do?

This is a single application of multiple imputation and we should avoid
making too much of how well we did or didn’t do. When we meet the
assumption of missing at random (MAR), the multiple imputation based
solution is an unbiased estimate of the solution we would obtain from
complete data. The following table provides a comparison of our three
solutions. The first is our “gold standard.” It is the solution we obtain for a
complete dataset, one with no missing values. The solution in the middle,
labeled listwise, is the solution we obtain when we lose almost half of our
observations to missing values. The solution on the right, labeled multiple
imputation, is what we obtain from doing multiple imputation and
pooling the results.

In this particular example, the listwise solution does not do a terrible job.
    It slightly overestimates the R2.
    Some of the t-ratios are too big and some are too small.
    The B values are too big, substantially so for most of the variables
      and too small for number of children and the interaction term.
    We lose significance for childs, and the interaction term compared
      to the gold standard.

The multiple imputation solution produces
   The identical R2 even though we are missing almost half of our
    observations and the missingness is far from random.
   The B’s are closer to those for the complete dataset than the
    corresponding B’s for the listwise solution.
   They are closer for female, other, black, educ, and the interaction
    term. The listwise solution is closer for childs although it is not
    significant.
   With the multiple imputation we lose significance for one of the
    variables, childs, but the interaction is still significant.

Param      Complete Data       Listwise        Mean Substitution Multiple Imputation
  eter     B      t     β   B       t      β     B     t       β B         t      β
female   -4.10 -3.32 -.14 -4.78 -2.89     -.17 -4.99 -4.32 -.17 -4.55 -3.17 -.14

Working with Missing Data—Presented at Florida State University, February, 2007        19
other    -2.96   -2.00   -.06 -3.56   -1.73   -.06      -    -1.91   -.05 -3.08   -1.89   -.05
                                                     32.77
Black      .29     .21    .01   .91     .52    .02     .52     .40    .01   .76     .49    .02
age       -.06   -1.74   -.05 -.08    -1.77   -.07    -.04   -1.01   -.03 -.04    -1.09   -.04
educ       .38    2.56    .07   .61    3.09    .12     .36    2.49    .07   .35    2.19    .06
childs    1.01    2.47    .10   .94    1.85    .10     .22    0.58    .02   .79    1.58    .08
femkid   -1.56   -2.73   -.14 -1.19   -1.64   -.11   =/te    -1.18   -.05 -1.48   -2.16   -.13
inter-   41.24   15.87        38.66   11.45          41.47   16.25        41.47   14.83
cept
     2
   R             .06                  .08                    .05                  .06




                                                Is this cheating?

A lot of people, upon their first exposure to multiple imputation, say it is
cheating. There are some myths about multiple imputation.

Myth 1—We are making up data. Actually, we are not making up any
data. We are simply using all of the data available in the dataset and
making the reasonable assumption (if MAR is appropriate) that the
missing values would be similarly distributed. There is nothing added that
is not there when we use all of the available data.

Myth 2—We are getting significant results by having more observations
than we really have. This may be true of single imputation, but multiple
imputation incorporates the uncertainty of the implementation process in
how it estimates the pooled standard errors and hence the t-ratios (see
the Appendix). Because of this, multiple imputation usually has smaller t-
ratios than single implementation that ignores this variance between
solutions. If there is a lot of missing values, the multiple imputation will
likely have larger t-values than listwise deletion

Myth 3—MAR is a ridiculous assumption and we can never justify this. The
fact is that we do not have a test of significance for this assumption.
However, it is not as unreasonable as it first seems. MCAR is probably
unreasonable unless there is planned missingness as part of the research
design—say each participant answers a random sample of 50% of the
items to keep an instrument from being too long. MAR is reasonable if
you do enough to find appropriate auxiliary variables that explain the
Working with Missing Data—Presented at Florida State University, February, 2007                  20
pattern of missingness. We only had a few of these in this example and
you should have many more. Fortunately, we know a great deal about
who is more likely to skip or refuse to answer items. As long as we include
a reasonable set of auxiliary variables the MAR assumption is reasonable.
Precautionary note, many applications using full information maximum
likelihood ignore this and only include variables that are used explicitly in
the model.

So you don’t have access or Stata or don’t want to learn it

The freeware program, Amelia, is reasonably easy to use. Norm is
reasonably easy to use. Both of these have some limitations in what they
can do, e.g., Norm assumes all variables are continuous. An additional
limitation is that you need to create the m datasets, run the program on
each of them, and then bring these results back to Norm. This is not hard,
but it is quite tedious and most users limited themselves to imputing 5
datasets. The value of additional datasets has rapidly decreasing
marginal utilities. Some experts say 5 is enough and some say you should
have more.

SAS has a command for multiple imputation (MI) that is not as flexible as
Stata’s but is fairly easy to use.

SPSS has nothing on multiple imputation. It has an expectation
maximization (EM) procedure for single imputation that an article in the
American Statistician reported was not properly implemented. Single
implementation under the best of circumstances (Norm, ice, etc.) can
produce a single implementation quite easily), still will provide biased
results. Some researchers say they believe the bias is not so great that
they want to learn Stata, SAS, NORM, or AMELIA. Although there is no
statistical basis for such a belief, my experience is that with small amounts
of missing values, only continuous variables being used, no interactions,
and the MAR assumption being reasonable that the SPSS solution is okay.
The SPSS solution is an-add on module to SPSS.

Working with Missing Data—Presented at Florida State University, February, 2007   21
                                                    Appendix A:
                                                   Rubin’s Rules


Rubin (1987) developed the following rules for pooling the solutions from the
multiple datasets:

Estimates of individual parameters (B, R2, β)

     1 m
M     M j
     m j 1
where     M j is the value of the parameter estimate in the jth dataset
          M is the pooled estimate
          m is the number of imputed datasets


There are two components of the pooled standard error. First we compute the
mean error variance (square of the standard error)

        1 m
Var      Varj
        m j 1
where Varj is the square of the standard error for the jth dataset
         Var is the mean of the squares of the standard errors


and the variance of the parameter estimates:

      1 m
B         ( M j  M )2
     m  1 j 1
where     B is the between imputation variance of the estimated parameter


The pooled standard error is:

                       1
Pooled(SE)  Var   1   B
                    m




Working with Missing Data—Presented at Florida State University, February, 2007   22
Some researchers use the ordinary degrees of freedom, N – k, where k is the
number of predictors. A more conservative estimate of the degrees of freedom
is provided by Schafer (1997).

df  (m  1) 1  (m+1)B 
                             2
                   mVar


where     M j is the value of the parameter estimate in the jth dataset
           M is the pooled estimate
           m is the number of imputed datasets


                                                    Appendix B:

New capabilities for ice and mim (replacing micombine) may or may not include
the incorporation of standardized values. The help menu’s for the beta versions
of the new ice and mim are now on the web page for this workshop
(www.oregonstate.edu/~acock/missing )

The mim command greatly extends the capability of micombine. It is possible to
use far more Stata commands this way including the multilevel commands and
the complex survey commands.




Working with Missing Data—Presented at Florida State University, February, 2007   23
                                           References

Acock, A. C. (2005). Working with missing values. Journal of Marriage and
   the Family, 67, 1012-1028.

Honaker, J., King, G., & Blackwell, M. (2007). Amelia II: A Program for
   Missing Data.

Cao, H. (2001). IMPUTE: A SAS application system for missing value
   imputation. Ann Arbor: Survey Research Center, Institute for Social
   Research.

King, G., Honaker, J. Joseph, A., Scheve, K. (2001). Analyzing incomplete
     political science data: An alternaqtive algorithm for multiple
     imputation. American Political Science Review 95:1: 49-69

McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007).
   Missing Data: A Gentle Introduction. New York: Guilford.

Royston P. (2004). Multiple imputation of missing values. Stata Journal
    4(3): 227-241.

Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. J. Wiley
    & Sons, New York.

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman
    & Hall, London.

Yuan, Yang C. Multiple imputation for missing data: Concepts and new
    developments. Rockville, MD: SAS Institute.




Working with Missing Data—Presented at Florida State University, February, 2007   24

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:2/5/2012
language:English
pages:24