Performing Sensitivity Analyses
of Imputed Missing Values
Jenny H. Qin and Mike Singleton
Kentucky CODES
Kentucky Injury Prevention & Research Center
University of Kentucky
July 14th, 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Multiple Imputation in
Public Health Research
Handling Missing Data in Nursing
Research with Multiple Imputation
Multiple Imputation
Publications
Application of Multiple Imputation in
Medical Studies: from AIDS to NHANES
Questions???
• May I use MI to deal with missing
data problems for my data sets?
• How can I believe that the MI will
give me better analysis results?
• What should I do to get good results
from MI?
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
??? A sensitivity analysis
tests if our study
results are sensitive
to our assumptions
(missing data
Sensitivity Analyses mechanism), data
on Imputed Values conditions (missing
data rate), and
choices (imputation
models or number of
imputations) made
Answers for obtaining the
results
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
MI Process
Analysis
Model
Set 1 Results 1
1 Missing Data Mechanism
Set 2 Results 2
Imputation
3 Model Proc
Data Set of Interest Proc MI Set 3 Results 3 MIANALYZE
. .
. .
2 Missing Data Rate 4 Proc MI
Options . . Results
Set n Results n
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
CODES Application
Research Question:
What was the relationship between driving under the influence of
drugs and/or alcohol, and being killed or hospitalized in a crash,
for motorcycle riders in Kentucky in 2001?
Outcome (Dependent Variable):
Killed or Hospitalized (K/H)
Risk Factor Candidates (Independent Variables):
Age, gender, suspected DUI, posted speed limit, helmet use,
fixed object, head-on collision, collision time, rural vs. urban
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Analysis Model
Logistic Regression Model:
K/H = β0 + β1*DUI + β2*Speed + β3*Fixed + β4*Head-On
Total records in our study Data set:
1,226
Records with missing values:
14 (1.1%)
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Results for the Gold Standard
This Gold
Parameter OR(95% CI) Estimate SE P
Standard
DUI 2.51 (1.58 3.98) 0.9189 result 0.0001
0.2364 is used
Speed 1.58 (1.18 2.10) 0.4546
to compare
0.1456 0.0018
with all other
Fixed 1.70 (1.24 2.33) 0.5311 0.1599
results.0.0009
Head-on 1.70 (1.04 2.77) 0.5316 0.2486 0.0380
Conclusion: comparing motorcyclists with DUI to motorcyclists
without DUI, the odds of being killed or hospitalized are 2.5
times greater than the odds of not being killed or hospitalized,
when other factors are controlled.
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Imputation Model
Analysis Model:
K/H = β0 + β1*DUI + β2*Speed + β3*Fixed + β4*Head-On
Imputation Model:
K/H DUI Speed Fixed Head-On
Note: The imputation model does not have to be identical to the analysis
model, but at least it should include all of the analysis covariates. You can
add any additional variables that are correlated to the variables that have
missing values.
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 1 Missing Data Mechanism
MCAR MAR NMAR
1 Missing Data Mechanism
Imputation Analysis
3 Model Model Data Proc
Study Data Set Proc MI
Analysis MIANALYZE
2 Missing Data Rate 4 Proc MI options
Results
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 1 Missing Data Mechanism
• Missing Completely At Random (MCAR)
– DFN: the missing data values are a simple random sample of all data values.
– We simulated this condition by using SAS Proc SurveySelect to pick a
random sample from the study data set, then set DUI = missing for those
selected cases.
• Missing At Random (MAR)
- DFN: the probability of missing values on one variable is unrelated to the
values of this variable, after controlling for other variables in the analysis
- We simulated this condition by setting DUI = missing for riders aged 46 or
older
• Not Missing At Random (NMAR)
– DFN: the probability of missing values on one variable is related to the
values of this variable even if we control other variables in the analysis
– We simulated this condition by setting DUI = missing for uninjured riders
who were not suspected of DUI (DUI=‘NO’).
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Created 3 data sets from the study data set with different missing data
mechanisms, but with the same percent missing values for DUI (25%)
MCAR MAR NMAR
25% missing on DUI 25% missing on DUI 25% missing on DUI
Parameter E SE P E SE P E SE P
Intercept -1.7336 0.1096 0.0001 -1.7259 0.1092 0.0001 -1.7204 0.1092 0.0001
DUI 0.8544 0.2664 0.0016 0.8286 0.2623 0.0018 0.5791 0.2223 0.0092
Speed 0.5018 0.1449 0.0005 0.4843 0.1448 0.0008 0.4812 0.1443 0.0009
Fixed 0.4927 0.1610 0.0022 0.5079 0.1597 0.0015 0.5400 0.1578 0.0006
Head-on 0.5133 0.2485 0.0388 0.5133 0.2486 0.0389 0.5103 0.2475 0.0393
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Sensitivity analysis Estimates for Parameters with Different
on missing data Missing Data Mechanisms
mechanism:
Different
1
1 Missing Data Mechanism
0.8
Same
Estimate
0.6
2 Missing Data Rate (25%)
0.4 GoldStd
Same MCAR
0.2 MAR
3 Imputation Model NMAR
0
Same
d
d
I
n
DU
ee
xe
-o
4 Proc MI Options
ad
Sp
Fi
He
What is the result?
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Conclusions of SA on Missing Data Mechanism
•Even if we used the
Point Estimate and 95% CI for DUI
simplest imputation model
with Different Missing Data Mechanisms MI was able to produce
results that are consistent
4.5
95%CI_upper with the Gold Standard
4 Point Estimate when the missing data
3.5 95%CI_lower
mechanisms were MCAR
3
Odds Ratio
or MAR, but not NMAR
2.5
2 •we would predict the
1.5 increased odds of death or
1 hospitalization for riders
0.5 suspected of DUI to be 1.78
0 (1.15 2.76) for NMAR,
GoldStd MCAR MAR NMAR while our Gold Standard
predicts it to be 2.51 (1.58
3.98).
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 2 Missing Data Rate
1 Missing Data Mechanism
Imputation Analysis
3 Model Model Data Proc
Study Data Set Proc MI
Analysis MIANALYZE
2 Missing Data Rate 4 Proc MI options
Results
6% 25% 50%
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 2 Missing Data Rate
• Data sets with MCAR (Test on
percentage of values missing for DUI as
6%, 25%, 50% respectively)
• Data sets with MAR (Test on
percentage of values missing for DUI as
6%, 25%, 50% respectively)
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Create 3 data sets with MCAR from the study data set having values
missing for DUI as 6%, 25%, and 50% respectively.
MCAR MCAR MCAR
6% missing on DUI 25% missing on DUI 50% missing on DUI
Parameter E SE P E SE P E SE P
Intercept -1.7361 0.1094 0.0001 -1.7336 0.1096 0.0001 -1.7377 0.1119 0.0001
DUI 0.9447 0.2429 0.0001 0.8544 0.2664 0.0016 0.8457 0.2973 0.0065
Speed 0.4812 0.1446 0.0009 0.5018 0.1449 0.0005 0.4831 0.1460 0.0009
Fixed 0.5213 0.1584 0.0010 0.4927 0.1610 0.0022 0.5200 0.1617 0.0013
Head-on 0.5245 0.2489 0.0351 0.5133 0.2485 0.0388 0.4936 0.2508 0.0490
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Create 3 data sets with MAR from the study data set having values missing
for DUI as 6%, 25%, and 50% respectively.
MAR MAR MAR
6% missing on DUI 25% missing on DUI 50% missing on DUI
Parameter E SE P E SE P E SE P
Intercept -1.7382 0.1095 0.0001 -1.7259 0.1092 0.0001 -1.7502 0.1109 0.0001
DUI 0.9191 0.2334 0.0001 0.8286 0.2623 0.0018 1.2722 0.3298 0.0002
Speed 0.4836 0.1449 0.0008 0.4843 0.1448 0.0008 0.5063 0.1473 0.0006
Fixed 0.5076 0.1590 0.0014 0.5079 0.1597 0.0015 0.5234 0.1597 0.0010
Head-on 0.5174 0.2486 0.0374 0.5133 0.2486 0.0389 0.5371 0.2487 0.0308
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Sensitivity analysis Estimates for Parameters with Different Missing
on Missing Data Rates
Rate?
Same 1.4
GoldStd
MAR6%
1 Missing Data Mechanism 1.2
MCAR or MAR MAR25%
1 MAR50%
Different MCAR6%
Estimate
0.8 MCAR25%
2 Missing Data Rate MCAR50%
0.6
Same
0.4
3 Imputation Model
0.2
Same
0
DUI Speed Fixed Head-on
4 Proc MI Options
What is the result?
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Conclusions of SA on Missing Data Rate
• For both missing data
Point Estimate and 95%CI for DUI with mechanisms, the 50% missing
Different Missing Data Rates case produced the DUI
parameter estimate farthest
from the Gold Standard
8
95%CI_upper estimate, as well as the widest
7
Point Estimate 95% CI. However, for MCAR
Odds Ratio
6
95%CI_lower the difference from the Gold
5 Standard estimate was -7%,
4 whereas for MAR it was 42%.
3 In addition, the 95% CI for
2 50%MCAR was 19% wider
1 than the Gold Standard 95%
0 CI, whereas for 50%MAR it
was 106% wider.
M
Go
M
M
M
M
M
CA
AR
AR
AR
CA
CA
ld
•It shows that the simplest
St
R6
6%
25
50
R2
R5
d
%
%
%
5%
0%
imputation model is not
sufficient to handle very high
missing data rates .
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 3 Imputation Model
Model1 Model2 Model3 Model4
1 Missing Data Mechanism
Imputation Analysis
3 Model Model Data Proc
Study Data Set Proc MI
Analysis MIANALYZE
2 Missing Data Rate 2 Proc MI options
Results
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 3 Imputation Model
• Data set with MAR and values missing for DUI=50%
• Tests on the following 4 Imputation models
– Model1: D/H DUI Speed Fixed Head-on
Model1 = Analysis model, it is the simplest imputation model
– Model2: Model1 + age_group + colltime (Categorical)
– Model3: Model1 + age_group + hour (Continuous)
– Model4: Model1 + age_group + hour_normal (Continuous)
We are adding age and collision time to help predict DUI in
Model2, Model3, and Model4
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Use 4 different imputation models to do MI on the same data set with
MAR, 50% missing on DUI.
Model 2 Model 3 Model 4
50% missing on DUI 50% missing on DUI 50% missing on DUI
Parameter E SE P E SE P E SE P
Intercept -1.8110 0.1222 0.0001 -1.8081 0.1235 0.0001 -1.8034 0.1238 0.0001
DUI 1.0127 0.2948 0.0016 0.9814 0.2966 0.0024 0.9563 0.2813 0.0015
Speed 0.5079 0.1466 0.0005 0.5021 0.1463 0.0006 0.5081 0.1469 0.0005
Fixed 0.5370 0.1604 0.0008 0.5404 0.1601 0.0007 0.5371 0.1598 0.0008
Head-on 0.5554 0.2537 0.0286 0.5477 0.2552 0.0320 0.5561 0.2521 0.0274
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Sensitivity analysis Estimates for Parameters with Different
on Imputation Imputation Models
Model
1.5
Same 1.4 GoldStd
NoMI
1 Missing Data Mechanism
1.3
Model1
MAR 1.2 Model2
1.1 Model3
Same
Estimates
1 Model4
0.9
2 Missing Data Rate (50%) 0.8
0.7
Different 0.6
0.5
3 Imputation Models 0.4
0.3
Same 0.2
I
n
d
d
DU
4 Proc MI Options
-o
xe
ee
ad
Fi
Sp
He
What is the result?
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Conclusions of SA on Imputation Models
•Models 2, 3, and 4 are all
Point Estimate and 95% CI for DUI with improvements over model 1,
Different Imputation Models and produced DUI
9
parameter estimates and
8
95% CI widths close to those
95%CI_upper
7 Point Estimate
of the Gold Standard.
95%CI_lower
•So even with 50% missing
Odds Ratio
6
5 values (MAR), we are able to
4 get a good result by using a
3 richer imputation model.
2
1 •The higher percent missing
0 values (MAR) in your data
NoMI Model1 Model2 Model3 Model4 GoldStd set, the more you must
include additional predictors
in the imputation model.
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Comparison of No MI and Model 4 to the Gold Standard
Estimates for Parameters
(Data set with 50% MAR on DUI)
1.6
GoldStd
1.4 NoMI
1.2 Model4
Estimates
1
0.8
0.6
0.4
0.2
0
DUI Speed Fixed Head-on
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Comparison of No MI and Model 4 to the Gold Standard
Point Estimate and 95% CI for DUI Point Estimate and 95% CI for Speed
9 2.5
8
2
7
6 MI
Odds Ratio
G.S.
Odds Ratio
1.5
5
4
1
3
G.S. MI
2
0.5
1
0
0
No MI
Point Estimate and 95% CI for Fixed Point Estimate and 95% CI for Head-on
3.5 6
3 5
2.5
Odds Ratio
4
Odds Ratio
2
G.S. MI 3
1.5
2
1 G.S. MI
1
0.5
0
0
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 4 Proc MI: Number of Imputations
1 Missing Data Mechanism
Imputation Analysis
3 Model Model Data Proc
Study Data Set Proc MI
Analysis MIANALYZE
2 Missing Data Rate 4 Proc MI: number of MI
Results
N=0 N=2 N=5 N=10 N=20
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
SA: 4 Proc MI: Number of Imputations
• Data set with MAR and values missing for
DUI=50%, use Model4 to do MI
• Test on different number of imputations
– N=0
– N=2
– N=5
– N=10
– N=20
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Use same imputation model (Model4), but different number of imputations to
do MI on the same data set with MAR, 50% missing on DUI.
N=5 N=10 N=20
50% missing on DUI 50% missing on DUI 50% missing on DUI
Parameter E SE P E SE P E SE P
Intercept -1.7975 0.1177 0.0001 -1.8034 0.1238 0.0001 -1.7898 0.1204 0.0001
DUI 0.8658 0.2537 0.0023 0.9563 0.2813 0.0015 0.9942 0.3176 0.0026
Speed 0.4971 0.1457 0.0006 0.5081 0.1469 0.0005 0.5016 0.1465 0.0006
Fixed 0.5448 0.1610 0.0007 0.5371 0.1598 0.0008 0.5286 0.1599 0.0010
Head-on 0.5652 0.2522 0.0251 0.5561 0.2521 0.0274 0.5506 0.2509 0.0282
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Sensitivity analysis Estimates for Parameters with
on Number of Different Number of Imputations
Imputations
Same
1.6
GoldStd
1 Missing Data Mechanism 1.4 NoMI
MAR
1.2 MI N2
Same
Estimates
1 MI N5
2 Missing Data Rate (50%) MI N10
0.8
MI N20
Same 0.6
0.4
3 Imputation Model
0.2
Different 0
4 Number of Imputation
ed
d
n
I
DU
ee
-o
x
What is the result?
ad
Fi
Sp
July 14th , 2003 www.kiprc.uky.edu He 29th TRF 2003, Denver
Conclusions of SA on Number of Imputations
Point Estimate and 95% CI for DUI with •In our example, n=5 to
Different Imputation Numbers 10 is enough to get good
results for data set with
9 50% MAR on DUI.
8 95%CI_upper
Point Estimate •No MI (complete cases
7
95%CI_lower only), we would conclude
Odds Ratio
6
5
that: motorcyclists with
4
DUI had 4.2 (2.1, 8.4)
3
times more likely killed
2
or hospitalized than
1 motorcyclists without
0 DUI. But from the Gold
n=0 n=2 n=5 GoldStd n=10 n=20 Standard, the OR is 2.5
(1.5, 4.0)
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Summary---Answers?
• May I use MI to deal with missing data problems for
my data sets?
Seems a good idea to try MI. Depend on the missing data mechanisms of
variables with missing values in your data sets (however, even our results
with MI for NMAR were better than No MI)
• How can I believe that the MI will give me the better
analysis results?
We found that using MI on our example gave us much better analysis
results than No MI (the complete cases only)
• How can I get better analysis results by using MI?
Understand the relationship between variables in your data sets;
Know the missing data mechanisms of variables;
Determine the percent of missing information;
Build a reasonable imputation model;
Use Proc MI options wisely
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Poll Results
Like Denver Like TRF Liked the Talk Use MI
Y
Q1. I like Denver.
Y Y Y
Missing (left
session early)
Missing Data
Y Missing (too nice to
say “NO”)
N
Q2. I like TRF.
Y
Y
Problems
N
N
Y
N
Y
Missing (not
sure yet)
N Everywhere
Q3. I liked the talk.
Missing
(daydreaming)
Y Y
Missing (fell Y Missing N
asleep)
N Q4. NI will use the MI.
N Missing
N Missing Y Y
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Acknowledgment
Special thanks to Dr. Mike McGlincy,
who gave us helpful suggestions during
our study of sensitivity analyses on
imputed values and insightful comments
on the analysis results.
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Thank You
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Questions?
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Can We Improve Analysis Results for NMAR
by Using a More Complex Imputation Model?
Estimates for Parameters on
25% NMAR with Different Models
Model5=Model1+age+hour
+gender+safety
1
GoldStd
Model4=Model1+age+hour 0.9 NoMI
Model1
0.8 Model4
Model5
Model1=K/H + DUI + Speed Estimates 0.7
+ Fixed + Head-on 0.6
0.5
No MI=Complete cases only
0.4
0.3
0.2
DUI Speed Fixed Head-on
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Multiple Imputation inference involves
three distinct phases:
1. The missing data are filled in m times to generate m
complete data sets
(using imputation model)
2. The m complete data sets are analyzed by using standard
procedures
(using analysis model)
3. The results from the m complete data sets are combined
for the inference
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
Statistical Assumptions for
Multiple Imputation
1. The MI procedure assumes that the data are from a continuous
multivariate distribution. It also assumes that the data are from a
multivariate normal distribution when the MCMC method is used
According to Schafer’s MI FAQ page, MI tends to be quite
forgiving of assumption for normal distribution. For example: when
working with binary or ordered categorical variables, it is often
acceptable to impute under a normality assumption and then round
off the continuous imputed values to the nearest category.
Variables whose distributions are heavily skewed may be
transformed to approximate normality and then transformed back
to their original scale after imputation.
2. Proc MI and Proc MIANALYZE assume that the missing data are
Missing At Random (MAR)
MCAR is unlikely for real world crash datasets
NMAR may be shifted to MAR by using a richer imputation model to
help predict missing values. Because crash datasets include many
related variables that can help predict each other
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver
July 14th , 2003 www.kiprc.uky.edu 29th TRF 2003, Denver