modelmethods by K61gtQv

VIEWS: 0 PAGES: 27

									What are we really reporting?
                     Michael Regier BAR, BSc, MSc
                      Department of Statistics, UBC
Outline
•   Motivation
•   Implications of current methods
•   Mechanics of the study
•   Results
•   Conclusions




                                      2
Motivation
Research context:
Observational Studies
• Using large (covariates and observations) health
  information and administrative databases

• Efficient to obtain but can be very complex and
  costly to understand which may offset initial cost
  savings

• Need an efficient way to sift through the information


                                                       4
Current statistical methodology
• Define the outcome
   – Typically a Bernoulli outcome (Yes/No, 0/1)
   – E.g. location of death (in/out of hospital)

• Data reduction (covariate screening)
   – Univariate test (t-test, Chi-squared test of homogeneity) is used to
     determine which covariates are associated with the outcome.

• Initial fit of the model
   – Covariates that have a p-value < 0.05 in the data screening step are
     included in the model

• Parsimonious model
   – Backwards (stepwise) elimination based on a chi-square
     approximation to the deviance with a predetermined p-values for
     inclusion/exclusion                                                    5
Concerns with current practice
• Disregarding the underlying covariate joint probability
  distribution
   – Screening disregards the joint probability model and treats the
     covariates as independent marginal probability distributions

• Imposition of highly restrictive assumptions on the unknown
  data generating mechanism (model specification)
   –   linear systematic component (h=xTq)
   –   logit link function (h=p/(1-p))
   –   the random component is a Bernoulli distribution
   –   main effects assumption

• The chi-square approximation to the deviance is poor when
  using Bernoulli outcomes
   – Binomial outcomes are generally better but does depend on the
     structure of the data                                             6
Concerns continued ...
• No distinction between adjustment and prediction covariates
   – Assuming the equivalence between significance and adjustment is
     tenuous.
   – Assumption has no known supporting literature.

• Model is incorrectly defined
   – Multivariable regression models are defined by each observation
     having a single outcome and many covariates.
   – Multivariate regression models are defined by each observation
     having a vector of outcomes that will be model simultaneously.
     These models typically have many covariates.

• No known evidence based or theoretical literature supporting
  covariate screening

                                                                       7
Implications of current methods
• Covariate screening is a questionable practice when the underlying data
  mechanism is unknown.

• The GLM regression coefficients are biased when screening is used.

• Hypotheses generated using covariate screening have little if any
  empirical evidence.

• Covariate screening is frequently used to supplant subject area
  knowledge about the problem.

• Congruence with published research does not validate findings since
  published research does employ covariate screening and covariate
  screening with a main effects only model.


                                                                        8
Mechanics of the study
Study Design
• Monte Carlo simulation (M=10,000; n=50, 100, 500)
• Two covariates were sampled from a multivariate normal
  distribution
• A Bernoulli response was constructed using a fully specified
  systematic component and logit link function
• Regression models were fit, using identical data sets, for the
  covariate screening method and the non-covariate screening
  method
• Parsimonious models were found using BIC selection
  (Bayesian Information Criterion)
• Bias, variance and odds ratio were obtained for covariate
  screened and non-covariate screened estimators
                                                            10
Data
• Covariates were sample from a multivariate
  normal distribution
                 = (0, 2)
                 1 0 
               =    
                 0 1 
• n=50, 100, 500
                                           11
Model
• The logit link with a linear systematic component
  was used to construct the Bernoulli outcome.

          pi       rT u  r
     log           = xi 
          1- pi   
                     =  0  1 xi1   2 xi 2  12 xi1 xi 2

• The ith patient has a probability of success, pi,
• The outcome was simulated using Bernoulli(pi)
                                                                12
Experimental design
•   x1 is retained in all models – adjustor covariate
•   x2 is the predictor covariate
•   Models 1 and 2 are commonly used main effects models
•   Models 3 and 4 are of primary interest

                              Coefficients
    Model        0          1          2          12
1                     -1       0.25         0.5           0
2                     -1       0.25      0.005            0
3                     -1       0.25         0.5        0.75
4                     -1       0.25      0.005         0.75
Covariate screening
• x2 was screened using a=0.05
• The use of screening has conceptual
  implication on treatment of data
  – Covariates with p-value > 0.05 are treated as if
    they were not collected.
  – This was integrated into the modelling procedure.




                                                  14
Model selection
• BIC (Bayesian Information Criterion) was chosen for model
  selection
• BIC minimizes

                            (          )
                              r
                              u r
                          µ
                      -2l  | y, x   log n
                                                                  (       )
                                                                       r
                                                                       u r
                                                                    µ| y , x
   – where  is the vector of coefficients in the fitted model M, l 
     is the log likelihood evaluated at the maximum likelihood estimator, ||
     is the number of parameters in the fitted model and n is the sample
     size.
• BIC tends to select smaller models
• BIC bases model selection on the minimization of a function
  over a localized search on the set of all possible models
  rather than a poor distributional approximation.          15
Measuring the bias
• Bias
  – The bias of an estimator is the difference
    between the expected value of the estimator and
    the value of the parameter it is estimating.

                   µ        µ
              Bias(i ) = E(i ) - i
• Monte Carlo estimation of the expectation
                             M
                 µ ) = 1
                 E( µi
                        M
                             µ
                               i ,m
                            m =1
                                                 16
Odds ratio with an interaction term
•   The odds ratio is a function of x1 and x2, and can be written as
                     OR ( x1 , x2 ) = exp(  0   1 xi1   2 xi 2   12 xi1 xi 2 )
                     ¶                     µ µ             µ          µ
                                           µ µ               µ µ
                                    = exp(  0   1 xi1  (  2   12 xi1 ) xi 2 )
                                           µ µ
                                    = exp(    x   x ) °
                                               0      1 i1        2| x1 i 2

•   The odds ratio is now a function of x2 given x1, thus the estimated adjusted odds
    ration for a one unit increase in x2 is

                                          ¶           °
                                          OR x2 = exp  2| x1 (               )
     – notice that it is a function of x1, thus the odds ratio is a curve, not a point
•   The variance on the log-odds scale is
                          µ°
                             (      ) µµ
                                           ( )    µµ
                                                          ( )   ·   µ µ
                          V  2| x1 = V  2  x12 V  12  2 x1 Cov  2 ,  12    (     )
     – Notice that the variance is also a function of x1

                                                                                            17
Results and conclusions
MC error and theoretical error
              Model 3, n=50                           Model 4, n=50




• One way to check the quality of the Monte Carlo simulation is to verify the
  results against known large sample theoretical results.
• The congruence between the theoretical and the simulation based errors
  attests to the quality of the simulation
• The choice of error will have little impact on the confidence intervals 19
Non-significant t-tests
• Recall that all the models had x2 as a predictor variable. What changed
  over the models was
    – the magnitude of 2
    – the inclusion of an interaction
• x2 should have been retained in all the models (small percentage of non-
  significant t-tests)
• As the sample size increases the proportion of non-significant t-tests
  decreases
                                   Proportion of non-significant t-tests
Model/Sample size       n=50              n=100                n=500
1                       68.5%             44.1%                0.3%
2                       95.1%             94.9%                94.7%
3                       72.8%             50.3%                0.6%
4                       94.7%             94.3%                92.7%
Bias
• Both methods have biased estimators, but the
  bias, in general, is much larger for the
  screening methodology
• The non-screening methodology bias is
  negligible or similar to that of the screening
  method


See pdf for details




                                             21
Odds ratio for model 1: n=100
                                True OR



                            Estimated OR




                                     22
Odds ratio for model 2: n=100




                            Estimated OR




                                True OR

                                     23
Odds ratio for model 3: n=100
                                True OR



                            Estimated OR




                                     24
Odds ratio for model 4: n=100
                                True OR



                            Estimated OR




                                     25
Conclusions
• Covariate screening is a questionable practice when the
  underlying data mechanism is unknown.
   – The bias can very large for the screening methodology (Model 4).
   – Non-screening and screening methods produce similar results when
     screening bias is small (Models 1 and 2).
• Screening is an ad hoc practice which simplifies subsequent
  analysis but has no known empirical or theoretical support.
• Even when biased, the non-screening method tends to a
  functional form that is similar in shape to the true functional
  form (Models 3 and 4).
   – Non-screening odds ratio is reasonable over a small domain
• Subject area knowledge and expertise cannot be replaced
  by univariate screening procedures and model fitting
  algorithms.
                                                                  26
Thank you

								
To top