Document Sample
carroll Powered By Docstoc
					Gene-Environment Case-Control

         Raymond J. Carroll
      Department of Statistics
 Faculty of Nutrition and Toxicology
        Texas A&M University

  Image of rat colon displaying
  apoptosis (green) and cell
  differentiation (red)

• Problem: Does Environment/Treatment Affect
  Disease Prediction via Gene Expression?
• Two Thought Experiments
• Case-control studies: Background
• Gene-Environment independence
• Profile likelihood approach
• Efficiency gains (Large ones!)
• Limitations
• Conclusions

• This work is joint with Nilanjan Chatterjee,
  National Cancer Institute
               Experiment #1

• Two strata: corn oil and fish oil fed rats
• All animals exposed to a carcinogen (200/strata)
• Within each strata, 50% randomized to radiation
• At initial stage, fecal (not mucosal) material
• At proliferation stage (8 weeks), animals
  sacrificed and assayed for aberrant crypt foci
• ACF are precursors to colon cancer
                Experiment #1

• The finger-like structures are colonic crypts
• They house the stem, proliferating and
  differentiating cells
               Experiment #1

• The dark spots are ACF: aberrant crypt foci
                Experiment #1

• We want to know whether microarray on fecal
  material is predictive of ACF status
• We also want to know whether gene expression
  differs in predictive ability for the environment
• With 400 animals, cannot do microarray on all
• We will construct an index of ACF, and sample
  cases (those with high # of ACF) and controls
• Microarray done on these animals
                Experiment #2

• Idea: patients with tumors have their tumor
  tissues stored
• They are then randomized to treatments
• Question 1: does gene expression predict
  recurrence (say)?
• Question 2: is predictive ability different for the
  different treatments?
• Again, for cost purposes, only some cases and
  controls will have gene expressions
        Basic Problem Formalized

• Case control sample: D = disease
• Gene expression: G
• Environment: X
• Strata: S
• We are interested in main effects for G and (X,S)
  especially their interaction
• In both experiments, G and X are independent
  in the population by design, given S
               Prospective Models

• Simplest logistic model

   pr(D  1|G, X)  H(b 0  b1G  b2 X  b 3G X)

• General logistic model

       pr(D  1|G, X)  H{b 0  m(G, X, β1 )}

• The function m(G,X,b1) is completely general
              Case-Control Data

• The data as we envision it are case-control data
• The % of cases (D=1) is often much higher in
  the case-control study than in the general
• For example, one might get gene expression for
  all cases, but only some controls
• Many microarray studies use a plan like this
               Case-Control Data

• Case-control data are not a random sample
• We observe (G,E) given D, i.e., we observe the
  covariates given the response, not vice-versa
• If we had a random sample, linear logistic
  regression would be used to fit the model
  • Essentially, Fisher LDA with no interactions
• Obvious idea: ignore the sampling plan and
  pretend you have a random sample
              Case-Control Data

        pr(D=1|G,X)=H{β 0 +m(G,X,β1 )}

• Cool Fact: all parameters except the intercept
  can be estimated consistently while ignoring
  the sampling plan
• The intercept is determined by pr(D=1) in the
  population, hence not identified from these data
              Case-Control Data

        pr(D=1|G,X)=H{β 0 +m(G,X,β1 )}

• Cool Fact: all parameters except the intercept
  can be estimated consistently while ignoring
  the sampling plan
• Standard errors are also asymptotically correct
• Well known fact for linear logistic (Prentice and
  Pyke, 1979), not so well known for general
  nonlinear models
   Environment and Gene Expression

• In my two examples, gene expression (G) and
  environment (X) are independent by design.
• Can we exploit this to get more efficient
• Should be possible: this is akin to a missing data
  problem, with outcomes MAR.
• We do this via a semiparametric profile likelihood
   Environment and Gene Expression

• Methodology: Start with the retrospective
  pr(G=g, X=x|D=d)
               pr(X=x)pr(G=g)exp  d b 0  m(g, x, b1 ) 1  H b 0  m(g, x, b1 )
                                                                                   
       pr(X=x')pr(G=g')exp d b
      x ',g'
                                           0    m(g', x ', b1 ) 1  H b 0  m(g', x ', b1 )
                                                                                                

• Note how independence of G and X is used here,
  see the red expressions
   Environment and Gene Expression

• Methodology: Start with the retrospective
             pr(G=g, X=x|D=d)

• Treat the environment (X1,…,Xn) as distinct
  parameters, and λ i=pr(X=x i ) as their distribution
• Let G have pr(G=g) =f(g,θ)
• Construct the profile likelihood, having
  estimated the λ i as functions of data and
  other parameters
              Profile Likelihood

• My approach is often called the Neyman-Scott
• With a single gene expression, n samples, we
  have more than n parameters
• Often does not work to produce workable
  • Non-constant variances treated as n parameters
  • Latent variables in measurement error problems
    treated as parameters
                     Profile Likelihood

• Result:
     = β0  log(n1 /n0 )  log pr(D=1) /pr(D  0) ;

                      f(g, ) exp  d    m(g, x, b1 )
                                                        
    S(d,g,x, ) =
                         1  exp b 0  m(g, x, b1 )

    Profile Likelihood = L(β0 ,β1 ,κ,θ)=L(Ω)
            =    1

                  S(d, g, X, )d(g)
              Profile Likelihood

• The form of the profiled likelihood makes it
  appear that and are identified, and hence so too
  are pr(D=1) and β 0.

     =β0  log(n1 /n0 )  log pr(D=1) /pr(D  0) ;
    Profile Likelihood = L(β0 ,β1 ,κ,θ)=L(Ω)

• This does not happen with regular case-control
  data, remember
               Profile Likelihood

• In light of the Neyman-Scott phenomenon, it
  would be a surprise if pr(D=1) and β 0 are
• Happy days: sometimes surprises are happy
• Both are identified theoretically from case-control
• So too is the distribution of gene expression
  • Even more interesting with alleles, haplotypes, etc.
   Environment and Gene Expression

• Summary of Assumptions:
  • G and X are independent (possibly after stratification)
  • Parametric form for the distribution of G
• Summary of Result:
  • Intercept and marginal pr(D=1) are identified
• Loss of robustness versus the usual analysis that
  assumes nothing
• Identification of pr(D=1) hardly seems worth
  the risk (but wait!)
            Alternative Derivation

• Consider a prospective study
• Let D= 1 mean selection into the study
• Pretend
            pr(Δ=1|D=d,G,X)  nd/pr(D=d);
            nd  # of observations with D  d

• Then compute     pr(D=d,G=g|Δ=1,X)

• This is exactly our profile pseudo-
        Interesting Technical Point

• The profile pseudo-likelihood acts like a real
• Information Asymptotics are (almost) exact
• Missing data handled seamlessly
• Measurement error/Misclassification in
  environment handled seamlessly
• Because it appears to be some sort of
  likelihood, hope for efficiency gains versus
  standard approach
                First Simulation

• Settings
  • 500 cases, 500 controls
  • Gene expression dichotomized into low and high,
    pr(high) = 0.05 and 0.2 (in the population)
  • X = min{10,Lognormal(0,1)}
  • Pr(D=1) = 0.05 (in the population)
• Standard multiplicative model:
  • Main effect parameter for G = 0.26
  • Main effect parameter for X = 0.10
  • Interaction parameter = 0.30
               First Simulation

• MSE Efficiency of Profile method

      2                              pr(G)=.05
     1.5                             pr(G)=.20

           G      X    G times X
             Second Simulation

• Gene expression G = Normal(0,1)
• Environment X = Binary, pr(X=1) = 0.5
  • Randomized treatment assignment
• pr(D=1) = 0.05 in the population
• 500 cases and 500 controls

            pr(D=1|G,X=0)=Logistic(β0 + 0.3 G)
            pr(D=1|G,X=1)=Logistic(β0 +1.1 G)

            Treatment Relative Risk = exp(0.8G)
                Second Simulation

• MSE Efficiency of Profile method

            G       X    G*X
                 Second Simulation

pr(D=1) = 0.05
         The G-model Assumption

• We have to specify a model for G: f(g,θ)
• To achieve model robustness, we have used
  • Skew-normal distribution (Azzalini, 2002, JRSSB)
  • SNP-Family (Davidian & Zhang, 2002, Biometrics)
• Both have the Gaussian embedded as a special
  • Skew-normal is unimodal
  • SNP can allow for some bimodality
  • Both allow for heavier tails
          The G-model Assumption

• We have to specify a model for G: f(g,θ)
• Our simulation experiments with Skew-Normal
  and SNP families show
   • Little loss of efficiency when G is Gaussian
   • Protection against bias when G is skew
• Both are straightforward to fit.
     The Independence Assumption

• Gains in efficiency come from assuming gene
  expression (G) and environment (X) are
  independent in the population
• I have given two potential cases where this is
  satisfied by design
• Generally implausible, since gene expression is
  affected by environment
          A Nutrition Experiment

• 5 rats each fed a fish oil diet, corn oil diet and
  olive oil diet for 3 weeks: 15 rats
• 5 rats each fed a fish oil diet, corn oil diet and
  olive oil diet for 12 weeks: 15 rats
• No treatments applied to animals
• Colon tissue assayed by Amersham CodeLink
  oligo microarray
• Diet by time interactions in gene expression?
          A Nutrition Experiment

• Approximately 50% of the rats had replicated
• Allows assessment of intraclass correlation
• 3 by 2 experiment fit via a linear mixed model for
  each gene
Robust Parameter Design: Microarrays
   Experiment (oligo-arrays):
       30 rats given different diets (corn oil, fish oil and
        olive oil enhanced)
       15 rats have duplicated arrays
       How much of the variability in gene expression is
        due to the array?

   We have consistently found that 2/3 of the
    variability is noise
       within animal rather than between animal
       Intraclass Correlations

 Simulated ICC for
 genes with
 common r = 0.35

 Estimated ICC for
 8,038 genes from
 mixed models

Clearly, more control of noise via robust parameter
design has the potential to impact power for analyses
             A Nutrition Experiment

• 93 of 8,038
  genes passed
  the FDR test at
  level 0.05 for
  diet main effects
             A Nutrition Experiment

• 2,718 of 8,038
  genes in saline
  data passed
  FDR with level
               Basic Summary

• Gene expression (G) and environment (X) are
  independent in the population, usually by design
• Assume flexible distribution for gene expression
• Semiparametric profile method
• Large gains in efficiency for: does
  environment/treatment affect disease prediction
  via gene expression?
                   Basic Summary

• The methodology easily works for
  • Missing data
  • Misclassification/Measurement error
• There is a version of this for family-based
  matched case-control studies
  • Irrelevant for gene expression studies?
• The methodology can be extended easily to low-
  order multivariate gene expression sets.
        Basic Summary

• The

Shared By: