Document Sample

Gene-Environment Case-Control Studies Raymond J. Carroll Department of Statistics Faculty of Nutrition and Toxicology Texas A&M University http://stat.tamu.edu/~carroll Image of rat colon displaying apoptosis (green) and cell differentiation (red) Outline • Problem: Does Environment/Treatment Affect Disease Prediction via Gene Expression? • Two Thought Experiments • Case-control studies: Background • Gene-Environment independence • Profile likelihood approach • Efficiency gains (Large ones!) • Limitations • Conclusions Acknowledgment • This work is joint with Nilanjan Chatterjee, National Cancer Institute http://dceg.cancer.gov/people/ChatterjeeNilanjan.html Experiment #1 • Two strata: corn oil and fish oil fed rats • All animals exposed to a carcinogen (200/strata) • Within each strata, 50% randomized to radiation • At initial stage, fecal (not mucosal) material gathered • At proliferation stage (8 weeks), animals sacrificed and assayed for aberrant crypt foci (ACF) • ACF are precursors to colon cancer Experiment #1 • The finger-like structures are colonic crypts • They house the stem, proliferating and differentiating cells Experiment #1 • The dark spots are ACF: aberrant crypt foci Experiment #1 • We want to know whether microarray on fecal material is predictive of ACF status • We also want to know whether gene expression differs in predictive ability for the environment (radiation) • With 400 animals, cannot do microarray on all • We will construct an index of ACF, and sample cases (those with high # of ACF) and controls • Microarray done on these animals Experiment #2 • Idea: patients with tumors have their tumor tissues stored • They are then randomized to treatments (environment) • Question 1: does gene expression predict recurrence (say)? • Question 2: is predictive ability different for the different treatments? • Again, for cost purposes, only some cases and controls will have gene expressions Basic Problem Formalized • Case control sample: D = disease • Gene expression: G • Environment: X • Strata: S • We are interested in main effects for G and (X,S) especially their interaction • In both experiments, G and X are independent in the population by design, given S Prospective Models • Simplest logistic model pr(D 1|G, X) H(b 0 b1G b2 X b 3G X) • General logistic model pr(D 1|G, X) H{b 0 m(G, X, β1 )} • The function m(G,X,b1) is completely general Case-Control Data • The data as we envision it are case-control data • The % of cases (D=1) is often much higher in the case-control study than in the general population • For example, one might get gene expression for all cases, but only some controls • Many microarray studies use a plan like this Case-Control Data • Case-control data are not a random sample • We observe (G,E) given D, i.e., we observe the covariates given the response, not vice-versa • If we had a random sample, linear logistic regression would be used to fit the model • Essentially, Fisher LDA with no interactions • Obvious idea: ignore the sampling plan and pretend you have a random sample Case-Control Data pr(D=1|G,X)=H{β 0 +m(G,X,β1 )} • Cool Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan • The intercept is determined by pr(D=1) in the population, hence not identified from these data Case-Control Data pr(D=1|G,X)=H{β 0 +m(G,X,β1 )} • Cool Fact: all parameters except the intercept can be estimated consistently while ignoring the sampling plan • Standard errors are also asymptotically correct • Well known fact for linear logistic (Prentice and Pyke, 1979), not so well known for general nonlinear models Environment and Gene Expression • In my two examples, gene expression (G) and environment (X) are independent by design. • Can we exploit this to get more efficient estimates? • Should be possible: this is akin to a missing data problem, with outcomes MAR. • We do this via a semiparametric profile likelihood approach Environment and Gene Expression • Methodology: Start with the retrospective likelihood pr(G=g, X=x|D=d) pr(X=x)pr(G=g)exp d b 0 m(g, x, b1 ) 1 H b 0 m(g, x, b1 ) = pr(X=x')pr(G=g')exp d b x ',g' 0 m(g', x ', b1 ) 1 H b 0 m(g', x ', b1 ) • Note how independence of G and X is used here, see the red expressions Environment and Gene Expression • Methodology: Start with the retrospective likelihood pr(G=g, X=x|D=d) • Treat the environment (X1,…,Xn) as distinct parameters, and λ i=pr(X=x i ) as their distribution • Let G have pr(G=g) =f(g,θ) • Construct the profile likelihood, having estimated the λ i as functions of data and other parameters Profile Likelihood • My approach is often called the Neyman-Scott formulation • With a single gene expression, n samples, we have more than n parameters • Often does not work to produce workable methods • Non-constant variances treated as n parameters • Latent variables in measurement error problems treated as parameters Profile Likelihood • Result: = β0 log(n1 /n0 ) log pr(D=1) /pr(D 0) ; f(g, ) exp d m(g, x, b1 ) S(d,g,x, ) = 1 exp b 0 m(g, x, b1 ) Profile Likelihood = L(β0 ,β1 ,κ,θ)=L(Ω) S(D,G,X,Ω) = 1 S(d, g, X, )d(g) d=0 Profile Likelihood • The form of the profiled likelihood makes it appear that and are identified, and hence so too are pr(D=1) and β 0. =β0 log(n1 /n0 ) log pr(D=1) /pr(D 0) ; Profile Likelihood = L(β0 ,β1 ,κ,θ)=L(Ω) • This does not happen with regular case-control data, remember Profile Likelihood • In light of the Neyman-Scott phenomenon, it would be a surprise if pr(D=1) and β 0 are identified • Happy days: sometimes surprises are happy ones! • Both are identified theoretically from case-control data • So too is the distribution of gene expression • Even more interesting with alleles, haplotypes, etc. Environment and Gene Expression • Summary of Assumptions: • G and X are independent (possibly after stratification) • Parametric form for the distribution of G • Summary of Result: • Intercept and marginal pr(D=1) are identified • Loss of robustness versus the usual analysis that assumes nothing • Identification of pr(D=1) hardly seems worth the risk (but wait!) Alternative Derivation • Consider a prospective study • Let D= 1 mean selection into the study • Pretend pr(Δ=1|D=d,G,X) nd/pr(D=d); nd # of observations with D d • Then compute pr(D=d,G=g|Δ=1,X) • This is exactly our profile pseudo- likelihood! Interesting Technical Point • The profile pseudo-likelihood acts like a real likelihood • Information Asymptotics are (almost) exact • Missing data handled seamlessly • Measurement error/Misclassification in environment handled seamlessly • Because it appears to be some sort of likelihood, hope for efficiency gains versus standard approach First Simulation • Settings • 500 cases, 500 controls • Gene expression dichotomized into low and high, pr(high) = 0.05 and 0.2 (in the population) • X = min{10,Lognormal(0,1)} • Pr(D=1) = 0.05 (in the population) • Standard multiplicative model: • Main effect parameter for G = 0.26 • Main effect parameter for X = 0.10 • Interaction parameter = 0.30 First Simulation • MSE Efficiency of Profile method 4 3.5 3 2.5 2 pr(G)=.05 1.5 pr(G)=.20 1 0.5 0 G X G times X Second Simulation • Gene expression G = Normal(0,1) • Environment X = Binary, pr(X=1) = 0.5 • Randomized treatment assignment • pr(D=1) = 0.05 in the population • 500 cases and 500 controls pr(D=1|G,X=0)=Logistic(β0 + 0.3 G) pr(D=1|G,X=1)=Logistic(β0 +1.1 G) Treatment Relative Risk = exp(0.8G) Second Simulation • MSE Efficiency of Profile method 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 G X G*X Second Simulation pr(D=1) = 0.05 The G-model Assumption • We have to specify a model for G: f(g,θ) • To achieve model robustness, we have used • Skew-normal distribution (Azzalini, 2002, JRSSB) • SNP-Family (Davidian & Zhang, 2002, Biometrics) • Both have the Gaussian embedded as a special case • Skew-normal is unimodal • SNP can allow for some bimodality • Both allow for heavier tails The G-model Assumption • We have to specify a model for G: f(g,θ) • Our simulation experiments with Skew-Normal and SNP families show • Little loss of efficiency when G is Gaussian • Protection against bias when G is skew • Both are straightforward to fit. The Independence Assumption • Gains in efficiency come from assuming gene expression (G) and environment (X) are independent in the population • I have given two potential cases where this is satisfied by design • Generally implausible, since gene expression is affected by environment A Nutrition Experiment • 5 rats each fed a fish oil diet, corn oil diet and olive oil diet for 3 weeks: 15 rats • 5 rats each fed a fish oil diet, corn oil diet and olive oil diet for 12 weeks: 15 rats • No treatments applied to animals • Colon tissue assayed by Amersham CodeLink oligo microarray • Diet by time interactions in gene expression? A Nutrition Experiment • Approximately 50% of the rats had replicated microarray • Allows assessment of intraclass correlation • 3 by 2 experiment fit via a linear mixed model for each gene Robust Parameter Design: Microarrays Experiment (oligo-arrays): 30 rats given different diets (corn oil, fish oil and olive oil enhanced) 15 rats have duplicated arrays How much of the variability in gene expression is due to the array? We have consistently found that 2/3 of the variability is noise within animal rather than between animal Intraclass Correlations Simulated ICC for 8,038 independent genes with common r = 0.35 Estimated ICC for 8,038 genes from mixed models Clearly, more control of noise via robust parameter design has the potential to impact power for analyses A Nutrition Experiment • 93 of 8,038 genes passed the FDR test at level 0.05 for diet main effects A Nutrition Experiment • 2,718 of 8,038 genes in saline data passed FDR with level 0.05 Basic Summary • Gene expression (G) and environment (X) are independent in the population, usually by design • Assume flexible distribution for gene expression (G) • Semiparametric profile method • Large gains in efficiency for: does environment/treatment affect disease prediction via gene expression? Basic Summary • The methodology easily works for • Missing data • Misclassification/Measurement error • There is a version of this for family-based matched case-control studies • Irrelevant for gene expression studies? • The methodology can be extended easily to low- order multivariate gene expression sets. Basic Summary • The

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 26 |

posted: | 8/17/2011 |

language: | English |

pages: | 41 |

OTHER DOCS BY cuiliqing

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.