Document Sample

Linear Modelling III Richard Mott Wellcome Trust Centre for Human Genetics Synopsis • Logistic Regression • Mixed Models Logistic Regression and Statistical Genetics • Applicable to Association Studies • Data: – Binary outcomes (eg disease status) – Dependent on genotypes [+ sex, environment] • Aim is to identify which factors influence the outcome • Rigorous tests of statistical significance • Flexible modelling language • Generalisation of Chi-Squared Test Data • Data for individual labelled i=1…n: – Binary Response yi – Unphased Genotypes gij for markers j=1..m Coding Unphased Genotypes • Several possibilities: – AA, AG, GG original genotypes – 12, 21, 22 – 1, 2, 3 – 0, 1, 2 # of G alleles • Missing Data – NA default in R Using R • Load genetic logistic regression tools • > source(‘logistic.R’) • Read data table from file – > t <- read.table(‘geno.dat’, header=TRUE) • Column names – names(t) – t$y response (0,1) – t$m1, t$m2, …. Genotypes for each marker Contigency Tables in R • ftable(t$y,t$m31) prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 515 387 75 1 28 11 2 > Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) > The Logistic Model • Effect of SNP k on phenotype • Null model The Logistic Model • Prob(Yi=1) = exp(hi)/(1+exp(hi)) • hi = Sj xij bj - Linear Predictor • xij – Design Matrix (genotypes etc) • bj – Model Parameters (to be estimated) • Model is investigated by – estimating the bj’s by maximum likelihood – testing if the estimates are different from 0 The Logistic Function Prob(Yi=1) = exp(hi)/(1+exp(hi)) Prob(Y=1) h Types of genetic effect at a single locus AA AG GG Recessive 0 0 1 Dominant 1 1 0 Additive 0 1 2 Genotype 0 a b Additive Genotype Model • Code genotypes as – AA x=0, – AG x=1, – GG x=2 • Linear Predictor – h = b0 + xb1 • P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1)) Additive Model: b0 = -2 b1 = 2 PAA = 0.12 PAG = 0.50 PGG = 0.88 Prob(Y=1) h Additive Model: b0 = 0 b1 = 2 PAA = 0.50 PAG = 0.88 PGG = 0.98 Prob(Y=1) h Recessive Model • Code genotypes as – AA x=0, – AG x=0, – GG x=1 • Linear Predictor – h = b0 + xb1 • P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0)) • PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) Recessive Model: b0 = 0 b1 = 2 PAA = PAG = 0.50 PGG = 0.88 Prob(Y=1) h Genotype Model • Each genotype has an independent probability • Code genotypes as (for example) – AA x=0, z=0 – AG x=1, z=0 – GG x=0, z=1 • Linear Predictor – h = b0 + xb1+zb2 two parameters • P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2)) • PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2)) Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27 Prob(Y=1) h Models in R response y genotype g AA AG GG model DF Recessive 0 0 1 y ~ recessive(g) 1 Dominant 0 1 1 y ~ dominant(g) 1 Additive 0 1 2 y ~ additive(g) 1 Genotype 0 a b y ~ genotype(g) 2 Data Transformation • g <- t$m1 • use these functions to treat a genotype vector in a certain way: –a <- additive(g) –r <- recessive(g) –d <- dominant(g) –g <- genotype(g) Fitting the Model • afit <- glm( t$y ~ additive(g),family=‘binomial’) • rfit <- glm( t$y ~ recessive(g),family=‘binomial’) • dfit <- glm( t$y ~ dominant(g),family=‘binomial’) • gfit <- glm( t$y ~ genotype(g),family=‘binomial’) • Equivalent models: – genotype = dominant + recessive – genotype = additive + recessive – genotype = additive + dominant – genotype ~ standard chi-squared test of genotype association Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 > Analysis of Deviance Chi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial'), test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m31) 2 3.96 1015 339.76 Model Comparison • Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 > Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 … … … Multilocus Models • Can test the effects of fitting two or more markers simultaneously • Several multilocus models are possible • Interaction Model assumes that each combination of genotypes has a different effect • eg t$y ~ t$m10 * t$m15 Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’, test="Chisq") > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m13) 2 108.68 1015 235.03 genotype(t$m26) 2 1.14 1013 233.89 genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86 > pchisq(6.03,3,lower.tail=F) calculate p-value [1] 0.1101597 Adding the effects of Sex and other Covariates • Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age • Fit models of the form • fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’) Adding the effects of Sex and other Covariates • Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age • anova(fit1,fit2) Random Effects Mixed Models • So far our models have had fixed effects – each parameter can take any value independently of the others – each parameter estimate uses up a degree of freedom – models with large numbers of parameters have problems • saturation - overfitting • poor predictive properties • numerically unstable and difficult to fit • in some cases we can treat parameters as being sampled from a distribution – random effects • only estimate the parameters needed to specify that distribution • can result in a more robust model Example of a Mixed Model • Testing genetic association across a large number of of families – yi = trait value of i’th individual – bi = genotype of individual at SNP of interest – fi = family identifier (factor) – ei = error, variance s2 – y=m+b+f+e – If we treat family effects f as fixed then we must estimate a large number of parameters – Better to think of these effects as having a distribution N(0,t2) for some variance t which must be estimated • individuals from the same family have the same f and are correlated – cov = t2 • individuals from different families are uncorrelated – genotype parameters b still treated as fixed effects • mixed model Mixed Models • y=m+b+f+e • Fixed effects model – E(y) = m + b + f – Var(y) = Is2 • I is identity matrix • Mixed model – E(y) = m + b – Var(y) = Is2 + Ft2 • F is a matrix, Fij = 1 if i, j in same family • Need to estimate both s2 and t2 Generalised Least Squares • y = Xb + e • Var(y) = V (symmetric covariance matrix) • V = Is2 (uncorrelated errors) • V = Is2 + Ft2 (single grouping random effect) • V = Is2 + Gt2 (G = genotype identity by state) • GLS solution, if V is known ˆ = (X V X) XT V-1y b T -1 -1 • unbiased, • efficient (minimum variance) • consistent (tends to true value in large samples) • asymptotically normally distributed Multivariate Normal Distribution • Joint distribution of a vector of correlated observations • Another way of thinking about the data • Estimation of parameters in a mixed model is special case of likelihood analysis of a MVN • y ~ MVN(m, V) – m is vector of expected values – V is variance-covariance matrix – If V = Is2 then elements of y are uncorrelated and equivalent to n independent Normal variables – probability density/Likelihood is 1 L(y | m,V) = exp(- (y - m)' V-1 (y - m)) 2 2 n2 (2p V ) Henderson’s Mixed Model Equations • General linear mixed model. y = Xb + Zu + e • b are fixed effect • u are random effects • X, Z design matrices u » MVN(0,G) (G = It 2 ) e » MVN(0,R) (R = Is 2 ) y » MVN(Xb,V) V = ZGZ T + R æ XT R-1X XT R-1Z öæb ö æ XT R-1yö ˆ ç T -1 T -1 -1 ÷ ç ÷ = ç T -1 ÷ è Z R X Z R Z + G øè uø è Z R y ø ˆ Mixed Models • Fixed effects models are special cases of mixed models • Mixed models sometimes are more powerful (as fewer parameters) – But check that the assumption effects are sampled from a Normal distribution is reasonable – Differences between estimated parameters and statistical significance in fixed vs mixed models • Shrinkage: often random effect estimates from a mixed model have smaller magnitudes than from a fixed effects model. Mixed Models in R • Several R packages available, e.g. – lme4 general purpose mixed models package – emma for genetic association in structured populations • Formulas in lme4, eg test for association between genotype and trait • H0: E(y) = m • vs • H1: E(y) = m + b • Var(y) = Is2 + Ft2 – h0 = lmer(y ~ 1 +(1|family), data=data) – h1 = lmer(y ~ genotype +(1|family), data=data) – anova(h0,h1) Formulas in lmer() • y ~ fixed.effects + (1|Group1) + (1|Group2) etc – random intercept models • y ~ fixed.effects +(1|Group1) + (x|Group1) – random slope model Example: HDL Data > library(lme4) > b=read.delim("Biochemistry.txt”) cc=complete.cases(b$Biochem.HDL) > f0=lm(Biochem.HDL ~ Family , data=b,subset=cc) > f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc) > anova(f0) Analysis of Variance Table Response: Biochem.HDL Df Sum Sq Mean Sq F value Pr(>F) Family 175 147.11 0.84064 5.2098 < 2.2e-16 *** Residuals 1681 271.24 0.16136 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > f1 Linear mixed model fit by REML Formula: Biochem.HDL ~ (1 | Family) Data: b Subset: cc AIC BIC logLik deviance REMLdev 2161 2178 -1078 2149 2155 Random effects: Groups Name Variance Std.Dev. Family (Intercept) 0.066633 0.25813 Residual 0.161487 0.40185 Number of obs: 1857, groups: Family, 176 Fixed effects: Estimate Std. Error t value (Intercept) 1.60615 0.02263 70.97 > plot(resid(f0),resid(f1))

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 18 |

posted: | 12/11/2012 |

language: | English |

pages: | 40 |

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.