# Linear Modelling III - Wellcome Trust Centre for Human Genetics

Document Sample

```					   Linear Modelling III

Richard Mott
Wellcome Trust Centre for Human
Genetics
Synopsis
• Logistic Regression
• Mixed Models
Logistic Regression and Statistical Genetics

• Applicable to Association Studies
• Data:
– Binary outcomes (eg disease status)
– Dependent on genotypes [+ sex, environment]
•   Aim is to identify which factors influence the outcome
•   Rigorous tests of statistical significance
•   Flexible modelling language
•   Generalisation of Chi-Squared Test
Data
• Data for individual labelled i=1…n:
– Binary Response yi
– Unphased Genotypes gij for markers j=1..m
Coding Unphased Genotypes
• Several possibilities:
– AA, AG, GG original genotypes
– 12, 21, 22
– 1, 2, 3
– 0, 1, 2 # of G alleles
• Missing Data
– NA default in R
Using R
• Load genetic logistic regression tools
• >   source(‘logistic.R’)

• Read data table from file

• Column names
– names(t)
– t\$y response (0,1)
– t\$m1, t\$m2, …. Genotypes for each marker
Contigency Tables in R
• ftable(t\$y,t\$m31) prints the contingency table

> ftable(t\$y,t\$m31)
11 12 22

0   515 387   75
1    28 11     2
>
Chi-Squared Test in R

> chisq.test(t\$y,t\$m31)

Pearson's Chi-squared test

data: t\$y and t\$m31
X-squared = 3.8424, df = 2, p-value = 0.1464

Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t\$y, t\$m31)
>
The Logistic Model
• Effect of SNP k on phenotype

• Null model
The Logistic Model
• Prob(Yi=1) = exp(hi)/(1+exp(hi))
• hi = Sj xij bj - Linear Predictor

• xij – Design Matrix (genotypes etc)
• bj – Model Parameters (to be estimated)

• Model is investigated by
– estimating the bj’s by maximum likelihood
– testing if the estimates are different from 0
The Logistic Function
Prob(Yi=1) = exp(hi)/(1+exp(hi))

Prob(Y=1)

h
Types of genetic effect at a single locus

AA   AG   GG

Recessive   0    0    1

Dominant    1    1    0

Genotype    0    a    b
• Code genotypes as
– AA    x=0,
– AG    x=1,
– GG    x=2

• Linear Predictor
– h = b0 + xb1

•   P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
•   PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0))
•   PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
•   PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
Additive Model: b0 = -2 b1 = 2
PAA = 0.12 PAG = 0.50 PGG = 0.88

Prob(Y=1)

h
Additive Model: b0 = 0 b1 = 2
PAA = 0.50 PAG = 0.88 PGG = 0.98

Prob(Y=1)

h
Recessive Model
• Code genotypes as
– AA    x=0,
– AG    x=0,
– GG    x=1

• Linear Predictor
– h = b0 + xb1

• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
• PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0))
• PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
Recessive Model: b0 = 0 b1 = 2
PAA = PAG = 0.50 PGG = 0.88

Prob(Y=1)

h
Genotype Model
• Each genotype has an independent probability
• Code genotypes as (for example)
– AA     x=0, z=0
– AG     x=1, z=0
– GG     x=0, z=1

• Linear Predictor
– h = b0 + xb1+zb2 two parameters

•   P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2))
•   PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0))
•   PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1))
•   PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2))
Genotype Model: b0 = 0 b1 = 2 b2 = -1
PAA = 0.5 PAG = 0.88 PGG = 0.27

Prob(Y=1)

h
Models in R
response y
genotype g

AA   AG      GG          model              DF

Recessive   0    0       1           y ~ recessive(g)   1

Dominant    0    1       1           y ~ dominant(g)    1

Genotype    0    a       b           y ~ genotype(g)    2
Data Transformation
• g <- t\$m1
• use these functions to treat a genotype vector
in a certain way:
–r   <-   recessive(g)
–d   <-   dominant(g)
–g   <-   genotype(g)
Fitting the Model
•   afit   <-   glm(   t\$y   ~   additive(g),family=‘binomial’)
•   rfit   <-   glm(   t\$y   ~   recessive(g),family=‘binomial’)
•   dfit   <-   glm(   t\$y   ~   dominant(g),family=‘binomial’)
•   gfit   <-   glm(   t\$y   ~   genotype(g),family=‘binomial’)

• Equivalent models:
– genotype = dominant + recessive
– genotype = additive + recessive
– genotype = additive + dominant
– genotype ~ standard chi-squared test of genotype
association
Parameter Estimates

> summary(glm( t\$y ~ genotype(t\$m31), family='binomial'))

Coefficients:
Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)        -2.9120     0.1941 -15.006   <2e-16 ***
b1 genotype(t\$m31)12 -0.6486      0.3621 -1.791    0.0733 .
b2 genotype(t\$m31)22 -0.7124      0.7423 -0.960    0.3372
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
Analysis of Deviance
Chi-Squared Test

> anova(glm( t\$y ~ genotype(t\$m31), family='binomial'),
test="Chisq")
Analysis of Deviance Table

Response: t\$y

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev
NULL                               1017     343.71
genotype(t\$m31)    2     3.96      1015     339.76
Model Comparison
• Compare general model with additive,
dominant or recessive models:
> afit <- glm(t\$y ~ additive(t\$m20))
> gfit <- glm(t\$y ~ genotype(t\$m20))
> anova(afit,gfit)
Analysis of Deviance Table

Model 2: t\$y ~ genotype(t\$m20)
Resid. Df Resid. Dev   Df Deviance
1      1016     38.301
2      1015     38.124    1    0.177
>
Scanning all Markers
Deviance DF         Pval     LogPval
m1 8.604197e+00 1 3.353893e-03 2.474450800
m2 7.037336e+00 1 7.982767e-03 2.097846522
m3 6.603882e-01 1 4.164229e-01 0.380465360
m4 3.812860e+00 1 5.086054e-02 1.293619014
m5 7.194936e+00 1 7.310960e-03 2.136025588
m6 2.449127e+00 1 1.175903e-01 0.929628598
m7 2.185613e+00 1 1.393056e-01 0.856031566
m8 1.227191e+00 1 2.679539e-01 0.571939852
m9 2.532562e+01 1 4.842353e-07 6.314943565
m10 5.729634e+01 1 3.748518e-14 13.426140380
m11 3.107441e+01 1 2.483233e-08 7.604982503
…
…
…
Multilocus Models
• Can test the effects of fitting two or more
markers simultaneously
• Several multilocus models are possible
• Interaction Model assumes that each
combination of genotypes has a different
effect
• eg t\$y ~ t\$m10 * t\$m15
Multi-Locus Models
> f <- glm( t\$y ~ genotype(t\$m13) * genotype(t\$m26) , family='binomial’,
test="Chisq")
> anova(f)
Analysis of Deviance Table

Response: t\$y

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev
NULL                                                   1017     343.71
genotype(t\$m13)                        2   108.68      1015     235.03
genotype(t\$m26)                        2     1.14      1013     233.89
genotype(t\$m13):genotype(t\$m26)        3     6.03      1010     227.86

> pchisq(6.03,3,lower.tail=F) calculate p-value
[1] 0.1101597
Adding the effects of Sex and other
Covariates
• Read in sex and other covariate data, eg. age
from a file into variables, say a\$sex, a\$age
• Fit models of the form
•   fit1 <- glm(t\$y ~ additive(t\$m10) + a\$sex + a\$age, family=‘binomial’)
•   fit2 <- glm(t\$y ~ a\$sex + a\$age, family=‘binomial’)
Adding the effects of Sex and other
Covariates
• Compare models using anova – test if the effect of
the marker m10 is significant after taking into
account sex and age
• anova(fit1,fit2)
Random Effects
Mixed Models
• So far our models have had fixed effects
– each parameter can take any value independently of the
others
– each parameter estimate uses up a degree of freedom
– models with large numbers of parameters have problems
• saturation - overfitting
• poor predictive properties
• numerically unstable and difficult to fit
• in some cases we can treat parameters as being sampled from a
distribution – random effects
• only estimate the parameters needed to specify that distribution
• can result in a more robust model
Example of a Mixed Model
•   Testing genetic association across a large number of of families
–   yi = trait value of i’th individual
–   bi = genotype of individual at SNP of interest
–   fi = family identifier (factor)
–   ei = error, variance s2

– y=m+b+f+e

– If we treat family effects f as fixed then we must estimate a large number of parameters
– Better to think of these effects as having a distribution N(0,t2) for some variance t which must
be estimated
•   individuals from the same family have the same f and are correlated
–   cov = t2
•   individuals from different families are uncorrelated

– genotype parameters b still treated as fixed effects
•   mixed model
Mixed Models
• y=m+b+f+e

• Fixed effects model
– E(y) = m + b + f
– Var(y) = Is2
• I is identity matrix

• Mixed model
– E(y) = m + b
– Var(y) = Is2 + Ft2
• F is a matrix, Fij = 1 if i, j in same family
• Need to estimate both s2 and t2
Generalised Least Squares
• y = Xb + e
• Var(y) = V (symmetric covariance matrix)
• V = Is2 (uncorrelated errors)
• V = Is2 + Ft2 (single grouping random effect)
• V = Is2 + Gt2 (G = genotype identity by state)
• GLS solution, if V is known
ˆ = (X V X) XT V-1y
b        T -1        -1

•   unbiased,
•   efficient (minimum variance)
•   consistent (tends to true value in large samples)
•   asymptotically normally distributed
Multivariate Normal Distribution
• Joint distribution of a vector of correlated observations
• Another way of thinking about the data
• Estimation of parameters in a mixed model is special case of likelihood
analysis of a MVN
• y ~ MVN(m, V)
– m is vector of expected values
– V is variance-covariance matrix
– If V = Is2 then elements of y are uncorrelated and equivalent to n
independent Normal variables
– probability density/Likelihood is
1
L(y | m,V) = exp(- (y - m)' V-1 (y - m))
2
2 n2
(2p V )
Henderson’s Mixed Model Equations
•   General linear mixed model.
y = Xb + Zu + e
•   b are fixed effect
•   u are random effects
•   X, Z design matrices          u » MVN(0,G)    (G = It 2 )
e » MVN(0,R)    (R = Is 2 )
y » MVN(Xb,V)
V = ZGZ T + R

æ XT R-1X  XT R-1Z öæb ö æ XT R-1yö
ˆ
ç T -1    T -1    -1 ÷
ç ÷ = ç T -1 ÷
è Z R X Z R Z + G øè uø è Z R y ø
ˆ
Mixed Models

• Fixed effects models are special cases of mixed models

• Mixed models sometimes are more powerful (as fewer
parameters)
– But check that the assumption effects are sampled from a
Normal distribution is reasonable
– Differences between estimated parameters and statistical
significance in fixed vs mixed models

• Shrinkage: often random effect estimates from a mixed
model have smaller magnitudes than from a fixed effects
model.
Mixed Models in R
• Several R packages available, e.g.
– lme4 general purpose mixed models package
– emma     for genetic association in structured populations

• Formulas in lme4, eg test for association between genotype and trait
•    H0: E(y) = m
•    vs
•    H1: E(y) = m + b

•    Var(y) = Is2 + Ft2

–   h0 = lmer(y ~ 1 +(1|family), data=data)
–   h1 = lmer(y ~ genotype +(1|family), data=data)
–   anova(h0,h1)
Formulas in lmer()
• y ~ fixed.effects + (1|Group1) + (1|Group2)
etc
– random intercept models
• y ~ fixed.effects +(1|Group1) + (x|Group1)
– random slope model
Example: HDL Data
> library(lme4)
cc=complete.cases(b\$Biochem.HDL)

> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)
> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)

> anova(f0)
Analysis of Variance Table

Response: Biochem.HDL
Df Sum Sq Mean Sq F value   Pr(>F)
Family     175 147.11 0.84064 5.2098 < 2.2e-16 ***
Residuals 1681 271.24 0.16136
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> f1
Linear mixed model fit by REML
Formula: Biochem.HDL ~ (1 | Family)
Data: b
Subset: cc
AIC BIC logLik deviance REMLdev
2161 2178 -1078       2149    2155
Random effects:
Groups    Name        Variance Std.Dev.
Family    (Intercept) 0.066633 0.25813
Residual              0.161487 0.40185
Number of obs: 1857, groups: Family, 176

Fixed effects:
Estimate Std. Error t value
(Intercept) 1.60615     0.02263   70.97
> plot(resid(f0),resid(f1))

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 18 posted: 12/11/2012 language: English pages: 40
Lingjuan Ma