Linear Modelling III - Wellcome Trust Centre for Human Genetics

Document Sample
Linear Modelling III - Wellcome Trust Centre for Human Genetics Powered By Docstoc
					   Linear Modelling III

         Richard Mott
Wellcome Trust Centre for Human
           Genetics
                  Synopsis
• Logistic Regression
• Mixed Models
     Logistic Regression and Statistical Genetics

• Applicable to Association Studies
• Data:
    – Binary outcomes (eg disease status)
    – Dependent on genotypes [+ sex, environment]
•   Aim is to identify which factors influence the outcome
•   Rigorous tests of statistical significance
•   Flexible modelling language
•   Generalisation of Chi-Squared Test
                      Data
• Data for individual labelled i=1…n:
  – Binary Response yi
  – Unphased Genotypes gij for markers j=1..m
    Coding Unphased Genotypes
• Several possibilities:
   – AA, AG, GG original genotypes
   – 12, 21, 22
   – 1, 2, 3
   – 0, 1, 2 # of G alleles
• Missing Data
   – NA default in R
                         Using R
• Load genetic logistic regression tools
• >   source(‘logistic.R’)

• Read data table from file
  – > t <- read.table(‘geno.dat’, header=TRUE)

• Column names
  – names(t)
  – t$y response (0,1)
  – t$m1, t$m2, …. Genotypes for each marker
           Contigency Tables in R
• ftable(t$y,t$m31) prints the contingency table

  > ftable(t$y,t$m31)
      11 12 22

  0   515 387   75
  1    28 11     2
  >
       Chi-Squared Test in R

> chisq.test(t$y,t$m31)

        Pearson's Chi-squared test

data: t$y and t$m31
X-squared = 3.8424, df = 2, p-value = 0.1464

Warning message:
Chi-squared approximation may be incorrect in:
chisq.test(t$y, t$m31)
>
           The Logistic Model
• Effect of SNP k on phenotype




• Null model
               The Logistic Model
• Prob(Yi=1) = exp(hi)/(1+exp(hi))
• hi = Sj xij bj - Linear Predictor

• xij – Design Matrix (genotypes etc)
• bj – Model Parameters (to be estimated)

• Model is investigated by
   – estimating the bj’s by maximum likelihood
   – testing if the estimates are different from 0
          The Logistic Function
     Prob(Yi=1) = exp(hi)/(1+exp(hi))




Prob(Y=1)




                     h
Types of genetic effect at a single locus

                   AA   AG   GG



       Recessive   0    0    1

       Dominant    1    1    0

       Additive    0    1    2

       Genotype    0    a    b
           Additive Genotype Model
• Code genotypes as
    – AA    x=0,
    – AG    x=1,
    – GG    x=2

• Linear Predictor
    – h = b0 + xb1

•   P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
•   PAA = P(Y=1|x=0) = exp(b0)/(1+exp(b0))
•   PAG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
•   PGG = P(Y=1|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
             Additive Model: b0 = -2 b1 = 2
            PAA = 0.12 PAG = 0.50 PGG = 0.88




Prob(Y=1)




                            h
            Additive Model: b0 = 0 b1 = 2
        PAA = 0.50 PAG = 0.88 PGG = 0.98




Prob(Y=1)




                           h
                    Recessive Model
• Code genotypes as
   – AA    x=0,
   – AG    x=0,
   – GG    x=1


• Linear Predictor
   – h = b0 + xb1


• P(Y=1|x) = exp(b0 + xb1)/(1+exp(b0 + xb1))
• PAA = PAG = P(Y=1|x=0) = exp(b0)/(1+exp(b0))
• PGG = P(Y=1|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
            Recessive Model: b0 = 0 b1 = 2
             PAA = PAG = 0.50 PGG = 0.88




Prob(Y=1)




                           h
                     Genotype Model
• Each genotype has an independent probability
• Code genotypes as (for example)
     – AA     x=0, z=0
     – AG     x=1, z=0
     – GG     x=0, z=1

• Linear Predictor
     – h = b0 + xb1+zb2 two parameters

•   P(Y=1|x) = exp(b0 + xb1+zb2)/(1+exp(b0 + xb1+zb2))
•   PAA = P(Y=1|x=0,z=0) = exp(b0)/(1+exp(b0))
•   PAG = P(Y=1|x=1,z=0) = exp(b0 + b1)/(1+exp(b0 + b1))
•   PGG = P(Y=1|x=0,z=1) = exp(b0 + b2)/(1+exp(b0 + b2))
   Genotype Model: b0 = 0 b1 = 2 b2 = -1
            PAA = 0.5 PAG = 0.88 PGG = 0.27




Prob(Y=1)




                            h
                      Models in R
                        response y
                        genotype g


            AA   AG      GG          model              DF



Recessive   0    0       1           y ~ recessive(g)   1


Dominant    0    1       1           y ~ dominant(g)    1


Additive    0    1       2           y ~ additive(g)    1


Genotype    0    a       b           y ~ genotype(g)    2
            Data Transformation
• g <- t$m1
• use these functions to treat a genotype vector
  in a certain way:
  –a   <-   additive(g)
  –r   <-   recessive(g)
  –d   <-   dominant(g)
  –g   <-   genotype(g)
                   Fitting the Model
•   afit   <-   glm(   t$y   ~   additive(g),family=‘binomial’)
•   rfit   <-   glm(   t$y   ~   recessive(g),family=‘binomial’)
•   dfit   <-   glm(   t$y   ~   dominant(g),family=‘binomial’)
•   gfit   <-   glm(   t$y   ~   genotype(g),family=‘binomial’)


• Equivalent models:
   – genotype = dominant + recessive
   – genotype = additive + recessive
   – genotype = additive + dominant
   – genotype ~ standard chi-squared test of genotype
     association
                Parameter Estimates

> summary(glm( t$y ~ genotype(t$m31), family='binomial'))

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
b0 (Intercept)        -2.9120     0.1941 -15.006   <2e-16 ***
b1 genotype(t$m31)12 -0.6486      0.3621 -1.791    0.0733 .
b2 genotype(t$m31)22 -0.7124      0.7423 -0.960    0.3372
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
                Analysis of Deviance
                 Chi-Squared Test

> anova(glm( t$y ~ genotype(t$m31), family='binomial'),
test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: t$y

Terms added sequentially (first to last)


                  Df Deviance Resid. Df Resid. Dev
NULL                               1017     343.71
genotype(t$m31)    2     3.96      1015     339.76
            Model Comparison
• Compare general model with additive,
  dominant or recessive models:
 > afit <- glm(t$y ~ additive(t$m20))
 > gfit <- glm(t$y ~ genotype(t$m20))
 > anova(afit,gfit)
 Analysis of Deviance Table

 Model 1: t$y ~ additive(t$m20)
 Model 2: t$y ~ genotype(t$m20)
   Resid. Df Resid. Dev   Df Deviance
 1      1016     38.301
 2      1015     38.124    1    0.177
 >
     Scanning all Markers
> logscan(t,model=‘additive’)
        Deviance DF         Pval     LogPval
m1 8.604197e+00 1 3.353893e-03 2.474450800
m2 7.037336e+00 1 7.982767e-03 2.097846522
m3 6.603882e-01 1 4.164229e-01 0.380465360
m4 3.812860e+00 1 5.086054e-02 1.293619014
m5 7.194936e+00 1 7.310960e-03 2.136025588
m6 2.449127e+00 1 1.175903e-01 0.929628598
m7 2.185613e+00 1 1.393056e-01 0.856031566
m8 1.227191e+00 1 2.679539e-01 0.571939852
m9 2.532562e+01 1 4.842353e-07 6.314943565
m10 5.729634e+01 1 3.748518e-14 13.426140380
m11 3.107441e+01 1 2.483233e-08 7.604982503
…
…
…
            Multilocus Models
• Can test the effects of fitting two or more
  markers simultaneously
• Several multilocus models are possible
• Interaction Model assumes that each
  combination of genotypes has a different
  effect
• eg t$y ~ t$m10 * t$m15
                 Multi-Locus Models
> f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial’,
test="Chisq")
> anova(f)
Analysis of Deviance Table

Model: binomial, link: logit

Response: t$y

Terms added sequentially (first to last)


                                      Df Deviance Resid. Df Resid. Dev
NULL                                                   1017     343.71
genotype(t$m13)                        2   108.68      1015     235.03
genotype(t$m26)                        2     1.14      1013     233.89
genotype(t$m13):genotype(t$m26)        3     6.03      1010     227.86

> pchisq(6.03,3,lower.tail=F) calculate p-value
[1] 0.1101597
        Adding the effects of Sex and other
                    Covariates
• Read in sex and other covariate data, eg. age
  from a file into variables, say a$sex, a$age
• Fit models of the form
•   fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’)
•   fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)
     Adding the effects of Sex and other
                 Covariates
• Compare models using anova – test if the effect of
  the marker m10 is significant after taking into
  account sex and age
• anova(fit1,fit2)
                    Random Effects
                     Mixed Models
• So far our models have had fixed effects
   – each parameter can take any value independently of the
     others
   – each parameter estimate uses up a degree of freedom
   – models with large numbers of parameters have problems
      • saturation - overfitting
      • poor predictive properties
      • numerically unstable and difficult to fit
      • in some cases we can treat parameters as being sampled from a
        distribution – random effects
      • only estimate the parameters needed to specify that distribution
      • can result in a more robust model
               Example of a Mixed Model
•   Testing genetic association across a large number of of families
     –   yi = trait value of i’th individual
     –   bi = genotype of individual at SNP of interest
     –   fi = family identifier (factor)
     –   ei = error, variance s2

     – y=m+b+f+e

     – If we treat family effects f as fixed then we must estimate a large number of parameters
     – Better to think of these effects as having a distribution N(0,t2) for some variance t which must
       be estimated
           •   individuals from the same family have the same f and are correlated
                  –   cov = t2
           •   individuals from different families are uncorrelated

     – genotype parameters b still treated as fixed effects
           •   mixed model
                      Mixed Models
• y=m+b+f+e

• Fixed effects model
   – E(y) = m + b + f
   – Var(y) = Is2
      • I is identity matrix


• Mixed model
   – E(y) = m + b
   – Var(y) = Is2 + Ft2
      • F is a matrix, Fij = 1 if i, j in same family
      • Need to estimate both s2 and t2
         Generalised Least Squares
• y = Xb + e
• Var(y) = V (symmetric covariance matrix)
        • V = Is2 (uncorrelated errors)
        • V = Is2 + Ft2 (single grouping random effect)
        • V = Is2 + Gt2 (G = genotype identity by state)
• GLS solution, if V is known
       ˆ = (X V X) XT V-1y
       b        T -1        -1

•   unbiased,
•   efficient (minimum variance)
•   consistent (tends to true value in large samples)
•   asymptotically normally distributed
   Multivariate Normal Distribution
• Joint distribution of a vector of correlated observations
• Another way of thinking about the data
• Estimation of parameters in a mixed model is special case of likelihood
  analysis of a MVN
• y ~ MVN(m, V)
    – m is vector of expected values
    – V is variance-covariance matrix
    – If V = Is2 then elements of y are uncorrelated and equivalent to n
      independent Normal variables
    – probability density/Likelihood is
                               1
             L(y | m,V) = exp(- (y - m)' V-1 (y - m))
                               2
                                                                     2 n2
                                                             (2p V )
    Henderson’s Mixed Model Equations
•   General linear mixed model.
                                     y = Xb + Zu + e
•   b are fixed effect
•   u are random effects
•   X, Z design matrices          u » MVN(0,G)    (G = It 2 )
                                  e » MVN(0,R)    (R = Is 2 )
                                  y » MVN(Xb,V)
                                  V = ZGZ T + R

      æ XT R-1X  XT R-1Z öæb ö æ XT R-1yö
                             ˆ
      ç T -1    T -1    -1 ÷
                            ç ÷ = ç T -1 ÷
      è Z R X Z R Z + G øè uø è Z R y ø
                             ˆ
                   Mixed Models

• Fixed effects models are special cases of mixed models

• Mixed models sometimes are more powerful (as fewer
  parameters)
   – But check that the assumption effects are sampled from a
     Normal distribution is reasonable
   – Differences between estimated parameters and statistical
     significance in fixed vs mixed models

• Shrinkage: often random effect estimates from a mixed
  model have smaller magnitudes than from a fixed effects
  model.
                        Mixed Models in R
• Several R packages available, e.g.
    – lme4 general purpose mixed models package
    – emma     for genetic association in structured populations

• Formulas in lme4, eg test for association between genotype and trait
   •    H0: E(y) = m
   •    vs
   •    H1: E(y) = m + b


   •    Var(y) = Is2 + Ft2


    –   h0 = lmer(y ~ 1 +(1|family), data=data)
    –   h1 = lmer(y ~ genotype +(1|family), data=data)
    –   anova(h0,h1)
           Formulas in lmer()
• y ~ fixed.effects + (1|Group1) + (1|Group2)
  etc
  – random intercept models
• y ~ fixed.effects +(1|Group1) + (x|Group1)
  – random slope model
                             Example: HDL Data
> library(lme4)
> b=read.delim("Biochemistry.txt”)
cc=complete.cases(b$Biochem.HDL)

> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)
> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)

> anova(f0)
Analysis of Variance Table

Response: Biochem.HDL
            Df Sum Sq Mean Sq F value   Pr(>F)
Family     175 147.11 0.84064 5.2098 < 2.2e-16 ***
Residuals 1681 271.24 0.16136
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> f1
Linear mixed model fit by REML
Formula: Biochem.HDL ~ (1 | Family)
   Data: b
 Subset: cc
  AIC BIC logLik deviance REMLdev
 2161 2178 -1078       2149    2155
Random effects:
 Groups    Name        Variance Std.Dev.
 Family    (Intercept) 0.066633 0.25813
 Residual              0.161487 0.40185
Number of obs: 1857, groups: Family, 176

Fixed effects:
            Estimate Std. Error t value
(Intercept) 1.60615     0.02263   70.97
> plot(resid(f0),resid(f1))

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:18
posted:12/11/2012
language:English
pages:40
Lingjuan Ma Lingjuan Ma
About