Linear Modelling III - Wellcome Trust Centre for Human Genetics(1) by pptfiles

VIEWS: 12 PAGES: 18

									   Linear Modelling III

         Richard Mott
Wellcome Trust Centre for Human
           Genetics
                Synopsis
• Comparing non-nested models
• Building Models Automatically
• Mixed Models
     Comparing Non-nested models
•   There is no equivalent to the partial F test when comparing non-nested models.
•   The main issue is how to compensate for the fact that a model with more
    parameters will tend to fit the data better (in the sense of explaining more
    variance) than a model with fewer parameters
•   But a model with many parameters tends to be a poorer predictor of new
    observations, because of over-fitting
•   However, several criteria have been proposed for model comparison
•   AIC (Akaike Information Criterion)
                                 where L = the likelihood for the data, p= # parameters
        For Linear Models,


        where n is the number of observations and L is the maximized value of the likelihood function
        for the estimated model.
        Among models with the same number of observations n, choose the model which minimises
        the simplified AIC
• BIC (Bayesian Information Criterion)



• The BIC penalizes p more strongly than the AIC
  - ie it prefers models with fewer paramters
• In R:
   f <- lm( formula, data)
   aic <- AIC(f)
   bic <- AIC(f, k = log(nrow(data)))
         Building Models Automatically
•   Example:
     –   We wish to find a set of SNPs that jointly explain variation in a quantitative trait
     –   There will be (usually) far more SNPs s than data points n
     –   There are a vast number 2s of possible models
     –   No model can contain more than n-1 parameters (model saturation)
     –   Forward Selection:
           •     Start with a small model and augment the model step by step so long as the improvement in fit
                 satisfies a criterion (eg AIC or BIC, or partial F test)
           •     At each step, add the variable which maximises the improvement in fit
     –         Backwards Elimination:
           •        Start with a very large model and subtract terms from the model step by step
           •        At each step, delete the variable that minimises the decrease in fit
     –         Forward-Backward
           •        At each step, either add or delete a term depending on which optimises a criterion
           •        in R, stepAIC
     –         Model Averaging
           •        Rather than try to find a single model, integrate over many plausible models
                    –     Bootstrapping
                    –     Bayesian Model Averaging
                    Random Effects
                     Mixed Models
• So far our models have had fixed effects
   – each parameter can take any value independently of the
     others
   – each parameter estimate uses up a degree of freedom
   – models with large numbers of parameters have problems
      • saturation - overfitting
      • poor predictive properties
      • numerically unstable and difficult to fit
      • in some cases we can treat parameters as being sampled from a
        distribution – random effects
      • only estimate the parameters needed to specify that distribution
      • can result in a more robust model
               Example of a Mixed Model
•   Testing genetic association across a large number of of families
     –   yi = trait value of i’th individual
     –   bi = genotype of individual at SNP of interest
     –   fi = family identifier (factor)
     –   ei = error, variance s2

     – y=m+b+f+e

     – If we treat family effects f as fixed then we must estimate a large number of parameters
     – Better to think of these effects as having a distribution N(0,t2) for some variance t which must
       be estimated
           •   individuals from the same family have the same f and are correlated
                  –   cov = t2
           •   individuals from different families are uncorrelated

     – genotype parameters b still treated as fixed effects
           •   mixed model
                      Mixed Models
• y=m+b+f+e

• Fixed effects model
   – E(y) = m + b + f
   – Var(y) = Is2
      • I is identity matrix


• Mixed model
   – E(y) = m + b
   – Var(y) = Is2 + Ft2
      • F is a matrix, Fij = 1 if i, j in same family
      • Need to estimate both s2 and t2
         Generalised Least Squares
• y = Xb + e
• Var(y) = V (symmetric covariance matrix)
        • V = Is2 (uncorrelated errors)
        • V = Is2 + Ft2 (single grouping random effect)
        • V = Is2 + Gt2 (G = genotype identity by state)
• GLS solution, if V is known
       ˆ  (XT V1X)1 XT V1y
       b                         


•   unbiased,
•   efficient (minimum variance)
•   consistent (tends to true value in large samples)
•   asymptotically normally distributed
   Multivariate Normal Distribution
• Joint distribution of a vector of correlated observations
• Another way of thinking about the data
• Estimation of parameters in a mixed model is special case of likelihood
  analysis of a MVN
• y ~ MVN(m, V)
    – m is vector of expected values
    – V is variance-covariance matrix
    – If V = Is2 then elements of y are uncorrelated and equivalent to n
      independent Normal variables
    – probability density/Likelihood is
                               1
             L(y | m,V)  exp( (y  m)'V1(y  m))
                               2
                                                                     2 n2
                                                             (2 V )
    Henderson’s Mixed Model Equations
•   General linear mixed model.
                                      y  Xb + Zu + e
•   b are fixed effect
•   u are random effects
•   X, Z design matrices           u  MVN(0,G) (G  It 2 )
                           
                                   e  MVN(0,R) (R  Is 2 )
                                  y  MVN(Xb,V)
                                   V  ZGZ T + R

       XT R1X  XT R1Z b   XT R1y
                             ˆ
       T 1    T 1           T 1 
                        1 u
      Z R X Z R Z + G  ˆ  Z R y
                   Mixed Models

• Fixed effects models are special cases of mixed models

• Mixed models sometimes are more powerful (as fewer
  parameters)
   – But check that the assumption effects are sampled from a
     Normal distribution is reasonable
   – Differences between estimated parameters and statistical
     significance in fixed vs mixed models

• Shrinkage: often random effect estimates from a mixed
  model have smaller magnitudes than from a fixed effects
  model.
                        Mixed Models in R
• Several R packages available, e.g.
    – lme4 general purpose mixed models package
    – emma     for genetic association in structured populations

• Formulas in lme4, eg test for association between genotype and trait
   •    H0: E(y) = m
   •    vs
   •    H1: E(y) = m + b


   •    Var(y) = Is2 + Ft2


    –   h0 = lmer(y ~ 1 +(1|family), data=data)
    –   h1 = lmer(y ~ genotype +(1|family), data=data)
    –   anova(h0,h1)
           Formulas in lmer()
• y ~ fixed.effects + (1|Group1) + (1|Group2)
  etc
  – random intercept models
• y ~ fixed.effects +(1|Group1) + (x|Group1)
  – random slope model
                             Example: HDL Data
> library(lme4)
> b=read.delim("Biochemistry.txt”)
cc=complete.cases(b$Biochem.HDL)

> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)
> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)

> anova(f0)
Analysis of Variance Table

Response: Biochem.HDL
            Df Sum Sq Mean Sq F value   Pr(>F)
Family     175 147.11 0.84064 5.2098 < 2.2e-16 ***
Residuals 1681 271.24 0.16136
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> f1
Linear mixed model fit by REML
Formula: Biochem.HDL ~ (1 | Family)
   Data: b
 Subset: cc
  AIC BIC logLik deviance REMLdev
 2161 2178 -1078       2149    2155
Random effects:
 Groups    Name        Variance Std.Dev.
 Family    (Intercept) 0.066633 0.25813
 Residual              0.161487 0.40185
Number of obs: 1857, groups: Family, 176

Fixed effects:
            Estimate Std. Error t value
(Intercept) 1.60615     0.02263   70.97
> plot(resid(f0),resid(f1))
Some Essential R tips
Comparing Models containing Missing Data
  – R will silently omit rows of data containing missing elements, and adjust the df
    accordingly

  – R will only compare models using anova() if the models have been fitted to
    identical observations.

  – Sporadic missing values in the explanatory variables can cause problems,
    because models may have a different numbers of complete cases

  – Solution is to use the R function complete.cases() to identify those rows with
    complete data in the most general model, and specify these rows for
    modelling:
           cc <- complete.cases(data.frame)
           f <-lm( formula, data, subset=cc)
Joining two data frames on a common column
 – Frequently data to be analysed are in several data frames, with a column of unique subject ids
   used to match the rows.
 – Unfortunately, often the subjects differ slightly between data frames, eg frame1 might contain
   phenotype data and frame2 contain genotypes. Usually these two sets don’t agree perfectly
   because there will be subjects with genotypes but no phenotypes and vice versa

 – Solution 1: Make a combined data frame containing the intersection of the subjects
      intersected <- sort(intersect(frame2$id, frame1$id))

      combined <- cbind( frame1[match(intersect,frame1$id),cols1],
         frame2[match(intersect,frame2$id),cols2])


 – Solution 2: Make a combined data frame containing the union of the subjects
      union <- unique(sort(c(as.character(frame1$id), as.character(frame2$id)))

      combined <- cbind(
         frame1[match(union,frame1$id),cols1],frame2[match(union,frame2$id),cols2])

 – cols1 is a list of the column names to include in frame1,
 – cols2 is a list of the column names to include in frame2

								
To top