# Linear Modelling III - Wellcome Trust Centre for Human Genetics(1) by pptfiles

VIEWS: 12 PAGES: 18

• pg 1
```									   Linear Modelling III

Richard Mott
Wellcome Trust Centre for Human
Genetics
Synopsis
• Comparing non-nested models
• Building Models Automatically
• Mixed Models
Comparing Non-nested models
•   There is no equivalent to the partial F test when comparing non-nested models.
•   The main issue is how to compensate for the fact that a model with more
parameters will tend to fit the data better (in the sense of explaining more
variance) than a model with fewer parameters
•   But a model with many parameters tends to be a poorer predictor of new
observations, because of over-fitting
•   However, several criteria have been proposed for model comparison
•   AIC (Akaike Information Criterion)
where L = the likelihood for the data, p= # parameters
For Linear Models,

where n is the number of observations and L is the maximized value of the likelihood function
for the estimated model.
Among models with the same number of observations n, choose the model which minimises
the simplified AIC
• BIC (Bayesian Information Criterion)

• The BIC penalizes p more strongly than the AIC
- ie it prefers models with fewer paramters
• In R:
f <- lm( formula, data)
aic <- AIC(f)
bic <- AIC(f, k = log(nrow(data)))
Building Models Automatically
•   Example:
–   We wish to find a set of SNPs that jointly explain variation in a quantitative trait
–   There will be (usually) far more SNPs s than data points n
–   There are a vast number 2s of possible models
–   No model can contain more than n-1 parameters (model saturation)
–   Forward Selection:
•     Start with a small model and augment the model step by step so long as the improvement in fit
satisfies a criterion (eg AIC or BIC, or partial F test)
•     At each step, add the variable which maximises the improvement in fit
–         Backwards Elimination:
•        Start with a very large model and subtract terms from the model step by step
•        At each step, delete the variable that minimises the decrease in fit
–         Forward-Backward
•        At each step, either add or delete a term depending on which optimises a criterion
•        in R, stepAIC
–         Model Averaging
•        Rather than try to find a single model, integrate over many plausible models
–     Bootstrapping
–     Bayesian Model Averaging
Random Effects
Mixed Models
• So far our models have had fixed effects
– each parameter can take any value independently of the
others
– each parameter estimate uses up a degree of freedom
– models with large numbers of parameters have problems
• saturation - overfitting
• poor predictive properties
• numerically unstable and difficult to fit
• in some cases we can treat parameters as being sampled from a
distribution – random effects
• only estimate the parameters needed to specify that distribution
• can result in a more robust model
Example of a Mixed Model
•   Testing genetic association across a large number of of families
–   yi = trait value of i’th individual
–   bi = genotype of individual at SNP of interest
–   fi = family identifier (factor)
–   ei = error, variance s2

– y=m+b+f+e

– If we treat family effects f as fixed then we must estimate a large number of parameters
– Better to think of these effects as having a distribution N(0,t2) for some variance t which must
be estimated
•   individuals from the same family have the same f and are correlated
–   cov = t2
•   individuals from different families are uncorrelated

– genotype parameters b still treated as fixed effects
•   mixed model
Mixed Models
• y=m+b+f+e

• Fixed effects model
– E(y) = m + b + f
– Var(y) = Is2
• I is identity matrix

• Mixed model
– E(y) = m + b
– Var(y) = Is2 + Ft2
• F is a matrix, Fij = 1 if i, j in same family
• Need to estimate both s2 and t2
Generalised Least Squares
• y = Xb + e
• Var(y) = V (symmetric covariance matrix)
• V = Is2 (uncorrelated errors)
• V = Is2 + Ft2 (single grouping random effect)
• V = Is2 + Gt2 (G = genotype identity by state)
• GLS solution, if V is known
ˆ  (XT V1X)1 XT V1y
b                         

•   unbiased,
•   efficient (minimum variance)
•   consistent (tends to true value in large samples)
•   asymptotically normally distributed
Multivariate Normal Distribution
• Joint distribution of a vector of correlated observations
• Another way of thinking about the data
• Estimation of parameters in a mixed model is special case of likelihood
analysis of a MVN
• y ~ MVN(m, V)
– m is vector of expected values
– V is variance-covariance matrix
– If V = Is2 then elements of y are uncorrelated and equivalent to n
independent Normal variables
– probability density/Likelihood is
1
L(y | m,V)  exp( (y  m)'V1(y  m))
2
2 n2
(2 V )
Henderson’s Mixed Model Equations
•   General linear mixed model.
y  Xb + Zu + e
•   b are fixed effect
•   u are random effects
•   X, Z design matrices           u  MVN(0,G) (G  It 2 )

e  MVN(0,R) (R  Is 2 )
y  MVN(Xb,V)
V  ZGZ T + R

 XT R1X  XT R1Z b   XT R1y
ˆ
 T 1    T 1           T 1 
1 u
Z R X Z R Z + G  ˆ  Z R y
Mixed Models

• Fixed effects models are special cases of mixed models

• Mixed models sometimes are more powerful (as fewer
parameters)
– But check that the assumption effects are sampled from a
Normal distribution is reasonable
– Differences between estimated parameters and statistical
significance in fixed vs mixed models

• Shrinkage: often random effect estimates from a mixed
model have smaller magnitudes than from a fixed effects
model.
Mixed Models in R
• Several R packages available, e.g.
– lme4 general purpose mixed models package
– emma     for genetic association in structured populations

• Formulas in lme4, eg test for association between genotype and trait
•    H0: E(y) = m
•    vs
•    H1: E(y) = m + b

•    Var(y) = Is2 + Ft2

–   h0 = lmer(y ~ 1 +(1|family), data=data)
–   h1 = lmer(y ~ genotype +(1|family), data=data)
–   anova(h0,h1)
Formulas in lmer()
• y ~ fixed.effects + (1|Group1) + (1|Group2)
etc
– random intercept models
• y ~ fixed.effects +(1|Group1) + (x|Group1)
– random slope model
Example: HDL Data
> library(lme4)
cc=complete.cases(b\$Biochem.HDL)

> f0=lm(Biochem.HDL ~ Family , data=b,subset=cc)
> f1=lmer(Biochem.HDL ~ (1|Family) , data=b,subset=cc)

> anova(f0)
Analysis of Variance Table

Response: Biochem.HDL
Df Sum Sq Mean Sq F value   Pr(>F)
Family     175 147.11 0.84064 5.2098 < 2.2e-16 ***
Residuals 1681 271.24 0.16136
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> f1
Linear mixed model fit by REML
Formula: Biochem.HDL ~ (1 | Family)
Data: b
Subset: cc
AIC BIC logLik deviance REMLdev
2161 2178 -1078       2149    2155
Random effects:
Groups    Name        Variance Std.Dev.
Family    (Intercept) 0.066633 0.25813
Residual              0.161487 0.40185
Number of obs: 1857, groups: Family, 176

Fixed effects:
Estimate Std. Error t value
(Intercept) 1.60615     0.02263   70.97
> plot(resid(f0),resid(f1))
Some Essential R tips
Comparing Models containing Missing Data
– R will silently omit rows of data containing missing elements, and adjust the df
accordingly

– R will only compare models using anova() if the models have been fitted to
identical observations.

– Sporadic missing values in the explanatory variables can cause problems,
because models may have a different numbers of complete cases

– Solution is to use the R function complete.cases() to identify those rows with
complete data in the most general model, and specify these rows for
modelling:
cc <- complete.cases(data.frame)
f <-lm( formula, data, subset=cc)
Joining two data frames on a common column
– Frequently data to be analysed are in several data frames, with a column of unique subject ids
used to match the rows.
– Unfortunately, often the subjects differ slightly between data frames, eg frame1 might contain
phenotype data and frame2 contain genotypes. Usually these two sets don’t agree perfectly
because there will be subjects with genotypes but no phenotypes and vice versa

– Solution 1: Make a combined data frame containing the intersection of the subjects
intersected <- sort(intersect(frame2\$id, frame1\$id))

combined <- cbind( frame1[match(intersect,frame1\$id),cols1],
frame2[match(intersect,frame2\$id),cols2])

– Solution 2: Make a combined data frame containing the union of the subjects
union <- unique(sort(c(as.character(frame1\$id), as.character(frame2\$id)))

combined <- cbind(
frame1[match(union,frame1\$id),cols1],frame2[match(union,frame2\$id),cols2])

– cols1 is a list of the column names to include in frame1,
– cols2 is a list of the column names to include in frame2

```
To top