Linear Modelling - Wellcome Trust Centre for Human Genetics by pptfiles

VIEWS: 5 PAGES: 35

									    Linear Modelling I

         Richard Mott
Wellcome Trust Centre for Human
           Genetics
                    Synopsis
•   Linear Regression
•   Correlation
•   Analysis of Variance
•   Principle of Least Squares
Correlation
    Correlation and linear regression

•   Is there a relationship?
•   How do we summarise it?
•   Can we predict new obs?
•   What about outliers?
          Correlation Coefficient r
• -1 < r < 1            • r=1 perfect positive linear

• r=0 no relationship




                        • r=-1 perfect negative linear
• r=0.6
Examples of Correlation
  (taken from Wikipedia)
         Calculation of r
• Data
                     Correlation in R

> cor(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")
[1] 0.2577617

> cor.test(bioch$Biochem.Tot.Cholesterol,bioch$Biochem.HDL,use="complete")

        Pearson's product-moment correlation

data: bioch$Biochem.Tot.Cholesterol and bioch$Biochem.HDL
t = 11.1473, df = 1746, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.2134566 0.3010088
sample estimates:
      cor
0.2577617

> pt(11.1473,df=1746,lower.tail=FALSE)   # T distribution on 1746 degrees of freedom
[1] 3.154319e-28
                  Linear Regression
Fit a straight line to data



• a     intercept
• b     slope
• ei     error
    – Normally distributed
    – E(ei) = 0
    – Var(ei) = s2
                  Example: simulated data
R code
>   #   simulate 30 data points
>   x   <- rnorm(30)
>   e   <- rnorm(30)
>   x   <- 1:30
>   e   <- rnorm(30,0,5)
>   y   <- 1 + 3*x + e

> # fit the linear model
> f <- lm(y ~ x)

> # plot the data and the predicted line
> plot(x,y)
> abline(reg=f)

> print(f)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)                x
   -0.08634          3.04747
                  Least Squares

• Estimate a, b by least
  squares
• Minimise sum of squared
  residuals between y and the
  prediction a+bx
• Minimise
                     Why least squares?

•   LS gives simple formulae for the estimates for a, b




•   If the errors are Normally distributed then the LS estimates are
    “optimal”
     In large samples the estimates converge to the true values
     No other estimates have smaller expected errors
     LS = maximum likelihood

•   Even if errors are not Normal, LS estimates are often useful
            Analysis of Variance (ANOVA)
LS estimates have an important property:
they partition the sum of squares (SS) into fitted and error components


    (yi  y )2  ( b(xi  x )) 2  (yi  bxi  a )2
                     ˆ                      ˆ     ˆ
     i                                  i                                 i

• total SS     = fitting SS + residual SS
• only the LS estimates do this
         Component                                  Degrees of   Mean Square                               F-ratio
                                                    freedom      (ratio of SS to df)                       (ratio of FMS/RMS)


                       ( b(xi  x )) 2                           ( b(x  x ))                            (n  2)( b(xi  x ))2
         Fitting SS                                 1
                          ˆ                                          ˆ              i
                                                                                            2
                                                                                                                     ˆ
                        i                                             i                                                 i

                       (y  bx  a )                                                                        (y  bx  a )
         Residual SS                                n-2
                             ˆ  i ˆ         i
                                                2
                                                                  (y         i
                                                                                    ˆ
                                                                                   bx i  a) 2 /(n  2)
                                                                                           ˆ                       ˆi   ˆ   i
                                                                                                                                2

                        i                                         i                                            i
         Total SS                                   n-1
                                                   
                       (y  y )    i
                                            2

                            i
                                                                                        
                             ANOVA in R


Component     SS        Degrees Mean Square   F-ratio
                        of
                        freedom
Fitting SS    20872.7   1       20872.7       965
Residual SS   605.6     28      21.6
Total SS      21478.3   29


> anova(f)
Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value    Pr(>F)
x          1 20872.7 20872.7    965 < 2.2e-16 ***
Residuals 28   605.6    21.6

> pf(965,1,28,lower.tail=FALSE)
[1] 3.042279e-23
                  Hypothesis testing
•                             no relationship between y and x


• Assume errors ei are independent and normally distributed N(0,s2)
• If H0 is true then the expected values of the sums of squares in the ANOVA
  are
                 (y  y )i
                                2
                                       ( b(x  x ))
                                           ˆ  i
                                                        2
                                                               (y  bx  a )
                                                                      ˆ
                                                                      i    ˆi
                                                                                  2

                   i                     i                       i
• degrees freedom         n 1                 1                       n 2
                       (n  2)s 2             s2                     (n  2)s 2
• Expectation

• F ratio = (fitting MS)/(residual MS) ~ 1 under H0
• F >> 1 implies we reject H0
• F is
       distributed as F(1,n-2)
                  Degrees of Freedom
•   Suppose            are iid N(0,1)

•   Then                   ie n independent variables


•   What about                  ?

•   These values are constrained to sum to 0:

•   Therefore the sum is distributed as if it comprised one fewer observation,
    hence it has n-1 df (for example, its expectation is n-1)

•   In particular, if p parameters are estimated from a data set, then the residuals



    have p constraints on them, so they behave like n-p independent variables
                  The F distribution
• If e1….en are independent and identically distributed (iid)
  random variables with distribution N(0,s2), then:
• e12/s2 … en2/s2 are each iid chi-squared random variables with
  chi-squared distribution on 1 degree of freedom c12
• The sum Sn = Si ei2/s2 is distributed as chi-squared cn2
• If Tm is a similar sum distributed as chi-squared cm2, but
  independent of Sn, then (Sn/n)/(Tm/m) is distributed as an F
  random variable F(n,m)
• Special cases:
   – F(1,m) is the same as the square of a T-distribution on m df
   – for large m, F(n,m) tends to cn2
                     ANOVA – HDL example
> ff <- lm(bioch$Biochem.HDL ~ bioch$Biochem.Tot.Cholesterol)
> ff                                                            HDL = 0.2308 + 0.4456*Cholesterol
Call:
lm(formula = bioch$Biochem.HDL ~
      bioch$Biochem.Tot.Cholesterol)

Coefficients:
                  (Intercept)   bioch$Biochem.Tot.Cholesterol
                       0.2308                          0.4456

> anova(ff)
Analysis of Variance Table

Response: bioch$Biochem.HDL
                                Df Sum Sq Mean Sq F value
     Pr(>F)
bioch$Biochem.Tot.Cholesterol    1 149.660 149.660   1044
Residuals                     1849 265.057   0.143

> pf(1044,1,28,lower.tail=FALSE)
[1] 1.040709e-23
                  correlation and ANOVA



•   r2 = FSS/TSS = fraction of variance explained by the model

•   r2 = F/(F+n-2)
     –   correlation and ANOVA are equivalent
     –   Test of r=0 is equivalent to test of b=0
     –   T statistic in R cor.test is the square root of the ANOVA F statistic


     –   r does not tell anything about magnitudes of estimates of a, b
     –   r is dimensionless
                          Effect of sample size
Total Cholesterol vs HDL data
Example R session to sample subsets of data and compute correlations


seqq <- seq(10,300,5)
corr <- matrix(0,nrow=length(seqq),ncol=2)
colnames(corr) <- c( "sample size", "P-value")
n <- 1
for(i in seqq) {
    res <- rep(0,100)
    for(j in 1:100) {
           s <- sample(idx,i)
           data <- bioch[s,]
           co <- cor.test(data$Biochem.Tot.Cholesterol,
                          data$Biochem.HDL,na="pair")
           res[j] <- co$p.value
    }
    m <- exp(mean(log(res)))
    cat(i, m, "\n")
    corr[n,] <- c(i, m)
    n <- n+1
}
 Calculating the right sample size n
• The R library “pwr” contains functions to
  compute the sample size for many problems,
  including correlation pwr.r.test() and ANOVA
  pwr.anova.test()
 Problems with non-linearity
All plots have r=0.8 (taken from Wikipedia)
          Non-Parametric Correlation
          Spearman Rank Correlation Coefficient

• Replace observations by their ranks
• eg x= ( 5, 1, 4, 7 ) -> rank(x) = (3,1,2,4)
• Compute sum of squared differences between
  ranks

• in R:
  – cor( x, y, method=“spearman”)
  – cor.test(x,y,method=“spearman”)
                  Spearman Correlation
> cor.test(xx,y, method=“pearson”)

       Pearson's product-moment correlation

data: xx and y
t = 0.0221, df = 28, p-value = 0.9825
alternative hypothesis: true correlation is not
    equal to 0
95 percent confidence interval:
 -0.3566213 0.3639062
sample estimates:
        cor
0.004185729

> cor.test(xx,y,method="spearman")

       Spearman's rank correlation rho

data: xx and y
S = 2473.775, p-value = 0.01267
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho
0.4496607
            Multiple Correlation
• The R cor function can be used to compute pairwise
  correlations between many variables at once,
  producing a correlation matrix.
• This is useful for example, when comparing expression
  of genes across subjects.
• Gene coexpression networks are often based on the
  correlation matrix.
• in R
     mat <- cor(df, na=“pair”)
   – computes the correlation between every pair of columns
     in df, removing missing values in a pairwise manner
   – Output is a square matrix correlation coefficients
             One-Way ANOVA
• Model y as a function of a categorical variable
  taking p values
  – eg subjects are classified into p families
  – want to estimate effect due to each family and
    test if these are different
  – want to estimate the fraction of variance
    explained by differences between families – ( an
    estimate of heritability)
            One-Way ANOVA




LS estimators


                average over group i
                 One-Way ANOVA
• Variance is partitioned in to fitting and
  residual SS


      total SS    fitting SS       residual SS
                  between groups   with groups

      n-1         p-1              n-p           degrees of freedom
                   One-Way ANOVA

Component     SS          Degrees of   Mean Square           F-ratio
                          freedom      (ratio of SS to df)   (ratio of FMS/RMS)




Fitting SS                p-1


Residual SS               n-p


Total SS                  n-1




Under Ho: no differences between groups
F ~ F(p-1,n-p)
                 One-Way ANOVA in R
fam <- lm( bioch$Biochem.HDL ~ bioch$Family )
> anova(fam)
Analysis of Variance Table

Response: bioch$Biochem.HDL
               Df Sum Sq Mean Sq F value    Pr(>F)
bioch$Family 173 6.3870 0.0369 3.4375 < 2.2e-16 ***
Residuals    1727 18.5478 0.0107
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
 Component       SS           Degrees of   Mean Square           F-ratio
                              freedom      (ratio of SS to df)   (ratio of FMS/RMS)




 Fitting SS       6.3870       173         0.0369                3.4375


 Residual SS     18.5478      1727         0.0107


 Total SS        24.9348      1900
             Non-Parametric
             One-Way ANOVA
• Kruskall-Wallis Test
• Useful if data are highly non-Normal
  – Replace data by ranks
  – Compute average rank within each group
  – Compare averages
  – kruskal.test( formula, data )
                   Factors in R
• Grouping variables in R are called factors
• When a data frame is read with read.table()
  – a column is treated as numeric if all non-missing
    entries are numbers
  – a column is boolean if all non-missing entries are T or
    F (or TRUE or FALSE)
  – a column is treated as a factor otherwise
  – the levels of the factor is the set of distinct values
  – A column can be forced to be treated as a factor using
    the function as.factor(), or as a numeric vector
    using as.numeric()
          Linear Modelling in R
• The R function lm() fits linear models
• It has two principal arguments (and some
  optional ones)
• f <- lm( formula, data )
  – formula is an R formula
  – data is the name of the data frame containing the
    data (can be omitted, if the variables in the
    formula include the data frame)
                formulae in R
• Biochem.HDL ~ Biochem$Tot.Cholesterol
  – linear regression of HDL on Cholesterol
  – 1 df
• Biochem.HDL ~ Family
  – one-way analysis of variance of HDL on Family
  – 173 df
• The degrees of freedom are the number of
  independent parameters to be estimated
                       ANOVA in R
•   f <- lm(Biochem.HDL ~ Tot.Cholesterol, data=biochem)
•   [OR f <- lm(biochem$Biochem.HDL ~ biochem$Tot.Cholesterol)]

•   a <- anova(f)

•   f <- lm(Biochem.HDL ~ Family, data=biochem)
•   a <- anova(f)

								
To top