royal stat soc 05 by SZ4v6I

VIEWS: 7 PAGES: 28

									            Propagating Measurement Uncertainty in
                   Microarray Data Analysis

                                            Magnus Rattray
                                 School of Computer Science
                                        University of Manchester




Combining the strengths of UMIST and
The Victoria University of Manchester
Talk Outline
•     Part 1: Affymetrix probe-level analysis
•     Probabilistic model for oligonucleotide arrays
•     Estimating credibility intervals
•     Evaluation on real and spike-in data
•     Part 2: Propagating uncertainties
•     A general framework for propagating uncertainties
•     Example 1: Identifying differentially expressed genes
•     Example 2: Modified Principal Component Analysis

Combining the strengths of UMIST and
The Victoria University of Manchester
Part 1: Affy probe-level analysis
                                             PM – Perfect match DNA probe
                                             designed to measure signal
                                             MM – Mismatch DNA probe
                                             designed to measure background
                                             Probes for the same gene differ
                                             greatly in their binding affinities, eg.
PM           83              77         70    982   530   1013   340 1832     464 1111
MM           86              65         79    489   172   1224   181 985      191 313

~10000-50000 probe-sets with 11-20 PM/MM probe-pairs
Combining the strengths of UMIST and
The Victoria University of Manchester
Are mismatch probes useful?
• In practice there is specific binding to MM, so some
      methods ignore MM probes altogether. But…




                                        …if fraction is the same for
                                        each chip, this term cancels
                                        when computing expression
                                        ratios.

Combining the strengths of UMIST and
The Victoria University of Manchester
Probabilistic probe-level analysis
•     Most methods return a single expression level estimate
•     Probabilistic models provide confidence intervals
•     Useful for propagating through higher-level analysis
•     Hopefully, this approach will also improve accuracy


      A hierarchical Bayesian model (Hein et al. 2005) uses
      MCMC for Bayesian parameter estimation, but this can
      be prohibitively slow – a more efficient approach is
      required.

Combining the strengths of UMIST and
The Victoria University of Manchester
Gamma model for oligo signal: gMOS

             Models (PM,MM) distribution for each probe-set

                                                     - PM (background+signal)
                                                     - MM (background)
                                                     - signal

                                                              Mean log-signal


              where

Combining the strengths of UMIST and
The Victoria University of Manchester   Milo et. al., Biochemical Transactions 31, 6 (2003)
Modelling probe affinity: mgMOS
•     PM and MM probes have correlated binding affinities
•     Use a shared scale parameter      for probe-pair
•     Treat scale parameter as a latent variable
•     Distribution of PM ( ) and MM (    ) is




                                                    Improves
                                                    fit to data

Combining the strengths of UMIST and
The Victoria University of Manchester
Further extensions of the model
• Share binding affinity parameter across multiple chips
• Include fraction specific binding to MM probe
                                                                Probe in probe-set
                                                                on chip

 Parameter                              is unidentifiable
 We estimate an empirical prior
 from spike-in data


Combining the strengths of UMIST and
The Victoria University of Manchester     Liu et. al., Bioinformatics 21, 3637 (2005).
Posterior signal distribution
• We estimate the mean signal over a probe-set as

• Only the first term is chip & condition specific
• Distribution of gives posterior signal distribution
• We assume a uniform positive prior on

• Approximate posterior of      as truncated Gaussian or
      using a histogram approach (very similar in practice)
• Percentiles of                        provide percentiles of
Combining the strengths of UMIST and
The Victoria University of Manchester
Posterior signal distribution
• Posterior becomes more peaked as signal increases




• Normal provides good fit for large signals
• For low signal there is a long left-hand tail due to the
      fact that we are measuring
• Posterior distribution can be used to put credibility
      intervals on the estimated expression level
Combining the strengths of UMIST and
The Victoria University of Manchester
 Results: Accuracy on real data
 • 5 time-points, 3 replicates & qr-PCR for 14 genes

                                                  Method               Error
              mgMOS
                                          GC-RMA                     0.69
                                          MAS 5.0                    0.66
                                          mgMOS (post.median) 0.60
                                          multi-mgMOS                0.60
  multi-mgMOS                             Hierarchical Bayesian      0.72

                                           RMS error to PCR results
  Combining the strengths of UMIST and
  The Victoria University of Manchester
Mouse hair-follicle morphogenesis data from Lin et. al. PNAS 101, 15955 (2004).
Importance of credibility intervals
                                                                      Red boxes show
                                                                      truly differentially
                                                                      expressed genes
                                                                      Left: Log-ratios used
                                                                      to rank genes
                                                                      Right: Credibility
                                                                      intervals used to
                                                                      rank genes


1331 up-regulated genes (1.2 to 4-fold), 12679 invariant
Combining the strengths of UMIST and
The Victoria University of Manchester
                                        Spike-in data from Choe et al Genome Biology 6, R16 (2005).
Part 2: Propagating uncertainties
• Uncertainties can be propagated as noise

      where                             is diagonal covariance matrix for gene
• Use your favourite probabilistic model for
• Data is not i.i.d. making parameter estimation tricky

      We consider two popular tasks as examples:
      (i) Combining replicates and identifying differential
      expression
      (ii) Principal Component Analysis (PCA)
Combining the strengths of UMIST and
The Victoria University of Manchester
(i) Combining replicates
 Simplest model of log-expression                       is a Gaussian:


 for replicate                          in conditions         with priors


  • Parameters are
  • Hyper-parameters are
  • We can then calculate the probability of the sign of
  change in expression level between two conditions:

Combining the strengths of UMIST and
The Victoria University of Manchester
Hyper-parameter estimation
 Likelihood:

  Prior:

  We wish to optimise the log marginal likelihood:


    The integral is intractable, so we use a variational
    approximation (popular approach in machine learning).

   The resulting optimisation resembles an EM-algorithm.
Combining the strengths of UMIST and
The Victoria University of Manchester
Variational approximation



     E-step:
     M-step:


      We use a factorised approximation to the posterior:



Combining the strengths of UMIST and
The Victoria University of Manchester
Results: credibility intervals




Combining the strengths of UMIST and
The Victoria University of Manchester
                                        Data from Lin et. al. PNAS 101, 15955 (2004)
Identifying differential expression




            One chip per condition                                3 replicates per condition

       1331 up-regulated genes (1.2 to 4-fold), 12679 invariant
Combining the strengths of UMIST and
The Victoria University of Manchester
                                        Spike-in data from Choe et al Genome Biology 6, R16 (2005).
(ii) Principal Component Analysis
• Popular dimensionality reduction technique
• Project data onto directions of greatest variation


                                          Useful tool for visualising patterns and
                                          clusters within the data set


                                          Usually requires an ad-hoc method for
                                          removing genes with low signal/noise


Combining the strengths of UMIST and    This example from Pomeroy et. al. Nature 415, 436, 2002.
The Victoria University of Manchester
                                        Embryonic tumours of the central nervous system.
Probabilistic PCA
• PCA can be cast as a probabilistic model

      with -dimensional latent variables
• The resulting data distribution is

• Maximum likelihood solution is equivalent to PCA

      Diagonal     contains the top sample covariance
      eigenvalues and    contains associated eigenvectors
Combining the strengths of UMIST and
The Victoria University of Manchester   Tipping and Bishop, J. Royal Stat. Soc. 6, 611 (1999).
Relationship to Factor Analysis
• Probabilistic PCA is equivalent to factor analysis with
      equal noise for every dimension
• In factor analysis                    for a diagonal
      covariance matrix
• An iterative algorithm (eg. EM) is required to find
      parameters if precisions are not known in advance


      In our case we want the precision to be gene and
      experiment specific – we need a more flexible model


Combining the strengths of UMIST and
The Victoria University of Manchester
PCA with measurement uncertainty
• If we let the covariance matrix be gene specific then
      Probabilistic PCA:
      Corrupted data model:
• The log-likelihood is

      with
• The maximum likelihood solution for the mean is


Combining the strengths of UMIST and
      which is no longer the sample mean
The Victoria University of Manchester
Likelihood optimisation
• The optimal parameters are solutions to a coupled
      non-linear set of equations (eg.     depends on        )
•     Gradients require inversion of large matrices
•     An EM-algorithm provides more efficient optimisation
•     M-step still requires non-linear optimisation
•     Redundant parameterisation of model gives us a
      significant speed-up




Combining the strengths of UMIST and
The Victoria University of Manchester
Advantages over standard PCA
• Automatically eliminates influence of consistently
      noisy genes, eg. noisy in all experiments
• Automatically chooses no. of principal components
      because noise “explains away” some of the variation
• Down-weights influence of noisy measurements in an
      experiment specific way
• Provides error-bars on the reduced dimension
      representation of the data
• Can be used to “denoise” expression profiles

Combining the strengths of UMIST and
The Victoria University of Manchester
Results: Improved visualisation
                                                                       Under standard
                                                                       PCA 43% of
                                                                       samples are closest
                                                                       to a sample of the
                                                                       same tumour type.




                                                                       For modified PCA
                                                                       this percentage
                                                                       increases to 71%.



Combining the strengths of UMIST and
The Victoria University of Manchester   Data from Pomeroy et. al. Nature 415, 436, 2002.
Denoising a data set
• We can estimate the uncorrupted data                                          from the noisy
      measurements                               as




• Denoised profile approaches original as noise is reduced
• Denoised data improves performance of clustering
Combining the strengths of UMIST and
The Victoria University of Manchester   Sanguinetti et al. Bioinformatics 21, 3748 (2005).
Conclusions
• We have developed a computationally efficient
      probabilistic model for Affymetrix probe-level analysis.
• The model provides good accuracy and confidence
      intervals for gene expression level estimates.
• Measurement uncertainties can be propagated
      through an appropriate probabilistic model.
• Example applications to Bayesian t-test and PCA.
• Parameter estimation becomes much more difficult,
      so approximate methods are needed.
• Same principal can be applied to other models.
Combining the strengths of UMIST and
The Victoria University of Manchester
Acknowledgments
Rest of the team:
Xuejun Liu, School of Computer Science, University of Manchester.
Guido Sanguinetti, Marta Milo & Neil Lawrence, Department of Computer
  Science, University of Sheffield.
Software: www.bioinf.man.ac.uk/resources/puma
Papers:
Liu et al. “A tractable probabilistic model for Affymetrix probe-level analysis
    across multiple chips” Bioinformatics 21, 3637 (2005).
Sanguinetti et al. “Accounting for probe-level noise in principal component
   analysis of microarray data” Bioinformatics 21, 3748 (2005).


Supported by a BBSRC award “Improved processing of microarray data with
   probabilistic models”
Combining the strengths of UMIST and
The Victoria University of Manchester

								
To top