Document Sample
20011206-Model-BasedClustering Powered By Docstoc
					Model-based clustering of gene
expression data from microarray
          Debashis Ghosh, Arul M. Chinnaiyan
    Department of Biostatistics, University of Michigan
     Department of Pathology, University of Michigan


   Introduction
   Systems and Methods
   Algorithm
   Implementation
   Results
   Discussion

   Large collections of gene will be rapidly
    available for parallel genomic studies
   Currently two platforms to dominate the
    microarray field – oligonucleotide arrays
    and spotted cDNA arrays
          Introduction (Cont’)

   Clustering methods are useful when the
    goal is to discover grouping in the gene
    expression data, and no external
    information exists.
   An attractive feature of the model-based
    approach is that it provide a statistical
    criterion for accessing the number of true
    clusters in the dataset of interest.
           Systems and Methods

   Data Preprocessing
   Model Specification
       Density function

       Multivariate normal density

   Two Steps in fitting model to the data
       Initialization by model-based hierarchical
        agglomerative clustering
       Maximum likelihood estimation using EM
       A criteria for determining the number of
        clusters in the data
        Hierarchical Agglomerative
   Classification log-likelihood

   To find the maximizer of classification log-
   Base on a combination of the dissimilarity
    matrix and a method of defining distance
    between clusters
Expectation-Maximization Algorithm

   Complete-data likelihood

   Log-likelihood
Expectation-Maximization Algorithm
   Estimator
Selecting the number of clusters

   Determining the number of clusters based
    on Bayes Factor
   Bayes Factor

   Bayesian Information Criterion (BIC)

   Model-based clustering in microarray
       Analysis of genes and ESTs (Expression
        Sequence Tag)
       Number of samples profiled (n) << the
        number of genes on a microarray (p) 
        impossible to fit the model to these data
          Implementation (Cont’)
   Model-based clustering of genes
       Using k-means clustering for preprocessing step
   Model-based clustering of samples
       Using principal components analysis for dimension
       Using Bayes factors to determine the number of
   Software
       Using MCLUST

   Cutaneous melanoma data
       Data: cDNA microarray experiments
        performed by Bittner et al.
       31 melanoma samples
       Microarray contains 8150 human cDNA, of
        which 6912 were sequence verified.
                 Result (Cont’)

   Prostate cancer data
       Data: 3955 genes across the 26 samples
       Original data: 9984 (5000 known gene from
        Research Genetics human cDNA clone set,
        4400 ESTs, 500 control elements)
Cluster Dendrogram (Bittner)
PCA of Melanoma Data
PCA of Prostate Cancer Data

   A attractive feature of the model-based
    clustering methodology is that the
    strength of evidence measure for the
    number of true clusters in the data is
   Bayes factor for model specification

   Principal Components Analysis(PCA)
   Bayes Factor