Data mining _ Machine Learning Methods for Micro-Array Analysis

Document Sample
Data mining _ Machine Learning Methods for Micro-Array Analysis Powered By Docstoc
					Methods for Micro-Array Analysis
Data Mining & Machine Learning
What is Micro-Array Analysis?
       Analysis of Micro-Array Data
• Challenges posed
   Typical characteristic of micro array data is the large
    number of variables relative to the number of
   Hidden knowledge in these data has to be discovered
      Eg.Gene expression data from 72 leukemia patients
       (samples) with 7,070 genes (variables)
   The study of the variability of gene expression patterns

• Problems
   How to analyze micro-array data with the following
    requirements met simultaneously ?
      Efficiency
      Accuracy
      Automation
                   Typical Micro-Array data set
Suppose that the identical micro-array experiment is repeted p times (e.g. colon
cancer cells from p patients compired with p wild tipes). Then we obtain a data
set (mij; i=1,…,G, j=1,…,p), in which mij is the expression ratio in gene I in jth

             mij      1       2        3      4       5         6       7        …         p

             1        1.04    1.17     1.08   1.06    1.14      1.09    1.07     …         1.11

             2        1.02    1.01     1.15   1.15    1.01      1.06    1.08     …         1.34

             ……       …       …        …      …       …         …       …        …         …

             G        1.45    1.08     1.02   1.06    1.12      1.57    1.11               1.06

Usually generate large data sets with expression values for thousands of genes
(2000~20000) but not more than a few dozens of samples
For example:

       Dataset                                Number         Samples                 References
                                              of genes

       Leukemia                               7129           47:25 bone marrow       Golub et al.
       ALL versus AML                                        samples                 (1999)

       Lung cancer (malignant pleural         12533          31:150 tissue           Gordon et al
       mesothelioma (MPM) versus                             samles                  (2002)
       adenocarcinoma of the lung (ADCA) )

       Prostate Cancer                        12600          77:59 prostate          Singh et al.
       (Tumor versus Normal classification)                  tumor samples and       (2002)
                                                             normal samples
   Main objectives in micro-array data analyses

   1. To find the genes that are differently expressed (DF) in the two samples (e.g. the given
   colon cancer sample / the wild type; cells that submited a given treatment / no treatment ).
   Although biologists can discover DF genes even with p=1, it has been realized lately that
   making independent replications is a good practise.

   Questions that could be asked:
 - Which genes expression is modified by the condition? (it has been reported that many diseases, especially tumors,
    have never been caused by a single gene mutation but are the result of a series of gene changes)
- Has the treatment changed the expression level of specific (target) genes / gene sequences to noticeably different
    levels? If so is it important (i.e. is the patient’s condition improved due to this change in expression levels)

   2. To find genes that behave similarly in different conditions (i.e. clustering the row vectors)
   and to find subgroups of samples (or patients’ tissues), that are similar to each other (i.e.
   clustering the column vectors).
  - novel discovery of genes in related biological pathway or having related functions
 - clinically important subgroups of patients

   3. Classification
 - For example: Golub et al. (1999) – 2 types of leukemia / based on gene expression profile of each sample

   4. Validation of the models, assessment of robustness/ predicting power of the classifiers
Main objectives in micro-array data analyses
     1.Finding differently expressed genes.
             Parametrical methods: t-test

  Standard t test
    H0 - no difference between the treatment
  and the controlled samples
    H1 - treatment has an influence.
Knowing the probability distribution of the T
  variable under H0 (Student law of p-1 ddl),
  the actual T is computed and compared to
  this distribution.
At a smaller p-value it is less likely to see
  extreme differences by chance.
1.Finding differently expressed genes. t-test

Advantages – simple and implemented in all
comercial microarray analysis packages

Disadvantages – distributional assumptions and the
problem of multiple testing (due to the small number
of samples, we can not assume normality of the
mean of the samples) . -> what is the “false
descovery rate” ?

Alternatives – empirical Bayes and parametric Bayes
   1.Finding differently expressed genes.
               Fold approach

If the average expression level of the genes is examined

If it changed by a certain number of folds, the gene is
declared changed (on or off)

Disadvantage: does not reveal the desired correlation
between the gene and its function. Does not find related
                     Data Mining


     And               Knowledge discovery

Observational Data

                             Background Knowledge
Gene Expression grouping and classification. Overview of
                 existing approaches

• Micro-Array analysis employs machine learning algorithms and techniques to
mine useful data.
• Unsupervised data analysis
      Principal Component Analysis (PCA)
      Hierarchical Clustering
      Non-Hierarchical Clustering
      K – means
      Self organizing maps (a type of neural networks)
•   Supervised data analysis
      Decision Trees - C5.0 implementation
      Artificial Neural Networks – Back-propagation algorithm
•   Two complementary techniques
      Cross-validation
      Multi-model approaches (boosting, bagging, stacking)
            Principal Component Analysis (PCA)
    This is a technique for finding major combinations of data (I.e. genes that are
    regularly up- and down- regulated together)
•   Objectives
         Graphically resume a large rectangular table of numbers, R, simplify its comprehension,
         find pertinent features.
        Reduce the dimensionality of the data set, (e.g. co-regulated genes)
        Graphically resume:
               - The correlations between the variables.
                - Find new meaningful underlying variables (dimensions), resuming the
         initial variables in this way.

               - The proximities and the principal oppositions between the individuals
•   Simple example:
        Imagine a micro-array data set consisting of only 2 experiments (2 samples)
        Graphically represent the data.
    Principal Component Analysis(PCA)
•    Principal component analysis of a two-dimensional data cloud.
    The line shown is the direction of the first principal component,
    which gives an optimal (in the mean-square sense) linear
    reduction of dimension from 2 to 1 dimensions.
    Principal Component Analysis(PCA)
•   Principal component analysis (PCA) involves a mathematical procedure that transforms a
    number of (possibly) correlated variables into a (smaller) number of uncorrelated variables
    called principal components.
•   Illustration for the case of 2 samples:

                                           x1  m( x)               y1  m( x) 
                                                                                 
                                       X                                      
                                           x  m( x )               y n  m( y ) 
                                           n                                     
                                                     1 n
•   The variance of the sample x is given by: v( x)   ( xi  m( x)) 2
                                                     n i 1
                                                                1 n
•   The variance of the sample y is given by:        v( y )       ( yi  m( y)) 2
                                                                n i 1

                                                            1 n
•   The covariance of between x and y:       c ( x, y )       ( xi  m( x))( xi  m( x))
                                                            n i 1

•                                           1 t     v( x) c( x, y) 
    Then we can write:
                                                    c( x, y) v( y) 
                                              X X                 
                                            n                      

•   This matrix is square and symmetric, admits a characteristic polynomial and is diagonalizable. Also
    admits a basis of orthogonal eigen vectors
     Principal Component Analysis(PCA)
     Then, it exists a matrix U so that:

                      1 t             u     u 21  1   0  u11 u12 
                 C     X X  UU t   11
                                      u                              
                      n                12   u 22  0
                                                        2  u 21 u 22 
                                                                        

1.   The 2 eigen vectors – orthogonal. Represent a new system of INDEPENDENT coordinates.
     The quantities u11 and u12 are actually the coordinates of the new axis expressed in a
     vectorial format. Same for u21 and u22 .
2.   Each coefficient indicates the weight of a particular experiment within this component ! (how
     much participates this experiment at the generation of this pattern)
3.   A translation and a rotation of the coordinate system.
      Principal Component Analysis(PCA)

•   The first principal component - as much of the variability in the data as possible,
•   Each succeeding component - as much of the remaining variability as possible.
•   Imagine cloud of data in many dimentions  benefits !
•   The projection of a point A (x, y) on a axis u (u1, u2) is obtained by performing the scalar product of
    the coordinates of this point and the vectorial coordinates of the axis: projection= x*u1+y*u2.
•    Now, our the points are the genes.

                    x1  m( x)   y1  m( x)                   progGENE1onAxis1 progGENE2onAxis2 
                                               u11   u 21                                     
          L  XU                           
                                                            
                                                                                                
                    x  m( x )                 u12
                                  y n  m( y ) 
                                                        u 22  
                                                                                   progGENEnAxis2 
                    n                                          progGENEnonAxis1                  

•   It is intersting to plot the eigen values, which expresses the way that the variability of data is
    repartised in the new coordinate system. The relative sizes of the major and minor axes in the
         Principal Component Analysis(PCA)

•   Application to sporulation             mij   t1     t2     t3     t4     t5     t6     t7     …   t10
    time-series: observations of
                                           1     1.04   1.17   1.08   1.06   1.14   1.09   1.07   …   1.11
    differential expression for
                                           2     1.02   1.01   1.15   1.15   1.01   1.06   1.08   …   1.34
    thousands of genes across
                                           ……    …      …      …      …      …      …      …      …   …
    multiple conditions
                                           G     1.45   1.08   1.02   1.06   1.12   1.57   1.11       1.06

         Usually, the first
          component has all positive
          coefficients, indicating a
          weighted average of all

         The second principal
          coefficient has negative
          values at early time points
          and positive values for the
          latter time points, indicating
          a measure of change in
 Machine Learning for Micro-Array Analysis:
Cluster analysis:
      Identification of new subgroups or classes of some
       biological entity (e.g. ,tumors)
             Hierarchical Clustering
Hierarchical cluster methos differ in:
   the distance measure selected
   the manner in which the distances are computed
  between the growing clusters and the remaining
  members of the data set
      Single Linkage. Disadvantage - loose clusters
      Complete Linkage. Disadvantage – to compact clusters of very similar size.
      Average Linkage

   Unweighted pair-group method average (UPGMA) : To groups of the lowest average distance are
      joined to form a new cluster.
            Hierarchical Clustering
Euclidian and Manhattan: sensitive to absolute expression levels.
Reveal genes that have similar expression levels.
         A and B – have aproximately the same expression levels
Correlation coefficient with centering: sensitive to expression profiles.
Reveal genes that have similar expression profiles.
          D and E – enhanced
          A and C – repressed
Absolute correlation coefficient:
          A, C, D, E – may be involved in the same biological pathway
                K-means Clustering
 1. Randomly assign data to the clusters.
   Suppose there are m genes per cluster.
 2. Calculate an average expression vector for each cluster i.
    This Corresponds to the centroid of the cluster.

 3. Calculate a mean interclass distance between each point and the centroid, for
      each cluster.

4. Move the data from one class to another.
  Aim of minimizing the averall interclass distance measure.

ADVANTAGES: easy to implement.
DISADVANTAGES: computationally intensive.
               outcome determined by such factors as distance metrics chose.
                 Non-parametric models
• Models that rely heavily on the empirical analysis of large data sets
rather than on prior domain knowledge

• Non-parametric Approaches:
    Decision trees, Neural networks, Genetic algorithms, and Nearest
neighbor methods.

• Fundamental assumption:
   Consistently observed relationships or patterns in large data sets will
recur in future observations.

• Advantages:
     Does not require a thorough understanding of the underlying system
    or problem
     Can be used to build arbitrarily complex models, that are highly non-
    linear and not restricted by human comprehension.
                       Decision Tree

   Clearly indicates which attributes are most important for
   prediction or classification.

   Limited ability to handle estimation or regression tasks
   where the goal is to predict the value of a continuous variable
   Error-prone when the number of training examples per class
   is small
                   Neural Networks
  Ability to handle a range of problem tasks including
  classification (discrete outputs) and estimation or regression
  tasks (continuous outputs)
  Provision of an indication (through sensitivity analysis) of
  which attributes are most important for prediction or

  The risk of premature convergence to an inferior solution
  (this is normally addressed by performing a sensible cross-
  validation procedure)
                Multi-Model Approaches
•Problem with the regular models
  Instability of Prediction Method
     Sensitivity of the final model to small changes in the training
•Unstable machine learning methods
       Decision trees
•Stable methods
       k-nearest neighbor
       Neural models

Now, let us see an approach to address the instability problem….
 Machine Learning for Micro-Array Analysis
• Cross validation
    To test the robustness of the classifier
• Algorithm choice depends on
    Attributes
    Ratio of the training data
    [ TP,TN;if TP is small- over-fitting occurs]
• Combined approaches
    Limited amount of training data, the individual classifier
     may not represent the true hypotheses.
    Combined classifier may produce a good approximation
     to the true hypotheses.
               Multi-Model Approaches
•Common methods for constructing multi-model systems
        Boosting, Bagging, and Stacking

•What they do?
       Creates and Combines multiple classifiers
•How are they different from each other?
       Differ in how the classifiers are trained and in how their
outputs are combined.
•How they improve accuracy?
      They improve accuracy by focusing the learning process on
examples in the data that are harder to model than others.
                    Boosting Algorithm
•Step 1: Form the Learning set and validation set (with uniform and
without replacement sampling).
•Step 2: N different training set replicas are sampled adaptively (with
non-uniform sampling probabilities and with replacement)
•Step 3: Build each classifier, f'i (x), based on the training set.
•Step 4: Establish each classifier’s performance by testing it against
the learning set.
•Step 5: Calculate a weight for each classifier based on its
•Step 6: Combine model by means of a weighted voting scheme,
where each individual prediction model carries a different weight.