Document Sample
SVM-gene-expr-analysis-KES Powered By Docstoc
					Supervised gene expression data analysis
        using SVMs and MLPs

             Giorgio Valentini
A real problem: Lymphoma gene expression
 data analysis by machine learning methods:
  • Diagnosis of tumors using a supervised approach
  • Discovering groups of genes related to
    carcinogenic processes
  • Discovering subgroups of diseases using gene
    expression data.
                    DNA microarray

DNA hybridization microarrays supply information about
  gene expression through measurements of mRNA levels of
  large amounts of genes in a cell
They offer a snapshot of the overall functional status of a cell:
  virtually all differences in cell type or state are related with
  changes in the mRNA levels of many genes.
DNA microarrays have been used in mutational analyses,
  genetic mapping studies, in genome monitoring of gene
  expression, in pharmacogenomics, in metabolic pathway
A DNA microarray image (E. coli)

                             • Each spot
                             corresponds to the
                             expression level of
                             a particular gene
                             • Red spots
                             correspond to over
                             expressed genes
                             • Green spots to
                             under expressed
                             • Yellow spots
                             correspond to
                             intermediate levels
                             of gene expression
         Analyzing microarray data by machine
                   learning methods
   The large amount of gene expression data requires machine learning methods
   to analyze and extract significant knowledge from DNA microarray data

       Unsupervised approach                        Supervised approach
No or limited a priori knowledge.           “A priori” biological and medical
Clustering algorithms are used to          knowledge on the problem domain.
group together similar expression          Learning algorithms with labeled
patterns :                                 examples are used to associate gene
    • grouping sets of genes
                                           expression data with classes:
                                               • separating normal form cancerous
    • grouping different cells or                tissues
      different functional status of the       • classifying different classes of cells
      cell.                                      on functional basis
Example: hierarchical clustering,              • Prediction of the functional class of
                                                 unknown genes.
fuzzy or possibilistic clustering, self-
organizing maps.                           Example: multi-layer perceptrons,
                                           support vector machines, decision
                                           trees, ensembles of classifiers.
                 A real problem:
     A gene expression analysis of lymphoma
           Biological                           Machine learning
           problems                                methods

1. Separating cancerous and normal      1. - Support Vector Machines (SVM) :
   tissues using the overall               linear, RBF and polynomial kernels
   information available.                  - Multi Layer Perceptron (MLP)
                                           - Linear Perceptron (LP)

2. Identifying groups of genes          2. Two step method:
   specifically related to the              A priori knowledge and
   expression of two different tumour       unsupervised methods to select
   phenotypes through expression            “candidate” subgroups
   signatures.                               SVM or MLP identify the most
                                            correlated subgroups
                              The data
• Data of a specialized DNA microarray, named "Lymphochip", developed
at the Stanford University School of Medicine:

  4026 different genes
  preferentially expressed in
  lymphoid cells or with known                   High dimensional data
  roles in processes important in
  immunology or cancer

  96 tissue samples from normal
  and cancerous populations of                   Small sample size
  human lymphocytes

               A challenging machine learning problem
                Types of lymphoma
Three main classes of lymphoma:
    • Diffuse Large B-Cell Lymphoma (DLBCL),
    • Follicular Lymphoma (FL)
    • Chronic Lymphocytic Leukemia (CLL)
    • Transformed Cell Lines (TCL)
and normal lymphoid tissues

       Type of tissue        Number of samples
       Normal lymphoid cells       24
       DLBCL                               46
       FL                                   9
       CLL                                 11
       TCL                                  6
 data with
Tree View
                  The first problem:
      Separating normal from cancerous tissues.
Our first task consists in distinguishing cancerous from
normal tissues using the overall information available, i.e. all
the gene expression data.

From a machine learning standpoint it is a dichotomic problem.

Data characteristics:               Main applicative goal:
       • Small sample size                 Supporting functional-
       • High dimension                    molecular diagnosis of
       • Missing values                    tumors and polygenic
       • Noise                             diseases
  Supervised approaches to molecular
       classification of diseases
Several supervised methods have been applied to the analysis of
cDNA microarrays and high density oligonucleotide chips:
• Decision trees
• Fisher linear discriminant          • Linear discriminant analysis
• Multi-Layer Perceptrons             • Parzen windows
• Nearest-Neighbours classifiers      • Support Vector Machines

Proposed by different authors:
Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001),
Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001),
Dudoit et al. (2002).
     Why using Support Vector Machines ?
“General” motivations                   “Specific” motivations
•SVM are two-class classifiers          • Kernel are well-suited to
theoretically founded on Vapnik' s
Statistical Learning Theory.            working with high dimensional
• They act as linear classifiers in a
high dimensional feature space          • Small sample sizes require
originated by a projection of the       algorithms with good
original input space.                   generalization capabilities.
• The resulting classifier is in        • Automatic diagnosis of tumors
general non linear in the input         requires high sensitivity and very
space.                                  effective classifiers.
• SVM achieves good
generalization performances             • SVM can identify mis-labeled
maximizing the margin between           data (i.e. incorrect diagnosis).
the classes.                            • We could design specific kernel
• SVM learning algorithm has no         to incorporate “a priori”
local minima                            knowledge about the problem.
   SVM to classify cancerous and normal
We consider 3   Varying:
standard SVM
                • Values of the the
                kernel parameters
• Gaussian                             Estimation of the
                • The regularization   generalization error
• Polynomial    factor C               through:
• Dot-product                          • 10-fold cross-
Comparing       Varying:
them with:                             • leave-one-out
                • Number of hidden
• MLP           units
• LP            • Backpropagation
   Learning machine model Gen. error St. dev. Prec. Sens.
   SVM-linear                                 1.04       3.16     98.63   100.0
   SVM-poly                                   4.17       5.46     94.74   100.0
   SVM-RBF                                   25.00       4.48     75.00   100.0
   MLP                                        2.08       4.45     98.61   98.61
   LP                                         9.38      10.24     95.65   91.66

• 10-fold cross-validation ~ leave-one-out estimation of error
• SVM-linear achieves the best results.
• High sensitivity, no matter what type of kernel function is used.
• Radial basis SVM high misclassification rate and high estimated VC dimension
                         ROC analysis
• The ROC curve of
the SVM-linear is
• The polynomial
SVM also achieves a
                                        • The ROC curve
reasonably good
                                        of the MLP is also
ROC curve
                                        nearly optimal
                                        • Linear
show a diagonal
                                        perceptron shows
ROC curve: the
                                        a worse ROC
highest sensitivity is
                                        curve, but with
achieved only when
                                        reasonable values
it completely fails to
                                        lying on the
correctly detect
                                        highest and
normal cells.
                                        leftmost part of
                                        the ROC plane.
      Summary of the results on the first
Using hierarchical clustering 14,6% of the examples are misclassified
   (Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the
   MLP and the 9.38% of the LP.
Supervised methods exploit a priori biological knowledge (i.e. labeled
   data), while clustering methods use only gene expression data to group
   together different tissues, without any labeled data.
Linear SVM achieve the best results, but also MLP and 2nd degree
   polynomial show a relatively low generalization error.
Linear SVM and MLP can be used to build classifiers with a high-
   sensitivity and a low rate of false positives.
These results must be considered with caution because the size of the
   available data set is too small to infer general statements about the
   performances of the proposed learning machines.
                The second problem:
           Identifying DLBCL subgroups
    It starts from an hypothesis of Alizadeh et al. about the existence
    of two distinct functional types of lymphoma inside DLBCL.

     Actually, we consider two problems:
1. Validation of Alizadeh’s           2. Finding groups of genes
hypothesis                            mostly related to this separation
• They identified two subgroups of      Different subsets of genes could
molecularly distinct DLBCL:             be responsible for the
germinal centre B-like (GCB-like)       distinction of these two DLBCL
and activated B-like cells (AB-like).   subgroups: the expression
                                        signatures Proliferation, T-cell,
• These two classes correspond to       Lymphnode and GCB
patients with very different            (Lossos,2000).
   A feature selection approach based on
           “a priori” knowledge
Finding the most correlated genes involves an exponential
combination of genes (2n-1), where n is usually of the order
of thousands.

     We need greedy algorithms and heuristic methods.

            Can we exploit “a priori” biological
            knowledge about the problem ?
           An heuristic method (1)
A two-stage approach:
   I. Select groups of coordinately expressed genes.
   II. Identify among them the ones mostly correlated to
       the disease.

  • We do not consider single genes.
  • We consider only groups of coordinately expressed
                An heuristic method (2)
I. Selecting groups of                 II. Identify subgroups of
   coordinately expressed                  genes mostly related to
   genes:                                  the disease:
•   Use “a priori” biological and         1.   Train a set of classifiers
    medical knowledge about                    using only the subgroups
    groups of genes with known or              of genes selected in the
    suspected roles in carcinogenic            first stage.
                                          2.   Evaluate and rank the
And/or                                         performance of the trained
•   Use unsupervised methods
    such as clustering algorithms to      3.   Select the subgroups by
    identify coordinately expressed            which the corresponding
    sets of genes                              classifiers achieve the best
            Applying the heuristic method
1. Selecting “candidate”              2. Identify subgroups of
   subgroups of genes:                   genes most related to the
We used biological knowledge and         the separation GCB-like /
   hierarchical clustering               AB-like
   algorithms to select four
   subgroups:                         •   Training of SVM, MLP and
                                          LP as classifiers using each
•   Proliferation: sets of genes          subgroup of genes and all the
    involved the biological process
    of proliferation                      subgroups together (All)

•   T-cell: genes preferentially              5 classification tasks
    expressed in T-cells
•   Lymphnode: Sets of genes          •   Leave-one-out methods used
    normally expressed in                 with gaussian, polynomial and
    lymphnodes                            linear SVM
•   GCB: genes that distinguish       •   10-fold cross-validation with
    germinal centre B-cells from          gaussian, polynomial and linear
    other stages in B-cell ontogeny       SVM, MLP and LP.
                   GCB signature
Learn. machine model Gen. error St. dev.   Prec. Sens.
SVM-linear                10.50   11.16    90.00 90.00
SVM-poly                    8.70   14.54   96.67 88.33
SVM-RBF                     4.50    9.55   100.0 90.00
MLP                         8.70   10.50   90.90 90.90
LP                           8.70 10.50 90.90 90.90
                    All signatures
Learn. machine model Gen. error St. dev. Prec. Sens.
SVM-linear                  15.00  11.16 85.00 85.00
SVM-poly                    14.00 18.97 93.33 76.67
SVM-RBF                     10.00 10.54 100.00 76.67
MLP                          8.70 13.28 95.00 86.36
LP                          10.87 14.28 86.96 90.90
       The second problem: summary

• The results support the hypothesis of Alizadeh about the
existence of two distinct subgroups in DLBCL.

• The heuristic method identifies the GCB signature as a
cluster of coordinately expressed genes related to the
separation between the GCB-like and AB-like DLBCL
I. Methods to discover subclasses of    II . Methods to identify small subsets
tumors on molecular basis.              of genes correlated to tumors
Integrating “a priori” biological       - Refinements of the proposed
knowledge, supervised machine           heuristic method using clustering
learning methods and unsupervised       algorithms with semi-automatic
clustering methods                      selection of the number of the
                                        significant subgroups of genes.

Stratifying patients into molecularly   - Greedy algorithms based on mutual
relevant categories, enhancing the      information measures.
discrimination power and precision of
clinical trials
                                        Discovery    Enhancing     Automatic
                                        of new       biological    diagnosis
New perspectives on the development                                of tumors
                                        subclasses   knowledge
of new cancer therapeutics based on a                              using DNA
                                        of tumors    about
molecular understanding of the cancer                              microchips

Shared By: