Supervised gene expression data analysis
using SVMs and MLPs
A real problem: Lymphoma gene expression
data analysis by machine learning methods:
• Diagnosis of tumors using a supervised approach
• Discovering groups of genes related to
• Discovering subgroups of diseases using gene
DNA hybridization microarrays supply information about
gene expression through measurements of mRNA levels of
large amounts of genes in a cell
They offer a snapshot of the overall functional status of a cell:
virtually all differences in cell type or state are related with
changes in the mRNA levels of many genes.
DNA microarrays have been used in mutational analyses,
genetic mapping studies, in genome monitoring of gene
expression, in pharmacogenomics, in metabolic pathway
A DNA microarray image (E. coli)
• Each spot
corresponds to the
expression level of
a particular gene
• Red spots
correspond to over
• Green spots to
• Yellow spots
of gene expression
Analyzing microarray data by machine
The large amount of gene expression data requires machine learning methods
to analyze and extract significant knowledge from DNA microarray data
Unsupervised approach Supervised approach
No or limited a priori knowledge. “A priori” biological and medical
Clustering algorithms are used to knowledge on the problem domain.
group together similar expression Learning algorithms with labeled
patterns : examples are used to associate gene
• grouping sets of genes
expression data with classes:
• separating normal form cancerous
• grouping different cells or tissues
different functional status of the • classifying different classes of cells
cell. on functional basis
Example: hierarchical clustering, • Prediction of the functional class of
fuzzy or possibilistic clustering, self-
organizing maps. Example: multi-layer perceptrons,
support vector machines, decision
trees, ensembles of classifiers.
A real problem:
A gene expression analysis of lymphoma
Biological Machine learning
1. Separating cancerous and normal 1. - Support Vector Machines (SVM) :
tissues using the overall linear, RBF and polynomial kernels
information available. - Multi Layer Perceptron (MLP)
- Linear Perceptron (LP)
2. Identifying groups of genes 2. Two step method:
specifically related to the A priori knowledge and
expression of two different tumour unsupervised methods to select
phenotypes through expression “candidate” subgroups
signatures. SVM or MLP identify the most
• Data of a specialized DNA microarray, named "Lymphochip", developed
at the Stanford University School of Medicine:
4026 different genes
preferentially expressed in
lymphoid cells or with known High dimensional data
roles in processes important in
immunology or cancer
96 tissue samples from normal
and cancerous populations of Small sample size
A challenging machine learning problem
Types of lymphoma
Three main classes of lymphoma:
• Diffuse Large B-Cell Lymphoma (DLBCL),
• Follicular Lymphoma (FL)
• Chronic Lymphocytic Leukemia (CLL)
• Transformed Cell Lines (TCL)
and normal lymphoid tissues
Type of tissue Number of samples
Normal lymphoid cells 24
The first problem:
Separating normal from cancerous tissues.
Our first task consists in distinguishing cancerous from
normal tissues using the overall information available, i.e. all
the gene expression data.
From a machine learning standpoint it is a dichotomic problem.
Data characteristics: Main applicative goal:
• Small sample size Supporting functional-
• High dimension molecular diagnosis of
• Missing values tumors and polygenic
• Noise diseases
Supervised approaches to molecular
classification of diseases
Several supervised methods have been applied to the analysis of
cDNA microarrays and high density oligonucleotide chips:
• Decision trees
• Fisher linear discriminant • Linear discriminant analysis
• Multi-Layer Perceptrons • Parzen windows
• Nearest-Neighbours classifiers • Support Vector Machines
Proposed by different authors:
Golub et al. (1999), Pavlidis et al. (2001), Khan et al. (2001),
Furey et al. (2000), Ramaswamy et al. (2001), Yeang et al. (2001),
Dudoit et al. (2002).
Why using Support Vector Machines ?
“General” motivations “Specific” motivations
•SVM are two-class classifiers • Kernel are well-suited to
theoretically founded on Vapnik' s
Statistical Learning Theory. working with high dimensional
• They act as linear classifiers in a
high dimensional feature space • Small sample sizes require
originated by a projection of the algorithms with good
original input space. generalization capabilities.
• The resulting classifier is in • Automatic diagnosis of tumors
general non linear in the input requires high sensitivity and very
space. effective classifiers.
• SVM achieves good
generalization performances • SVM can identify mis-labeled
maximizing the margin between data (i.e. incorrect diagnosis).
the classes. • We could design specific kernel
• SVM learning algorithm has no to incorporate “a priori”
local minima knowledge about the problem.
SVM to classify cancerous and normal
We consider 3 Varying:
• Values of the the
• Gaussian Estimation of the
• The regularization generalization error
• Polynomial factor C through:
• Dot-product • 10-fold cross-
them with: • leave-one-out
• Number of hidden
• MLP units
• LP • Backpropagation
Learning machine model Gen. error St. dev. Prec. Sens.
SVM-linear 1.04 3.16 98.63 100.0
SVM-poly 4.17 5.46 94.74 100.0
SVM-RBF 25.00 4.48 75.00 100.0
MLP 2.08 4.45 98.61 98.61
LP 9.38 10.24 95.65 91.66
• 10-fold cross-validation ~ leave-one-out estimation of error
• SVM-linear achieves the best results.
• High sensitivity, no matter what type of kernel function is used.
• Radial basis SVM high misclassification rate and high estimated VC dimension
• The ROC curve of
the SVM-linear is
• The polynomial
SVM also achieves a
• The ROC curve
of the MLP is also
• The SVM-RBF
show a diagonal
ROC curve: the
a worse ROC
highest sensitivity is
curve, but with
achieved only when
it completely fails to
lying on the
leftmost part of
the ROC plane.
Summary of the results on the first
Using hierarchical clustering 14,6% of the examples are misclassified
(Alizadeh, 2000), against the 1.04% of the SVM, the 2.08% of the
MLP and the 9.38% of the LP.
Supervised methods exploit a priori biological knowledge (i.e. labeled
data), while clustering methods use only gene expression data to group
together different tissues, without any labeled data.
Linear SVM achieve the best results, but also MLP and 2nd degree
polynomial show a relatively low generalization error.
Linear SVM and MLP can be used to build classifiers with a high-
sensitivity and a low rate of false positives.
These results must be considered with caution because the size of the
available data set is too small to infer general statements about the
performances of the proposed learning machines.
The second problem:
Identifying DLBCL subgroups
It starts from an hypothesis of Alizadeh et al. about the existence
of two distinct functional types of lymphoma inside DLBCL.
Actually, we consider two problems:
1. Validation of Alizadeh’s 2. Finding groups of genes
hypothesis mostly related to this separation
• They identified two subgroups of Different subsets of genes could
molecularly distinct DLBCL: be responsible for the
germinal centre B-like (GCB-like) distinction of these two DLBCL
and activated B-like cells (AB-like). subgroups: the expression
signatures Proliferation, T-cell,
• These two classes correspond to Lymphnode and GCB
patients with very different (Lossos,2000).
A feature selection approach based on
“a priori” knowledge
Finding the most correlated genes involves an exponential
combination of genes (2n-1), where n is usually of the order
We need greedy algorithms and heuristic methods.
Can we exploit “a priori” biological
knowledge about the problem ?
An heuristic method (1)
A two-stage approach:
I. Select groups of coordinately expressed genes.
II. Identify among them the ones mostly correlated to
• We do not consider single genes.
• We consider only groups of coordinately expressed
An heuristic method (2)
I. Selecting groups of II. Identify subgroups of
coordinately expressed genes mostly related to
genes: the disease:
• Use “a priori” biological and 1. Train a set of classifiers
medical knowledge about using only the subgroups
groups of genes with known or of genes selected in the
suspected roles in carcinogenic first stage.
2. Evaluate and rank the
And/or performance of the trained
• Use unsupervised methods
such as clustering algorithms to 3. Select the subgroups by
identify coordinately expressed which the corresponding
sets of genes classifiers achieve the best
Applying the heuristic method
1. Selecting “candidate” 2. Identify subgroups of
subgroups of genes: genes most related to the
We used biological knowledge and the separation GCB-like /
hierarchical clustering AB-like
algorithms to select four
subgroups: • Training of SVM, MLP and
LP as classifiers using each
• Proliferation: sets of genes subgroup of genes and all the
involved the biological process
of proliferation subgroups together (All)
• T-cell: genes preferentially 5 classification tasks
expressed in T-cells
• Lymphnode: Sets of genes • Leave-one-out methods used
normally expressed in with gaussian, polynomial and
lymphnodes linear SVM
• GCB: genes that distinguish • 10-fold cross-validation with
germinal centre B-cells from gaussian, polynomial and linear
other stages in B-cell ontogeny SVM, MLP and LP.
Learn. machine model Gen. error St. dev. Prec. Sens.
SVM-linear 10.50 11.16 90.00 90.00
SVM-poly 8.70 14.54 96.67 88.33
SVM-RBF 4.50 9.55 100.0 90.00
MLP 8.70 10.50 90.90 90.90
LP 8.70 10.50 90.90 90.90
Learn. machine model Gen. error St. dev. Prec. Sens.
SVM-linear 15.00 11.16 85.00 85.00
SVM-poly 14.00 18.97 93.33 76.67
SVM-RBF 10.00 10.54 100.00 76.67
MLP 8.70 13.28 95.00 86.36
LP 10.87 14.28 86.96 90.90
The second problem: summary
• The results support the hypothesis of Alizadeh about the
existence of two distinct subgroups in DLBCL.
• The heuristic method identifies the GCB signature as a
cluster of coordinately expressed genes related to the
separation between the GCB-like and AB-like DLBCL
I. Methods to discover subclasses of II . Methods to identify small subsets
tumors on molecular basis. of genes correlated to tumors
Integrating “a priori” biological - Refinements of the proposed
knowledge, supervised machine heuristic method using clustering
learning methods and unsupervised algorithms with semi-automatic
clustering methods selection of the number of the
significant subgroups of genes.
Stratifying patients into molecularly - Greedy algorithms based on mutual
relevant categories, enhancing the information measures.
discrimination power and precision of
Discovery Enhancing Automatic
of new biological diagnosis
New perspectives on the development of tumors
of new cancer therapeutics based on a using DNA
of tumors about
molecular understanding of the cancer microchips