# Invited talk ppt PowerPoint Presentation Cross Rate by MikeJenny

VIEWS: 6 PAGES: 71

• pg 1
```									       On Some Problems
in the Statistical Analysis of
Microarray Gene-Expression Data
Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
University of Queensland
ARC Centre of Excellence in Bioinformatics

http://www.maths.uq.edu.au/~gjm
Outline of Talk

1. Supervised Classification of Tissue Samples (Expression
Signatures)

2. Error Rate Estimation

3. Selection Bias (various types)

4. Applications to van Veer and van de Vijver data sets

5. Links with Detection of Differential Expression
Microarray Data represented as N x M Matrix

Sample 1 Sample 2                            Sample M

Gene 1
Gene 2

Expression Signature
M columns (samples) ~
102

N rows (genes) ~
Expression Profile                                     104

Gene N
With the tissue samples considered as the
feature vectors and the genes as the variables,
if we put n=M and p=N, we have the

big p, small n problem.
There has been increasing interest in the use of
gene signatures in the diagnosis and prognosis
of diseases, and as a guide to adjuvant therapy.

e.g. Microarray-based studies of breast cancer in:
• van’t Veer et al. (2002, Nature)
• van de Vijver et al. (2002, NEJM)
van’t Veer et al. (2002)
Netherlands Cancer Institute (NKI) and Rosetta
Inpharmatics looked at 25,000-gene microarrays
on n=78 patients, all aged under 55 years, .
Tumours <0.5 cm diameter, and no patient had
lymph node involvement. Patients were all
treated at the NKI, and outcome over 5 or more
years was known (34 with distant metastases
< 5 years; 44 with no recurrence in this period).
Breast tumours have a genetic signature. The expression
pattern of a set of 70 genes can predict whether a tumour
is going to prove lethal, despite treatment, or not.

“This gene expression profile will outperform all currently
used clinical parameters in predicting disease outcome.”
van ’t Veer et al. (2002), van de Vijver et al. (2002)
FDA has approved on 7/2/07 a new genetic test called
MammaPrint®, "that determines the likelihood of breast
cancer returning within five to 10 years after a woman's
initial cancer." According to the FDA, the test, a product of
Agendia (Amsterdam, the Netherlands), is the first cleared
product that profiles genetic activity.

MINDACT (Microarray for Node Negative Disease may
Avoid Chemotherapy) prospective project, run by the
EORTC (European Organisation for Research and Treatment
of Cancer) and TRANSBIG, the translational research
network of the Breast International Group
Neven et al. (2008, Lancet 9, Are gene
factors?).

They conclude that:
“For now, whether microarray gene-expression
profiling for breast-cancer prognosis is better
than an optimised panel of clinical, objectively
measured, prognostic markers remains an open
question.”
The gene-expression profiling of primary breast
tumours have resulted in the identification of many
apparently different prognostic profiles/gene sets,
which show little overlap in gene identity.

A comparison of the different prognostic profiles
shows that, even thought different gene sets and
algorithms are being used with breast cancer
patients, there is a significant agreement in
outcome predictions for individual patients.
Further work by van de Veer and/or van de Vijver
1) Gene-expression profiling and the future of adjuvant therapy (2005, The Oncologist).

2) Validation and clinical utility of a 70-gene prognostic signature….(2006, Journal of National Cancer
Institute).

3) Use of 70-gene signature to predict prognosis of patients …. (2007, Oncology).

4) A functional genetic approach … in breast cancer (2007, Cancer Cell).

5) The macrophage-stimulating protein pathway … in a mouse model
for breast cancer …(2007, PNAS).

6) Revealing targeted therapy for human cancer …(2008, Cancer Research).

7) Pooling breast cancer datasets….(2008, BMC Genomics).

8) Comparison of prognostic gene expression signatures for breast cancer (2008, BMC Genomics).

9) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer
Ignoring here whether one should be combine the
gene signatures with clinical factors, like oestrogen-
receptor (ER) and PR (progesterone-receptor) status,
ERBB2 status of lesion, and other surrogate markers
of proliferation, such as Ki-67, etc.
Supervised Classification of Tissue Samples

We OBSERVE the CLASS LABELS z1, …, zn, where
zij=(zj)i = 1 if jth tissue sample comes from the ith
class, and zero otherwise (i=1,…,g; j=1,…,n) in the
training set t.
AIM: TO CONSTRUCT A RULE r(y; t) FOR
PREDICTING THE UNKNOWN CLASS LABEL
z OF A TISSUE SAMPLE y.
e.g.    g = 2 classes   G1 - DISEASE-FREE
G2 - METASTASES
Supervised Classification (Two Classes)

Sample 1          .......           Sample n
Gene 1
.......

Gene p

Class 1              Class 2
(good prognosis)     (poor prognosis)
SUPERVISED CLASSIFICATION

In supervised classification, one seeks to construct a rule
that will allow one to assign objects to one of a prespecified
set of classes based solely on a vector of measurements
taken on those objects.

Construction of the rule is based on a “design set” or
“training set” of objects with known measurement vectors
and for which the true class is also known.
Such problems are ubiquitous and, as a consequence, have
been tackled in several different research areas.

As a result, a tremendous variety of algorithms and models
have been developed for the construction of such rules.
Such problems are ubiquitous and, as a consequence, have
been tackled in several different research areas.

As a result, a tremendous variety of algorithms and models
have been developed for the construction of such rules.

•“Prediction is difficult, especially when it's about the future.”

•Friedman (2006), quoting Yogi Berra
The impact of population drift on supervised classification rules
is nicely described by the American philosopher Eric Hoffer,
who said, “In times of change, learners inherit the Earth, while
the learned find themselves beautifully equipped to deal
with a world that no longer exists.”

Hand (2006, Stat. Sci.)
Kostka and Spang (2008, PLoS Computational
Biology, Microarray based diagnosis profits from
better documentation of gene expression
signatures)

In forming the diagnostic rule r(y;t) for the
classification of a new patient with gene
expression signature y collected subsequent to the
training data t, we need to ensure that y and t are
on a consistent signature scale.
Overfitting – the antidote is to limit reliance
on the training data by not fitting it too
closely. This is the basic principle underlying
regularization.
Selection Bias

Bias that occurs when a subset of the
variables is selected (dimension
reduction) in some “optimal” way, and
then the predictive capability of this
subset is assessed in the usual way; i.e.
using an ordinary measure for a set of
variables.
Selection Bias

Discriminant Analysis:
McLachlan (1992 & 2004, Wiley, Chapter 12)

Regression:
Breiman (1992, JASA)

“This usage (i.e. use of residual of SS’s etc.) has
long been a quiet scandal in the statistical
community.”
Nature Reviews Cancer, Feb. 2005
LINEAR CLASSIFIER (g=2 classes)
r(y; t)= 1, if c(y;t) > 0,
= 2, if c(y;t) < 0,
where

c ( y; t )   0  β y  T

 β0  β1 y1    β p y p
for the prediction of the class label z of a
future entity with feature vector y.

We denote this prediction by r(y; t),
where t contains the training data.
FISHER’S LINEAR DISCRIMINANT FUNCTION

c( y; t)    β y
0
T

1
where    β  S ( y1  y 2 )
1
 0   ( y1  y 2 ) S ( y1  y 2 )
T 1

2
and   y1 , y 2 , and S are the sample means and pooled sample
covariance matrix found from the training data
SUPPORT VECTOR MACHINE
Vapnik (1995)

c( y; t )  β0  β y                     T

where β0 and β are obtained as follows:
n
   j
1       2
min          β
β , 0   2              j 1

subject to  j   0,
z j C(y j )  1   j ( j  1,, n)

1 ,,  n relate to the slack variables
   separable case
From Hastie et al. (2001) The Elements of Statistical Learning
Error Rate Estimation
Suppose there are g groups G1,…,Gg,
r(y;t) is a rule formed from the
classified training data t,
(y1,z1, y2,z2, y3,z3,…, yn,zn),
where zij= (zj)i =1, if yj belongs to Gi, and zero otherwise
(i=1,…,g; j=1,…,n).
The apparent error A is the proportion
of the data set t misallocated by r(y; t).
Let r{yj; t, sd(t)} be a rule formed from the
training data t, using the subset sd(t) of variables
(genes) selected on the basis of t according to some
criterion.

Let Q[u,v]=0, if u =v, and is 1, otherwise,

Apparent Error Rate
Cross-Validation
From the original training data set, remove y1
to give the reduced set
t(1)=(y2,z2,y3,z3,…, yn,zn)

Then form the rule r{y;t(1), sd(t(1))}
from this reduced set t(1).
Use r{y1;t(1), sd(t(1))} to allocate y1 to either G1
or G2.
Repeat this process for the second data
point, y2.

So that this point is assigned to either G1 or
G2 on the basis of the rule r{y2;t(2), sd(t(2))}.

And so on up to yn.
n-FOLD CV
n-FOLD CV

10-FOLD CV
Selection bias increased if we use instead
Ten-Fold Cross Validation

1     2   3     4    5    6     7    8    9   10

Test
Training
SUPPORT VECTOR MACHINE

c( y; t )  β0  β1 y1    β p y p
For a given p, can select the variables yv (the genes)
on the basis of the magnitude of the fitted coefficients
v

Using this approach, GUYON, WESTON, BARNHILL
& VAPNIK (2002, Machine Learning) proposed:

RECURSIVE FEATURE ELIMINATION (RFE)
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
Figure 2: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of leukemia tissue samples
Isaksson et al. (2008, Pattern Recognition Letters,
Cross-validation and bootstrapping are unreliable in
small sample classification)

CV estimate for small sample sizes is “practically
useless for prediction of the true performance.”

They focussed on the relative error of the CV
estimate,
|CV estimate – true error rate |/true error rate
BOOTSTRAP APPROACH
Efron’s (1983, JASA) .632 estimator
B.632  .368  A  .632  B1
*
where B1 is the bootstrap when rule              R k is applied to a point not in
the training sample.
A Monte Carlo estimate of B1 is
n
B1   Ej n
j 1
K              K
where    Ej   IjkQjk      I     jk

k 1          k 1
if xj  kth bootstrap sample
with    Ijk  1
0
     otherwise

       *
and    Qjk                        es
 1 if R k misallocat xj
 0 otherwise

Toussaint & Sharpe (1975) proposed the
ERROR RATE ESTIMATOR
A(w)  (1 - w)A  wA   (2CV)

where   w  0.5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.

Value of w0 was found to range between 0.6
and 0.7, depending on the values of p, , and n1 .
n2
The 0.632 estimator fails for methods that fit
the data heavily, such as the one-nearest neighbour
rule for which the apparent error A is zero.

This led Efron and Tibshirani (1997) to propose the
0.632+ estimator.
.632+ estimate of Efron & Tibshirani (1997, JASA)

B.632   (1 - w)A  wB1

.632
where   w
1  .368r
B1  A
r                   (relative overfitting rate)
 A
g
   pi (1  qi )    (estimate of no information error rate)
i 1

If r = 0, w = .632, and so B.632+ = B.632
r = 1, w = 1, and so B.632+ = B1
Figure 1: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of colon tissue samples
• Fu, Carroll, and Wang (2005, Bioinformatics,
Estimating misclassification error with small
samples via bootstrap cross-validation)

• Jiang and Simon (2007, Statistics in Medicine,
A comparison of bootstrap methods and an
adjusted bootstrap approach for estimating the
prediction error in microarray classification)
1. Have just seen the selection bias that
occurs if CV is not performed correctly.

2. Selection Bias in not having “all” the
genes available to assess the performance
of one’s prediction rule
Example :
van Veer et al. (2002) selected
the top 70 genes (according to the pooled two-sample
t-statistic) from a screened set of 5,442 genes
(filtered from a total of 24,481genes) in 78 tissue
samples on breast cancer patients categorized
into 2 classes (corresponding to good- and bad-
prognosis).

genes for these 78 tissue samples. It is not possible to
assess the performance of a rule formed from these
top 70 genes in its performance to new patients if the
expression values of no more genes are available.
Number of Genes   Error Rate for Top   Error Rate for Top   Error Rate for
70 Genes (without    70 Genes (with       5422 Genes
correction for       correction for
Selection Bias as    Selection Bias as
Top 70)              Top 70)

1                 0.40                 0.40                 0.40
2                 0.31                 0.37                 0.40
4                 0.24                 0.40                 0.42
8                 0.23                 0.32                 0.29
16                0.22                 0.33                 0.38
32                0.19                 0.35                 0.38
64                0.23                 0.32                 0.33
70                0.23                 0.32                 -
128               -                    -                    0.32
256               -                    -                    0.29
512               -                    -                    0.31
1024              -                    -                    0.32
2048              -                    -                    0.35
4096              -                    -                    0.37
5422              -                    -                    0.37
70 
Here t Bk k denotes the training data on only the top 70
genes selected using t  B 
k
1. Have just seen the selection bias that
occurs for an optimal subset of fixed size
if CV is not performed correctly

2. and in not using “all” the genes available to
assess the performance of one’s prediction rule

3. Selection bias can occur if optimal subset
is of unrestricted size.
Number of Genes   Error Rate for Top   Error Rate for Top   Error Rate for
70 Genes (without    70 Genes (with       5422 Genes
correction for       correction for
Selection Bias as    Selection Bias as
Top 70)              Top 70)

1                 0.40                 0.40                 0.40
2                 0.31                 0.37                 0.40
4                 0.24                 0.40                 0.42
8                 0.23                 0.32                 0.29
16                0.22                 0.33                 0.38
32                0.19                 0.35                 0.38
64                0.23                 0.32                 0.33
70                0.23                 0.32                 -
128               -                    -                    0.32
256               -                    -                    0.29
512               -                    -                    0.31
1024              -                    -                    0.32
2048              -                    -                    0.35
4096              -                    -                    0.37
5422              -                    -                    0.37
Table 2: Ten-fold Cross-validated error rates for SVM
based on subset with minimum error rate in RFE process
with and without bias correction for optimization over
subsets of varying sizes

Dataset                    Error rate without bias   Error rate with bias
correction for varying    correction for varying
size of optimal subset    size of optimal subset

van ’t Veer (5422 genes)   0.29                      0.37
Let r{y; t, s*(t)} denote the rule found using the
subset of genes s*(t) that minimizes A10CV{sd(t)}
over all values of d considered; i.e.

where
where

and
Ten-Fold Cross Validation

1     2   3     4    5    6     7    8    9   10

Test
Training
Nearest-Shrunken Centroids
(Tibshirani et al., 2002)

The usual estimates of the class means   y i are shrunk toward the
overall mean y of the data, where
n
y i   zij y j / ni
j 1
and                    n
y   y j / n.
j 1
The nearest-centroid rule is given by

where yv is the vth element of the feature vector y and yiv  ( y )v .
In the previous definition, we replace the sample mean yiv of
the vth gene by its shrunken estimate

yiv  yv  mi sv d iv (i  1,  , g ; v  1, , p ),
*

where                         yiv  yv
d iv           ,
mi sv

d  sign(div )( div  k )
*
iv

1
mi  ( ni 1  n ) 1 2
Comparison of Nearest-Shrunken Centroids with SVM

Apply (i) nearest-shrunken centroids and
(ii) the SVM with RFE
to colon data set of Alon et al. (1999), with
N = 2000 genes and M = 62 tissues
(40 tumours, 22 normals)
Nearest-Shrunken Centroids applied to Alon data

(a) Overall Error Rates

(b) Class-specific Error Rates
SVM with RFE applied to Alon data
(a) Overall Error Rates

(b) Class-specific Error Rates
Guo, Hastie, and Tibshirani (2007, Biostatistics)
generalize the idea of nearest shrunken centroids
(NSC) with SCRDA (shrunken centroids
regularized discriminant analysis).

Efron (2008, Stanford Technical Report) shows
the close connection between an empirical Bayes
approach to large-scale prediction and NSC (a
frequentist regularization approach).
van’t Veer et al. (2002)
Netherlands Cancer Institute (NKI) and Rosetta
Inpharmatics looked at 25,000-gene microarrays
on n=78 patients, all aged under 55 years, .
Tumours <0.5 cm diameter, and no patient had
lymph node involvement. Patients were all
treated at the NKI, and outcome over 5 or more
years was known (34 with distant metastases
< 5 years; 44 with no recurrence in this period).
van de Vijver et al. (2002) considered a further 234 breast
cancer tumours but initially only made available the data for
the top 70 genes based on the previous study of van’t Veer
et al. (2002)
Table 3: Cross-validated error rates starting with 70 genes versus all available genes
with missing values in the latter handled by k-NN estimate for predicted and true labels
Number of genes   External CV error         External CV error     External CV error rate
rate using 70 genes       rate using original   using original 24,481
(without correction for   24,481 genes (using   genes (using true
bias from using just      predicted labels)     labels)
these 70 genes)
1                 0.29                      0.42                  0.31
2                 0.17                      0.38                  0.31
4                 0.20                      0.38                  0.34
8                 0.13                      0.31                  0.35
16                0.11                      0.23                  0.33
32                0.09                      0.22                  0.30
64                0.10                      0.19                  0.30
70                0.10

128                                         0.16                  0.26
256                                         0.15                  0.29
512                                         0.14                  0.29
1024                                        0.15                  0.28
2048                                        0.14                  0.27
4096                                        0.14                  0.27
8192                                        0.16                  0.27
16384                                       0.17                  0.29
24481                                       0.18                  0.27
Other Associated Topics
• Selection of biomarkers

• Detection of differential expression

•   Multiple hypothesis testing and control of FDR

Use of z-scores and then fitting of normal mixture models to them to select genes
on basis of local FRD – McLachlan et al. (2006, Bioinformatics); concept of
empirical null proposed by Efron (2006).

Can also use this approach to select useful features – e.g. approach of
Donoho and Jin (PNAS, Sept., 2008) using higher-criticism
thresholding
• Zhu, J.X., McLachlan, G.J., Ben-Tovim Jones, L., and Wood, I.
(2008). On selection biases with prediction rules formed from
gene expression data. Journal of Statistical Planning and
Inference 38, 374-386.

• McLachlan, G.J., Chevelu, J., and Zhu, J. (2008). Correcting for
selection bias via cross-validation in the classification of
microarray data. In Beyond Parametrics in Interdisciplinary
Research: Festschrift in Honour of Professor Pranab K. Sen, N.
Balakrishnan, E. Pena, and M.J. Silvapulle (Eds.). Hayward, California: IMS
Collections, pp. 383-395.
• Wood, I.A., Visscher, P.M., and Mengersen, K.L. (2007).
Classification based upon gene expression data: bias and precision
of error rates. Bioinformatics 23, 1363-1370.
http://www.maths.uq.edu.au/bioinformatics/
RMagpie
RMagpie is an R package designed for both biologists and statisticians. It offers
the ability to train a classifier on a labelled microarray dataset and to then use
that classifier to predict the class of new observations. A range of modern
classifiers are available, including support vector machines (SVMs) and nearest
shrunken centroids (NSCs). Advanced methods are provided to estimate the
predictive error rate and to report the subset of genes which appear effective in
discriminating between classes.
Carrak
Carrak is software with a web interface designed for both biologists and
statisticians. It uses the RMagpie R package to offer the ability to train a
classifier on a labelled microarray dataset and to then use that classifier to
predict the class of new observations. A range of modern classifiers are
available, including support vector machines (SVMs) and nearest shrunken
centroids (NSCs). Advanced methods are provided to estimate the predictive
error rate and to report the subset of genes which appear effective in
discriminating between classes.

```
To top