Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Invited talk ppt PowerPoint Presentation Cross Rate

VIEWS: 13 PAGES: 71

Invited talk ppt PowerPoint Presentation Cross Rate

More Info
									       On Some Problems
  in the Statistical Analysis of
Microarray Gene-Expression Data
                   Geoff McLachlan
Department of Mathematics & Institute for Molecular Bioscience
                  University of Queensland
               ARC Centre of Excellence in Bioinformatics

              http://www.maths.uq.edu.au/~gjm
                     Outline of Talk


1. Supervised Classification of Tissue Samples (Expression
   Signatures)

2. Error Rate Estimation

3. Selection Bias (various types)

4. Applications to van Veer and van de Vijver data sets

5. Links with Detection of Differential Expression
  Microarray Data represented as N x M Matrix

         Sample 1 Sample 2                            Sample M

Gene 1
Gene 2




                               Expression Signature
                                                                 M columns (samples) ~
                                                                 102

                                                                 N rows (genes) ~
          Expression Profile                                     104


Gene N
With the tissue samples considered as the
feature vectors and the genes as the variables,
 if we put n=M and p=N, we have the

           big p, small n problem.
 There has been increasing interest in the use of
  gene signatures in the diagnosis and prognosis
  of diseases, and as a guide to adjuvant therapy.

e.g. Microarray-based studies of breast cancer in:
• van’t Veer et al. (2002, Nature)
• van de Vijver et al. (2002, NEJM)
       van’t Veer et al. (2002)
Netherlands Cancer Institute (NKI) and Rosetta
Inpharmatics looked at 25,000-gene microarrays
on n=78 patients, all aged under 55 years, .
Tumours <0.5 cm diameter, and no patient had
lymph node involvement. Patients were all
treated at the NKI, and outcome over 5 or more
years was known (34 with distant metastases
 < 5 years; 44 with no recurrence in this period).
Breast tumours have a genetic signature. The expression
pattern of a set of 70 genes can predict whether a tumour
is going to prove lethal, despite treatment, or not.

“This gene expression profile will outperform all currently
used clinical parameters in predicting disease outcome.”
                      van ’t Veer et al. (2002), van de Vijver et al. (2002)
FDA has approved on 7/2/07 a new genetic test called
 MammaPrint®, "that determines the likelihood of breast
 cancer returning within five to 10 years after a woman's
 initial cancer." According to the FDA, the test, a product of
 Agendia (Amsterdam, the Netherlands), is the first cleared
 product that profiles genetic activity.

MINDACT (Microarray for Node Negative Disease may
 Avoid Chemotherapy) prospective project, run by the
 EORTC (European Organisation for Research and Treatment
 of Cancer) and TRANSBIG, the translational research
 network of the Breast International Group
Neven et al. (2008, Lancet 9, Are gene
signatures better than traditional clinical
factors?).

 They conclude that:
“For now, whether microarray gene-expression
 profiling for breast-cancer prognosis is better
 than an optimised panel of clinical, objectively
 measured, prognostic markers remains an open
 question.”
The gene-expression profiling of primary breast
 tumours have resulted in the identification of many
 apparently different prognostic profiles/gene sets,
 which show little overlap in gene identity.

A comparison of the different prognostic profiles
 shows that, even thought different gene sets and
 algorithms are being used with breast cancer
 patients, there is a significant agreement in
 outcome predictions for individual patients.
 Further work by van de Veer and/or van de Vijver
1) Gene-expression profiling and the future of adjuvant therapy (2005, The Oncologist).

2) Validation and clinical utility of a 70-gene prognostic signature….(2006, Journal of National Cancer
    Institute).

3) Use of 70-gene signature to predict prognosis of patients …. (2007, Oncology).

4) A functional genetic approach … in breast cancer (2007, Cancer Cell).

5) The macrophage-stimulating protein pathway … in a mouse model
   for breast cancer …(2007, PNAS).

6) Revealing targeted therapy for human cancer …(2008, Cancer Research).

7) Pooling breast cancer datasets….(2008, BMC Genomics).

8) Comparison of prognostic gene expression signatures for breast cancer (2008, BMC Genomics).

9) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer
    (2008, Seminars in Radiation Oncology).
Ignoring here whether one should be combine the
gene signatures with clinical factors, like oestrogen-
receptor (ER) and PR (progesterone-receptor) status,
ERBB2 status of lesion, and other surrogate markers
of proliferation, such as Ki-67, etc.
 Supervised Classification of Tissue Samples


We OBSERVE the CLASS LABELS z1, …, zn, where
zij=(zj)i = 1 if jth tissue sample comes from the ith
class, and zero otherwise (i=1,…,g; j=1,…,n) in the
training set t.
AIM: TO CONSTRUCT A RULE r(y; t) FOR
PREDICTING THE UNKNOWN CLASS LABEL
z OF A TISSUE SAMPLE y.
 e.g.    g = 2 classes   G1 - DISEASE-FREE
                         G2 - METASTASES
               Supervised Classification (Two Classes)

           Sample 1          .......           Sample n
Gene 1
 .......




Gene p

                   Class 1              Class 2
               (good prognosis)     (poor prognosis)
       SUPERVISED CLASSIFICATION

In supervised classification, one seeks to construct a rule
that will allow one to assign objects to one of a prespecified
set of classes based solely on a vector of measurements
taken on those objects.

Construction of the rule is based on a “design set” or
“training set” of objects with known measurement vectors
and for which the true class is also known.
Such problems are ubiquitous and, as a consequence, have
been tackled in several different research areas.

As a result, a tremendous variety of algorithms and models
have been developed for the construction of such rules.
Such problems are ubiquitous and, as a consequence, have
been tackled in several different research areas.

As a result, a tremendous variety of algorithms and models
have been developed for the construction of such rules.

•“Prediction is difficult, especially when it's about the future.”

•Friedman (2006), quoting Yogi Berra
The impact of population drift on supervised classification rules
is nicely described by the American philosopher Eric Hoffer,
who said, “In times of change, learners inherit the Earth, while
the learned find themselves beautifully equipped to deal
with a world that no longer exists.”

Hand (2006, Stat. Sci.)
Kostka and Spang (2008, PLoS Computational
Biology, Microarray based diagnosis profits from
better documentation of gene expression
signatures)

In forming the diagnostic rule r(y;t) for the
classification of a new patient with gene
expression signature y collected subsequent to the
training data t, we need to ensure that y and t are
on a consistent signature scale.
Overfitting – the antidote is to limit reliance
on the training data by not fitting it too
closely. This is the basic principle underlying
regularization.
           Selection Bias

Bias that occurs when a subset of the
variables is selected (dimension
reduction) in some “optimal” way, and
then the predictive capability of this
subset is assessed in the usual way; i.e.
using an ordinary measure for a set of
variables.
            Selection Bias

Discriminant Analysis:
McLachlan (1992 & 2004, Wiley, Chapter 12)

Regression:
Breiman (1992, JASA)

“This usage (i.e. use of residual of SS’s etc.) has
 long been a quiet scandal in the statistical
 community.”
Nature Reviews Cancer, Feb. 2005
LINEAR CLASSIFIER (g=2 classes)
           r(y; t)= 1, if c(y;t) > 0,
                 = 2, if c(y;t) < 0,
where

  c ( y; t )   0  β y  T


            β0  β1 y1    β p y p
for the prediction of the class label z of a
future entity with feature vector y.

We denote this prediction by r(y; t),
where t contains the training data.
FISHER’S LINEAR DISCRIMINANT FUNCTION

            c( y; t)    β y
                    0
                         T

                1
where    β  S ( y1  y 2 )
                 1
           0   ( y1  y 2 ) S ( y1  y 2 )
                              T 1

                 2
and   y1 , y 2 , and S are the sample means and pooled sample
      covariance matrix found from the training data
       SUPPORT VECTOR MACHINE
                           Vapnik (1995)

              c( y; t )  β0  β y                     T

where β0 and β are obtained as follows:
                                                n
                                               j
                                1       2
                       min          β
                       β , 0   2              j 1

      subject to  j   0,
                 z j C(y j )  1   j ( j  1,, n)

   1 ,,  n relate to the slack variables
      separable case
From Hastie et al. (2001) The Elements of Statistical Learning
              Error Rate Estimation
 Suppose there are g groups G1,…,Gg,
 r(y;t) is a rule formed from the
 classified training data t,
          (y1,z1, y2,z2, y3,z3,…, yn,zn),
where zij= (zj)i =1, if yj belongs to Gi, and zero otherwise
(i=1,…,g; j=1,…,n).
   The apparent error A is the proportion
   of the data set t misallocated by r(y; t).
Let r{yj; t, sd(t)} be a rule formed from the
training data t, using the subset sd(t) of variables
(genes) selected on the basis of t according to some
criterion.

Let Q[u,v]=0, if u =v, and is 1, otherwise,

                  Apparent Error Rate
            Cross-Validation
From the original training data set, remove y1
to give the reduced set
         t(1)=(y2,z2,y3,z3,…, yn,zn)

Then form the rule r{y;t(1), sd(t(1))}
from this reduced set t(1).
Use r{y1;t(1), sd(t(1))} to allocate y1 to either G1
or G2.
Repeat this process for the second data
point, y2.

So that this point is assigned to either G1 or
G2 on the basis of the rule r{y2;t(2), sd(t(2))}.

And so on up to yn.
n-FOLD CV
n-FOLD CV




10-FOLD CV
Selection bias increased if we use instead
               Ten-Fold Cross Validation


 1     2   3     4    5    6     7    8    9   10



Test
                     Training
    SUPPORT VECTOR MACHINE

     c( y; t )  β0  β1 y1    β p y p
For a given p, can select the variables yv (the genes)
on the basis of the magnitude of the fitted coefficients
 v

Using this approach, GUYON, WESTON, BARNHILL
  & VAPNIK (2002, Machine Learning) proposed:

 RECURSIVE FEATURE ELIMINATION (RFE)
Figure 1: Error rates of the SVM rule with RFE procedure
 averaged over 50 random splits of colon tissue samples
Figure 2: Error rates of the SVM rule with RFE procedure
averaged over 50 random splits of leukemia tissue samples
Isaksson et al. (2008, Pattern Recognition Letters,
 Cross-validation and bootstrapping are unreliable in
 small sample classification)

CV estimate for small sample sizes is “practically
useless for prediction of the true performance.”

They focussed on the relative error of the CV
estimate,
  |CV estimate – true error rate |/true error rate
                 BOOTSTRAP APPROACH
                Efron’s (1983, JASA) .632 estimator
                 B.632  .368  A  .632  B1
                                                    *
where B1 is the bootstrap when rule              R k is applied to a point not in
the training sample.
A Monte Carlo estimate of B1 is
                             n
                     B1   Ej n
                             j 1
                      K              K
        where    Ej   IjkQjk      I     jk

                      k 1          k 1
                                                        if xj  kth bootstrap sample
                                    with    Ijk  1
                                                  0
                                                       otherwise
                                                   
                                                          *
                                    and    Qjk                        es
                                                  1 if R k misallocat xj
                                                    0 otherwise
                                                   
 Toussaint & Sharpe (1975) proposed the
       ERROR RATE ESTIMATOR
               A(w)  (1 - w)A  wA   (2CV)


  where   w  0.5
McLachlan (1977) proposed w=wo where wo is
chosen to minimize asymptotic bias of A(w) in the
case of two homoscedastic normal groups.

      Value of w0 was found to range between 0.6
and 0.7, depending on the values of p, , and n1 .
                                              n2
 The 0.632 estimator fails for methods that fit
 the data heavily, such as the one-nearest neighbour
  rule for which the apparent error A is zero.

This led Efron and Tibshirani (1997) to propose the
 0.632+ estimator.
.632+ estimate of Efron & Tibshirani (1997, JASA)

                     B.632   (1 - w)A  wB1

               .632
 where   w
            1  .368r
             B1  A
          r                   (relative overfitting rate)
              A
               g
            pi (1  qi )    (estimate of no information error rate)
              i 1

     If r = 0, w = .632, and so B.632+ = B.632
          r = 1, w = 1, and so B.632+ = B1
Figure 1: Error rates of the SVM rule with RFE procedure
 averaged over 50 random splits of colon tissue samples
• Fu, Carroll, and Wang (2005, Bioinformatics,
  Estimating misclassification error with small
  samples via bootstrap cross-validation)

• Jiang and Simon (2007, Statistics in Medicine,
  A comparison of bootstrap methods and an
  adjusted bootstrap approach for estimating the
  prediction error in microarray classification)
1. Have just seen the selection bias that
   occurs if CV is not performed correctly.


2. Selection Bias in not having “all” the
 genes available to assess the performance
 of one’s prediction rule
Example :
 van Veer et al. (2002) selected
 the top 70 genes (according to the pooled two-sample
 t-statistic) from a screened set of 5,442 genes
 (filtered from a total of 24,481genes) in 78 tissue
 samples on breast cancer patients categorized
 into 2 classes (corresponding to good- and bad-
 prognosis).

 Suppose now that one has only access to these top 70
 genes for these 78 tissue samples. It is not possible to
 assess the performance of a rule formed from these
 top 70 genes in its performance to new patients if the
 expression values of no more genes are available.
Number of Genes   Error Rate for Top   Error Rate for Top   Error Rate for
                  70 Genes (without    70 Genes (with       5422 Genes
                  correction for       correction for
                  Selection Bias as    Selection Bias as
                  Top 70)              Top 70)


1                 0.40                 0.40                 0.40
2                 0.31                 0.37                 0.40
4                 0.24                 0.40                 0.42
8                 0.23                 0.32                 0.29
16                0.22                 0.33                 0.38
32                0.19                 0.35                 0.38
64                0.23                 0.32                 0.33
70                0.23                 0.32                 -
128               -                    -                    0.32
256               -                    -                    0.29
512               -                    -                    0.31
1024              -                    -                    0.32
2048              -                    -                    0.35
4096              -                    -                    0.37
5422              -                    -                    0.37
       70 
Here t Bk k denotes the training data on only the top 70
genes selected using t  B 
                         k
1. Have just seen the selection bias that
   occurs for an optimal subset of fixed size
    if CV is not performed correctly

2. and in not using “all” the genes available to
  assess the performance of one’s prediction rule

3. Selection bias can occur if optimal subset
   is of unrestricted size.
Number of Genes   Error Rate for Top   Error Rate for Top   Error Rate for
                  70 Genes (without    70 Genes (with       5422 Genes
                  correction for       correction for
                  Selection Bias as    Selection Bias as
                  Top 70)              Top 70)


1                 0.40                 0.40                 0.40
2                 0.31                 0.37                 0.40
4                 0.24                 0.40                 0.42
8                 0.23                 0.32                 0.29
16                0.22                 0.33                 0.38
32                0.19                 0.35                 0.38
64                0.23                 0.32                 0.33
70                0.23                 0.32                 -
128               -                    -                    0.32
256               -                    -                    0.29
512               -                    -                    0.31
1024              -                    -                    0.32
2048              -                    -                    0.35
4096              -                    -                    0.37
5422              -                    -                    0.37
Table 2: Ten-fold Cross-validated error rates for SVM
based on subset with minimum error rate in RFE process
with and without bias correction for optimization over
subsets of varying sizes

Dataset                    Error rate without bias   Error rate with bias
                           correction for varying    correction for varying
                           size of optimal subset    size of optimal subset


van ’t Veer (5422 genes)   0.29                      0.37
Let r{y; t, s*(t)} denote the rule found using the
subset of genes s*(t) that minimizes A10CV{sd(t)}
over all values of d considered; i.e.




 where
 where



and
               Ten-Fold Cross Validation


 1     2   3     4    5    6     7    8    9   10



Test
                     Training
                   Nearest-Shrunken Centroids
                       (Tibshirani et al., 2002)

The usual estimates of the class means   y i are shrunk toward the
overall mean y of the data, where
                        n
                 y i   zij y j / ni
                        j 1
 and                    n
                  y   y j / n.
                       j 1
The nearest-centroid rule is given by




where yv is the vth element of the feature vector y and yiv  ( y )v .
In the previous definition, we replace the sample mean yiv of
the vth gene by its shrunken estimate

            yiv  yv  mi sv d iv (i  1,  , g ; v  1, , p ),
                               *




where                         yiv  yv
                       d iv           ,
                               mi sv

                d  sign(div )( div  k )
                  *
                  iv


                                             1
                       mi  ( ni 1  n ) 1 2
Comparison of Nearest-Shrunken Centroids with SVM

Apply (i) nearest-shrunken centroids and
      (ii) the SVM with RFE
to colon data set of Alon et al. (1999), with
N = 2000 genes and M = 62 tissues
                      (40 tumours, 22 normals)
 Nearest-Shrunken Centroids applied to Alon data

(a) Overall Error Rates




(b) Class-specific Error Rates
SVM with RFE applied to Alon data
(a) Overall Error Rates




(b) Class-specific Error Rates
Guo, Hastie, and Tibshirani (2007, Biostatistics)
generalize the idea of nearest shrunken centroids
(NSC) with SCRDA (shrunken centroids
regularized discriminant analysis).

Efron (2008, Stanford Technical Report) shows
the close connection between an empirical Bayes
approach to large-scale prediction and NSC (a
frequentist regularization approach).
       van’t Veer et al. (2002)
Netherlands Cancer Institute (NKI) and Rosetta
Inpharmatics looked at 25,000-gene microarrays
on n=78 patients, all aged under 55 years, .
Tumours <0.5 cm diameter, and no patient had
lymph node involvement. Patients were all
treated at the NKI, and outcome over 5 or more
years was known (34 with distant metastases
 < 5 years; 44 with no recurrence in this period).
van de Vijver et al. (2002) considered a further 234 breast
cancer tumours but initially only made available the data for
the top 70 genes based on the previous study of van’t Veer
et al. (2002)
Table 3: Cross-validated error rates starting with 70 genes versus all available genes
with missing values in the latter handled by k-NN estimate for predicted and true labels
      Number of genes   External CV error         External CV error     External CV error rate
                        rate using 70 genes       rate using original   using original 24,481
                        (without correction for   24,481 genes (using   genes (using true
                        bias from using just      predicted labels)     labels)
                        these 70 genes)
      1                 0.29                      0.42                  0.31
      2                 0.17                      0.38                  0.31
      4                 0.20                      0.38                  0.34
      8                 0.13                      0.31                  0.35
      16                0.11                      0.23                  0.33
      32                0.09                      0.22                  0.30
      64                0.10                      0.19                  0.30
      70                0.10


      128                                         0.16                  0.26
      256                                         0.15                  0.29
      512                                         0.14                  0.29
      1024                                        0.15                  0.28
      2048                                        0.14                  0.27
      4096                                        0.14                  0.27
      8192                                        0.16                  0.27
      16384                                       0.17                  0.29
      24481                                       0.18                  0.27
                Other Associated Topics
• Selection of biomarkers

• Detection of differential expression

•   Multiple hypothesis testing and control of FDR

Use of z-scores and then fitting of normal mixture models to them to select genes
  on basis of local FRD – McLachlan et al. (2006, Bioinformatics); concept of
  empirical null proposed by Efron (2006).

Can also use this approach to select useful features – e.g. approach of
 Donoho and Jin (PNAS, Sept., 2008) using higher-criticism
 thresholding
• Zhu, J.X., McLachlan, G.J., Ben-Tovim Jones, L., and Wood, I.
  (2008). On selection biases with prediction rules formed from
  gene expression data. Journal of Statistical Planning and
  Inference 38, 374-386.

• McLachlan, G.J., Chevelu, J., and Zhu, J. (2008). Correcting for
  selection bias via cross-validation in the classification of
  microarray data. In Beyond Parametrics in Interdisciplinary
  Research: Festschrift in Honour of Professor Pranab K. Sen, N.
  Balakrishnan, E. Pena, and M.J. Silvapulle (Eds.). Hayward, California: IMS
  Collections, pp. 383-395.
• Wood, I.A., Visscher, P.M., and Mengersen, K.L. (2007).
  Classification based upon gene expression data: bias and precision
  of error rates. Bioinformatics 23, 1363-1370.
http://www.maths.uq.edu.au/bioinformatics/
RMagpie
RMagpie is an R package designed for both biologists and statisticians. It offers
the ability to train a classifier on a labelled microarray dataset and to then use
that classifier to predict the class of new observations. A range of modern
classifiers are available, including support vector machines (SVMs) and nearest
shrunken centroids (NSCs). Advanced methods are provided to estimate the
predictive error rate and to report the subset of genes which appear effective in
discriminating between classes.
Carrak
Carrak is software with a web interface designed for both biologists and
statisticians. It uses the RMagpie R package to offer the ability to train a
classifier on a labelled microarray dataset and to then use that classifier to
predict the class of new observations. A range of modern classifiers are
available, including support vector machines (SVMs) and nearest shrunken
centroids (NSCs). Advanced methods are provided to estimate the predictive
error rate and to report the subset of genes which appear effective in
discriminating between classes.

								
To top