VIEWS: 6 PAGES: 71 POSTED ON: 12/18/2010 Public Domain
On Some Problems in the Statistical Analysis of Microarray Gene-Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland ARC Centre of Excellence in Bioinformatics http://www.maths.uq.edu.au/~gjm Outline of Talk 1. Supervised Classification of Tissue Samples (Expression Signatures) 2. Error Rate Estimation 3. Selection Bias (various types) 4. Applications to van Veer and van de Vijver data sets 5. Links with Detection of Differential Expression Microarray Data represented as N x M Matrix Sample 1 Sample 2 Sample M Gene 1 Gene 2 Expression Signature M columns (samples) ~ 102 N rows (genes) ~ Expression Profile 104 Gene N With the tissue samples considered as the feature vectors and the genes as the variables, if we put n=M and p=N, we have the big p, small n problem. There has been increasing interest in the use of gene signatures in the diagnosis and prognosis of diseases, and as a guide to adjuvant therapy. e.g. Microarray-based studies of breast cancer in: • van’t Veer et al. (2002, Nature) • van de Vijver et al. (2002, NEJM) van’t Veer et al. (2002) Netherlands Cancer Institute (NKI) and Rosetta Inpharmatics looked at 25,000-gene microarrays on n=78 patients, all aged under 55 years, . Tumours <0.5 cm diameter, and no patient had lymph node involvement. Patients were all treated at the NKI, and outcome over 5 or more years was known (34 with distant metastases < 5 years; 44 with no recurrence in this period). Breast tumours have a genetic signature. The expression pattern of a set of 70 genes can predict whether a tumour is going to prove lethal, despite treatment, or not. “This gene expression profile will outperform all currently used clinical parameters in predicting disease outcome.” van ’t Veer et al. (2002), van de Vijver et al. (2002) FDA has approved on 7/2/07 a new genetic test called MammaPrint®, "that determines the likelihood of breast cancer returning within five to 10 years after a woman's initial cancer." According to the FDA, the test, a product of Agendia (Amsterdam, the Netherlands), is the first cleared product that profiles genetic activity. MINDACT (Microarray for Node Negative Disease may Avoid Chemotherapy) prospective project, run by the EORTC (European Organisation for Research and Treatment of Cancer) and TRANSBIG, the translational research network of the Breast International Group Neven et al. (2008, Lancet 9, Are gene signatures better than traditional clinical factors?). They conclude that: “For now, whether microarray gene-expression profiling for breast-cancer prognosis is better than an optimised panel of clinical, objectively measured, prognostic markers remains an open question.” The gene-expression profiling of primary breast tumours have resulted in the identification of many apparently different prognostic profiles/gene sets, which show little overlap in gene identity. A comparison of the different prognostic profiles shows that, even thought different gene sets and algorithms are being used with breast cancer patients, there is a significant agreement in outcome predictions for individual patients. Further work by van de Veer and/or van de Vijver 1) Gene-expression profiling and the future of adjuvant therapy (2005, The Oncologist). 2) Validation and clinical utility of a 70-gene prognostic signature….(2006, Journal of National Cancer Institute). 3) Use of 70-gene signature to predict prognosis of patients …. (2007, Oncology). 4) A functional genetic approach … in breast cancer (2007, Cancer Cell). 5) The macrophage-stimulating protein pathway … in a mouse model for breast cancer …(2007, PNAS). 6) Revealing targeted therapy for human cancer …(2008, Cancer Research). 7) Pooling breast cancer datasets….(2008, BMC Genomics). 8) Comparison of prognostic gene expression signatures for breast cancer (2008, BMC Genomics). 9) Using microarray analysis as a prognostic and predictive tool in oncology: focus on breast cancer (2008, Seminars in Radiation Oncology). Ignoring here whether one should be combine the gene signatures with clinical factors, like oestrogen- receptor (ER) and PR (progesterone-receptor) status, ERBB2 status of lesion, and other surrogate markers of proliferation, such as Ki-67, etc. Supervised Classification of Tissue Samples We OBSERVE the CLASS LABELS z1, …, zn, where zij=(zj)i = 1 if jth tissue sample comes from the ith class, and zero otherwise (i=1,…,g; j=1,…,n) in the training set t. AIM: TO CONSTRUCT A RULE r(y; t) FOR PREDICTING THE UNKNOWN CLASS LABEL z OF A TISSUE SAMPLE y. e.g. g = 2 classes G1 - DISEASE-FREE G2 - METASTASES Supervised Classification (Two Classes) Sample 1 ....... Sample n Gene 1 ....... Gene p Class 1 Class 2 (good prognosis) (poor prognosis) SUPERVISED CLASSIFICATION In supervised classification, one seeks to construct a rule that will allow one to assign objects to one of a prespecified set of classes based solely on a vector of measurements taken on those objects. Construction of the rule is based on a “design set” or “training set” of objects with known measurement vectors and for which the true class is also known. Such problems are ubiquitous and, as a consequence, have been tackled in several different research areas. As a result, a tremendous variety of algorithms and models have been developed for the construction of such rules. Such problems are ubiquitous and, as a consequence, have been tackled in several different research areas. As a result, a tremendous variety of algorithms and models have been developed for the construction of such rules. •“Prediction is difficult, especially when it's about the future.” •Friedman (2006), quoting Yogi Berra The impact of population drift on supervised classification rules is nicely described by the American philosopher Eric Hoffer, who said, “In times of change, learners inherit the Earth, while the learned find themselves beautifully equipped to deal with a world that no longer exists.” Hand (2006, Stat. Sci.) Kostka and Spang (2008, PLoS Computational Biology, Microarray based diagnosis profits from better documentation of gene expression signatures) In forming the diagnostic rule r(y;t) for the classification of a new patient with gene expression signature y collected subsequent to the training data t, we need to ensure that y and t are on a consistent signature scale. Overfitting – the antidote is to limit reliance on the training data by not fitting it too closely. This is the basic principle underlying regularization. Selection Bias Bias that occurs when a subset of the variables is selected (dimension reduction) in some “optimal” way, and then the predictive capability of this subset is assessed in the usual way; i.e. using an ordinary measure for a set of variables. Selection Bias Discriminant Analysis: McLachlan (1992 & 2004, Wiley, Chapter 12) Regression: Breiman (1992, JASA) “This usage (i.e. use of residual of SS’s etc.) has long been a quiet scandal in the statistical community.” Nature Reviews Cancer, Feb. 2005 LINEAR CLASSIFIER (g=2 classes) r(y; t)= 1, if c(y;t) > 0, = 2, if c(y;t) < 0, where c ( y; t ) 0 β y T β0 β1 y1 β p y p for the prediction of the class label z of a future entity with feature vector y. We denote this prediction by r(y; t), where t contains the training data. FISHER’S LINEAR DISCRIMINANT FUNCTION c( y; t) β y 0 T 1 where β S ( y1 y 2 ) 1 0 ( y1 y 2 ) S ( y1 y 2 ) T 1 2 and y1 , y 2 , and S are the sample means and pooled sample covariance matrix found from the training data SUPPORT VECTOR MACHINE Vapnik (1995) c( y; t ) β0 β y T where β0 and β are obtained as follows: n j 1 2 min β β , 0 2 j 1 subject to j 0, z j C(y j ) 1 j ( j 1,, n) 1 ,, n relate to the slack variables separable case From Hastie et al. (2001) The Elements of Statistical Learning Error Rate Estimation Suppose there are g groups G1,…,Gg, r(y;t) is a rule formed from the classified training data t, (y1,z1, y2,z2, y3,z3,…, yn,zn), where zij= (zj)i =1, if yj belongs to Gi, and zero otherwise (i=1,…,g; j=1,…,n). The apparent error A is the proportion of the data set t misallocated by r(y; t). Let r{yj; t, sd(t)} be a rule formed from the training data t, using the subset sd(t) of variables (genes) selected on the basis of t according to some criterion. Let Q[u,v]=0, if u =v, and is 1, otherwise, Apparent Error Rate Cross-Validation From the original training data set, remove y1 to give the reduced set t(1)=(y2,z2,y3,z3,…, yn,zn) Then form the rule r{y;t(1), sd(t(1))} from this reduced set t(1). Use r{y1;t(1), sd(t(1))} to allocate y1 to either G1 or G2. Repeat this process for the second data point, y2. So that this point is assigned to either G1 or G2 on the basis of the rule r{y2;t(2), sd(t(2))}. And so on up to yn. n-FOLD CV n-FOLD CV 10-FOLD CV Selection bias increased if we use instead Ten-Fold Cross Validation 1 2 3 4 5 6 7 8 9 10 Test Training SUPPORT VECTOR MACHINE c( y; t ) β0 β1 y1 β p y p For a given p, can select the variables yv (the genes) on the basis of the magnitude of the fitted coefficients v Using this approach, GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) proposed: RECURSIVE FEATURE ELIMINATION (RFE) Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples Figure 2: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of leukemia tissue samples Isaksson et al. (2008, Pattern Recognition Letters, Cross-validation and bootstrapping are unreliable in small sample classification) CV estimate for small sample sizes is “practically useless for prediction of the true performance.” They focussed on the relative error of the CV estimate, |CV estimate – true error rate |/true error rate BOOTSTRAP APPROACH Efron’s (1983, JASA) .632 estimator B.632 .368 A .632 B1 * where B1 is the bootstrap when rule R k is applied to a point not in the training sample. A Monte Carlo estimate of B1 is n B1 Ej n j 1 K K where Ej IjkQjk I jk k 1 k 1 if xj kth bootstrap sample with Ijk 1 0 otherwise * and Qjk es 1 if R k misallocat xj 0 otherwise Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR A(w) (1 - w)A wA (2CV) where w 0.5 McLachlan (1977) proposed w=wo where wo is chosen to minimize asymptotic bias of A(w) in the case of two homoscedastic normal groups. Value of w0 was found to range between 0.6 and 0.7, depending on the values of p, , and n1 . n2 The 0.632 estimator fails for methods that fit the data heavily, such as the one-nearest neighbour rule for which the apparent error A is zero. This led Efron and Tibshirani (1997) to propose the 0.632+ estimator. .632+ estimate of Efron & Tibshirani (1997, JASA) B.632 (1 - w)A wB1 .632 where w 1 .368r B1 A r (relative overfitting rate) A g pi (1 qi ) (estimate of no information error rate) i 1 If r = 0, w = .632, and so B.632+ = B.632 r = 1, w = 1, and so B.632+ = B1 Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples • Fu, Carroll, and Wang (2005, Bioinformatics, Estimating misclassification error with small samples via bootstrap cross-validation) • Jiang and Simon (2007, Statistics in Medicine, A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification) 1. Have just seen the selection bias that occurs if CV is not performed correctly. 2. Selection Bias in not having “all” the genes available to assess the performance of one’s prediction rule Example : van Veer et al. (2002) selected the top 70 genes (according to the pooled two-sample t-statistic) from a screened set of 5,442 genes (filtered from a total of 24,481genes) in 78 tissue samples on breast cancer patients categorized into 2 classes (corresponding to good- and bad- prognosis). Suppose now that one has only access to these top 70 genes for these 78 tissue samples. It is not possible to assess the performance of a rule formed from these top 70 genes in its performance to new patients if the expression values of no more genes are available. Number of Genes Error Rate for Top Error Rate for Top Error Rate for 70 Genes (without 70 Genes (with 5422 Genes correction for correction for Selection Bias as Selection Bias as Top 70) Top 70) 1 0.40 0.40 0.40 2 0.31 0.37 0.40 4 0.24 0.40 0.42 8 0.23 0.32 0.29 16 0.22 0.33 0.38 32 0.19 0.35 0.38 64 0.23 0.32 0.33 70 0.23 0.32 - 128 - - 0.32 256 - - 0.29 512 - - 0.31 1024 - - 0.32 2048 - - 0.35 4096 - - 0.37 5422 - - 0.37 70 Here t Bk k denotes the training data on only the top 70 genes selected using t B k 1. Have just seen the selection bias that occurs for an optimal subset of fixed size if CV is not performed correctly 2. and in not using “all” the genes available to assess the performance of one’s prediction rule 3. Selection bias can occur if optimal subset is of unrestricted size. Number of Genes Error Rate for Top Error Rate for Top Error Rate for 70 Genes (without 70 Genes (with 5422 Genes correction for correction for Selection Bias as Selection Bias as Top 70) Top 70) 1 0.40 0.40 0.40 2 0.31 0.37 0.40 4 0.24 0.40 0.42 8 0.23 0.32 0.29 16 0.22 0.33 0.38 32 0.19 0.35 0.38 64 0.23 0.32 0.33 70 0.23 0.32 - 128 - - 0.32 256 - - 0.29 512 - - 0.31 1024 - - 0.32 2048 - - 0.35 4096 - - 0.37 5422 - - 0.37 Table 2: Ten-fold Cross-validated error rates for SVM based on subset with minimum error rate in RFE process with and without bias correction for optimization over subsets of varying sizes Dataset Error rate without bias Error rate with bias correction for varying correction for varying size of optimal subset size of optimal subset van ’t Veer (5422 genes) 0.29 0.37 Let r{y; t, s*(t)} denote the rule found using the subset of genes s*(t) that minimizes A10CV{sd(t)} over all values of d considered; i.e. where where and Ten-Fold Cross Validation 1 2 3 4 5 6 7 8 9 10 Test Training Nearest-Shrunken Centroids (Tibshirani et al., 2002) The usual estimates of the class means y i are shrunk toward the overall mean y of the data, where n y i zij y j / ni j 1 and n y y j / n. j 1 The nearest-centroid rule is given by where yv is the vth element of the feature vector y and yiv ( y )v . In the previous definition, we replace the sample mean yiv of the vth gene by its shrunken estimate yiv yv mi sv d iv (i 1, , g ; v 1, , p ), * where yiv yv d iv , mi sv d sign(div )( div k ) * iv 1 mi ( ni 1 n ) 1 2 Comparison of Nearest-Shrunken Centroids with SVM Apply (i) nearest-shrunken centroids and (ii) the SVM with RFE to colon data set of Alon et al. (1999), with N = 2000 genes and M = 62 tissues (40 tumours, 22 normals) Nearest-Shrunken Centroids applied to Alon data (a) Overall Error Rates (b) Class-specific Error Rates SVM with RFE applied to Alon data (a) Overall Error Rates (b) Class-specific Error Rates Guo, Hastie, and Tibshirani (2007, Biostatistics) generalize the idea of nearest shrunken centroids (NSC) with SCRDA (shrunken centroids regularized discriminant analysis). Efron (2008, Stanford Technical Report) shows the close connection between an empirical Bayes approach to large-scale prediction and NSC (a frequentist regularization approach). van’t Veer et al. (2002) Netherlands Cancer Institute (NKI) and Rosetta Inpharmatics looked at 25,000-gene microarrays on n=78 patients, all aged under 55 years, . Tumours <0.5 cm diameter, and no patient had lymph node involvement. Patients were all treated at the NKI, and outcome over 5 or more years was known (34 with distant metastases < 5 years; 44 with no recurrence in this period). van de Vijver et al. (2002) considered a further 234 breast cancer tumours but initially only made available the data for the top 70 genes based on the previous study of van’t Veer et al. (2002) Table 3: Cross-validated error rates starting with 70 genes versus all available genes with missing values in the latter handled by k-NN estimate for predicted and true labels Number of genes External CV error External CV error External CV error rate rate using 70 genes rate using original using original 24,481 (without correction for 24,481 genes (using genes (using true bias from using just predicted labels) labels) these 70 genes) 1 0.29 0.42 0.31 2 0.17 0.38 0.31 4 0.20 0.38 0.34 8 0.13 0.31 0.35 16 0.11 0.23 0.33 32 0.09 0.22 0.30 64 0.10 0.19 0.30 70 0.10 128 0.16 0.26 256 0.15 0.29 512 0.14 0.29 1024 0.15 0.28 2048 0.14 0.27 4096 0.14 0.27 8192 0.16 0.27 16384 0.17 0.29 24481 0.18 0.27 Other Associated Topics • Selection of biomarkers • Detection of differential expression • Multiple hypothesis testing and control of FDR Use of z-scores and then fitting of normal mixture models to them to select genes on basis of local FRD – McLachlan et al. (2006, Bioinformatics); concept of empirical null proposed by Efron (2006). Can also use this approach to select useful features – e.g. approach of Donoho and Jin (PNAS, Sept., 2008) using higher-criticism thresholding • Zhu, J.X., McLachlan, G.J., Ben-Tovim Jones, L., and Wood, I. (2008). On selection biases with prediction rules formed from gene expression data. Journal of Statistical Planning and Inference 38, 374-386. • McLachlan, G.J., Chevelu, J., and Zhu, J. (2008). Correcting for selection bias via cross-validation in the classification of microarray data. In Beyond Parametrics in Interdisciplinary Research: Festschrift in Honour of Professor Pranab K. Sen, N. Balakrishnan, E. Pena, and M.J. Silvapulle (Eds.). Hayward, California: IMS Collections, pp. 383-395. • Wood, I.A., Visscher, P.M., and Mengersen, K.L. (2007). Classification based upon gene expression data: bias and precision of error rates. Bioinformatics 23, 1363-1370. http://www.maths.uq.edu.au/bioinformatics/ RMagpie RMagpie is an R package designed for both biologists and statisticians. It offers the ability to train a classifier on a labelled microarray dataset and to then use that classifier to predict the class of new observations. A range of modern classifiers are available, including support vector machines (SVMs) and nearest shrunken centroids (NSCs). Advanced methods are provided to estimate the predictive error rate and to report the subset of genes which appear effective in discriminating between classes. Carrak Carrak is software with a web interface designed for both biologists and statisticians. It uses the RMagpie R package to offer the ability to train a classifier on a labelled microarray dataset and to then use that classifier to predict the class of new observations. A range of modern classifiers are available, including support vector machines (SVMs) and nearest shrunken centroids (NSCs). Advanced methods are provided to estimate the predictive error rate and to report the subset of genes which appear effective in discriminating between classes.