Multivariate methods in microarray research
Yee Hwa Yang & David Lin
Health Canada Workshop on Microarray Data Analysis Lecture 4 Ottawa, Canada April 24-25, 2003
1
Life Cycle
Failed
Biological question
Experimental design Microarray experiment
Quality Measurement
Pass
Image analysis Normalization
Analysis Estimation Testing
Clustering
Discrimination
Biological verification and interpretation
2
Motivation I The problem of finding patterns
In many situations we have a set of hybridizations with temporal or spatial aspects to them, and seek genes exhibiting patterns. E.g. yeast cell cycle data, in this case, the nature of the pattern of interest was clear. Harder examples include developmental time series, or searching for spatial patterns in brain. Here genes are “on” or “off”, or “come on” and “go off” separately and jointly, according to their own programs. In a sense every gene has its own pattern, though we expect some sharing of patterns. But most importantly, we usually don’t know the patterns a priori.
Our problem seems to be distinguishing interesting or real patterns from meaningless variation, at the level of the gene. Two principal approaches 1a. Find the genes whose expression fits specific, predefined patterns. 1b. Find the genes whose expression follows the pattern of predefined gene or set of genes. 2. Carry out some kind of exploratory analysis to see what expression patterns emerge; cluster analysis (usually on genes).
3
Motivation II Tumor Classification
A reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. Current methods for classifying human malignancies rely on a variety of morphological, clinical, and molecular variables. In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous.
There are three main types of statistical problems associated with tumor classification:
1. The identification of new/unknown tumor classes using gene expression profiles; cluster analysis (usually on samples). 2. The classification of malignancies into known classes; Discrimination. 3. The identification of “marker” genes that characterize the different tumor classes; variable selection. These issues are relevant to other questions we meet , e.g. characterising/classifying neurons or the toxicity of chemicals administered to cells or model animals.
4
Cluster and discriminant analysis
• These techniques group, or equivalently classify, observational units on the basis of measurements. • They differ according to their aims, which in turn depend on the availability of a pre-existing basis for the grouping. • In cluster analysis, there are no predefined groups or labels for the observations, while discriminant analysis is based on the existence of such groups or labels.
• Alternative terminology – Computer science: unsupervised and supervised learning. – Microarray literature: class discovery and class prediction.
5
Gene expression data
Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1 2 3 4
5
0.46 -0.10 0.15 -0.45
-0.06
0.30 0.49 0.74 -1.03
1.06
0.80 0.24 0.04 -0.79
1.35
1.51 0.06 0.10 -0.56
1.09
0.90 0.46 0.20 -0.32
-1.09
... ... ... ...
...
Gene expression level of gene i in mRNA sample j
=
Log( Red intensity / Green intensity) Log intensity using (MAS, RMA or Li-Wong)
6
Strategy 1a: identifying genes with specific, predefined patterns
7
Finding early and late response genes in a short times series
M
576
(u)
16
1 1/4 1/2 1 4 24
Time
Which genes increase or decrease like the function x2?
8
Doing this in the vector world
• For each gene, we have a vector, y, of expression estimates at the different time points • Project the vector onto the space spanned by the vector u (the values of x2 at our time points).
y u
C
y u
• C is the scalar product
y30 1/ 4 y1 1 y 16 4 C y u y 576 9 24
C
Use a qq-plot of C to identify a genes with “significant” projections along u.
10
Search for the genes with the most similar pattern to 15228 (black line) base on Euclidean distance.
11
Strategy 2: pattern discovery
12
Clustering microarray data
We can cluster cell samples (cols), e.g. 1) neuron cells, for identification (profiles). Also, we might want to estimate the number of different neuron cell types in a set of samples, based on gene expression l evels. 2) the identification of new/unknown tumor classes using gene expresison profiles. We can cluster genes (rows) , e.g. using large numbers of yeast experiments, to identify groups of co-regulated genes. We can cluster genes (rows) to reduce redundancy (cf. variable selection) in predictive models.
13
Advantages of Clustering
We can cluster genes (rows), mRNA samples (cols), or both at once. • Clustering leads to readily interpretable figures. • Clustering strengthens the signal when averages are taken within clusters of genes (Eisen). • Clustering can be helpful for identifying patterns in time or space. • Clustering is useful, perhaps essential, when seeking new subclasses of cell samples (tumors, etc).
14
Clusters
Aim: To group observations that are “similar” based on predefined criteria.
Row and columns clusters
Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,
15
Discovering sub-groups
16
Basic principles of clustering
Aim: to group observations that are “similar” based on predefined criteria.
Issues: Which genes / arrays to use? Which similarity or dissimilarity measure? Which clustering algorithm?
It is advisable to reduce the number of genes from the full set to some more manageable number, before clustering. The basis for this reduction is usually quite context specific, see later example.
17
Which similarity or dissimilarity measure?
– – – – – – Euclidean distance; Mahalanobis distance; Manhattan metric; Minkowski metric; Canberra metric; one minus correlation and many others
18
Partitioning methods
Partition the data into a prespecified number k of mutually exclusive and exhaustive groups. Iteratively reallocate the observations to clusters until some criterion is met, e.g. minimize within cluster sums of squares.
Examples: – k-means, self-organizing maps (SOM), PAM, etc.;
– Fuzzy: needs stochastic model, e.g. Gaussian mixtures.
19
Hierarchical methods
Hierarchical clustering methods produce a tree or dendrogram. They avoid specifying how many clusters are appropriate by providing a partition for each k obtained from cutting the tree at some level.
The tree can be built in two distinct ways – bottom-up: agglomerative clustering; – top-down: divisive clustering.
20
Agglomerative methods
• Start with n clusters. At each step, merge the two closest clusters using a measure of between-cluster disssimilarity, which reflects the shape of the clusters. • Between-cluster dissimilarity measures – Unweighted Pair Group Method with Arithmetic mean (UPGMA): average of pairwise dissimilarities. – Single-link: minimum of pairwise dissimilarities. – Complete-link: maximum of pairwise dissimilarities.
21
Divisive methods
• Start with only one cluster. At each step, split clusters into two parts. • Advantages. Obtain the main structure of the data, i.e. focus on upper levels of dendogram. • Disadvantages. Computational difficulties when considering all possible divisions into two groups.
22
Partitioning vs. hierarchical
Partitioning: Advantages • Optimal for certain criteria. Disadvantages • Need initial k; • Often require long computation times. Hierarchical Advantages • Faster computation. Disadvantages • Rigid; • Cannot correct later for erroneous decisions made earlier.
23
Three generic clustering problems
Three important tasks (which are generic) are: 1. Estimating the number of clusters; 2. Assigning each observation to a cluster; 3. Assessing the strength/confidence of cluster assignments for individual observations. Not equally important in every problem.
24
Comparison of clustering and single gene approaches (presented in Lecture 3) for microarray data analysis
Cluster analyses: 1) Usually outside the normal framework of statistical inference; 2) less appropriate when only a few genes are likely to change. 3) Needs lots of experiments Single gene approaches 1) may be too noisy in general to show much 2) may not reveal coordinated effects of positively correlated genes. 3) harder to relate to pathways.
25
Discrimination
26
Discrimination
• A predictor or classifier for K tumor classes partitions the space X of gene expression profiles into K disjoint subsets, A1, ..., AK, such that for a sample with expression profile x=(x1, ...,xp) Ak the predicted class is k.
• Predictors are built from past experience, i.e., from observations which are known to belong to certain classes. Such observations comprise the learning set
L = (x1, y1), ..., (xn,yn).
• A classifier built from a learning set L is denoted by
C( . ,L): X {1,2, ... ,K}, with the predicted class for observation x being C(x,L).
27
Taken from Nature Jan, 2002 Paper by L van’t Veer et al Gene expression profiling predicts clinical outcome of breast cancer. Further validated in the New England Journal of Medicine on 295 women.
28
• Prediction methods
– – – – – – – Fisher linear discriminant analysis (FLDA) Nearest Neighbor Classification Trees Support vector machines (SVMs) Neural networks Bayesian regression methods Genetic algorithm
• How good is your classifier?
– Test error rate
29
Fisher linear discriminant analysis
First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA) consists of
i. finding linear combinations x a of the gene expression profiles x=(x1,...,xp) with large ratios of between-groups to within-groups sums of squares - discriminant variables;
ii. predicting the class of an observation x by the class whose mean vector is closest to x in terms of the discriminant variables.
30
FLDA
31
Maximum likelihood discriminant rules
When the class conditional densities pr(x|y=k) are known, the maximum likelihood (ML) discriminant rule predicts the class of an observation x by
C(x) = maxk pr(x|y=k).
For multivariate normal class densities, i.e., for x|y= k ~ N(k, k), this is (in general) a quadratic rule.
In practice, population mean vectors and covariance matrices are estimated by the corresponding sample quantities.
32
ML discriminant rules - special cases
1. Linear discriminant analysis, LDA. When the class densities have the same covariance matrix, k , the discriminant rule is based on the square of the Mahalanobis distance and is linear and given by
C(x) = arg min k (x-)’ -1(x - ) 2. Diagonal linear discriminant analysis, DLDA. In this simplest case, when the class densities have the same diagonal covariance matrix = diag(s12, …, sp2)
2 p ( x j kj ) C ( x) arg min k s2 j 1 j
Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (wrong variance calculation).
33
Nearest neighbor rule
Nearest neighbor methods are based on a measure of distance between observations, such as the Euclidean distance or one minus the correlation between two gene expression profiles. The k-nearest neighbor rule, due to Fix and Hodges (1951), classifies an observation x as follows: i. find the k observations in the learning set that are closest to x; ii. predict the class of x by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by crossvalidation.
34
Nearest neighbor rule
35
Classification trees
Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets, starting with X itself. Each terminal subset is assigned a class label and the resulting partition of X corresponds to the classifier. Three main aspects of tree construction are:
i. the selection of the splits; ii. the decision to declare a node terminal or to continue splitting; iii. the assignment of each terminal node to a class.
Different tree classifiers use different approaches to these three issues. Here, we use CART: Classification And Regression Trees, Breiman et al. (1984).
36
Classification trees
37
Aggregating predictors
Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set. In classification, the multiple versions of the predictor are aggregated by voting. Let C(., Lb) denote the classifier built from the bth perturbed learning set Lb , and let wb denote the weight given to predictions made by this classifier. The predicted class for an observation x is given by argmaxk sb wb I(C(x,Lb) = k).
38
Aggregating predictors
1. Bagging. Bootstrap samples of the same size as the original learning set. - non-parametric bootstrap, Breiman (1996); - convex pseudo-data, Breiman (1998). 2. Boosting. Freund and Schapire (1997), Breiman (1998). The data are resampled adaptively so that the weights in the resampling are increased for those cases most often misclassified.
The aggregation of predictors is done by weighted voting.
39
Prediction votes
For aggregated classifiers, prediction votes assessing the strength of a prediction may be defined for each observation. The prediction vote (PV) for an observation x is defined to be PV(x) = maxk ∑b wb I(C(x,Lb) = k) / ∑ bwb . When the perturbed learning sets are given equal weights, i.e., wb =1, the prediction vote is simply the proportion of votes for the “winning'' class, regardless of whether it is correct or not. Prediction votes belong to [0,1].
40
Other prediction methods
• Support vector machines (SVMs)
• Neural networks
• Bayesian regression methods
41
Issues in discrimination
• Missing values and imputation.
• Variable selection. • Standardization or transformation. • Accessing performance on the classifier based on – Leave one out cross validation.
• Independent testing on future dataset.
42
Example Datasets
• Leukemia data, Golub et al. (1999). • n=72 tumor mRNA samples; • 3 known classes (B-cell ALL, T-cell ALL, AML); • p=6,817 genes.
• Lymphoma data, Alizadeh et al. (2000). • n=81 tumor mRNA samples;
• 3 known classes (CLL, FL, DLBCL); • p=4,682 genes. • n=64 cell line mRNA samples; • 9 known classes; • p=5,244 genes.
• NCI 60 data, Ross et al. (2000).
43
Data pre-processing
• Image analysis and normalization: beyond our control for these data. • Imputation: k-nearest neighbor imputation, where genes are``neighbors'' and the similarity measure between two genes is the correlation in their expression profiles.
• Standardization: Standardize observations (arrays) to have mean 0 and variance 1 across variables (genes).
44
Comparison of predictors: Study design
The original datasets are repeatedly randomly divided into a learning set and a test set, comprising respectively 2/3 and 1/3 of the data.
For each of N=150 runs: • Select a subset of p genes from the learning set based on their ratio of between to within-groups sums of squares, BSS/WSS. p=50 for lymphoma, p=40 for leukemia, p=30 for NCI 60. • Build the different predictors using the learning sets with p genes. • Apply the predictors to the observations in the test set to obtain test set error rates. 45
Lymphoma data, 3 classes: Test set error rates; N=150 LS/TS 46 runs.
47 Leukemia data, 3 classes: Test set error rates; 150 LS/TS runs.
NCI 60 data: Test set error rates; 150 LS/TS runs.
48
Leukemia data: Image of correlation matrix for 72 mRNA 49 samples based on the top 40 genes for the 3-class BSS/WSS.
Results
• In the main comparison, the nearest neighbor classifier and DLDA had the smallest error rates, while FLDA had the highest error rates.
• Aggregation improved the performance of CART classifiers, the largest gains being with boosting and bagging with convex pseudo-data.
• For the lymphoma and leukemia datasets, increasing the number of variables to p=200 didn't greatly affect the performance of the various classifiers. There was an improvement for the NCI 60 dataset. • A more careful selection of a small number of genes p=10 improved the performance of FLDA dramatically.
50
Discussion
• ``Diagonal'' LDA: ignoring correlation between genes helped here. • Unlike classification trees and nearest neighbors, LDA is unable to take into account gene interactions. • Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions. • Classification trees are capable of handling and revealing interactions between variables. • Useful by-product of aggregated classifiers: prediction votes. • Variable selection: A crude criterion such as BSS/WSS may not identify the genes that discriminate between all the classes and 51 may not reveal interactions between genes.
Summary Experiment: horses for courses
• mRNA levels compared in many different contexts – Different tissues, same organism. – Same tissue, different organism (wt, KO, tg). – Same tissue, same organism (trt vs ctl). – Tumour cell lines. – Time course. Typical questions – Detect genes are differentially expressed between two or more samples. [t-test, F-test and many others] – Identification of groups of genes with characterizing a particular class of tumors. [Discrimination] – Discover at molecular level, potential sub-classes of tumors / disease. [Clustering] – Detection of gene regulatory mechanisms. [Network and meta data]
•
Gene expression data has stimulated many areas of research. No single method of analysis can be appropriate for all. Rather, each type of 52 experiment requires its own analysis.
Acknowledgments
Clustering section based on Lecture 3 in Temple short course
Discriminant analysis section: based on S. Dudoit and J. Fridlyand (Bioconductor short course lecture 5) http://www.bioconductor.org
53