Some problems in the statistical analysis of microarray data

Some problems in the statistical analysis of microarray data Terry Speed Department of Statistics, UC Berkeley Division of Genetics & Bioinformatics, WEHI Workshop on Microarrays Mathematical Biosciences Institute, OSU October 15, 2004 1 Approach to be adopted in this talk I’m going to attempt to describe classes of problems in microarray data analysis that are (more or less) open, as far as I am aware, at least not as developed as (say) problems such as that of identifying differentially expressed genes. After each description, I will answer any questions seeking clarification, and then hope for comments from you on relevant literature (problem attempted here, problem solved there, etc). I may know the works mentioned, and want to explain why I don’t see it as providing a solution. I have not given many references to literature here due to lack of time. Apologies to those whose work I should have cited. 2 1. Multiple testing We still do not have ways of dealing with the multiple testing problems arising from a traditional treatment structure on the arrays (Example: 4 replicate samples from each of 5 different treatments on Affymetrix arrays), at the same time as dealing with many responses, say 22,000 gene expression measurements. In the Example, we might like to determine, for each gene, which treatment means are significantly different from one another, controlling the overall family-wise error across pairwise treatment comparisons and across genes. In other words, we need a unification of the original multiple testing procedures due to Tukey, Scheffé, Newman and Keuls, Duncan, etc., as dealt with in RG Miller’s book, developed for treatment structure, with the more recent ones due to Westfall and Young, Benjamini, etc. developed for 3 many responses. Multiple testing, cont. Think of the problem just described as a 2-dimensional multiple testing problem, there treatments  genes. A similar problem arose yesterday in Christina Kendziorski’s discussion of ETL (also called eQTL). In her experiment, 60 F2 intercross mice were genotyped at 145 markers along the genome, while mRNA from the pancreatic islet cells of each mouse was also arrayed on an Affymetrix chip, giving 45,265 mRNA transcript abundance measurements per mouse. For each measurement and each marker locus, the data on the 60 mice can be summarized in a LOD score or a t- or Fstatistic. (Hers was a single locus approach. Multi-locus approaches also exist, but are definitely more complicated in this context, though not uninteresting.) 4 Multiple testing, completed. These ~40,000 LOD scores are all highly spatially correlated along the markers along a chromosome, while the LOD scores for different measurements at the same marker can also be highly correlated, due to their correlation in the individual mice. (If Christina had taken the further step of doing interval mapping, she would have a LOD score for every measurement at every cM along every chromosome). Either way, the 2dimensional multiple testing problem here is location  genes. The multiplicity here apears massive (145 or 3,000  40,000), but any proper solution to the problem must deal with it, one way or another. It does not vanich by using Bayes rule. In reality, the 145 or 3,000 is effectively not much more than O(20), i.e. a factor times the number of chromosomes. I venture to suggest that there will soon be entirely satisfactory classical inference procedures which control family wise type 1 error rates, or estimate FDR or related quantities for these 2-dimensional problems. I further predict that they will be permutation- or bootstrap-based procedures, as such procedures already exist for multiple testing problems for each dimension separately. Maybe Susmita Datta can deal with it 5 already! 2. Gene set analyses The great strength of cluster analyses is their ability (frequently) to identify regulated sets of genes which have common biological features. Such analyses are a posteriori, and their great weakness is that they are currently outside the framework of standard statistical inference. Can we change this? More recently, people have been studying sets of genes defined a priori, e.g. categories of genes (GO), metabolic, signal transduction or other pathways, or simply sets of genes found previously to be co-ordinately regulated (e.g. from previous cluster analyses). In a given experiment, interest focusses on whether the set as a whole is perturbed. You should all know about standard GO analyses, using exact p-values, but perhaps not about the other kind. 6 PGC-1-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes Mootha et al, Nature Genetics July 2003 Data: Affymetrix microarray data on 22,000 genes in skeletal muscle biopsy samples from 43 males, 17 with normal glucose tolerance (NGT), 8 with impaired glucose tolerance and 18 with Type 2 diabetes (DM2). In their single gene analysis, a t-statistic was calculated for each gene. No significant difference found between NGT and DM2 after adjusting for multiple testing. Their idea: test 149 a priori defined gene sets for association with disease phenotypes. 7 Their 149 gene sets Sets of metabolic pathways: manually curated pathways (standard textbook literature reviews, and LocusLink) Netaffx annotations using GenMAPP. Sets of coregulated genes: SOM clustering of the mouse expression atlas. 8 Their method ES=enrichment score for each gene = scaled K-S dist A set called OXPHOS got the largest ES score, with p=0.029 on 1,000 permutations. 9 Their result OXPHOS Other (A small difference for many genes) All genes OXPHOS 10 Simplification Mootha et al did a two sample K-S test to compare genes in a specific gene set with genes not in that set. Instead of doing this, why don’t we simply do a one sample test, comparing each gene set to the whole (population) directly? Each gene set is small w.r.t. the entire set of genes, so all other genes ≈ all genes. If we have approximate normality, a z-test should work for shift alternatives. A chi-squared test for scale changes also works. (“works” here means works better!) 11 Mootha’s 22,000 t’s are approx N(0,1) 12 One sample z-test Assumption: the population of t-statistics of all genes follows a normal distribution. Denote its mean by  and SD by . If this is the case, the best test of the null hypothesis that a random sample t1 , t2, …..,tn is from this distribution, with alternative a shift of the original distribution, is based on t . Specifically, it would reject for large or small z = ( t - )/ / n . In general, we’d expect =0 and =1, and this is the case for Mootha’s t’s. We don’t expect iid, but let’s try anyway. Thus we test the null hypothesis that our sample (i.e. the t’s from a specified set) comes from a N(0,1) using z =  n t . For just one set (or for all if we wish) we could get a permutation-based 13 value, as did Mootha. For 149 sets, we can do a normal qq-plot. p- Normal qq-plot of n t for 149 sets Mootha’s data OXPHOS 14 A Bayesian approach Recalling Kim-Anh Do’s discussion yesterday, note that Bayesian approaches to identifying differentially expressed (d.e.) genes usually lead to posterior probabilities for d.e. for each gene. Instead of summing and scaling t-statistics to identify d.e. sets as I did just before, one could perhaps sum (and normalize?) posterior probabilities, or calculate (estimate from the MCMC) a posterior probability for the set, as Kim described. What one does next depends on one’s philosophy. As with the classical approach, one might wish to do this for many sets, and deal with the multiple testing issue, set. Over to Ms Bayes. 15 A “continuous” GO analysis Harmen Bussemaker has just told me about his Quontology algorithm. I quote from one of his papers. “Several publications have used genome-wide expression patterns to score functional gene categories, by considering the overlap between each category and the set of genes induced or repressed above a certain threshold. We take a very different approach that is essentially identical to the REDUCE algorithm for scoring promoter elements [4] but with manually defined gene categories from an ontology replacing the set of genes whose promoter contains a particular DNA motif. Again, a Z-score can be calculated for each category that measures the deviation of the average log-ratio for genes in the category from the genome-wide average, in units of the standard deviation. A similar method has recently been discussed by..” Pavlidis, Lewis and Noble, PSB 2002: Exploring gene expression data with class scores. . 16 Discovery of sets of co-regulated genes Just as we might search for single genes which are regulated in an experiment, e.g. by a 2-sample t-test or something similar, so we might seek sets of genes using multivariate analogues of our univariate statistics, e.g. 2sample Hotelling T2 for two or more genes’ expression values. The problems we have found with this approach are two-fold: a) even pairs known to be co-regulated, do not necessarily rank as highly as we would wish among all pairs, and b) the number of pairs (not to mention triples, etc) of genes exacts a huge price when we seek some kind of type 1 error control. Bonferroni: multiplies p-values by (22,000)2/2. Nevertheless, some form of multivariate analysis must work. Ideas?! 17 What is needed? There is a great need for research further to develop both types of gene set analysis, that is, ways of discovering sets not previously known to be co-ordinately regulated, and of determining which known sets of genes are coregulated in a given experiment. Required are novel procedures for doing the discovering or determining, and others for controlling false positive error rates in appropriate ways, paying particular attention to the logical nesting relationships associated with set inclusion. Kostka &Spang’s differentially co-expressed sets are one such discovery method. Let there be more! My modification z of Mootha’s ES can be improved by incorporating correlation among the t’s. Much needs to be 18 done on dealing with nesting relationships. 3. Microarray time series There are many open problem in this area. Here are four. a) Flexible modelling of longitudinal data, including moderation and measures to rank genes. b) Clustering methods suited to time-course profiles. In comparative experiments, should these work on individual curves or differences? c) Robust fitting procedures to deal with aberrant curves. d) Suitable permutation-based or bootstrap analyses to assign p-values and deal with multiple testing. 19 Wild type and mutant strains of Arabidopsis thaliana infected with Erysiphe orontii Find genes whose expression profiles differ between genotypes? wt log2intensity Profiles differ mutant days Extra dots at day 7 from uninfected samples. 20 Comment on the literature Christina Kendziorski has a very nice paper on this topic (Yuan et al, see her web site), as does Hongzhe Li of UC Davis, while many others have made nice contributions to this literature, see the references in Yuan et al. Why do I think there are still open problems? First, these authors rarely give solutions which reduce to what we would do if we had a single gene’s expression measurements. This worries me. I believe our methods for dealing with microarray data should be consistent with “traditional” statistical methods, and smoothly compatible with them, whether we are doing survival analysis, analysing longitudinal data etc. When we go from 1 gene to a few genes, we usually know what to do. The further step to many genes makes new demands, not the least being the gene set questions I mentioned earlier, plus the extra difficulties of dealing with n << p. But I like reasonably transparent continuity of methods. I’d also like to be able to rank genes, to deal with replication naturally, and deal with large numbers of genes, all in a seamless manner. Not all these things are available in existing methods. 21 4. Observational studies using microarray data (plus the usual covariates) We are seeing more and more observational (e.g. cross-sectional, or case/control) studies, in which each subject has had a gene expression profile done on some cells (e.g. white blood cells, or a tumor), and these are related to exposure or outcome, together with other, more standard covariates (age, sex, smoking status, etc). Typically sample sizes are not large (10s-100s), while the number of variables (genes+covariates) can exceed 40,000. How do we do the usual epidemiological things to avoid confounding, to identify interactions, and so on? Ray Carroll’s talk was a fine contribution, in an unusually well thought out context. We are not all so lucky (ask me later). 22 Signatures of environmental exposure using peripheral blood leukocyte gene expression: tobacco smoke, Lampe et al, Cancer Epidemiol Biomarkers Prev 13(3) 2004. An age-balanced sample of male and female smokers and non-smokers 23 5. Joint analysis of expression and sequence data (the next wave) This is an important area needing the attention of statisticians. Already there is a lot of work, but the problem is far from solved. Given: microarray data from an experiment, e.g. a replicated treatment-control experiment, or a time-course experiment whose gene responses are clustered. Also: access to the DNA sequence likely to include cis-regulatory motifs or, more properly, sets of motifs (modules) “near” the transcription start sites of the genes on your array Wanted: to identify novel (cis-regulatory) motifs or modules associated with the genes which are differentially expressed in the experiment. Here motif means: sequence indicator of a transcription factor 24 binding site or of a protein acting as a transcription cofactor. Approaches Linear models, with a gene’s log expression or log fold change as responses, and counts or scores of possible motifs in the gene’s regulatory sequence as independent variables, these possible motifs coming from a) a “complete list” or exhaustive search through possible motif sequences b) highly ranked output from a motif discovery program (MotifRegressor) c) known TFBS information, positive controls, and validation Variant: include ChIP data as well. Harman Bussemaker and Hongyu Zhao have done my work for me here. 25 What important areas have I omitted? Your favourite topic? Low level analysis (Earl Hubbell) will always be with us! Comparative analyses (Katie Kerr) will always be with us! Ways of dealing with batch and other effects (Evan Johnson) will always be needed! The need for good design (Jason Hsu) will never go away! Not having David Kreil’s or David Allison’s or Eric Schadt’s talks, I can’t even make a passing reference to their contents, though don’t doubt they all leave important problems unsolved! 26

Related docs
Microarray Data Analysis
Views: 13  |  Downloads: 3
An Introduction to Microarray Data Analysis
Views: 37  |  Downloads: 9
Introduction to Microarray and Data Analysis
Views: 68  |  Downloads: 11
Introduction to analysis of microarray data
Views: 84  |  Downloads: 5
Tutorial - Analysis of Microarray Data
Views: 49  |  Downloads: 7
premium docs
Other docs by alextt
antiwiki
Views: 251  |  Downloads: 3
th140
Views: 40  |  Downloads: 0
pld050_001
Views: 50  |  Downloads: 1
angel_investing_oct2004
Views: 343  |  Downloads: 9
Broken genetics
Views: 401  |  Downloads: 4
Cuentos coloniales de Terror
Views: 3050  |  Downloads: 30
drakula
Views: 94  |  Downloads: 1
Cayman Economic Report for 2006[2]
Views: 122  |  Downloads: 0
tr120
Views: 36  |  Downloads: 0
wg004_001
Views: 40  |  Downloads: 0
sc110
Views: 56  |  Downloads: 1
e-signature backgrounder
Views: 758  |  Downloads: 23
sc140
Views: 74  |  Downloads: 2
Farley v. Champs Fine Foods
Views: 68  |  Downloads: 0
Brief Baby M[1]
Views: 544  |  Downloads: 10