Document Sample

Microarray Analysis of Variance (MANAVA) – the search for a general recipe to detect differentiations in gene expression Robin D.K. Christensen Copenhagen, July 2001 The Royal Veterinary and Agricultural University, Denmark ABSTRACT Microarrays is a new class of biotechnologies, logarithmic mean of the samples: ½[loge(Heart) + which allow the monitoring of expression levels loge(IBEC)]. Using the Log-Ratio as response for thousands of genes simultaneously. The variable and the Log-Mean as a covariate in a response from these types of analysis are mostly linear regression, we adjust (normalize) the non parametric, which allow no distribution to the response variable by subtracting the product of the following statistical analysis. There are many slope and Log-Mean, by Slides (equals block in a sources of systematic variation in microarray block design), in an iterative process. When experiments, which affect the measured gene steady state is obtained, the final two way expression levels. To ensure a valid statistical ANOVA is expressed as: Ygs -jsfj(xgs) = g + s analysis we have to adjust for these contributions, + gs followed by calculation of confidence limits the process of removing and describing such to the Log-Ratios ( t1-/2(MSe)). The level of variations is called normalization. In the present significance was evaluated by Sidáks formula; study I have evaluated and analysed the data from p<0.000023. From these procedures I found 115 a partly visual method, plotting the logarithm of genes with a significant different degree of the ratio, between two samples; Heart mRNA expression, between Heart mRNA and IBEC (sample of interest) and IBEC mRNA (a mRNA. reference): loge[(Heart/IBEC)] versus the 1. Introduction The purpose of this "project-thesis" is to study and practice normalization of a large matrix of data, from investigation of multiple genes, and to determine which genes that show differences in expression between an unknown and a "control" sample. In this case I have used a set of data with 2208 genes in a block design, with all the 2208 genes represented in each of the 4 blocks (referred to as slides), with one channel as the "active" (unknown) sample and the other as a control. The data is supplied from a study of Scheidl et al. (2001) via Mats Rudemo, Professor in Bio Statistics at the Royal Danish University of Veterinary- and Agricultural Sciences. The goal of the study is to define a method by which data (gene material) from microarray analysis can be analysed and discussed despite the residuals, hopefully without making any Type II 1 Error misinterpretations (= approving the hypothesis of no differences within any of the genes, on a false assumption). But, do to the large number of multiple comparisons, a Type I Error misinterpretation (= rejecting the hypothesis on a false assumption), is probably an even bigger hazard, on the other hand. One specific distribution for these types of microarray analysis is not likely to be found, which is why I can't make any unambiguous distribution assumptions, but via iterative methods I will try to normalize and reduce the different contributions from e.g. the 4 blocks (slides). 2.1 Background concerning Microarray analysis The relevance for valid statistical/objective analysing tools in molecular biology, when measuring e.g. DNA sequences, is a subject of increasing interest. In 1975, Southern developed a technique that made the separation and identification of nucleotides possible, by separating a biological sample into a mobile and immobile phase. If one of these phases contains a known DNA sequence, the other can be identified by a relevant interpolation. These methods can be used in either a qualitative or a quantitative analysis, if a well-known DNA (cDNA) is available as a standard. In 1995, Schena et al. developed a technique for quantitative monitoring of gene expression patterns, by using cDNA “Microarray”, a technique, which typically results in data consisting of two-dimensional arrays, with genes in one dimension and experimental conditions in the other dimension (Rudemo, 2001). The basics of this technique: The immobile phase consist of single stranded pieces of known DNA ( cDNA), which is attached to a nylon filter, glass slide or silicon chip. The unknown mobile phase is a mixture of labelled copies of mRNA. When the hybridisation is over, the label is localised on the spots where the unknown phase matches the known phase. It is a big advantage of this method, that an enormous amount of test sequences can be arrayed in a single experiment. Though the velocity at which results can be obtained with this technique is a major advantage, there are a few drawbacks: The error of quantitation can reach levels of 30 to 2 50%, whereas this is 15 to 20% in the traditional methods (e.g. Northern blot or quantitative PCR) (Dutilh, 1999). It is important for a potential user of the DNA-chip method, to remember that only known genes can be analysed using this Microarray technique, because a reference is needed. Because of the complexity of the topic, and the ease by which a large numbers of genes can be monitored with this “new” method, a need for many liable databases is inevitable. See e.g. http://www-binf.bio.uu.nl/~dutilh/gene-networks these kinds of analysis, makes a computational evaluation of results (conclusions etc.) recommendable. Because of the many sources of systematic variation in Microarray experiments, normalization is used to “remove” such variation, which is in analogy to parametric statistics (Gauss’/Normal- distribution), where transformation of the response variables are preferable, instead of using Non-Parametric statistics. For more information about normalization see Yang et al. (www.berkeley.EDU/users/terry) The way the expression level is obtained, is typically by the logarithm of the quotient between the two intensities in the colours e.g. RED (target-sample) and GREEN (reference), and this is often finally multiplied by e.g. 2, and given in the “final” matrix of results, without any decimals (Rudemo, 2001). In the present thesis I have used the results from a study by Scheidl et al. (2001), in which they presented and evaluated an experimental procedure for ‘global gene expression analysis of slender embryonic structures using laser microbeam microdissection and laser pressure catapulting’. As described earlier, the purpose of this study haven’t been to examine the problem in a biological sense, but to investigate the possibilities to determine which genes that are expressed differently, from a large gene sample (2208 different genes) one “unknown” (the sample of interest) compared to the “known” (the sample used as a reference). In the following: The sample of interest is referred to as Heart (from mRNA) and the “reference” is referred to as IBEC (also descended from mRNA). 3 2.2 How do we normalize and analyse our response’ variables, to obtain a “true” picture of the differences between the genes that are expressed In Microarray experiments there are many sources of systematic variation, which cannot be, neglected because of the huge possibility of making either a result without proper meaning, or extracting a false conclusion, due to a Type I Error. The process of reducing (and possibly removing) these variations is termed normalization. We describe a within/between -slide normalization approach, which makes a stepwise reduction of the response variable. When given data that needs normalization, I speculate a method for the identifying of single differentially expressed genes, by using a two way analysis of variance (ANOVA), from which we adjust (normalize) our response variables by using linear regression to calculate the coefficient of slope from an assumed linear distribution. These two steps can then be used repeatable, to adjust and normalize the response variables by iteration. The approach of using proper normalization should make it possible to consider and analyse data by univariate testing for each gene (Dudoit et al., 2000), indicating the use of methods known from e.g. the Gauss’ distribution N(0, 2) - assuming varians homogeneity and E{ij}= 0. A model for Microarray data with two treatments: Suppose that we have Microarray data from S slides, denoted s = 1,….., S. For each slide there are two treatments t = 1, 2 Let Zgts , g = 1,…….., G, t = 1, 2 s = 1,……., S Denote the observed intensity value for the spot corresponding to gene g and treatment t in slide s, where G is the number of genes. We assume here that each spot corresponds to one gene. Put Ygs = loge(Zg1s) - loge(Zg2s) (referred to as M) And Xgs = ½ (loge(Zg1s) + loge(Zg2s)) (referred to as A) We will regard the following model: J (Model A ) Ygs = g + s + js fj(xgs) + gs 4 where gs, g = 1,…….., G s = 1,…….., S are independent with E(gs) = 0 and var(gs) = gs2 We assume that: S s =0 The functions fj, j = 1, …….., J, are assumed to be known and to satisfy: S G fj(xgs) =0 j = 1, ………, J As an example we could let fj(x) be a polynomial of degree: j In the following we assume j = 1 linear, and (maybe an extreme case) that we have only one variance. From the linear regression (fj(xgs)) we want a slope (s) for each of the slides (s = 1, …, S) by which we can normalize by subtracting this estimate from Ygs J (Model A’ ) Ygs - js fj(xgs) = g + s + gs Because of the way all the response variables are calculated (Ygs = loge(Zg1s) - loge(Zg2s)), a significant difference from 0, could (‘would’) be an indication of a difference in the expression of the Geneg After a suitable number of iterations, we calculate confidence limits to every each of the genes (1,…..,2208) on the basis of their means, and the MSe We want to test the hypothesis H0 : loge[ZgHeart/ZgIBEC ] 0 - CL1-/2 < Yg t1-/2 (MSe) < CL1-/2 Because of the large amount of comparisons between Heart (mRNA) and IBEC (mRNA), it is necessary to adjust the p-values (because of the multiple testing), to reduce the risk of making a Type I Error. 5 There are two simple and slightly different methods to adjust the p-values: - The Bonferroni adjustment: Padjusted= kPcalculated - The Sidák adjustment: Padjusted= 1 – (1 – Pcalculated)k Where the letter k = the number of genes tested. I have chosen to use Sidák’s adjusted P-value, for multiple comparisons. 3. Results from the statistical analysis of the Microarray data, using SAS After sorting of data from the study by Scheidl et al. (2001), I made a program in SAS to display the outputs of all the 4 slides (249, 286, 287 and 346) without any adjustments. I made scatter plots of: M = loge(Zg1s) - loge(Zg2s) versus A = ½ (loge(Zg1s) + loge(Zg2s)) - note that g indicates the number of one specific gene (g = 1, ….., 2208) and the treatment 1 or 2 (1 = Heart mRNA & 2 = IBEC mRNA). As shown in figure 1 there is four scatter plots, one for each slides Fig. 1 Scatter plots of M versus A, for unadjusted data. The figure show (from left to the right): s = 249, 286, 287 and 346. (n = 2208 Genes). To define the estimates of the Log-Ratio’s from a two way ANOVA (as mentioned in section 2.2) I made the following SAS-editor. Because of the huge amounts of data (if build as a Matrix) I chose to use an ANOVA statement, which is a valid method when the design is balanced and complete, instead of the more obvious ‘Generalised Linear model’ (proc GLM). /*The following is the SAS program used:*/ data data1 ; input Slide Gene MeanHe MeanIB RatioHI M A ; cards; 249 39 295.91 495.01 0.597785903 -0.51452261 5.947316659 249 40 59.9 78.03 0.767653467 -0.264416863 4.224884937 6 (There are 8832 lines in the data editor, one for each gene, times 4) 346 635679 2566.06 2595.57 0.988630628 -0.011434497 7.855844176 346 635726 503.88 1426.14 0.353317346 -1.040388628 6.742532459 ; proc anova data=data1; class slide gene ; model M = slide gene ; means slide gene ; run; proc reg; by slide ; model M = A ; run; /*The following is parts of the SAS Output:*/ Analysis of Variance Procedure Dependent Variable: M Source DF Sum of Squares Mean Square F Value Pr > F Model 2210 7426.21031233 3.36027616 19.58 0.0001 Error 6621 1136.07842842 0.17158714 Corrected Total 8831 8562.28874075 R-Square C.V. Root MSE M Mean 0.867316 -313.1447 0.41423078 -0.13228093 Source DF Anova SS Mean Square F Value Pr > F SLIDE 3 4696.80875487 1565.60291829 9124.24 0.0001 GENE 2207 2729.40155747 1.23670211 7.21 0.0001 ------------------------------------------ SLIDE=249 ------------------------------------------- Model: MODEL1 Dependent Variable: M Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 0.84394 0.84394 1.662 0.1974 Error 2206 1119.87844 0.50765 C Total 2207 1120.72238 Root MSE 0.71250 R-square 0.0008 Dep Mean -0.25918 Adj R-sq 0.0003 C.V. -274.90481 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -0.333268 0.05942840 -5.608 0.0001 A 1 0.014724 0.01141940 1.289 0.1974 7 ------------------------------------------ SLIDE=286 ------------------------------------------- Model: MODEL1 Dependent Variable: M Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 217.38230 217.38230 645.239 0.0001 Error 2206 743.20587 0.33690 C Total 2207 960.58817 Root MSE 0.58043 R-square 0.2263 Dep Mean -0.32170 Adj R-sq 0.2260 C.V. -180.42542 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -1.619827 0.05257581 -30.809 0.0001 A 1 0.220763 0.00869093 25.402 0.0001 ------------------------------------------ SLIDE=287 ------------------------------------------- Model: MODEL1 Dependent Variable: M Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 1 233.84467 233.84467 562.351 0.0001 Error 2206 917.32905 0.41583 C Total 2207 1151.17372 Root MSE 0.64485 R-square 0.2031 Dep Mean 1.03215 Adj R-sq 0.2028 C.V. 62.47651 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 2.417441 0.06000700 40.286 0.0001 A 1 -0.225042 0.00948987 -23.714 0.0001 ------------------------------------------ SLIDE=346 ------------------------------------------- Model: MODEL1 Dependent Variable: M Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F 8 Model 1 70.00958 70.00958 274.325 0.0001 Error 2206 562.98614 0.25521 C Total 2207 632.99572 Root MSE 0.50518 R-square 0.1106 Dep Mean -0.98039 Adj R-sq 0.1102 C.V. -51.52833 Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP 1 -2.401480 0.08647113 -27.772 0.0001 A 1 0.195344 0.01179415 16.563 0.0001 These output statements for the 4 slides (249, 286, 287 and 346) are then used to normalize our response variables, which is done by generating a new response variable, M’ by using (Model A’ ) Slide 249: M’ = M – 0.014724*A ; Slide 286: M’ = M – 0.220763*A ; Slide 287: M’ = M + 0.225042*A ; Slide 346: M’ = M – 0.195344*A ; From the new variable M’, the same procedure repeats (iterates) until steady state is reached. This step is followed by a new cycle (with two way ANOVA etc.), from which it is possible to examine the MSe to generate confidence limits. Note: In the present case I have only run the procedure a single time, although it isn’t optimal conditions! /*The following is parts of the 2nd SAS Output:*/ Analysis of Variance Procedure Dependent Variable: M’ Source DF Sum of Squares Mean Square F Value Pr > F Model 2210 32292.58289147 14.61202846 147.59 0.0001 Error 6621 655.49326130 0.09900215 Corrected Total 8831 32948.07615277 R-Square C.V. Root MSE M’ Mean 9 0.980105 -64.97132 0.31464608 -0.48428460 Source DF Anova SS Mean Square F Value Pr > F SLIDE 3 29604.67664758 9868.22554919 99676.88 0.0001 GENE 2207 2687.90624388 1.21790043 12.30 0.0001 From this SAS output, we are able to create a confidence interval on both sides (absolute value) of the mean value of every single gene. We recall that: We want to test the hypothesis, H0 : loge[ZgHeart/ZgIBEC ] 0 From Sidák’s adjustment (P < 0.05) we find that Padjusted < 0.000023 which is equal to a t-value of 4.235; DFe= 6621 - CL1-/2 < Yg t1-/2 (MSe) < CL1-/2 4.235 (0.099) 4.2350.3146 Loge(Zg(Heart)/Zg(IBEC)) 1.33 - are considered significant, which indicate that the gene (g = 1, ….., 2208) is expressed different for Heart and IBEC ! I found 115 genes that are expressed significantly different between Heart- and IBEC mRNA. The codes of those genes are listed below. Level of --------------M’------------- GENE N Mean SD 96 4 -1.77706561 1.73504983 118 4 -1.42050163 1.86470806 120 4 -2.26124846 1.71616866 334 4 2.45132825 2.31903167 359 4 -1.71602782 2.04587237 388 4 -1.45170947 1.93259376 444 4 -1.71665568 2.02282346 454 4 -1.40136929 2.38519972 476 4 -2.50204893 1.72009338 483 4 2.38481770 2.53358889 577 4 -1.42956460 1.79847840 578 4 -1.40729397 2.39270125 615 4 -2.04795280 1.68799200 633 4 -1.50352218 2.25108857 678 4 -2.37132493 2.14877464 679 4 -1.52520470 2.25786694 678 4 -2.37132493 2.14877464 10 679 4 -1.52520470 2.25786694 686 4 -1.44179038 2.42870911 709 4 -1.58019562 2.28518023 711 4 -1.51502202 2.12457047 723 4 -1.34480726 2.41626528 739 4 -1.35949234 1.92563179 759 4 -1.57488816 2.01243322 770 4 -2.32499648 1.96242649 794 4 -1.43985042 2.45410611 799 4 -1.48797712 2.42778473 809 4 -1.33434804 1.70056162 855 4 -2.10995666 1.73030837 871 4 -1.61398086 2.28944950 885 4 2.41969034 2.32293005 887 4 -2.50112000 1.84981440 933 4 2.26010378 2.24350151 958 4 1.92778691 2.39551563 959 4 -1.62086664 2.26731640 967 4 -1.74608039 2.07221323 987 4 -1.47042080 2.46540126 1031 4 -2.39563869 1.82980338 1034 4 -1.46248592 2.40007732 1067 4 2.08096289 2.67522106 1099 4 2.44351166 2.28541313 1104 4 -1.65684651 2.09215447 1108 4 1.70954915 2.53606826 1125 4 -2.04283789 1.65232604 1128 4 -1.40847924 2.01101305 1147 4 1.72347634 2.44664088 1151 4 -1.34697217 1.90574357 1179 4 -1.91065343 2.17633976 1183 4 2.17431987 2.58022198 1187 4 -1.45402919 2.03550297 1188 4 -1.55378624 2.48806810 1189 4 -1.37586213 2.12068623 1307 4 -1.68733341 2.19122401 1339 4 -1.42647236 2.22604891 1352 4 -2.38261710 2.20000304 1413 4 -1.88232954 1.86929083 1433 4 -2.19352001 1.73566700 1463 4 -2.18860179 1.99317671 1515 4 -1.70990896 2.33093347 1517 4 -1.41773078 2.52510968 1524 4 -1.47733504 2.52347329 1529 4 -1.49298790 2.17291139 1553 4 2.50489576 2.32930453 1710 4 2.14336801 2.60169161 1728 4 2.37921286 2.58530001 1760 4 2.11736703 2.44116018 1779 4 -1.62847674 2.44949063 1807 4 2.30083460 2.28785772 1811 4 -1.39042424 1.90765922 1833 4 -1.69608750 1.68554939 1841 4 -1.57271217 2.43594157 1851 4 -2.59788936 1.94429717 1861 4 -1.98104773 2.45878339 1869 4 2.08103173 2.57000343 1882 4 -1.34556324 1.91221477 1902 4 2.30358993 2.33281576 1909 4 2.33876502 2.57121868 1911 4 2.34263121 2.45076489 1940 4 2.38394178 2.45381613 2055 4 -1.51223441 2.33356717 2063 4 1.53859244 2.13995288 2066 4 -1.39350661 2.42007881 2072 4 2.18627622 2.60723914 2087 4 -1.34288358 1.69906235 2094 4 1.98553889 2.77823313 2103 4 -1.69560531 1.96589747 2105 4 -1.56933925 1.98501220 2198 4 2.07518916 2.75892279 11 2200 4 -1.51264190 2.39964722 2202 4 -1.50012546 2.08198123 2205 4 -1.53663403 2.40720351 2348 4 2.31418773 2.45148774 2361 4 -1.46131749 2.37670218 2366 4 2.33003389 2.72712673 2370 4 -1.36629538 2.27619187 2371 4 1.85604930 2.58134105 2373 4 -1.59193834 1.87400389 2398 4 -2.39354930 2.07290769 313981 4 2.54179171 2.43782196 316082 4 2.19755988 2.46187709 333633 4 -1.40968211 2.50381096 334438 4 -1.52660832 2.20936398 367062 4 -1.69800726 1.89878641 481408 4 2.49329173 2.41310003 482270 4 2.11229665 2.48069470 483232 4 1.80929860 2.40772366 517625 4 1.38981251 2.13102859 533961 4 -2.24291503 2.13020547 534143 4 1.50963798 2.31931527 535688 4 -2.09953625 2.30368923 537504 4 -1.54040353 1.73944632 618431 4 1.85063707 2.20984390 634838 4 -1.88573912 2.12259336 635173 4 -1.67097214 1.78623670 635508 4 -2.04690098 1.91969363 4. Discussion The normalization process for microarray analysis is a complex affair because of the large diversity between the genes and the lack of distribution. Yang et al. (www.berkeley.EDU/users/terry) discussed several methods to describe and remove systematic variation in microarray experiments. Variation which otherwise can affect the measured gene expression levels. Like Yang et al., Dudoit et al. (2000) and Scheidl et al. (2001) I found that presenting data in a scatter plot, provides a good information if the graph is presented as Log-Ratio versus the “Log-Mean”, because by using these plots a stepwise reduction of the systematic variation seems less abstract. Unlike the most other authors contributing to this kind of research, I haven’t used the adjusted p- values from permutation or any other non-parametric methods (e.g. p-values generated from ranks) instead I have used an iterative method in which the systematic variation is attempted removed, while improving the variance homogeneity and thereby hopefully the power of the design. Due to the multiple comparisons of genes, an “ordinary p-value” (e.g. at 5%) is obviously not valid, because of the probability of making type I Errors. Instead an adjustment has to be done to the level of significance. In this study I have used the Sidák adjustment due to the simplicity, although “The Westfall and Young step-down adjusted p-values” (Dudoit et al., 2000) probably are 12 more correct. I found the level of significance to be P < 0.000023 and the degrees of freedom, DF e= 6621. Whether I have found the same genes to be differentially expressed as Scheidl et al. (2001) – I don’t know, because the data I got was given in coded names and I couldn’t find any “nomenclature key”. One thing is for certain; the discipline of using statistics and other calculating tools in microarray analysis, needs further work e.g. via integration to some advanced biochemistry courses, so that the complexity of the analysis becomes more “hidden” before clinicians (e.g. oncologists) and other non-statisticians “dare” using these methods on their own. Maybe a total MANAVA (Microarray Analysis of Variance) package can be generated to software like SAS, SPSS, Sigma Stat etc. in the future !!! References From the World Wide Web: www-stat.Stanford.edu/~hastie/Papers/ www.stat.Berkeley.EDU/users/terry/zarray/HTML/index.html Dudoit, S. Yang, YH. Callow, MJ. and Speed, TP. “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments”, Technical report #578, 2000. B. Dutilh. “Analysis of data from Microarray experiments, the state of the art in gene network reconstruction”. From http://www-binf.bio.uu.nl/~dutilh/gene-networks, 1999. M. Rudemo. “Excerpts from Image Analysis”. Unpublished; Department of Mathematical Statistics, Chalmers University of Technology, 1-3, 2001. Scheidel, SJ. Nilsson, S. Kalén, M. Hellström, M. Takemoto, M. Håkansson, J. and Lindahl, P. “mRNA Expression Profiling of Laser Microbeam Microdissected Cells from embryonic structures”. Submitted, 2001. M. Schena, D. Shalon, R.W. Davis, and P.O. Brown. “Quantitative monitoring of gene expression patterns with a complementary DNA Microarray”. Science, 270:467-470, 1995. E. Southern. “Detection of specific sequences among DNA fragments separated by gel electrophoresis.” J. Mol. Biol., 98:503-517, 1975. Yang, YH. Dudoit, S. Luu, P. and Speed, TP. “Normalization for cDNA Microarray Data”. 13 Submitted, 2001. 14 15

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 5 |

posted: | 7/24/2012 |

language: | |

pages: | 15 |

OTHER DOCS BY jennyyingdi

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.