Storage and analysis of microarray data

Storage and analysis of microarray data Chih-hung Jen, Ioannis Michalopoulos, Archana Sharma-Oates, Iain Manfield, Phil Gilmartin, Noel Buckley, Phil Quirke and David Westhead Introduction The group has a growing interest in data from post genomic research, including microarray based measurements of gene expression, and, more recently, tissue microarrays. Work is collaborative with local experimental groups who generate data, and we are responsible for three aspects of these projects: (i) appropriate storage and archiving of data according to international standards, and efforts to advance these standards; (ii) data analysis using methods of multivariate statistics; and (iii) the use of private and public data to make predictions and motivate experimental verification or refutation. Plant post genomics Microarray analyses are being carried out to identify in vivo targets of plant GATA transcription factors in Arabidopsis thaliana. GATA transcription factors are Type IV zinc finger proteins, found in all eukaryotes. 29 Arabidopsis thaliana GATA genes are cloned, all sharing a CX2CX18CX2C zinc finger domain. Their mammalian orthologs show specificity of the conserved promoter element GATAAGG. Arabidopsis thaliana GATA-2 and GATA-4 are Phytochrome A regulated, as opposed to GATA-1. Identifying coexpressed genes from the microarray data can be used to assign potential functions to new genes and help the discovery of transcriptional regulation networks. Currently, the coexpressed genes are usually analysed by many sophisticated clustering algorithms e.g. SOM, hierarchical clustering, k-means clustering. However, these clustering approaches usually depend on the distance cut-off value or arbitrary k value to group the genes, and these criteria do not really indicate the significance of the similarity within the clusters. Besides, they assign particular genes to only one cluster that may cause loss information where genes may have multiple biological roles or respond to different transcription factors. In order to identify the in vivo potential targets of Arabidopsis GATA family transcription factors using microarray data and avoid the drawbacks of clustering algorithms, we propose a novel robust approach of assessing the significance of relationships in expression. We developed a new WWW-based Arabidopsis Co-Expression Tool (ACT) for plant gene analysis, based on large Arabidopsis thaliana public microarray data sets consisting of 322 Affymetrix arrays (ATH1) from 51 different experiments, obtained from the Nottingham Arabidopsis Stock Centre (NASC). The co-expression analysis tool allows users to identify genes whose expression patterns are correlated across selected experiments or the complete data set. The output is the Pearson correlation coefficient, or r-value, which is a scaleinvariant measure of expression similarity, and this is accompanied by probability (p) and expect (E) values reflecting statistical significance against a background of random chance correlations. The E value is calculated as a product of the number of genes on the array and the p value. The correlation coefficient (r) is used to rank the genes in descending order of correlation with the driver. In addition to r, p and E values, the output includes Affymetrix probe ID, AGI code and current annotation for each gene. Genes with strongly correlated expression patterns are likely to be under similar transcription regulatory mechanisms, or involved in related biological processes. We illustrate the applications of the software by analysing genes encoding functionally related proteins, as well as pathways involved in plant responses to environmental stimuli. The resource is freely available at http://www.arabidopsis.leeds.ac.uk/. 29 Based on the r-value derived from the correlation analysis tool, we can reveal the GATA coexpressed genes with confidence. An example result of the top fifty genes coexpressed with GATA-1 is shown in Table 1 and co-correlation plot of GATA2 and GATA-4 is shown in Fig. 1. Table 1: Top fifty correlated genes to GATA-1 Arabidopsis gene. Fig. 1: Co-correlation scatter plot of GATA-2 against GATA-4. Human cancer pathology Human tissue samples are obtained from the individuals in a clinical trial (both cancerous and normal tissues from the same person) for biochemical and histopathological analyses. 30 Depending on the nature of the trial these tissue samples undergo extensive characterisation using a number of high-throughput molecular biology techniques. The high-throughput techniques most commonly used in cancer research are cDNA microarrays, comparative genomic hybridisation arrays and tissue microarrays. The purpose of the cDNA microarray approach is to gain an insight into the expression levels of all the predicted genes in the human genome with the aim of identifying a set of genes related to a clinical outcome that may be either up or down regulated in tumour verses normal tissue. Comparative genomic hybridisation (CGH-arrays) arrays are used to study chromosomal instability at a genome level within tumour verses normal tissues. TMA is a technique that enables the analysis of a large cohort of clinical specimens in a single experiment thereby studying the molecular alterations (at the DNA, RNA, or protein level) in thousands of tissue specimens in parallel. The aim of cDNA microarray and CGH-array techniques are to either identify biomarkers that can be verified by TMA. Our involvement in this research involves analysis of the cDNA microarray and CGH-array data using statistical approaches, and the development of storage and analysis software for tissue microarray experiments, and area where we are contributing to the development of international standards, and the integration of this data with MIAME compliant microarray databases. TMAs are used in the laboratory to assess on a large-scale the diagnostic and therapeutic significance of various genes and proteins in colorectal tumour samples. A relational database has been designed and implemented in MySQL. The information stored in the database include TMA design constructs, tissue staining protocols, the results including images scanned from digital slide scanners and the pathology reports associated with each tumour sample. Additional information includes experiment authors, dates of each experiment, quality of cores on each TMA slide and the storage location of each TMA within the laboratory. This database is interfaced with the World Wide Web (WWW) thereby enabling users to query and assimilate their own data into the database. Collaborators Profs. P.M.Gilmartin and N. Buckley, Dr. P. Devlin (plant project), Prof. P. Quirke (human cancer pathology). Funding This work is funded by the BBSRC. 31

Related docs
premium docs
Other docs by larryp
Repossession by seller
Views: 255  |  Downloads: 1
Antonucci v Stevens Dodge
Views: 255  |  Downloads: 0
Short Summary of US History: 1900 to 2006
Views: 926  |  Downloads: 12
This Little Light of Mine
Views: 183  |  Downloads: 3
Tell Me the Story of Jesus
Views: 340  |  Downloads: 3
Microbiology Gelatinase Test Results
Views: 2909  |  Downloads: 20
Notice of sale of corporate property by receiver
Views: 217  |  Downloads: 1
adr105
Views: 115  |  Downloads: 0
Applying to Graduate School
Views: 967  |  Downloads: 15
Acupuncture: A Clinical Reveiw
Views: 647  |  Downloads: 26
Management and Organizational Behavior Topics
Views: 2025  |  Downloads: 59
Magnet Geometry Review
Views: 656  |  Downloads: 26
cr125
Views: 94  |  Downloads: 0
Accounting Review (the)
Views: 916  |  Downloads: 32
Holy Holy Holy (new)
Views: 233  |  Downloads: 0