Understanding microarray results Annotation of reporters on gene

Document Sample
Understanding microarray results Annotation of reporters on gene Powered By Docstoc
					Understanding microarray results: Annotation of reporters on gene
expression microarrays using Perl scripted SRS calls.

       Stan Gaj, Edwin G.W. ter Voert, Willem P.A. Ligtenberg, Rachel I.M. van Haaften and
                                                                            Chris T.A. Evelo

Gene expression microarrays are important tools in modern biomedical research. A
typical microarray experiment yields a huge amount of data about the expression of
genes under different conditions. Different types of experiments can be conducted, e.g.:
sick vs healthy, time-course and (drug) treated versus non-treated studies. Nowadays
microarrays typically contain well over 10.000 spots on which reporter gene sequences
are present. The rate of expression of the reported transcription levels is obtained by
calculating the intensity of the binding of the sample cDNA to its binding-specific
reporter. This can be measured using fluorescent dyes or radioactive labelling. The data
derived in this way gives us a better understanding of the amount of protein synthesis
that occurs at the mRNA level at a specific moment. Interpretation of such data is
however not always trivial. Apart from all kinds of data treatment and statistical issues,
reporter annotation is still an important problem. For improved biological understanding
we would like to visualize the expression in relation to functional pathways, for instance
by using the GenMapp program. To enable this, individual reporters have to be
annotated with the name of the corresponding protein. This typically means that we
need Swissprot ID, trEMBL ID’s or Genbank protein ID’s. However, the reporters on the
array are usually annotated with the ID of the gene cluster (Unigene ID) to which the
reporting sequence belongs, obtained through sequence alignment (Blastn) with the
Unigene database. Since mRNA sequences and their reporters on the arrays often
contain large stretches of non-coding sequences a direct Blastp against protein
databases is not the best option to solve this problem. We developed a tool written as a
Perl script that makes use of command-line SRS (Sequence Retrieval System) to obtain
the corresponding Swissprot or trEMBL ID’s for a Unigene cluster. If such a cross link is
not present in the Unigene entry while a PIR crosslink is present it uses that to obtain a
GenBank protein entry from the crosslink present in PIR. This tool was used to improve
annotation of both commercial (Incyte and Agilent) and home spotted microarrays and
also proved useful for data derived from suppression subtraction hybridisation studies.
The resulting annotation was used successfully for identification of biological pathways
where changes in gene expression occur.