Sorghum Functional Genomics by hcj


									                               Sorghum Functional Genomics

             An Instruction Manual for Java Interfaces at

      Http:// provides access to information about, and derived from,
sorghum ESTs. In addition to informational pages, this web site provides interfaces that access
directly our Oracle 9i database (MAGIC DB), thereby allowing unrestricted query access to it.

    Our objective is to provide a resource that biologists will find intelligible and useful.
Towards that end we welcome your comments and suggestions.

      Whether you have an interest in sorghum microarrays or not, this document serves as
detailed instructions on how to retrieve information from MAGIC DB. We realize it is an
extensive document, but then its size corresponds to the level of detail that anyone can retrieve
from the database.

      Some of the functionalities include the following:

1.  BLAST against sorghum ESTs and/or sorghum EST clusters.
2.  View sequences and download those of interest in fasta format.
3.  View phred quality scores (probability that a base call is correct) for every base called.
4.  View and download translations in all six reading frames (coming very soon).
5.  Link from our sequences to GenBank accessions or to locations on the rice genome at
6. Identify all genes expressed at any desired level in any cDNA library or subgroup of similar
    cDNA libraries.
7. Explore and search annotations of any or all ESTs.
8. Identify all genes that contain specified text in their provisional BLAST annotation.
9. Identify full coding length genes.
10. Identify instances of potential alternative splicing.
11. View alignments of EST clusters.
12. Download in fasta format from one to all consensus sequences for 3‟ EST clusters.

    And, much, much more. If what you want to retrieve from the database cannot be
accomplished with the present interfaces, we would be pleased to receive your suggestion(s) for

     As has always been the case, clones can be obtained as described at Clones so far have been delivered
without the use of any material transfer agreement, which is to say without restrictions of any

     Our convention for naming DNA sequences is available at

                                   MAGIC Gene Discovery

      The interface that provides the most information about ESTs is known as MAGIC Gene
Discovery. It is available at To use this Java GUI you will
need first to install Java Web Start (free from Sun) as explained on the page that launches this
program and as described at the end of this document.

      The program begins with a small window
that provides a brief list of „cluster runs.‟
Sorghum Milestone V1.0 was used for the first
generation sorghum microarray, with 16,801
unique 3‟ EST clusters. Sorghum Milestone
V2.0 (selected here) was used for the second
generation microarray, which contains 22,392
unique 3‟ EST clusters.

      We cluster ESTs using a newly developed algorithm (Olympiad) that we are currently
describing for publication. It produces initially what we call “SuperScripts.” SuperScripts are
clusters assembled with an emphasis on 3‟ UTRs, thereby providing as best one can clusters of
genes. We do not call them UniGenes because they have not been mapped to a single locus in
the sorghum genome. Olympiad further subdivides SuperScripts into two or more UniScripts
when it detects sufficient variation to suggest alternative splicing or some other variation. Hence,
UniScripts are equivalent to unique transcripts, two or more of which can come from the same
gene. For sorghum, however, a SuperScript gives rise to only one UniScript in the large majority
of cases.

       Selection and
submission of a Cluster
Run opens a new
window. Clicking the
„submit‟ button in this
window processes the
data for view (right).
Initially, only singletons
are in view. Moving the
slider bar to the right
brings into the window
UniScripts with
increasing numbers of 3‟
ESTs in them. Moving it
almost all the way to the
right, and changing the
scale of the Y axis to 50
provides the view seen
on the next page.

       In this histogram
view, each bar represents
a UniScript and its height
is proportional to the
number of 3‟ ESTs it
contains. Within a bar,
each color represents a
different cDNA library.
Clicking on a bar (in this
example the bar
immediately above the
red arrowhead) opens a
new window (below
right). The Pie Chart in
this new window not
only displays more
detailed information
about the distribution of
ESTs within this cluster,
but also provides access
to far more information
via the buttons at the
bottom of this window.

      The „Group and Library Definition‟ button
opens a new window giving further information
about each cDNA library (below).

       The „Expression by SubGroup‟ button
presents data in similar fashion, but
categorized by data grouped into biological
categories (right). The „Subgroup Definition‟
button in this window specifies how different
libraries were combined into the subgroups
shown here. While not shown here, the
window with this information is similar to
that providing Group and Library Definitions
on the preceding page.

      Going back to the Pie Chart window on
the preceding page, selecting the „Contig
Alignment‟ button displays the alignment
with discrepancies from the consensus
sequence highlighted in red, and low-
confidence base calls shown as an X and highlighted in light blue. The alignment can be viewed
sorted by offset from the consensus (default), by sequence name, or by genotype, and at three
levels of resolution. The following two illustrations show sequences sorted by genotype and
presented at the lowest and highest levels of resolutions, respectively.

      The „Sequences in Contig and Consensus‟ opens a window that displays the name of every
sequence in the cluster and displays the consensus sequence (not illustrated). In this new
window, a „Contig NCBI Blast‟ button opens a browser window at NCBI Blast, where you can
paste in the consensus sequence and execute a BLAST against a database of your choice. A
„Make Fasta‟ button provides the ability to download in fasta format either the selected consensus
sequence or all of them in the selected clustering (in this case >25,000 consensus sequences).

      Selecting the „Annotation‟ button from the Pie Chart window returns the results of a
BLAST against PIR_NREF (next page). In the near future, these data will be replaced with those
returned from BLAST against UniProt. The data can be sorted in this window by clicking on any
column header. The same is true for all other tabular views in this program. Column headers
from left to right are:

Num = arbitrary number
Seq Name = our name for the sequence
Alias Name = not relevant here
User AKA Name = not relevant here
UniScriptId = our identifier for the UniScript
HSP = high scoring pair
Score = value returned by BLAST
Expect = BLAST expect value (this view has been sorted by Expect value)
% Identity = value returned by BLAST
% Positive = value returned by BLAST

OffSet = location of first nt in the indicated EST with reference to the first nt of the consensus
Nref ID = PIR_NREF accession id
Protein Name = target name at PIR_NREF
Species = name of organism providing the match at PIR_NREF
Protein Len = length of target protein
Protein Function = reserved for future use with GO numbers
Q1 = first base in the query sequence found in the match
Q2 = last base in the query sequence found in the match
H1 = first amino acid in the target protein found in the match
H2 = last amino acid in the target protein found in the match
Blast Run Id = identification of the BLAST run that produced the data
Query Id = arbitrary Oracle number corresponding to the query EST in that row
Target Id = arbitrary Oracle number corresponding to the target protein in that row
Frame = reading frame of the match, which here should always be negative as these are all 3‟

     Selecting a row (highlighted at the top in this illustration) provides the match itself at the
bottom of the window. „View PIR Database‟ takes a user to the selected accession at PIR.

       Selected data within a table, either this one or any other similar MAGIC Gene Discovery
table, can be entered into a new sub table with the „Sub Table‟ button. Thus, one can sort data on
any column, select those rows that meet desired criteria, create a sub table, then repeat the
process with another column as often as one likes in order to obtain a final table with the desired
data. Data can be copied from any table by selecting all data with Ctrl-A (or a subset can be
selected in conventional fashion with a mouse), copying selected data with Ctrl-C, and then
pasting it into an Excel spreadsheet using Ctrl-V or the paste function in Excel.

      The relationship between a selected UniScript and other similar UniScripts can be viewed
via the „Blat Analysis‟ button. In the example shown below, the BLAT comparison indicates the
possibility of alternative splicing.

       Other views on the Cluster Overview page permit monitoring the rate of UniScript
discovery as a function of the number of 3‟ ESTs accumulated (below) and viewing the length
distribution of UniScripts (not shown here). In this screen shot, the red line describes the
observed rate of gene discovery; the green line presents a theoretical fit that permits extrapolation
to any number of 3‟ ESTs. The data can be monitored either with respect to UniScripts or, as
here, SuperScripts. This example indicates that 38,697 SuperScripts are anticipated when all
libraries have been sampled to a collective depth of 5,000,000,000,000 3‟ ESTs.

       The Search UniScript page (below) queries the database to retrieve ESTs meeting user-
specified criteria on a library-by-library basis. Typically the first step is to select a library of
interest, although multiple libraries – including all – can be selected. If multiple libraries are
selected, the query is nonetheless executed one library at a time. If one wants to query subgroups
of libraries, the Search Subgroup page performs this function. Because this latter page otherwise
functions identically to the Search UniScript page, it will not be described separately here.

      After selecting a library or libraries, all data can be returned using the „Search‟ button.
Data can, however, be filtered in numerous ways. If a UniScript is represented within the
selected library(ies), 3‟ ESTs in that UniScript can be retrieved by entering its id (e.g., 2_9153) in
the appropriate box. Similarly, the UniScript that contains a specific sequence can be retrieved.
Only part of a sequence name need be entered (e.g., CCC1_13_A06).

      A typical query to find UniScripts expressed preferentially within a given library (or
subgroup if Search Subgroup is used) is illustrated in the figure below. Here we have asked for
all UniScripts that have four or more members and that have been detected at least 75% of the
time only in library CCC1. A list of 42 UniScripts matching these criteria is returned. It has first
been sorted by total number of 3‟ ESTs and then by ratio. As in the example of the annotation
window above, data can be sorted by any column simply by clicking on the header of that
column. Again, selected data can be moved into a new sub table with the „Sub Table‟ button.
There it can be sorted further in iterative fashion. Please note that after selecting one or more
UniScripts all previously described functions can also be accessed directly from this window via
the buttons along the bottom of the window.

       The Search Annotation page (below) is designed to permit identification of ESTs through
their provisional electronic annotations. We have so far used blastx against PIR_NREF, which is
a well annotated and non-redundant protein database. We will in the coming months be replacing
these provisional annotations with data obtained from UniProt. Be warned that we enter into
MAGIC DB the best hit for every query sequence, irrespective of Expect value up to a maximum
of 10. Our objective is to leave it to the person who views the data to decide for himself the
significance of a hit. Consequently, before taking any annotation seriously, ensure that you
consider the values returned by BLAST. Expect value, score, % identity, % positive, and the
alignment itself are all provided, as illustrated above for the Annotation window. Please note that
when BLAST returns are available from other target databases, these data are made available
through the „Target Database‟ window.

       The following screen shot illustrates what is returned when querying the anaerobic root
library (ANR1) for all annotations that include „dehydrogenase.‟ Other filters not used here
include target species and, as before, Seq Name. The 76 ESTs retrieved, both 3‟ and 5‟, were
first sorted by Expect value to get the best hits at the top of the window, and second by Target
Name. The third row was selected to display the match against sugarcane alcohol
dehydrogenase. Note the value of including Q1, Q2, H1 and H2 (see above for definitions) in
this table. In this case H1 = 1, which means that the selected sequence includes the initiating
methionine. Thus, the corresponding clone is full coding length. As expected, the selected
sequence is a 5‟ EST, as annotated in a column to the left of those shown in this screen shot.
Also as expected, the reading frame of the match is positive as annotated in a column to the right
of those shown here. Thus, one can not only find clones of interest, but frequently identify full
coding length clones and identify those with the longest 5‟ UTR. In this case, since Q1 = 147, we
know that the clone includes 146 nt of 5‟ UTR. Altogether, ~65% of all clones are full coding
length; those prepared over the past three years by Drs. Sumio Sugano and Yutaka Suzuki in
Tokyo are slightly in excess of 80% full coding length!

      As in the case of other views, after selecting one or more rows access to all previously
described functionalities is provided through the buttons at the bottom of the window.

        A final screen shot illustrates the functionality of the Search Full Length Seqs page.
Typically, a user would select one or more libraries and a target database (PIR_NREF by default
as it is most useful for this application). Optional filters as available on other pages are provided.
In the example presented below, 2955 blastx returns were obtained from PIR_NREF for ANR1
as the query library. Because these are new data, they have not yet been incorporated into
UniScripts or
SuperScripts. Data
were sorted first by EST
direction to bring 5‟
ESTs to the top, second
by Q1 value to bring the
largest values to the top,
and third by H1. As
before, selecting a row
reveals the actual
alignment. In this
example, H1 = 1
thereby identifying a
full coding length clone
(the target database has
been filtered to exclude
entries that are not full
coding length), while
Q1 = 371, indicating
that the clone includes
370 nt of 5‟ UTR.

       Upon completion of a query, a search of the data returned is conducted. After either
accepting default values for H1, a cut-off for the Expect value, and the total number of hsp‟s, or
entering values of your own, a variety of information is calculated and displayed. The „Library
Full Length Sequence Report‟ window reveals that 0.4% of these inserts are apparently cloned
backwards from expectations, while 90.7% of
the clones in this library are full coding
length. The default view of this window is
illustrated here; selecting the „Full Report‟
option provides much more information.

      A second new window is also opened
(next page), this one displaying 906 rows of
data for all ESTs from this library found or
deduced to include the initiating methionine.
Data have been sorted inversely by Q1 and
then by H1. There was no need to sort by
Expect value because this was done prior to

the calculations described in the previous paragraph. The alignment returned from blastx is
shown for a eukaryotic translation initiation factor from maize. The BLAST return indicates that
the clone includes 370 nt of 5‟ UTR. Again, buttons along the bottom of the window provide
access to all other functionalities.

       As already described for the Annotation window, sub tables can be created from this and
any other similar view in MAGIC Gene Discovery, thereby permitting iterative queries until the
data have been reduced to exactly what you are looking for. Once a table contains only the data
of interest, those data can be selected using Ctrl-A (alternatively, rows can be selected with a
mouse in conjunction with the Shift and Ctrl keys), copied with Ctrl-C, and then pasted into an
Excel spreadsheet with Ctrl-V or the copy function within Excel. Please recall also that a
function to download data in fasta format is available for all clones that have been included in 3‟
EST clusters. This function is available from a number of windows, and always available from
the „Sequences in Contig and Consensus‟ window selected from the „Contig Composition Pie
Chart‟ window.

     Sequences themselves can be downloaded in fasta format from MAGIC Sequence Viewer,
which is described below.


       A BLAST page at ( permits anyone to BLAST
their favorite sequence(s), nucleotide or amino acid, against either all sorghum ESTs or 3‟
sorghum EST clusters. This is a standard NCBI BLAST against databases at Shown
here is an example of a tblastn against the sorghum EST database using default parameters and an
arabidopsis protein sequence as the query. If you are looking for a sorghum equivalent of an
already known gene, this is likely to be the quickest way to find it.

                     Sequences – including MAGIC Sequence Viewer

       Two options are provided for exploring sequences. Please note the page
( that describes the naming
convention for sorghum sequences at

(1) Java server pages ( provide direct
or drill-down access to sequences. This approach is most useful for those who have low-speed
access to the internet or who have specific objectives that can be satisfied with these pages.
Please note that these pages are currently being modified to accommodate extensive recent
changes in the table structure of our Oracle database, resulting for the moment in some data
requests going unfulfilled.

      Direct access to sequences is achieved via a link at the top of the page listing the sorghum
cDNA libraries. This link takes you to a new page where you can enter either a GenBank
accession id or all or part of a sequence
name (e.g., DG1_17_A05 is sufficient).
This page is being modified as
currently it is not working for all
sequences as a consequence of recent
database modifications.

      Drill-down access is achieved by
selecting first a library, second a 96-
well plate in that library, and third the
desired sequence. When available, note
that hyperlinks both to GenBank and to
Gramene are provided. Gramene
provides hyperlinks back to,
both for ESTs and 3‟ EST clusters.
Please note, however, that at Gramene
sorghum ESTs and EST clusters are not
shown by default but must be selected
from the Advanced function in the
Features menu. We are currently
modifying access to EST clusters from
Gramene as the code has to be changed
to accommodate our recently modified

       Raw, untrimmed sequences are
displayed, color-coded with respect to
quality, vector, linker and polyT as
illustrated here. They can be copied
and pasted into your own application in
the usual fashion.

(2) MAGIC Sequence Viewer is a java GUI ( that provides more
powerful access to sequences. Because of the quantity of information it provides, however, it is
used most conveniently with high-speed internet access. To use this program you will need first
to install Java Web Start (free from Sun) as explained on the page that launches this program and
as described at the end of this document.

       Once downloaded, the first window (right) provides a
list of libraries from which one must be selected. This
window also permits selecting either vector or EST
orientation and either all data from the library („All Blocks‟)
or only a single block.

      Clicking „Get Sequences‟ opens two new windows.
One (below) lists sequences by name („Alias Name‟ and
„User AKA Name‟ are not relevant here), EST direction,
whether vector is identified at the beginning (VF1) or end
(VF2) of the sequence, whether it is all vector (Vtot), and
length of sequence after trimming for vector, adaptor, polyT
and quality (Q16VS). PolyT length is given when appropriate.

      Note that sequences can be sorted by clicking on any column. Thus, for example, all
sections with vector identified at the end can be brought to the top by clicking on VF2. In this
way, all sequences with inserts shorter than the read length can be identified easily.

      [„PID‟, „Method‟ and „Failure Report‟ have
no function here. This program is the same as that
used with our production database. Hence, some
functions relate to quality control issues and are
not relevant when applied, as here, only to data
that have passed quality control.]

Selected sequences, from one to all, can be
downloaded as a fasta file (right) by making this
selection in the Select a Sequence window above.
Sequences can be downloaded raw, trimmed in
various ways, and/or reverse complemented as
desired. A counter is provided in the „Select a
Sequence‟ window (above).

      Selecting a single sequence (ANR1_1_A04.g1_A002 in the „Select a Sequence‟ window
above) provides detail in the Display Window (below). The various options in the upper right of
this window highlight different features of the sequence. Displayed here in green is that part of
the sequence sent to GenBank.

      Selecting the Blast
function (lower right) opens a
new window. Clicking on
„Read Sequence‟ copies the
highlighted region into memory
and pastes it into that window.
Hyperlinks to BLAST at NCBI
and at are provided.

       The lower part of this
window displays phred quality
scores as a function of position
in the sequence. Phred quality
scores are a quantitative
measure of the certainty with
which a base is called correctly.
To give an idea of this
exponential scale, a value of 10
is 90% certainty the call is
correct, while 20, 30, 40, etc.,
correspond to certainties of
99%, 99.9%, 99.99%, etc.

      Any region of a sequence in the text window can be selected with a mouse in conventional
fashion. The selected region is then highlighted in both views, with the lower view defining
where it begins and ends and displaying the phred quality scores at the two extremes (line of text
in red near bottom of window). Phred scores for single base calls can be obtained by using a
mouse to place a cursor in the text window. This cursor can then be moved with the four cursor
(arrow) keys on the keyboard.

      A new version of this interface, which we call MAGIC SeqView, is presently near
completion. When available, it will replace this one. Because of the way Java Web Start
functions, it will then be automatically downloaded to your computer the next time you run the
program. While it will look and function pretty much like this version, it will have a number of
added functions. These new functions will include the ability to view and explore translation of a
sequence in all six frames and to download in fasta format any or all of those translations. In
addition, it will be possible to search for any user specified sequence in the EST that is currently
being displayed in the window. It will also provide the reverse complement of a sequence and
the ability to search it as well.

                    Obtaining Java Web Start and Downloading
                 MAGIC Gene Discovery and MAGIC Sequence Viewer

Java Web Start 1.2 must be installed on a PC to run these programs. It is available from Sun at
no cost. It is difficult to provide lasting instructions on how to obtain Java Web Start because
Sun seems to change its web site every couple of months. Currently, it can be obtained at We use routinely (with Win2000, SP3) J2SE v
1.4.2_05 SDK (developer‟s version; ~52 MB) and have tested successfully J2SE v 1.4.2_05 JRE
(end-user‟s version; ~15 MB). Download and install whichever version you prefer. Installation
instructions are provided by Sun from this page.

To use MAGIC Gene Discovery and MAGIC Sequence Viewer, you will also need to be able to
send data from your computer not only to the widely used default port 80, but also ports 8080 and
1521. If you are behind a firewall, please note that as a security measure some firewalls prevent
sending, especially to port 1521. You will then need to speak with whomever administrates
firewall functions.

When you launch one of these programs for the first time, it is downloaded to your computer
where it is saved for future use. Each time you request the program, your computer compares the
version you already have to that at If the latter is newer, then your computer will
automatically download the latter and replace what you have with the newer version. Thus, you
always use the most current version without any action on your part.

When MAGIC Gene Discovery or MAGIC Sequence Viewer is initially downloaded, you will
likely be warned not to execute the code, as we have not paid for a certificate from any of the
companies (such as Verisign) recognized by internet browsers. It is, however, completely safe to
ignore this warning as long as you trust us. The certificate does nothing more than verify that we
are who we say we are; it does not mean that the code has been verified as safe.


To top