Guidelines for Genome Annotation - Muktak Aklujkar

Document Sample
Guidelines for Genome Annotation - Muktak Aklujkar Powered By Docstoc
					                                                                  Aklujkar, Muktak 1-5

                         Guidelines for Genome Annotation

There are a lot of genomes for us to annotate, and everyone is encouraged to
participate. This is a good way to get one more paper to your credit. However, the
criteria for authorship should be the same as for any other paper: you should make an
intellectual contribution and discover something.

If you are an authority on any particular group of proteins, please share your insights
by e-mailing them as an Excel file.
Seven columns:
    1. The date you submit the file
    2. Your name
    3. Genome name and draft date
    4. Gene number
    5. Suggested gene name (optional)
    6. Description
    7. Evidence.

You don’t have to limit yourself to your area of expertise. Just pick a topic that
interests you, and follow the clues. Anyone can do this sort of research, and the more
you find out, the higher up you will move in the list of authors.

The purpose of these genome annotation projects is not just to corroborate or qualify
what we have learned from Geobacter sulfurreducens, but to record new discoveries
that may be important later on.
It is a huge task to attach meaningful labels to the proteins of unknown function:
     1. Are they conserved?
     2. Are they specific to the Geobacteraceae?
     3. With what other proteins are they expressed?
Make sure that the orf-finding software doesn’t miss any orfs that should be on the
microarrays, especially those that look like other orfs in the same genome!
Check everything that looks suspicious:
     1. Long intergenic regions
     2. Truncated/extended N-termini
     3. Overlapping genes
     4. Orfs split into pieces by frameshifts
     5. Orfs within repetitive DNA
     6. Etc.

Double check all work
                                                                   Aklujkar, Muktak 2-5

Questions to ask when you look at a gene product annotation:

1.     Is it similar to any other protein? Is the alignment full-length or partial? Are
       there big gaps? Do the homologs have longer or shorter N-termini or C-
       termini? How many homologs are in the same genome?

2.     If a function has been predicted, is a domain associated with that function
       present in the orf? If so, is it the complete domain; if not, was the function
       predicted reasonably? For which homolog is there experimental evidence of

3.     Are there repeats within the orf? Are they plausible protein repeats, or DNA
       repeats that extend into the noncoding regions on either side?

4.     What is on either side of the orf? If it’s a toxin, is the antidote encoded next to
       it? If there’s a big intergenic region, what is in it - any hairpins or repeats? If
       the adjacent orf has the same annotation, is it a duplication or is it another
       piece of the same protein, split up by a frameshift? How well is the gene
       arrangement conserved?

5.     Is the protein typical of the Geobacteraceae or a very different group of
       organisms - e.g., Archaea or Eukarya? Are there multiple proteins with the
       same function but different lineages? Does the protein belong to a mobile
       genetic element such as a plasmid or prophage?

6.     Is there anything suspicious about the protein sequence? Lots of prolines and
        arginines? Is there a more plausible orf in a different reading frame? On the
       opposite strand? Is the orf on the same strand as the adjacent orfs? If not, does
             it seem to interrupt an operon of gene products that work together?
                                                                 Aklujkar, Muktak 3-5

                             Genome Annotation Tools

The place for you to start looking at our Geobacteraceae genomes is, where you can browse from gene to gene,
getting a feel for how little we know. Or, you can click on the "ORFS" button and
search for a word in the gene description, such as "kinase," if you are interested in a
particular sort of protein. The Geobacteraceae genomes encode a lot of cytochromes,
transcriptional regulators, sensor kinases, chemotaxis proteins, and transposases, and
a whole lot of proteins of unknown function (a.k.a. “proteins of imminent
importance).” Maybe it would be better if you went to and
searched for something like “laccase” or “sortase” - whatever nifty protein you heard
about at a seminar, and used the sequence as a query in the UMass BLAST tool.

For each genome, there is a list of contigs (fragments of the genome for which we are
fairly certain of the sequence), and for each contig there is a map showing where
genes have been predicted on one DNA strand (red) or the other (blue). Bacterial
genomes don't waste a lot of space, so if the genes are few and far between, we
should take a closer look: have we missed any genes or is there something else in the
DNA that is interesting? The maps all look circular, but unless the genome is
finished, the contigs are in fact linear. (The G. metallireducens plasmid is the only
small contig that is truly circular.) Sometimes, you will see two independent
predictions of genes (by the automated annotation at Oak Ridge National
Laboratories (ORNL) of the Joint Genome Institute (JGI), and by our collaborator
Julia Krushkal) that may not agree. I want to get rid of as many incorrect gene
predictions as possible.

Each predicted gene has its own page, where you can see its DNA and protein
sequences and functional description (if any exists). These pages are NOT hand-
made; if something on the page doesn't make sense, it probably doesn't belong there!
Click on "BLAST this sequence" to use the UMass BLAST tool to find similar genes
in our other genomes. The ">" symbol is required in the first line (called the FASTA
header) to signal that this line is the gene name, not part of the query sequence. The
BLAST results (or "hits") will include a number called the "e-value" of each hit,
which tells you "how many times you would expect to find this much similarity in a
database of this size, just by chance." An e-value less than 10^-5 is a decent
indication that two sequences had a common origin somewhere, and the closer to zero
the e-value is, the more certain you can be that the two sequences are variations on
the same theme. Keep in mind, however, that BLAST is a "local alignment" tool that
matches pieces of sequences; the two sequences may not be similar over their entire
lengths. You may want to pull out the protein sequences and generate an alignment,
or a phylogenetic tree showing their relationships, which you can do at (be sure to include a FASTA header before each protein
                                                                   Aklujkar, Muktak 4-5

BLAST can also be done at the National Center for Biotechnology Information
(NCBI) website - where you have the
advantage of seeing what domains your protein contains, and what they might do.
This site does not have several of the genomes-in-progress that you can find at - of particular interest to us are the genomes
with names that begin with “Desulf...” because these are relatives of the

Other places to find out more about a protein sequence are: to predict where in the cell it goes to predict how it folds (alpha-helices, beta-
strands, coils) to predict membrane protein topology (how
many transmembrane segments and which way they go into the membrane). The so-
called "positive-inside rule" is that proteins thread in and out of the membrane so that
most of the lysines and arginines in between the hydrophobic segments end up in the
cytoplasm. Although lysine and arginine carry a positive charge, the rule may have
nothing to do with that, because the other charged amino acids (histidine, aspartate,
glutamate) don’t affect topology. Rather, it may be significant that lysine and arginine
(and methionine) have long, slender side chains.

You might also find some useful tools at

Back to our own website... You can use the Sequence Extractor tool from each gene's
page to pull out the DNA sequence and adjust the numbers to include the sequences
on either side of the gene. Not all genes start with an "ATG" codon; "GTG" is fairly
common, and others such as "TTG" and "ATC" have also been observed in nature.
However, there should be a "putative ribosome-binding sequence" similar to
AGGAGGT on the 5' side of the start codon, separated by 4 to 11 bases. This
sequence pairs with the 3' end of the 16S ribosomal RNA within the ribosome,
positioning the messenger RNA so that the ribosome knows where the start codon is,
and can start to translate protein. If you copy the sequence into a program such as
DNA Strider (available from me for Macintosh computers only) you can identify
other features of the sequence, such as hairpins (a.k.a. stem-loops - palindromic DNA
sequences such as GTGAATcatgttATTCAC in which potential base-pairing is shown
in capital letters). A strong hairpin can terminate transcription at the end of an operon
(a co-transcribed set of adjacent genes on the same strand), whereas the weak hairpin
in the example above blocks an ATG start codon so that the gene can only be
translated if the previous gene (ending with TGA) is fully translated by disrupting the

Other things that you can compute with DNA Strider are the amino acid composition
of a protein and its codon usage. In genomes that are rich in G and C bases, like the
Geobacteraceae, the three stop codons (TAA, TAG, TGA) are infrequent, and so you
often find overlapping open reading frames of considerable length. Which one is
really a gene? You can make an educated guess by considering that the proline
                                                                   Aklujkar, Muktak 5-5

codons (CCA, CCC, CCG, and CCT) would be fairly common in open reading
frames that are not real genes, but most real proteins don’t contain a lot of proline (the
protein would be too flexible because proline can’t donate a hydrogen bond to
anything). Likewise, the arginine codons (AGA, AGG, CGA, CGC, CGG, CGT) are
common in noncoding open reading frames, but a protein bristling with positively
charged arginines isn’t very likely to be useful. The codon usage of a predicted
protein can also be used to make an educated guess about how likely it is to be real. If
you see a lot of the codons GGA (glycine), AGA and CGA (arginine) instead of the
alternatives, you can imagine how hard it would be for the gene to survive when a
single base mutation would introduce a TGA stop codon at any of these locations.
Maybe it’s not a real protein. The codon usage can be used to calculate the Codon
Adaptation Index (CAI) that measures how well a predicted protein matches the
codon usage of highly expressed proteins in the same species. A high CAI suggests
that the protein is real and important; a low CAI could mean that the protein is
expressed at a low level because too much of it would be bad for the cell.

Shared By: