Subsystem-Approach-to-Genome-Annotation by akgame

VIEWS: 9 PAGES: 20

More Info
									Subsystem Approach to
 Genome Annotation
 National Microbial Pathogen Data Resource
              www.nmpdr.org

                 Claudia Reich
       NCSA, University of Illinois, Urbana
 Complete Microbial Genomes
• 464 complete microbial genomes in NCBI as of 3-1-07
• 691 microbial genomes in progress as of 3-1-07




                    www.nmpdr.org
   Making Sense of Genome Data
• Locate Genes: identify ORFs automatically
     GeneMark
     NCBI’s ORF Finder
     Glimmer
     Critica
• Assign Function: by sequence similarity to
  experimentally characterized proteins
   BLAST family of sequence comparison tools


                      www.nmpdr.org
Problems with Assignments by
         Similarity
• When ORF is a member of a protein family
• Paralogous genes
• ORFs encoding similar proteins acting on
  different substrates
• Assignments can be transitive, and many
  times removed from experimental data



                www.nmpdr.org
Other Factors Can Aid in Function
          Assignments
 •   Molecular phylogeny
 •   Paralogous and orthologous families
 •   Conserved gene neighborhood
 •   Metabolic context
 •   Bidirectional best hit matches across
     multiple genomes



                   www.nmpdr.org
Incorporating Information Other Than
              Similarity
• KEGG: manually curated pathway and
  metabolic maps
• GO: vocabularies that describe ORFs as
  associated with
   biological processes
   cellular components
   molecular function
• MetaCyc: experimentally elucidated metabolic
  pathways

                     www.nmpdr.org
             What is Needed:
• A system that:
   integrates all the above concepts
   organizes genomic data in structured idioms
   allows high-throughput annotation of newly
    sequenced genomes
   resolves discrepancies in different annotation
    tools
   informs experimental research



                     www.nmpdr.org
              Enter the SEED*
• Database and annotation environment
• Underlies, and accessible through, NMPDR
  (www.nmpdr.org)
• Expert annotation via subsystems building
• Provides the most accurate genome
  annotations available

  *Argonne National Lab, University of Chicago, UIUC, FIG



                       www.nmpdr.org
          What is a Subsystem?
• Any organizing biological principle:
   metabolic pathway
     • amino acid biosynthesis, nitrogen fixation, glycolysis
   complex structure
     • ribosome, flagellum
   set of defining features
     • virulome, pathogenicity islands
   functional concept
     • bacterial sigma factors, DNA binding proteins



                        www.nmpdr.org
             Subsystems are:
• Sets of functional roles, which are functions,
  or abstractions of functions (such as an EC
  number), that together implement a specific
  biological process or concept
• Created manually by expert curators
• Experts annotate single subsystems over the
  complete collection of genomes, thus
  contributing and sharing their expertise with
  the scientific community

                   www.nmpdr.org
      How Subsystems are Built
• Create a subsystem for the biological concept,
  and define the functional roles
• In one (or a few) key organisms that include
  the subsystem, find the genes and assign
  meaningful functional names
• Project the annotations to orthologous genes
• Expand to more genomes, creating a
  Populated Subsystem


                   www.nmpdr.org
       Populated Subsystems
• Are Spreadsheets where:
   Columns: functional roles
   Rows: specific genomes
   Cells: genes in the organism that implement the
    functional role




                   www.nmpdr.org
     How to Access Subsystems
• From Search menu
• From Organism pages
• From search results when found protein is
  included in a subsystem
• From Annotation Overview pages




                  www.nmpdr.org
      Subsystem Pages in NMPDR
•   Table of Functional Roles
•   Subsystem diagram (if appropriate)
•   Populated subsystem spreadsheet
•   Customizable spreadsheet viewing options
•   Functional variants and subsets of roles
•   Curator’s notes




                    www.nmpdr.org
        Benefits of Subsystems
• More accurate annotations
• Annotation of protein families
• Analysis of sets of functionally related
  proteins
• Less error-prone to automatic projections to
  novel genomes




                   www.nmpdr.org
  Subsystems Reveal Interesting
• Pathway variants:
   Are they clustered by phylogeny?
    • Delta subunit of RNA polymerase only Bacillales
   Are they clustered by functional niche?
   Horizontal gene transfer?
• Fused genes:
     
     and ’ subunit of RNA polymerase fused in
     Helicobacter
• Fissioned genes:
     ’
     subunit of RNA polymerase is fissioned in
     Cyanobacteria

                      www.nmpdr.org
  Subsystems Reveal Interesting
• Duplicate assignments
   More than one gene for one functional role?
    • Alpha subunit of RNA polymerase in Magnetococcus
      and Francisella
   Same sequenced region in more than one contig
    in partially assembled genomes?
   Frameshifts or other sequencing errors?
   Annotation errors?




                     www.nmpdr.org
  Subsystems Reveal Interesting
• Missing genes:
   Is the function essential?
   Is the function conserved?
   Does the missing gene cluster with homologs in
    other organisms?
   Is the function performed by a newly recruited
    gene?
   Has a gene been acquired by horizontal gene
    transfer and now performs that function?



                    www.nmpdr.org
Synthesis of Selenocysteinyl-tRNA
• Two known pathway variants
   One step in Bacteria
    • SelA is annotated
   Two steps in Archaea and Eucarya
    • PSTK was missing until very recently




                      www.nmpdr.org
       Explore Selenocysteine Usage
• Start by searching for gene name, selA, in an organism known
  to use Sec, E. coli K12
• Start from subsystem tree; expand category of "Protein
  metabolism," expand subcategory of "Selenoproteins"
• Open "Selenocysteine metabolism" subsystem from protein
  page or SS tree
    Genomes arranged phylogenetically
    Roles defined on mouse-over
    What genes are missing in which organisms?
    Are there Sec metabolism genes present in any organisms that do not
     have proteins that need Sec?
    Are there organisms known to need Sec for certain proteins, but that
     do not have a complete Sec biosynthesis pathway?
    Why is there a hypothetical protein included in this subsystem?

                            www.nmpdr.org

								
To top