Subsystem-Approach-to-Genome-Annotation by akgame


More Info
									Subsystem Approach to
 Genome Annotation
 National Microbial Pathogen Data Resource

                 Claudia Reich
       NCSA, University of Illinois, Urbana
 Complete Microbial Genomes
• 464 complete microbial genomes in NCBI as of 3-1-07
• 691 microbial genomes in progress as of 3-1-07

   Making Sense of Genome Data
• Locate Genes: identify ORFs automatically
     GeneMark
     NCBI’s ORF Finder
     Glimmer
     Critica
• Assign Function: by sequence similarity to
  experimentally characterized proteins
   BLAST family of sequence comparison tools

Problems with Assignments by
• When ORF is a member of a protein family
• Paralogous genes
• ORFs encoding similar proteins acting on
  different substrates
• Assignments can be transitive, and many
  times removed from experimental data

Other Factors Can Aid in Function
 •   Molecular phylogeny
 •   Paralogous and orthologous families
 •   Conserved gene neighborhood
 •   Metabolic context
 •   Bidirectional best hit matches across
     multiple genomes

Incorporating Information Other Than
• KEGG: manually curated pathway and
  metabolic maps
• GO: vocabularies that describe ORFs as
  associated with
   biological processes
   cellular components
   molecular function
• MetaCyc: experimentally elucidated metabolic

             What is Needed:
• A system that:
   integrates all the above concepts
   organizes genomic data in structured idioms
   allows high-throughput annotation of newly
    sequenced genomes
   resolves discrepancies in different annotation
   informs experimental research

              Enter the SEED*
• Database and annotation environment
• Underlies, and accessible through, NMPDR
• Expert annotation via subsystems building
• Provides the most accurate genome
  annotations available

  *Argonne National Lab, University of Chicago, UIUC, FIG

          What is a Subsystem?
• Any organizing biological principle:
   metabolic pathway
     • amino acid biosynthesis, nitrogen fixation, glycolysis
   complex structure
     • ribosome, flagellum
   set of defining features
     • virulome, pathogenicity islands
   functional concept
     • bacterial sigma factors, DNA binding proteins

             Subsystems are:
• Sets of functional roles, which are functions,
  or abstractions of functions (such as an EC
  number), that together implement a specific
  biological process or concept
• Created manually by expert curators
• Experts annotate single subsystems over the
  complete collection of genomes, thus
  contributing and sharing their expertise with
  the scientific community

      How Subsystems are Built
• Create a subsystem for the biological concept,
  and define the functional roles
• In one (or a few) key organisms that include
  the subsystem, find the genes and assign
  meaningful functional names
• Project the annotations to orthologous genes
• Expand to more genomes, creating a
  Populated Subsystem

       Populated Subsystems
• Are Spreadsheets where:
   Columns: functional roles
   Rows: specific genomes
   Cells: genes in the organism that implement the
    functional role

     How to Access Subsystems
• From Search menu
• From Organism pages
• From search results when found protein is
  included in a subsystem
• From Annotation Overview pages

      Subsystem Pages in NMPDR
•   Table of Functional Roles
•   Subsystem diagram (if appropriate)
•   Populated subsystem spreadsheet
•   Customizable spreadsheet viewing options
•   Functional variants and subsets of roles
•   Curator’s notes

        Benefits of Subsystems
• More accurate annotations
• Annotation of protein families
• Analysis of sets of functionally related
• Less error-prone to automatic projections to
  novel genomes

  Subsystems Reveal Interesting
• Pathway variants:
   Are they clustered by phylogeny?
    • Delta subunit of RNA polymerase only Bacillales
   Are they clustered by functional niche?
   Horizontal gene transfer?
• Fused genes:
     and ’ subunit of RNA polymerase fused in
• Fissioned genes:
     subunit of RNA polymerase is fissioned in

  Subsystems Reveal Interesting
• Duplicate assignments
   More than one gene for one functional role?
    • Alpha subunit of RNA polymerase in Magnetococcus
      and Francisella
   Same sequenced region in more than one contig
    in partially assembled genomes?
   Frameshifts or other sequencing errors?
   Annotation errors?

  Subsystems Reveal Interesting
• Missing genes:
   Is the function essential?
   Is the function conserved?
   Does the missing gene cluster with homologs in
    other organisms?
   Is the function performed by a newly recruited
   Has a gene been acquired by horizontal gene
    transfer and now performs that function?

Synthesis of Selenocysteinyl-tRNA
• Two known pathway variants
   One step in Bacteria
    • SelA is annotated
   Two steps in Archaea and Eucarya
    • PSTK was missing until very recently

       Explore Selenocysteine Usage
• Start by searching for gene name, selA, in an organism known
  to use Sec, E. coli K12
• Start from subsystem tree; expand category of "Protein
  metabolism," expand subcategory of "Selenoproteins"
• Open "Selenocysteine metabolism" subsystem from protein
  page or SS tree
    Genomes arranged phylogenetically
    Roles defined on mouse-over
    What genes are missing in which organisms?
    Are there Sec metabolism genes present in any organisms that do not
     have proteins that need Sec?
    Are there organisms known to need Sec for certain proteins, but that
     do not have a complete Sec biosynthesis pathway?
    Why is there a hypothetical protein included in this subsystem?


To top