Annotating Hymenoptera Genomes

Document Sample
Annotating Hymenoptera Genomes Powered By Docstoc
					Annotating Hymenoptera Genomes

            Chris Elsik
       Georgetown University
    Our roles in honey bee and wasp
    genome analysis and annotation
•   Creating consensus gene set
•   Community manual annotation
•   Hymenoptera Genome Database - BeeBase and NasoniaBase
•   GC composition analysis
•   microRNA prediction
•   Repeat Analysis (bee)
•   Superscaffolding chromosomes (bee)
•   Manual gene annotation - chromosomes 15 and 16; homeobox genes (bee)
•   Porting sequence feature coordinates across assemblies (bee)
Consensus Gene Set
     Objectives for Honey Bee
      Consensus Gene Set
• Obtain an improved set of gene models with
  increased coverage of known genes, while
  maintaining gene model quality.

• Provide a single official gene list that the research
  community could further utilize for consistent and
  comparable analyses and functional annotation.
  GLEAN Consensus Gene Tool

• GLEAN is a tool for creating consensus gene lists
  by integrating gene evidence using Latent Class
• Does not require a training set.
• Each consensus prediction is labeled with a
  probabilistic confidence score reflecting the
  underlying support for that gene model.
• Other consensus gene tools include JIGSAW,
  EVidenceModeler, Evigan
Honey Bee Sources of Gene Evidence

• Ensembl - based on protein and EST evidence
• RefSeq - based on protein and EST evidence,
  combined with Gnomon (hidden Markov model)
• Evolutionary Conserved Core - based on metazoan
  orthologous proteins, using Fgenesh+
• Drosophila Ortholog Set - based on Drosophila
  orthologs using Genewise
• Fgenesh - ab initio using hidden Markov model
• Alignments to honey bee ESTs and to Swissprot
  metazoan proteins using Exonerate for splice site
 Evaluating Honey Bee GLEAN Sets

• Several GLEAN runs using different inputs
• Evaluate by comparing GLEAN sets and individual
  input sets with “Gold Standard”
   – 395 manual annotations based on full-length
• Select best GLEAN set for use in genome analysis
• Further evaluation
   – sensitivity and specificity using expert annotated
     chromosomes 15 and 16 (entire chromosome)
   – Agreement with EST splice sites
  Comparison with Gold Standard
Predicted Gene   No. Gene   No. Perfect     No.
Set               Models    Alignments Present /
                            / weighted   weighted
                            by no. gene by no. gene
                               models     models
GLEAN             10157      111 / .011  356 / .035
Fgenesh           32664      100 / .003  385 / .012
NCBI               9759       88 / .009  340 / .035
Evolutionary      10966       39 / .004  284 / .026
Conserved Core
Ensembl           27755     32 / .0012   217 / .008
Drosophila         8878     4 / .0005    116 / .013
Sensitivity and Specificity of Gene Prediction
Methods using 684 Expert Annotated Gene
   Models from Chromosomes 15 & 16
     Nasonia Consensus Gene Set

• Evidence:
• RefSeq (evidence based) - 9230 genes
• Gnomon ab initio (NCBI’s ab initio set) - 16,324
  genes (that don’t overlap RefSeq)
• Fgenesh++ (ab initio plus evidence) - 26,057 genes
• Fgenesh (ab initio) - 32,417 genes
• Augustus (ab initio) - 30,196 genes
• Nasonia EST, Swissprot metazoa homolog, and
  honey bee official gene set homolog alignments
Nasonia Consensus Gene Set Challenges

• RefSeq was the only evidence-based set
   – We chose RefSeq as the Official Gene Set, but it
      only had 9230 genes (there should be more)
   – How do we select ab initio gene models to
      increase the number of genes?
• Initially we did not have Augustus
• We could not use Gnomon ab initio in GLEAN
  because it was in total disagreement with RefSeq
  (which is a subset of the entire Gnomon set)
• We didn’t want to combine RefSeq with Gnomon,
  because we needed a good evidence set
• Fgenesh and Fgenesh++ are similar to each other
  (so together they out vote RefSeq)
            Nasonia GLEAN Sets
Gene Set     Genes   Input to GLEAN
RefSeq       9230
Gnomon ab    16324
Augustus     30196
Fgenesh      32417
Fgenesh++    26057
GENEID       15170
Glean1       28565   All, Fgenesh and Fgenesh++ are not
Glean2       26699   All, Fgenesh and Fgenesh++ are combined
Glean3       23371   RefSeq, Augustus, combined
                     Fgenesh/Fgenesh++, GENEID
Glean4       23591   RefSeq, Augustus, Fgenesh, GENEID
Glean5       22844   RefSeq, Augustus,Fgenesh
Glean6       15216   RefSeq, Gnomon, Augustus, Fgenesh
Glean7       28248   RefSeq, Gnomon, Augustus, Fgenesh,
Evaluation of Nasonia GLEAN Sets
• Comparison with 82 manually annotated gene
  models from Hugh Robertson
• Agreement with EST splice sites
• Gene number
• Number of GLEAN genes that overlap RefSeq and
  vice versa
• Overlap of GLEAN sets with each other (estimate
  splitting and merging compared to each other)
• Spot check GLEAN sets in Gbrowse
       Final Nasonia OGS

• 9230 RefSeg genes

• 6935 GLEAN6 genes that do not
  overlap with RefSeq

• 16165 total genes
       Community Gene Annotation

• Advantages
   – Biologists can focus on their genes of interest
   – Leverage the expertise of biologists

• Challenges
   – Varying degrees of gene annotation expertise among
   – Errors in data entry (exon coordinates)
   – Need to quality check submitted gene models
    Honey Bee Community Annotation

•    61 honey bee biologists from around the world contributed ~3000
     gene models (exon/intron/UTR structure) and functional annotations

•    Participants were divided into groups based on themes, each with a
     group leader
      – Innate immunity
      – Pesticides and stress resistance
      – Neurobiology and behavior
      – Gene regulation
      – Development and metabolism
      – Reproduction
Annotation Tools at BeeBase

• Multiple genome browsers - scaffold coordinates,
  chromosome coordinates, different assemblies

• Special set of BLAST databases
   – scaffolds, contigs, unscaffolded contigs,
     unassembled reads and repeat reads
   – All gene predictions sets
   – Special PSI-BLAST database - NCBI non-
     redundant protein database plus Official Honey
     Bee Predicted Gene Set and ab initio gene
 Community Annotation Process

• Participants submitted annotations online to
  an annotation database at Baylor HGSC
• Data was transferred to BeeBase
• BeeBase reviews gene models and
  incorporates them into the Official Gene Set
Honey Bee Manual Annotation Outcome
          (61 annotators)
Nasonia Community Annotation (in progress)
   • Two methods:
   • 1. Participants use various software (online BLAST,
     cut-and-paste) and submit annotations on Baylor
     HGSC manual annotation website
      – Data will be transferred to NasoniaBase
   • 2. Participants use Apollo Annotation Software
     installed on their computer, and submit annotation as
     XML file to NasoniaBase
      – Users’ Apollo client connects directly to
        NasoniaBase to retrieve gene evidence
      – This method significantly reduces work in
        processing submitted annotations at
    The Apollo Genome Annotation Tool
    developed at UC Berkeley (S. Lewis)
•   A graphical tool that allows you to view genome feature details
    and edit/create gene models
•   Advantages of using a graphical annotation tool over BLAST
    cut and paste methods of annotation
     – Special feature viewing interfaces including genome viewer
        and exon viewer
     – Splice site modeling
     – Start and stop codon detection
     – Keeping track of splice variants at a single gene locus
     – Special menus for saving extra information, such as
        functional description
     – Saving results in different formats (Fasta sequence, XML
        for upload to database, GFF coordinates)
           Loading Data into Apollo

• Apollo (installed on user’s computer) can load flatfiles
  stored your computer
   – Formatting files can be problematic if the user
     does not have bioinformatics expertise
• or
• Apollo installed on user’s computer acts as a client to
  connect to a central database that has been set up to
  host the genome data and gene prediction evidence
   – File formatting by the user is not required
Apollo for Community Annotation
•   One lab (with bioinformatics expertise) sets up a “Chado”
    Postgresql database (the GMOD database schema)
     – Loads the database with reference sequence, gene
       prediction data, homolog and EST alignments
     – Creates Apollo configuration files specific to the data for that
•   Community members (not needing bioinformatics expertise)
     – download the conf files
     – use Apollo to connect to the database over the internet
     – manually correct gene predictions or create new gene
     – submit manual annotation data back to the central database
•   The central database collects and organizes the manual
    annotation data

      Curators                 Data Storage            WWW Front End
  Manually merge gene
                                Incoming Chado          Gene annotation              Nasonia
    every 6 months
                                (PostgreS QL)          submission website            Research
                                Community gene          (CGI, HTML, MySQL,
                                                             Chado XML)
         Apollo                   annotations
                                                      Submission of annotations      Apollo
                                                      Correction of annotations
    GMODTools                                  XORT

                                  Main Chado
                                (PostgreS QL)
Submit                                                  Genome Browser
  to                              Gbrowse DB
 NCBI                              (MySQL)

                            Load into
                            databases                                                   Software
                                    Raw data
  Non-                              (flat files)                                      Community
                               Genome assemblies                                      Annotation
redundant                      Automated gene calls
gene calls                                                                             system
                                Protein homologs
                                   ESTs/cDNAs                                        Data exchange

      Honey Bee Superscaffolds
• Improved chromosome models produced by
  research community member, Hugh
  Robertson (University of Illinois)
• Chromosomes 12, 13, 14, 15, 16
• Manually filled in gaps between assigned
  –   Unassigned scaffolds
  –   Unscaffolded contigs
  –   454 genomic reads
  –   Long PCR
• BeeBase used submitted data to create new
  chromosome sequence fasta files, AGP file
  and Gbrowse




• (these sites are linked to each other)
•   Current Elsik Lab       •   Baylor HGSC
     – Justin Reese              – Stephen Richards
     – Chris Childers            – Kim Worley
     – Darren Hagen
                            •   Funding
•   Previous Elsik Lab           – USDA NRI
    members (BeeBase)
                                 – USDA ARS
     – Kevin Childs
                                 – NIH NHGRI
     – Natalia Milshina
                                 – Sioux Honey
•   Nasonia Genome               – Golden Heritage
    Sequencing Consortium          Foods

•   Honey Bee Genome
    Sequencing Consortium

Shared By: