Bovine Genome Annotation Workshop January 15, 2007 San Diego Rough Agenda for Workshop • Melissa Landrum - Resources at NCBI • Chris Childers, TAMU - Manual annotation tutorial using tools familiar to biologists • Chris Elsik, TAMU - Organization of community annotation, submission website, tools and data at BovineGenome.org (to be continued after the Apollo talk) • Coffee Break 3:00-3:30 • Lynn Crosby, Harvard - The Apollo Annotation Editor • Chris Elsik - finish up the talk that was started before coffee break Outline • Organization of community annotation • Rules • Signing Up • Objectives • The Annotation Submission Site • Chromosome Coordinate Systems • Resources at BovineGenome.Org • Bovine Data for Apollo Bovine Annotation Themes and Leaders • Muscle - Jim Reecy • Immune function- Ross Tellam, Loren Skow • Lactation - Tad Sonstegard, Monique Rijnkels • Energy partitioning, metabolism, rumen function - Steve Moore • Reproduction, endocrinology, sex determination, development - Mike Roberts • Imprinted genes, HDACs, methyl transferases -Randy Jirtle • Bovine models of human diseases - Frank Nicholas • Non-coding RNA - Brian Dalrymple • Behavior, maternal naturing - Clare Gill • Prion protein - John Williams • Adipose - Diane Spurlock-Moody • Blood/Skin - Fernando Garcia • All Other - Chris Elsik Rules related to Cooperation • When different participants wish to work on the same area, theme leaders will coordinate, and participants will cooperate for joint publication. • Theme leaders will make sure all contributers are listed as authors. • When two people submit conflicting gene models, the BovineGenome.org annotator will resolve the conflict. • Submitted gene models may be modified by the annotator at BovineGenome.org. Those that are modified will be made available on the website, so that submitters may politely dispute the modification via email if they disagree. • Submitters will follow gene model submission and publication deadlines. Rules Related to Data • All annotations will be based on the current assembly (3.1). • The ONLY exceptions will be genes that cross scaffolds, which may be affected by upcoming assembly changes, or genes that are not found in the assembly. • Gene models that are annotated using additional sequence data, not found in the assembly, will not be incorporated into the official gene set until there is a future assembly that does contain the gene. • However, these genes may be submitted as tentative official gene models. • Gene models must be submitted to the website (not emailed). How To Sign Up (2 steps) • 1. Sign up on Baylor’s Bovine Genome Listserv http://listserv-public.bcm.tmc.edu/archives/bovine-genome.html • 2. Register at annotation submission site at BovineGenome.Org (not live yet) Manual Gene Annotation: Objectives • Check consensus gene models for correct gene structure • Identify consensus genes that should be split or merged • Correct 5’ and 3’ ends (missing start and termination codons) • Annotate splice variants • Identify diverged gene family members - using special PSI-BLAST Database at BovineGenome.org which combines NCBI NR protein database with bovine predicted and ab initio proteins • Identify novel genes - not included in consensus gene set • Annotate genes that cross scaffolds • Functional annotation (homolog description and ids, phylogenetic analysis to identify orthologs) Objectives (more specifically) • There will be one Bovine Official Gene Set (OGS) and 6 preliminary gene sets (NCBI, Ensembl, Fgenesh, Fgenesh++, Geneid, SGP) • The first step is to check if your gene is in the OGS set, then check if the OGS gene model is correct. • Gene models from the preliminary gene sets may be used to replace the OGS gene model. • Or any of these gene models may be modified. • Indicate in “Data Source” which dataset the gene model you started with came from. • If your gene is not found in any gene set, but is found in the assembly using BLAST, you may create the gene model from scratch. The Annotation Submission Site • A preliminary gene model id will automatically be assigned to your submission. • Clicking on a field name will allow you to see a description of that field. • There are a few required fields - data source, chromosome or ChrUn scaffold id, assemblyif not 3.1 • The minimum sequence information required if you are changing the sequence of a gene model is the protein sequence. • Tabs will be stripped out of all fields, so don’t depend on them. • You may view and edit your submissions. • Other users cannot edit your submissions • User information is searchable so leaders may contact their group members. Top half of submission page Bottom half of submission page Top half of edit page Coordinate Systems • There are currently two coordinate systems. • Terminology for these systems used by NCBI, Baylor, Ensembl, BovineGenome.org may be different. • At BovineGenome.org: • Scaffolds = ChrUn (unassigned) and assigned sequence segments equivalent in the assembly process to the ChrUn sequences. These are the segments that may be ordered and assigned to chromosomes. • Chromosomes = (Chr1-29 and ChrX). These are concatenated scaffolds that represent an entire chromosome. • ChrUn are always on the scaffold coordinate system (because they have not been assigned to chromosomes) Coordinate Systems (Continued) • The BovineGenome.org BLAST site has databases for both coordinate systems. • The Genome Browser is currently using the chromosome coordinate system for all chromosomes, except ChrUn (always the scaffold system) • We will provide another browser with the scaffold coordinate system for all scaffolds (not just ChrUn) • Data for Apollo will always be on the scaffold coordinate system. Resources that will be available at BovineGenome.Org • Two genome browsers - chromosome and scaffold coordinate systems • BLAST site with assembly sequences (chromosome and scaffold sequences), all gene prediction sets, and special requests • Connection of BLAST results to genome browser • PSI-BLAST site to identify diverged family members • Ftp site to download Apollo formatted data • The following slides have examples from BeeBase BeeBase - Link from Blast hit to Gbrowse BeeBase - Gbrowse with predicted genes and BLAST hits BeeBase - Gbrowse tracks on assembly 2 scaffolds Gbrowse Things • EST direction: – The track label will indicate whether arrows designate coding direction or strand of alignment • Spliced Features - for spliced EST tracks, arrows always indicate coding direction. Splice sites are modeled. • BLAST or BLAT alignment tracks - gaps between segments roughly indicate introns, but splices are not modeled (splice sites are likely incorrect). This also applies to your BLAST searches viewed in Gbrowse. Splice Sites • Apollo helps you find splice sites and start/stop codons • Gbrowse does not help you find splice sites unless the track specifically models splice sites (predicted CDS and transcript tracks; exonerate and splign alignments) • BLAST hits do not always reveal correct splice sites – The ends of BLAST HSPs can be fuzzy - they may extend slightly past the real homology, or stop short Bovine Data for Apollo • Bovine data for Apollo will be provided on BovineGenome.org ftp site • Configuration file • GAME-XML with OGS and preliminary gene sets • BLAST searches of NCBI NR protein database and ESTs A Strategy for Apollo • 1. Search bovine scaffolds with a protein homolog of your gene of interest using TBLASTN at the BovineGenome.org BLAST site. • 2. Search bovine predicted proteins with your protein homolog with BLASTP to identify the homologous bovine OGS gene model (or gene model from an input gene set). • 3. Download the Apollo XML and BLAST output file for the correct scaffold from our ftp site. • 4. Load the data into Apollo and locate the gene model from step 2. • View all sources of evidence to evaluate and modify gene model.