Docstoc

BDGP - Berkeley Drosophila Genome Project.PPT

Document Sample
BDGP - Berkeley Drosophila Genome Project.PPT Powered By Docstoc
					The challenge of annotating a
complete eukaryotic genome:
 A case study in Drosophila
       melanogaster
      Martin G. Reese (mgreese@lbl.gov)
       Nomi L. Harris (nlharris@lbl.gov)
   George Hartzell (hartzell@cs.berkeley.edu)
  Suzanna E. Lewis (suzi@fruitfly.berkeley.edu)

              Drosophila Genome Center
       Department of Molecular and Cell Biology
              539 Life Sciences Addition
          University of California, Berkeley

                                           Reese et al., Tutorial #3, ISMB ‘99
            Abstract
Many of the technical issues involved in sequencing complete genomes are essentially solved.
Technologies already exist that provide sufficient solutions for ascertaining sequencing error rates
and for assembling sequence data. Currently, however, standards or rules for the annotation
process are still an outstanding problem.

How shall the genomes be annotated, what shall be annotated, which computational tools are most
effective, how reliable are these annotations, how organism-specific do the tools have to be and
ultimately how should the computational results be presented to the community? All these
questions are unsolved. This tutorial will give an overview and assessment of the current state of
annotation based upon experiences gained at the Drosophila melanogaster genome project.

In the tutorial we will do three things. First, we will break down the annotation process and discuss
the various aspects of the problem. This will serve to clarify the term "annotation", which is often
used to collectively describe a process that has a number of discrete steps. Second, with the
participation of computational biologists from the community we will compare existing tools for
sequence annotation. We will do this by providing a 3 megabase sequence that has already been
well-characterized at our center as a testbed for evaluating other feature-finding algorithms. This is
similar to what has been done at the CASP (critical assessment of techniques for protein structure
prediction) conferences (http://predictioncenter.llnl.gov) for protein structure prediction. Third, we
will discuss which annotation problems are essentially solved and which problems remain.




                                                                        Reese et al., Tutorial #3, ISMB ‘99
             Tutorial goals


   Review the algorithms currently used in annotation

   Assess existing methods under “field” conditions

   Identify open issues in annotation




                                         Reese et al., Tutorial #3, ISMB ‘99
          Tutorial organization
   Definitions
   Annotation
     “Biological” issues
     “Engineering”  issues
     Application of tools within an existing annotation system

   Break (20 minutes)
   Review of existing tools
   Our annotation experiment
   Conclusions and outstanding issues

                                            Reese et al., Tutorial #3, ISMB ‘99
           What is a gene?


   Definition: An inheritable trait associated with a
    region of DNA that codes for a polypeptide chain
    or specifies an RNA molecule which in turn have
    an influence on some characteristic phenotype of
    the organism.




                                         Reese et al., Tutorial #3, ISMB ‘99
          What are annotations?


   Definition: Features on the genome derived
    through the transformation of raw genomic
    sequences into information by integrating
    computational tools, auxiliary biological data, and
    biological knowledge.




                                        Reese et al., Tutorial #3, ISMB ‘99
            How does an annotation differ
            from a gene?

   Many annotations are the same as ‘genes’
     The annotation describes an inheritable trait associated
      with a region of DNA.


   But an annotation may not always correspond in
    this way, e.g. an STS, or sequence overlap
     Region  of genomic DNA or RNA is not translated or
      transcribed



                                              Reese et al., Tutorial #3, ISMB ‘99
Transcription and translation




                    Reese et al., Tutorial #3, ISMB ‘99
            Schematic gene structure
 DNA:
 Promoter       Exon 1              Exon 2                                   Exon 3
          TSS
                         Intron 1                   Intron 2
                   ATG                                                               TAA
                         GT   AG              GT                  AG


transcription
                Exon 1              Exon 2                                   Exon 3
                         Intron 1                    Intron 2
preRNA:            ATG GT     AG              GT                  AG                 TAA

    splicing

                    5'UTR                 ORF                     3'UTR      polyA
mRNA:                     ATG                                 TAA          AAAAAAAAA

    translation
                                     [cle avage product]
 primary
                          ATG                                  TAA
 translation:                                                          amino acid sequence
                          MPYCPLTW            ..............GFL
   modification                     [glycosylation site]

  active protein:
                                CPLTW               ......G
                                                                               Reese et al., Tutorial #3, ISMB ‘99
               Sequence feature types
   Transcribed region
       mRNA, tRNA, snoRNA, snRNA, rRNA
   Structural region
       Exon, intron, 5’ UTR, 3’ UTR, ORF, cleavage product
       Mutations: insertion, deletion, substitution, inversion, translocation
       Functional or signal region
       Promoter, enhancer, DNA/RNA binding site, splice site signal, poly-
        adenylation signal
       Protein processing: glycosylation, methylation, phosphorylation site
   Similarity
       Homolog, paralog, genomic overlap (syntenic region)
   Other feature types
       Transposable element, repetitive element
       Pseudogene
       STS, insertion site

                                                         Reese et al., Tutorial #3, ISMB ‘99
             DNA transcription unit features
   Promoter elements
     Core   promoter elements
        TATA box
        Initiator (Inr)
        Downstream promoter element (DPE)

     Transcription   factor (“TF”) binding sites
        CAAT boxes
        GC boxes
        SP-1 sites
        GAGA boxes

     Enhancer   site(s)




                                                Reese et al., Tutorial #3, ISMB ‘99
               mRNA features
   Exon
        Initial, internal, terminal
              Codon usage, preference
              Control elements (e.g. splice enhancers)
   Intron
        5’ splice site (“GT”), branchpoint (lariat), 3’ splice site (“AG”)
        Repeat elements
   Start codon (translation start site)
        “Kozak” rule
   UTR (untranslated regions)
        5’ UTR
              Translation regulatory elements
              RNA binding sites
        Initial, internal, terminal
              Control elements (e.g. splice enhancers)
        3’ UTR
              RNA binding sites (cis-acting elements)
   Stop codon
   Poly-adenylation signal and site
   RNA destabilization signal


                                                             Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
            Definitions for data modeling
   Feature: An interval or an ordered set of intervals on a
    sequence that describes some biological attribute and is
    justified by evidence.
   Sequence: A linear molecule of DNA, RNA or amino
    acids.
   Evidence: A computational or experimental result
    coming out of an analysis of a sequence
   Annotation: A set of features


                                         Reese et al., Tutorial #3, ISMB ‘99
                             Annotation

                        Detailed analysis
                     (typically biological) of               Annotated genome
Depth of knowledge




                           single genes




                                                               Large-scale analysis
                                                            (typically computational)
                                                                of entire genome




                                                 Breadth of knowledge
                                                                        Reese et al., Tutorial #3, ISMB ‘99
           Annotation process overview

           Data                                   Methods

 Genome           Auxiliary              Computational          Database
Sequence           Data                     Tools               Resources




                          Annotation Systems




                      Understanding of a Genome




                                                  Reese et al., Tutorial #3, ISMB ‘99
         Types of sequence data
   Chromosomal sequence
     Euchromatic
     Heterochromatic
   mRNA sequences
     Fulllength cDNA
     5’ EST
     3’ EST
   Protein sequences
   Insertion site flanking sequences




                                        Reese et al., Tutorial #3, ISMB ‘99
           Auxiliary data
   Maps
     Genetic,physical, radiation hybrid map (RH), deletion,
      cytogenetic

   Expression data
     Tissue,   stage

   Phenotypes
     Lethality,   sterility




                                           Reese et al., Tutorial #3, ISMB ‘99
          Computational annotation tools
   Gene finding
   Repeat finding
   EST/cDNA alignment
   Homology searching
     BLAST,   FASTA, HMM-based methods, etc.
   Protein family searching
     PFAM,   Prosite, etc.




                                       Reese et al., Tutorial #3, ISMB ‘99
           Database resources
   Curated sequence feature data sets
     Repeatelements
     Transposons
     Non-redundant mRNA
     STSs and other sequence markers

   Genome sequence from related species
     D.   melanogaster vs. D. virilis, D. hydei
   Genome sequence from more distant species
   Protein sequences from distant species


                                             Reese et al., Tutorial #3, ISMB ‘99
           Biological issues in annotation
   Common
     Genes within genes
     Alternative splicing
     Alternative poly-adenylation sites

   Rare
     Translational frame shifting
     mRNA editing
     Eukaryotic operons
     Alternative initiation




                                           Reese et al., Tutorial #3, ISMB ‘99
               Engineering issues in annotation
   What sequence to start with?
       Because features are intervals on a sequence, problems can be caused by
        gaps, frameshifts, and other changes to the sequence. How do you track
        these changes over time and model features that span gaps?
   When to annotate?
       Feature identification can aid in sequencing. It may be advisable to carry
        out sequencing and annotation in parallel thus enabling them to
        complement one another.
   What analyses need to be run and how?
       What dependencies are there between various analysis programs?
       What parameters settings to use?



                                                         Reese et al., Tutorial #3, ISMB ‘99
                 Engineering issues in annotation
   What public sequence data sets are needed?
       What are the mechanics of obtaining public sequence databases?
       Are curated data sets available or do you need to set up a means of
        maintaining your own (for repeats, insertions, organism of interest)
   How do you achieve computational throughput?
       Workstation farm, or simply a big, powerful box?
       Job flow control
   What do you do with the results?
       Homogenize results into single format?
       Filter results for significance and redundancy




                                                           Reese et al., Tutorial #3, ISMB ‘99
               Engineering issues in annotation
   Interpreting the results
       Is human curation needed?
       How can you achieve consistency between curators?
       How do you design the user interface so that it is simple enough to get the
        task completed speedily but complex enough to deal with biology?
       How do you capture curations?


   How are annotation translations to be described?
       EC terminology
       ProSite families
       Pfam domains
       Is function distinguishable from process?


                                                         Reese et al., Tutorial #3, ISMB ‘99
               Engineering issues in annotation
   How do you manage data?
       What is the appropriate database schema design?
       How is the database to be kept up to date? Will it be directly from
        programs running user interfaces and analyses or via a middleware layer?
       Is a flat file format needed and what should it be?
       What query and retrieval support is needed?


   How do you distribute data?
       For bulk downloads what is the format of the data?
       What information is best summarized in tables?
       What information requires an integrated graphical view?



                                                       Reese et al., Tutorial #3, ISMB ‘99
           Engineering issues in annotation
   How do you update the annotations?
       How frequently are they re-evaluated?
       How can re-evaluation be minimized (only subsets of the
        databanks, only modified sequences)?
       How can differences between old and new computational results
        be detected?
       Changes in computational results may need to trigger changes in
        curated annotations




                                                   Reese et al., Tutorial #3, ISMB ‘99
            Drosophila melanogaster
   Drosophila is the most important model organism*

   Drosophila genome:
    4 chromosomes
     180 Mb total sequence
     140 Mb euchromatic sequence
     12-14,000 genes




     * source: G.M. Rubin
                                      Reese et al., Tutorial #3, ISMB ‘99
             Drosophila Genome Project
   Laboratories working on Drosophila sequencing:
       BDGP (Berkeley Drosophila Genome Project)
       EDGP (European Drosophila Genome Project)
       Celera Genomics Inc.


   “Complete” D. melanogaster sequence will be
    finished by the end of 1999

   Comprehensive database - FlyBase



                                                    Reese et al., Tutorial #3, ISMB ‘99
           Goals of the Drosophila Genome
           Project

   Complete genome sequence
   Structure of all transcripts
   Expression pattern of all genes
   Phenotype resulting from mutation of all ORFs
   And more...




                                       Reese et al., Tutorial #3, ISMB ‘99
           Sequencing at the BDGP
   Genomic sequence
     P1and BAC clones
     24Mb of completed sequence (as of July 22, 1999)
     18Mb unfinished sequence in process

   Complete tiling path in BACs
     1.5x-path   draft sequencing
   ESTs and cDNAs
     80,942 ESTs finished (as of March 19, 1999)
     Over 800 full-length cDNAs




                                           Reese et al., Tutorial #3, ISMB ‘99
The BDGP sequence annotation
process




                  Reese et al., Tutorial #3, ISMB ‘99
             What sequence to start with?

   Unit of sequencing at the BDGP
       Completed high-quality clone sequences


   Reassembling the genomic sequence
       Need to place clones in correct genomic positions
       Need to integrate genes that span multiple clones
       Solved by using genomic overlaps to reconstitute full genomic sequence




                                                     Reese et al., Tutorial #3, ISMB ‘99
            Which analyses need to be run?
   Similarity searches
     BLAST      (Altschul et al., 1990)
         BLASTN (nucleotide databases)
         BLASTX (amino acid databases)
         TBLASTX (amino acid databases, six-frame translation)

     sim4    (Miller et al., 1998)
           Sequence alignment program for finding near-perfect matches
            between nucleotide sequences containing introns
   Gene predictors
     Genefinder (Green, unpublished)
     GenScan (Burge and Karlin, 1997)
     Genie (Reese et al., 1997)
   Other analyses
     tRNAscanSE        (Lowe and Eddy, 1996)
                                                   Reese et al., Tutorial #3, ISMB ‘99
             Which analyses need to be run
             and how?

   mRNAs
     ORFFinder(Frise,       unpublished)
   Protein translations
       HMMPFAM 2.1 (Eddy 1998) against PFAM (v 2.1.1 Sonnhammer
        et al. 1997, Bateman et al. 1999)
       Ppsearch (Fuchs 1994) against ProSite (release 15.0) filtered with
        EMOTIF ( Nevill-Manning et al. 1998)
       Psort II (Horton and Nakai 1997)
       ClustalW (Higgins et al. 1996)



                                                     Reese et al., Tutorial #3, ISMB ‘99
               What public sequence data sets are
               needed?

   Automating updates of public databases:
       Genbank, SwissProt, trEMBL, BLOCKS, dbEST, EDGP
   Curated data sets
       D. melanogaster genes (FlyBase)
       Transposable elements (EDGP)
       Repeat elements (EDGP)
       STSs (BDGP)




                                               Reese et al., Tutorial #3, ISMB ‘99
Which analyses need to be run
and how?




                    Reese et al., Tutorial #3, ISMB ‘99
              How do you achieve
              computational throughput?
   BDGP computing power
       Sun Ultra 450 (3 machines, 4 processors each)
       Sun Enterprise (1 machine, 8 processors)
       Used these directly, without any system for distributed computing.
   Job flow control: the Genomic Daemon
       Automatic batch analysis of genomic clones
       Berkeley Fly Database is used for queuing system and storage of results
       Many clones can be analyzed simultaneously
       Results are processed and saved in XML format for interactive browsing




                                                       Reese et al., Tutorial #3, ISMB ‘99
               What do you do with the results?
   Berkeley Output Parser (BOP)
     Input   to BOP:
        Genomic sequence
        Results of computational analyses

        Filtering preferences

     Parses  results from BLAST, sim4, GeneFinder, GenScan, and
      tRNAscan-SE analyses
     Filters BLAST and sim4 results
        Eliminates redundant or insignificant hits
        Merges hits that represent single region of homology

     Homogenizes       results into single format
          Output: sequence and filtered results in XML format

                                                      Reese et al., Tutorial #3, ISMB ‘99
             Is human curation needed?

   Not for everything
       Some features are obvious and can be identified computationally
            Known D. melanogaster genes are detected automatically by
             GeneSkimmer
            Repetitive elements


   But still for many things
       Annotating complete gene structure is still hard
       We use CloneCurator (BDGP’s Java graphical editor) for curation




                                                        Reese et al., Tutorial #3, ISMB ‘99
           Gene Skimmer
   Quick way of identifying genes in new sequence before
    curation
   Start with XML output from BOP
   Look for sim4 hits with known Drosophila genes
   Find gene hits with sequence identity >98%,
    coverage >30%
   Verify that hits represent real genes




                                        Reese et al., Tutorial #3, ISMB ‘99
          Gene Skimmer




URL: http://www.fruitfly.org/sequence/genomic-clones.html



                                              Reese et al., Tutorial #3, ISMB ‘99
           CloneCurator
   Displays computational results and annotations on a
    genomic clone
   Interactive browsing
     Zoom/scroll
     Change cutoffs for display of results
     Analyze GC content, restriction sites, etc.

   Interactive annotation editing
     Expert   “endorses” selected results
   Presents annotations to community via Web site


                                              Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
            How do we annotate gene/protein
            function?
   Gene Ontology Project
     Controlled hierarchical vocabulary for multiple-genome
      annotations and comparisons
     Standardized vocabulary facilitates collaboration
     Good data modeling allows better database querying
     Ontology browser provides interactive search of hierarchical
      terms
     “GO” project (http://www.ebi.ac.uk/~ashburn/GO)




                                              Reese et al., Tutorial #3, ISMB ‘99
Ontology browser




                   Reese et al., Tutorial #3, ISMB ‘99
Reese et al., Tutorial #3, ISMB ‘99
Ontology browser: searching for
terms




                    Reese et al., Tutorial #3, ISMB ‘99
         How do you distribute the data?
   Bulk downloads
     FASTA   at http://www.fruitfly.org/sequence/download.html
     Curated data sets

   Tabular data
     At http://www.fruitfly.org/sequence/
     Sequenced  genomic clones
     Clone contigs sorted by genomic location
     Clone contigs sorted by size

   Ribbon provides integrated graphical view of
    annotations on physical contigs

                                             Reese et al., Tutorial #3, ISMB ‘99
             Ribbon

   Human curator annotates individual clones (~100Kb)
   Clones are assembled into physical contigs (regions of
    physical map)
   Clone annotations are merged and renumbered for
    display on whole physical contigs
   Ribbon is our Java display tool for displaying curated
    annotations on physical contigs
   Will soon be available on Web



                                          Reese et al., Tutorial #3, ISMB ‘99
Ribbon




         Reese et al., Tutorial #3, ISMB ‘99
          How do you manage the data?
   Using Informix as our database server
   Updated via Perl dbi.pm module
   Development underway in
     Schema  revisions
     GAME DTD (Genome Annotation Markup Entities)
     Perl module for annotation objects
     http://www.bioxml.org/ (Ewan Birney)




                                       Reese et al., Tutorial #3, ISMB ‘99
          How do you maintain annotations?

   Open questions
     How   frequently are annotations re-evaluated?
     How can re-evaluation be minimized (only subsets of
      the databanks, only modified sequences)?
     How can differences between old and new
      computational results be detected?
     Changes in computational results may need to trigger
      changes in curated annotations




                                           Reese et al., Tutorial #3, ISMB ‘99
    Integrated annotation systems

   ACeDB
   Genotator
   Magpie
   GAIA
   TIGR




                        Reese et al., Tutorial #3, ISMB ‘99
             Integrated annotation systems:
             ACeDB

   Developed for analysis of the C. elegans genome
   Sophisticated database designed for storing annotations
    and related information
   New Java and Web-based versions available
   Written by Jean Thierry-Mieg and Richard Durbin
   http://www.sanger.ac.uk/Software/Acedb/



                                          Reese et al., Tutorial #3, ISMB ‘99
ACeDB




        Reese et al., Tutorial #3, ISMB ‘99
            Genotator
   Back end automates sequence analysis; browser
    provides interactive viewing and editing of annotations
   Nomi Harris (1997), Genome Research 7(7), 754-762.
   http://www-hgc.lbl.gov/inf/annotation.html




                                         Reese et al., Tutorial #3, ISMB ‘99
             Magpie


   Expert system based (PROLOG)
     Data collection daemon
     Data analysis and report daemon

   “Intelligent” integration of various individual feature
    prediction systems
   Allows human interactions
   Gaasterlund and Sensen (1996), TIG, 12, 76-78.
   http://genomes.rockefeller.edu/magpie/magpie.html

                                            Reese et al., Tutorial #3, ISMB ‘99
             GAIA


   Web-based system
   Results displayed as Java applets
   Bailey, L.C., J. Schug, S. Fischer, M. Gibson, J.
    Crabtree, D.B. Searls, and G.C. Overton (1998),
    Genome Research.
   http://daphne.humgen.upenn.edu:1024/gaia/




                                           Reese et al., Tutorial #3, ISMB ‘99
             TIGR Human Gene Index


   Gene Indices for various organisms
   Databases for transcribed genes linked into
    external/internal genomic databases
   Internal backend analysis software
   http://www.tigr.org/tdb/tdb.html




                                          Reese et al., Tutorial #3, ISMB ‘99
          Computational analysis tools
   Gene finding
   Repeat finding
   EST/cDNA alignment
   Homology searching
     BLAST,   FASTA, HMM-based methods, etc.
   Protein family searching
     PFAM,   Prosite, etc.




                                       Reese et al., Tutorial #3, ISMB ‘99
              Gene finding:
                 Prokaryotes vs. Eukaryotes
   Prokaryotes
     Contiguous  open reading frames (ORF)
     Short intergenic sequences
     Good method: detecting large ORFs
     Complications:
        Partial sequences
        Sequencing errors

        Start codon prediction

        Overlapping genes on both strands




                                              Reese et al., Tutorial #3, ISMB ‘99
              Gene finding:
                 Prokaryotes vs. Eukaryotes
   Eukaryotes
     Complex   gene structures (exon/introns)
        D. melanogaster has an average of 4 introns/gene
        Very long genes (D. melanogaster X gene 160 kb)

        Very long introns

        Many introns

        “Nested”, overlapping, and alternatively spliced genes

        5’ UTRs with non-coding exons

        Long 3’ UTRs

        Complex transcription machinery

     ORF-finding    alone is not adequate

                                                       Reese et al., Tutorial #3, ISMB ‘99
         Integrated gene finding
   Assumptions
     Signals and content method sensors alone are not
      sufficient for predicting gene structure
     Gene structure is hierarchical
     Each component (exon, intron, splice site, etc.) can be
      modeled independently
   The approach
     Generate   a list of candidates for each component (with
      scores)
     Assemble the components into a “gene model”



                                             Reese et al., Tutorial #3, ISMB ‘99
          Integrated gene finding:
               Dynamic programming
   Determines the best combination of components
   Two-part problem:
     Develop  an “optimal” scoring function
     Use dynamic programming to find an “optimal” alignment
      through scoring matrix




                                         Reese et al., Tutorial #3, ISMB ‘99
Integrated gene finding:
     Dynamic programming




                  Reese et al., Tutorial #3, ISMB ‘99
               Integrated gene finding:
               Linear and Quadratic
               Discriminant Analysis (LDA/QDA)
   LDA
     Deterministic  calculation of thresholds
     n-class discrimination
     Example:
          HSPL, Solovyev et al. (1997), ISMB, 5,294-302.
   QDA
     Canrepresent a great improvement over LDA
     Example:
          MZEF, Michael Zhang (1997), PNAS, 94, 565-568.


                                                      Reese et al., Tutorial #3, ISMB ‘99
               Integrated gene finding:
               Feed-forward neural networks
   Supervised learning
   Training to discriminate between several feature classes
   Computing units
   Gradient descent optimization
   Multi-layer networks
   Limitations
       Black-box predictions
       Local minima
   Example:
       GRAIL, Uberbacher et al. (1991), PNAS, 88, 11261-11265.


                                                    Reese et al., Tutorial #3, ISMB ‘99
                Approaches to gene finding:
                Hidden Markov models
   Model
       A finite model describing a probability distribution over all possible sequences of
        equal length
       “Natural” scoring function
       (Conditional) Maximum likelihood “training”
   Markov
       k-order Markov chain: current state dependent on k previous states
       The next state in a 1st-order Markov model depends on current state
   Hidden
       Hidden states generate visible symbols
   Assumptions
       Independence of states
            No long range correlation
   Example: HMMgene, A. Krogh (1998), In Guide to Human Genome
    Computing, 261-274.
                                                              Reese et al., Tutorial #3, ISMB ‘99
               Approaches to gene finding:
               Generalized hidden Markov models
   Each HMM state can be a probabilistic sub-model
   Complex hierarchical system
   Requires care in modeling state overlaps
   Example:
           Kulp et al. (1996), ISMB, 4, 134-142
     Genie,
     GenScan, Burge and Karlin (1997), JMB, 268(1), 78-94




                                             Reese et al., Tutorial #3, ISMB ‘99
             Gene finding software
   Signal recognition
       Promoter prediction
       Splice site prediction
       Start codon prediction
       Poly-adenylation site prediction
   Coding potential
   Coding exons
   Gene structure prediction
       Spliced alignment
       LDA/QDA
       Neural networks
       HMMs and GHMMs

                                           Reese et al., Tutorial #3, ISMB ‘99
                  Promoter recognition
   PromoterScan
       Identify potential promoter regions
       Based on databases of known TF binding sites
            TFD (Gosh (1991), TIBS, 16, 445-447)
            TRANSFAC (Heinemeyer et al. (1999), NAR, 27, 318-322)
       Prestridge (1995), JMB, 249, 923-932
       http://bimas.dcrt.nih.gov/molbio/proscan/
   MatInd and MatInspector
       Finding consensus matches to known TF binding sites
       Based on TRANSFAC
            Heinemeyer et al. (1999), NAR, 27, 318-322
       Quandt et al. (1995), NAR, 23, 4878-4884.
       http://transfac.gbf.de/TRANSFAC/

                                                          Reese et al., Tutorial #3, ISMB ‘99
             Promoter recognition (cont.)

   TSSG/TSSW
     LDA    based combination of several features (TATA-box, Inr
      signal, upstream regions)
     Solovyev et al. (1997), ISMB, 5, 294-302.
     http://genomic.sanger.ac.uk/gf/gf.shtml

   Transcription Element Search Software
     Identify TF binding sites
     Based on TRANSFAC
     http://agave.humgen.upenn.edu/tess/index.html




                                               Reese et al., Tutorial #3, ISMB ‘99
             Promoter recognition (cont.)
   CBS Promoter 2.0 Prediction Server
     Simulated  transcription factors
     Principles common to neural networks and genetic algorithms
     Knudsen (1999), Bioinformatics 13(5), 356-361.
     http://genome.cbs.dtu.dk/services/promoter/

   CorePromoter
     Position   dependent 5-tuple
     QDA
     Michael   Zhang (1998), Genome Research, 8, 319-326.
     http://scislio.cshl.org/genefinder/CPROMOTER/


                                              Reese et al., Tutorial #3, ISMB ‘99
             Promoter recognition (cont.)


   Neural network promoter prediction (NNPP)
     Time-delay neural network
     Combining TATA box and initiator
     Reese (1999), in preparation.
     http://www-hgc.lbl.gov/projects/promoter.html




                                              Reese et al., Tutorial #3, ISMB ‘99
Example: NNPP




                Reese et al., Tutorial #3, ISMB ‘99
              Promoter recognition (cont.)
   Markov chain promoter finder
     Competing   interpolated Markov chains for promoters, exons,
      introns
     Promoter model consists of five states representing the core
      promoter parts
     Ohler, Reese et al., Bioinformatics 13(5), 362-369.




                                                Reese et al., Tutorial #3, ISMB ‘99
               Splice site prediction
   Nakata, 1985
     Nakata   (1985), NAR, 13(14), 5327-5340.
   BCM GeneFinder
     HSPL    - Prediction of splice sites in human DNA sequences
     Triplet frequencies in various functional parts of splice site
      regions
     Combined with codon statistics
     Solovyev et al. (1994), NAR, 22(24), 5156-5163.
     http://genomic.sanger.ac.uk/gf/gf.shtml




                                                 Reese et al., Tutorial #3, ISMB ‘99
               Splice site prediction (cont.)
   Neural Network splice site predictor (NNSPLICE)
       Multi-layered feed-forward neural network
       Modeled after Brunak et al. (1991), JMB, 220, 49-65.
       Reese et al. (1997), JCB, 4(3), 311-323.
       http://www-hgc.lbl.gov/projects/splice.html
   NetGene2
       Combination of neural networks and rule-based system
       Splice site signal neural network combined with coding potential
       Hebsgaard et al. (1996), NAR, 24(17), 3439-3452.
       Brunak et al. (1991), JMB, 220, 49-65.
       http://www.cbs.dtu.dk/services/NetGene2/




                                                        Reese et al., Tutorial #3, ISMB ‘99
              Splice site prediction (cont.)
   SplicePredictor
     Logitlinear   models for splice site regions
        Degree of matching to the splice site consensus
        Local compositional contrast

     Brendel  and Kleffe (1998), NAR, 26(20), 4748-4757.
     http://gnomic.stanford.edu/~volker/SplicePredictor.html




                                                       Reese et al., Tutorial #3, ISMB ‘99
               Start codon prediction

   NetStart
     Trained on cDNA-like sequences
     Neural network based
        Local start codon information
        Global sequence information

     Pedersen and Nielsen (1997), ISMB, 5, 226-233.
     http://www.cbs.dtu.dk/services/NetStart/




                                             Reese et al., Tutorial #3, ISMB ‘99
              Poly-adenylation signal prediction
   BCM GeneFinder
     POLYAH     - Recognition of 3'-end cleavage and poly-
      adenylation region
     Triplet frequencies in various functional parts in poly-
      adenylation regions
     LDA
     Solovyev et al. (1994), NAR, 22(24), 5156-5163.
     http://genomic.sanger.ac.uk/gf/gf.shtml




                                                 Reese et al., Tutorial #3, ISMB ‘99
                Prediction of coding potential

   Periodicity detection
     Coding sequences have an inherent periodicity of three
     Especially good on long coding sequences
     Auto-correlation
         Seeking the strongest response when shifted sequence is compared
          with original
         Michel (1986), J. Theor. Biol. 120, 223-236.

     Fourier   transformation: Spectral analysis
         Detection of peak at position corresponding to 1/3 of the frequency
         Silverman and Linsker (1986), J. Theor. Biol. 118, 295-300.




                                                        Reese et al., Tutorial #3, ISMB ‘99
              Prediction of coding potential
              (cont.)
   Trifonov (1980;1987)
     G-notG-U  periodicity
     JMB , 194, 643-652.

   Fickett (1982)
             asymmetry in the three codon positions
     Position
     NAR 10(17), 5303-5318.

   Staden (1984)
     Codon usage in tables
     NAR 12, 551-567.



                                             Reese et al., Tutorial #3, ISMB ‘99
               Prediction of coding potential
               (cont.)
   Claverie and Bougueleret (1987)
     Hexamer  frequency differentials
     NAR 14, 179-196.

   Fichant and Gautier (1987)
     Codonusage homogeneity
     CABIOS, 3(4), 287-295.

   GRAIL I (1991)
     Neural network using a shifting fixed size window
     7 sensors as input, 2 hidden layers and 1 unit as output
     Uberbacher et al. (1991), PNAS, 88(24), 11261-11265.

                                                 Reese et al., Tutorial #3, ISMB ‘99
           Prediction of coding potential
           (cont.)
   GeneMark (1986)
     Inhomogeneous    Markov chain models
     Easy trainable (closed solution for Maximum Likelihood)
     Used extensively in prokaryotic genomes
     Borodovsky et al. (1993), Computers & Chemistry, 17, 123-
      133.
   Glimmer (1998)
     Interpolated Markov chains from first to eighth order
     Salzberg et al. (1998), NAR, 26(2), 544-548.
     http://www.tigr.org/softlab/glimmer/glimmer.html


                                             Reese et al., Tutorial #3, ISMB ‘99
              Prediction of coding potential
              (cont.)

   Review by Fickett (1992)
     “Assessment   of protein coding measures”, NAR, 20, 6441-
      6450.




                                               Reese et al., Tutorial #3, ISMB ‘99
               Prediction of coding exons
   SorFind
       Detection of “spliceable” ORFs
       Hutchinson, NAR, 20(13), 3453-3462.
   BCM GeneFinder
       FEXD, FEXN, FEXA, FEXY, FEXH, HEXON
       LDA
       Solovyev et al. (1994), NAR, 22(24), 5156-5163.
       http://genomic.sanger.ac.uk/gf/gf.shtml
   GRAIL II
       Exon candidates, heuristic integration, learning with neural network
       Uberbacher et al., Genet. Eng., 16, 241-253.
       http://compbio.ornl.gov/


                                                        Reese et al., Tutorial #3, ISMB ‘99
             “Integrated” gene models:
             LDA/QDA

   FGene
     LDA    based
     Dynamic programming for the integration of LDA output
     Solovyev et al. (1995), ISMB, 3, 367-375.
     http://genomic.sanger.ac.uk/gf/gf.shtml




                                            Reese et al., Tutorial #3, ISMB ‘99
             “Integrated” gene models: NN


   GeneParser
     “Gene-parsing”   approach
     Potential alternative splicing recognized
     Neural network and dynamic programming
     Snyder and Stormo (1995), JMB, 248, 1-18.




                                             Reese et al., Tutorial #3, ISMB ‘99
             “Integrated” gene models:
             Artificial intelligence approaches
   GeneID
     Rule-based  system
     Homology integration
     Guigó et al. (1992), JMB , 226, 141-157.
     http://www1.imim.es/geneid.html

   GeneID using DP
     DP to combine a set of potential exons
     Guigó et al. (1998), JCB , 5, 681-702.




                                               Reese et al., Tutorial #3, ISMB ‘99
            “Integrated” gene models:
            Artificial intelligence approaches
   GenLang
     Syntactic  pattern recognition system
     Formal grammar
     Tools from computational linguistics
     Dong and Searls (1994), Genomics, 23,540-551.
     http://cbil.humgen.upenn.edu/~sdong/genlang_home.html




                                           Reese et al., Tutorial #3, ISMB ‘99
              “Integrated” gene models: HMMs
   HMMGene
     Several genes per sequence possible
     User constraints possible
     Krogh (1997), ISMB, 5, 179-186.
     http://www.cbs.dtu.dk/services/HMMgene/

   GeneMark.hmm
     Based on GeneMark program for bacterial sequences
     Can predict frame shifts
     Trained for various organisms
     Lukashin and Borodovsky (1998), NAR, 26, 1107-1115.
       http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html

                                                   Reese et al., Tutorial #3, ISMB ‘99
              “Integrated” gene models:
              GHMMs
   Genie
     Generalized  hidden Markov model with length distribution
     Integration of multiple content and signal sensors
        Content: codon statistics, repeats, intron, intergenic, database
         homology hits
        Signal: promoter, start codon, splice sites, stop codon

     Dynamic   programming to find optimal parse
     Several genes per sequence possible
     Kulp et al. (1996), ISMB, 4, 134-142.
     Reese et al. (1997), JCB, 4(3), 311-323.
     http://www.cse.ucsc.edu/~dkulp/cgi-bin/genie

                                                         Reese et al., Tutorial #3, ISMB ‘99
Example: Genie




                 Reese et al., Tutorial #3, ISMB ‘99
              “Integrated” gene models:
              GHMMs

   GenScan
     Multiple content and signal models
     Semi-hidden Markov model sensors with length distribution
     Takes GC content into account (separate models)
     Several genes per sequence possible
     Burge and Karlin (1997), JMB, 268(1), 78-94.
     http://CCR-081.mit.edu/GENSCAN.html




                                             Reese et al., Tutorial #3, ISMB ‘99
                EST/cDNA alignment for gene
                finding: Spliced alignments

   PROCRUSTES
     Spliced  alignment algorithm
     Dynamic programming to combine a set of potential exons
     Frame conservation
     Homologous sequence needed
     Gelfand et al. (1996), PNAS, 93, 9061-9066.
     http://hto-13.usc.edu/software/procrustes/




                                             Reese et al., Tutorial #3, ISMB ‘99
            EST/cDNA alignment
   Sim4
     Aligns cDNA to genomic sequence
     Uses local similarity
     Florea et al. (1998), Genome Research, 8, 967-974.

   GeneWise
     Dynamic   programming
     Partial genes allowed
     Based on Pfam and statistical splice site models
     Birney (1999), unpublished
     http://www.sanger.ac.uk/Software/Wise2



                                               Reese et al., Tutorial #3, ISMB ‘99
               EST/cDNA alignment (cont.)


   ACEMBLY
     Aligns  ESTs to genomic sequence
     Identifies alternative splicing
     Integrated in ACeDB
     Jean Thierry-Mieg (unpublished)




                                         Reese et al., Tutorial #3, ISMB ‘99
                Repeat finders

   Censor
     Uses database of repeat sequences
     Jurka et al. (1996), Comp. and Chem., 20(1), 119-122.

   BLAST
             masking operations
     Integrated
     XBLAST procedure
          Claverie (1994), In Automated DNA Sequencing and Analysis
           Techniques, M. D. Adams, C. Fields and J. C. Venter, eds., 267-279.
     http//:www.ncbi.nlm.nih.gov/BLAST




                                                        Reese et al., Tutorial #3, ISMB ‘99
             Repeat finders (cont.)


   RepeatMasker
     Detection   of interspersed repeats
     Smit and Green, unpublished results
     http://ftp.genome.washington.edu/RM/RepeatMasker.html




                                            Reese et al., Tutorial #3, ISMB ‘99
              Homology searching
   BLAST suite
     BLASTN,    BLASTX, TBLASTX, PSI-BLAST
     Altschul et al. (1990), JMB, 215, 403-410.
     http://www.ncbi.nlm.nih.gov/BLAST

   FASTA suite
     FASTA,   TFASTA
     Pearson and Lipman (1988), PNAS, 85, 2444-2448.

   HMM-based searching
     SAM     (UCSC group)
          http://www.cse.ucsc.edu/research/compbio/sam.html
     HMMER,       Sean Eddy
          http://hmmer.wustl.edu/
                                                    Reese et al., Tutorial #3, ISMB ‘99
           Gene family searching

   BLOCKS
     http://www.blocks.fhcrc.org

   PROSITE
     http://www.expasy.ch/prosite/

   PFAM
     http://pfam.wustl.edu/

   SCOP
     http://scop.mrc-lmb.cam.ac.uk/scop/




                                            Reese et al., Tutorial #3, ISMB ‘99
           The genome annotation
           experiment (GASP1)
   Genome Annotation Assessment Project (GASP1)
   Annotation of 2.9 Mb of Drosophila melanogaster
    genomic DNA
   Open to everybody, announced on several mailing lists
   Participants can use any analysis methods they like
    (gene finding programs, homology searches, by-eye
    assessment, combination methods, etc.) and should
    disclose their methods.
   “CASP” like
   12 participating groups
                                       Reese et al., Tutorial #3, ISMB ‘99
URL: http://www.fruitfly.org/GASP1




                              Reese et al., Tutorial #3, ISMB ‘99
           Goals of the experiment

   Compare and contrast various genome annotation
    methods
   Objective assessment of the state of the art in gene
    finding and functional site prediction
   Identify outstanding problems in computational
    methods for the annotation process




                                          Reese et al., Tutorial #3, ISMB ‘99
             Adh contig

   2.9 Mb contiguous Drosophila sequence from the Adh
    region, one of the best studied genomic regions
     From chromosome 2L (34D-36A)
     Ashburner et al., (to appear in Genetics)
     222 gene annotations (as of July 22, 1999)
     375,585 bases are coding (12.95%)

   We chose the Adh region because it was thought to be
    typical. A representative test bed to evaluate annotation
    techniques.


                                               Reese et al., Tutorial #3, ISMB ‘99
    Adh paper (to appear in Genetics)




URL: http://www.fruitfly.org/publications/PDF/ADH.pdf
                                    Reese et al., Tutorial #3, ISMB ‘99
GAATTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCC

                 Raw sequence:
CTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACG
GAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGCTCGTACTGTTGGGCGAGG
ATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTGGCCGTGGGGAGACCCTTC

                 Adh.fa
TCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGATAGA
TAGATAGATCGTGCGGCGATCTCGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGGTTCTGGCTTACGATCGGGTTTT
GGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACT
GGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGC
CGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATT
GACCTGATCCTGTTTGCCATCCTCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTAAAGTAAC
CTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGAGCCCAAATGAGCGATAGATAGATAGATCGTGCGGCGATCT
CGTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGGTTCTGGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAG
TTCTCTGGCTCGTTGCCTGTGCCAATTCAAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACG
ATTCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGG
CATACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCC
TCGAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACT
GCTCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTG
TGGCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGCTTACGATCGGGTTTTGGGCTTTGGTTGTGGCCTCCAGTTCTCTGGCTCGTTGCCTGTGCCAATTC
AAGTGCGCATCCGGCCGTGTGTGTGGGCGCAATTATGTTTATTTACTGGTAACTGGTAATTTGATCGATTCAAACGATTCTGGGTCTCCCCGGTTTTCTGTCCC
GGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCATACTTCTTTTCCTTCTCCCTTCCCAT
GTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGACTGGTAACTGGTAATTTGATCGATTCAAACGAT
TCTGGGTCTCCCCGGTTTTCTGTCCCGGTTCAATCTCGTAGAACTTGCCCTTGGTGGACAGTGGGACGTACAACACCTGCCGGTTTTCATTAAGCAGCTGGGCA
TACTTCTTTTCCTTCTCCCTTCCCATGTACCCACTGCCATGGGACCTGGTCGCATTGCCGTTGCCATGTTGCGACATATTGACCTGATCCTGTTTGCCATCCTC
GAAGACGGCCAACAGACGGAATACCTGCCCGCCCCTTGCCGTCGTTTTCACGTACTGTGGTCGTCCCTTGTTTATGGGCAGGCATCCCTCGTGCGTTGGACTGC
TCGTACTGTTGGGCGAGGATTCCGTAAACGCCGGCATGTTGTCCACTGAGACAAACTTGTAAACCCGTTCCCGAACCAGCTGTATCAGAGATCCGTATTGTGTG
GCCGTGGGGAGACCCTTCTCGCTTAGCATCGAAAAGTAACCTGCGGGAATTCCACGGAAATGTCAGGAGATAGGAGAAGAAAACAGAACAACAGCAAATACTGT
GCGGCGATCTCGTACTGGACGGAAATGTCAGGAGATAGGAGAAGAAAA




                                                                        Reese et al., Tutorial #3, ISMB ‘99
              Drosophila data sets provided to
              participants
   Curated Drosophila nuclear DNA "coding sequences" (CDS)
   Curated non-redundant Drosophila genomic DNA data (275
    “multi”- and 144 “single”-exon sequence entries from Genbank)
   Drosophila 5' and 3' splice sites
   Drosophila start codon sites
   Drosophila promoter sequences
   Drosophila repeat sequences
   Drosophila transposon sequences
   Drosophila cDNA sequences
   Drosophila EST sequences
    URL: http://www.fruitfly.org/GASP1/data/data.html
                                                        Reese et al., Tutorial #3, ISMB ‘99
            Timetable
   May 13, 1999 - June 30, 1999
     Distribution of the sample sequence and associated data to the
      predictors. Collection of predictions.
   June 30, 1999 - July 31, 1999
     Evaluation   of the predictions by the Drosophila Genome
      Center.
   August 4, 1999
     External expert assessment of the prediction results (HUGO
      meeting, EMBL)
   August 6, 1999
            #3 at the ISMB ‘99 conference in Heidelberg,
     Tutorial
      Germany
                                              Reese et al., Tutorial #3, ISMB ‘99
              Resources for assessing predictions
   80 cDNA sequences NOT in Genbank before
    experiment deadline
     Sequenced   from 5 different cDNA libraries
     3 paralogs to other genes in the genome
     19 cDNAs with cloning artifacts
        2 apparently representing unspliced RNA
        Multiple inserts (2 cDNAs cloned in the same vector)

     58   “usable” cDNAs
   33 cDNA sequences in Genbank during experiment
   Annotations from Adh paper

                                                      Reese et al., Tutorial #3, ISMB ‘99
              Curated data sets for assessing
              predictions
   Standard 1 (Adh.std1.gff) “conservative gene set”
     43 gene structures (7 single- and 36 multi- coding exon
      genes)
     Criteria for inclusion:
        >=95% (most >=99%) of the cDNA aligned to genomic DNA (using
         sim4)
        “GT”/”AG” splice site consensus sequences

        Splice site score from neural net

             • 5’ splice sites: >=0.35 threshold ( 98% True Positive score)
             • 3’ splice sites: >=0.25 threshold ( 92% True Positive score)
          Start codon and stop codon annotations from Standard 3 (derived
           from Adh paper)
     These   43 genes represent “typical” genes

                                                            Reese et al., Tutorial #3, ISMB ‘99
              Curated data sets for assessing
              predictions
   Standard 2 (Adh.std2.gff)
     Superset of Standard 1
     15 additional gene structures
     Same alignment criteria as Standard 1 but no splice site
      consensus requirement
     Not used in the experiment




                                                Reese et al., Tutorial #3, ISMB ‘99
                Curated data sets for assessment
   Standard 3 (Adh.std3.gff) “more complete gene set”
     222 gene structures (39 single- and 183 multi- coding exon
      genes)
     Criteria:
        Annotated as described in Ashburner et al.
        cDNA to genomic alignment using sim4

        Start codons predicted by ORFFinder (Frise et al., unpublished)

        ~182 genes have similarity to a homologous protein sequence in
         another organism or have a Drosophila EST hit
            •   Edge verification by partial EST/cDNA alignments
            •   BLASTX, TBLASTX homology results
            •   PFAM alignments
            •   Gene structure verification using GenScan (human)
        14 genes had EST/homology hits but no gene finding predictions
        ~40 genes only have “strong” GenScan predictions
                                                          Reese et al., Tutorial #3, ISMB ‘99
           Submission format

   GFF (Durbin and Haussler, 1998, unpublished)
     http://www.sanger.ac.uk/Software/GFF/




                                              Reese et al., Tutorial #3, ISMB ‘99
                Sample submission
         # organism: Drosophila melanogaster
         # std1
          Adh    std1    TFBS    32002     32006    .       +    .
         Adh    std1    TATA_signal       32009    32012   .    +    .       transcript   "1"
         Adh    std1    TSS     32033     32034    .       +    .    transcript "1"
         Adh    std1    prim_transcript   32034    33122   .    +    .       transcript   "1"
         Adh    std1    exon    32034     32277    .       +    .    transcript "1"
Gene 1   Adh
         Adh
                std1
                std1
                        start_codon
                        CDS     32122
                                          32122
                                          32277
                                                   32124
                                                   .
                                                           .
                                                           +
                                                                +
                                                                .
                                                                     .       transcript
                                                                     transcript "1"
                                                                                          "1"

         Adh    std1    splice5 32277     32278    .       +    .    transcript "1"
         Adh    std1    splice3 32332     32333    .       +    .    transcript "1"
         Adh    std1    exon    32785     32830    .       +    .    transcript "1"
         Adh    std1    CDS     32785     32830    .       +    .    transcript "1"
         Adh    std1    splice5 32830     32831    .       +    .    transcript "1"
         Adh    std1    splice3 32825     32826    .       +    .    transcript "1"
         Adh    std1    CDS     32826     33003    .       +    .    transcript "1"
         Adh    std1    exon    32826     33122    .       +    .    transcript "1"
         Adh    std1    stop_codon        33001    33003   .    +    .       transcript   "1"
         Adh    std1    polyA_signal      33090    33095   .    +    .       transcript   "1"
         Adh    std1    polyA_site        33101    33102   .    +    .       transcript   "1"
         Adh    std1    prim_transcript   38100    41973   .    -    .       transcript   "2"
         Adh    std1    exon    38100     41973    .       -    .    transcript "2"
         Adh    std1    polyA_site        39620    39621   .    -    .       transcript   "2"
         Adh    std1    polyA_signal      39685    39690   .    -    .       transcript   "2"
         Adh    std1    stop_codon        40125    40127   .    -    .       transcript   "2"
         Adh    std1    CDS     40125     40390    .       -    .    transcript "2"
Gene 2   Adh    std1    start_codon       40388    40390   .    -    .       transcript   "2"
         Adh    std1    TSS     41973     41974    .       -    .    transcript "2"
         Adh    std1    TATA_signal       41998    42001   .    -    .       transcript   "2"
         Adh    std1    TFBS    42187     42193    .       -    .
         Adh    std1    TFBS    42211     42216    .       -    .


                                                                           Reese et al., Tutorial #3, ISMB ‘99
              Submissions
   MAGPIE Team
     Credit
        Terry Gaasterland, Alexander Sczyrba, Elizabeth Thomas, Gulriz
         Kurban, Paul Gordon, Christoph Sensen
        Laboratory for Computational Genomics, Rockefeller and Institute
         for Marine Biosciences, Canada
     Method
          Automatic genome analysis system integrating Drosophila Genscan
           predictions, confirming exons boundaries using database searches,
           repeat finding (Calypso, REPupter) and gene function annotations.




                                                      Reese et al., Tutorial #3, ISMB ‘99
        Submissions (cont.)
 References
    “Multigenome MAGPIE” poster at ISMB ‘99.
    Gaasterland and Ragan (1998), J. of Microbial and Comparative
     Genomics, 3, 305-312.
    Gaasterland and Sensen (1996), Biochimie 78, 302-310.

    REPupter: Kurtz and Schleiermacher (1999), Bioinformatics 15(5),
     426-427.




                                               Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Computational Genomics Group, The Sanger Centre
     Credit
          Victor Solovyev, Asaf Salamov
     Method
        Discriminant analysis based gene prediction programs FGenes
         (trained for Human) and FGenesH (trained for Drosophila);
         Combining the output of Fgenes, FGenesH and BLAST using
         FGenesH+. 3 different “threshold” annotations are submitted.
        The programming running time is linear with the sequence length.

        Automatic, plus additional user interactive screening.

        Non-redundant NCBI database used for BLAST.

     URL/References
          http://genomic.sanger.ac.uk/gf/gf.shtml
                                                     Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Genome Annotation Group, The Sanger Centre
     Credit
          Ewan Birney
     Method
          Protein family based gene identification using Wise2 (previously
           Genewise) and PFAM.
     URL
          http://www.sanger.ac.uk/Software/Wise2




                                                       Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Pattern Recognition, The University of Erlangen
     Credit
          Uwe Ohler, Georg Stemmer, Stefan Harbeck, Heinrich Niemann
     Method
        Promoter recognition based on interpolated Markov chains;
         “Genscan” like promoter model (MCPromoter); maximal mutual
         information based estimation of interpolated Markov chains.
        Automatic.

        Promoter training data set from
         http://www.fruitfly.org/data/genesets




                                                   Reese et al., Tutorial #3, ISMB ‘99
          Submissions (cont.)
 References
    Ohler, Harbeck, Niemann, Noeth and Reese (1999), Bioinformatics
     15(5), 362-369.
    Ohler, Harbeck and Niemann (1999), Proc. EUROSPEECH, to appear.

 URL
      http://www5.informatik.uni-erlangen/HTML/English/Research/Promoter




                                                   Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Computational Biosciences, Oakridge National
    Laboratory
     Credit
          Richard J. Mural, Douglas Hyatt, Frank Larimer, Manesh Shah,
           Morey Parang
     Method
          Integrated neural network based system including gene assembly
           using EST and homology information (GRAILexp).
     URL:
          http://compbio.ornl.gov/droso




                                                     Reese et al., Tutorial #3, ISMB ‘99
             Submissions (cont.)
   Center for Biological Sequence Analysis, Technical
    University of Denmark
     Credit
          Anders Krogh
     Method
        Modular HMM incorporating database hits (proteins and
         ESTs/cDNAS) and other “external information” probabilistically
         (HMMGene); the HMM has modules for coding regions, splice sites,
         translation start/stop, etc..
        It will be a fully automated system.

        Trained on Drosophila data

            • http://www.fruitfly.org/GSAC1/data/data.html
          and
            • Victor Solovyev (personal communication)
                                                         Reese et al., Tutorial #3, ISMB ‘99
        Submissions (cont.)
 References
    Krogh (1998), In S.L. Salzberg et al., eds., Computational Methods in
     Molecular Biology, 45-63, Elsevier.
    Krogh (1997), Gaasterland et al., eds., Proc. ISMB 97, 179-186.

    http://www.cbs.dtu.dk/krogh/refs.html

 URL
    http://www.cbs.dtu.dk/services/HMMgene/
    Not yet for Drosophila.




                                                 Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   BLOCKS group, Fred Hutchinson Cancer Research
    Center in Seattle, Washington
     Credit
          Jorja Henikoff, Steve Henikoff
     Method
        DNA translation in 6 frames and search against BLOCKS+ and
         against BLOCKS extracted from Smart3.0 (http://coot-embl-
         heidelberg.de/SMART/) using BLIMPS; automatic post-processing to
         join multiple predictions from the same block.
        Automatic with some user interactive screening of results.




                                                 Reese et al., Tutorial #3, ISMB ‘99
        Submissions (cont.)
 References
    Henikoff, Henikoff and Pietrokovski (1999), Nucl. Acids Res., 27,
     226-228.
    Henikoff and Henikoff (1994), Proc. 27th Ann. Hawaii Intl. Conf. On
     System Sciences, 265-274.
    Henikoff and Henikoff (1994), Genomics, 19, 97-107.

 URL
    http://blocks.fhcrc.org
    http://blocks.fhcrc.org/blocks-bin/getblock.sh?<block name>




                                                Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Genome Informatics Team, IMIM, Barcelona, Spain
     Credit
          Roderic Guigó, Josep F. Abril, Enrique Blanco, Moises Burset, Genis
           Parra
     Method
        Dynamic programming based system to combine potential exon
         candidates modeled as a fifth order Markov model and functional
         sequence sites modeled as a position weight matrix (Geneid version 3).
        Fully automatic, very fast.

        Trained on Drosophila data

             • http://www.fruitfly.org/GSAC1/data/data.html




                                                         Reese et al., Tutorial #3, ISMB ‘99
          Submissions (cont.)
 References
      Guigó et al. (1998), JCB , 5, 681-702.
 URL
      Information on training process:
         • http://www1.imim.es/~rguigo/AnnotationExperiment/index.html
      http://www1.imim.es/geneid.html




                                                   Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)

   Mark Borodovsky's Lab, School of Biology, Georgia
    Institute of Technology
     Credit
          Mark Borodovsky, John Besemer
     Method
          Markov chain models combined with HMM technology
           (Genemark.hmm).
     URL
          http://genemark.biology.gatech.edu/GeneMark/hmmchoice.html




                                                   Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Biodivision, GSF Forschungszentrum für Umwelt und
    Gesundheit, Neuherberg, Germany
     Credit
          Matthias Scherf, Andreas Klingenhoff, Thomas Werner
     Method
        Universal sequence classifier which is based on a correlated word
         analysis to predict initiators and promoter associated TATA boxes
         (CoreInspector V1.0 beta). Sequences of 100 bp are classified at once.
        Trained on Eukaryotic Promoter Database (EPD version 5.9).

        Fully automatic, 2 seconds per 1Kb.

     References
          Scherf et al. (1999), in preparation.
     URL
          http://www.gsf.de/biodv/
                                                     Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   The Department of Biomathematical Sciences, Mount
    Sinai School of Medicine, New York
     Credit
          Gary Benson
     Method
        Tandem repeats finder (TRF v2.02) uses theoretical model of the
         similarity between adjacent copies of pattern (pattern from 1 -500 bp
         recognized); dynamic programming for candidate validation.
        Fully automatic; very fast (seconds per 1Mb).

        http://c3.biomath.mssm.edu/trf/Adh.fa.2.7.7.80.10.50.500.1.html

     References
          Benson (1999), Nucl. Acids Res., 27(2), 573-580.
     URL
          http://c3.biomath.mssm.edu/trf.html
                                                      Reese et al., Tutorial #3, ISMB ‘99
              Submissions (cont.)
   Genie, UC Berkeley/UC Santa Cruz/ Neomorphic Inc.
     Credit
          Martin G. Reese, David Kulp, Hari Tammana, David Haussler
     Method
        Generalized hidden Markov model with optional integration of EST
         hits and homology searches (Genie).
        Trained on Drosophila data

             • http://www.fruitfly.org/GSAC1/data/data.html
        Semi-automatic, in that the overlaps of the analyzed sequence contigs
         (110kb) where manual run again with Genie to resolve conflicts.
        BLAST used for homology searches on non-redundant protein
         database (nr).



                                                         Reese et al., Tutorial #3, ISMB ‘99
          Submissions (cont.)
 References
    Reese et al. (1997), JCB, 4(3), 311-323.
    Kulp et al. (1997), Biocomputing: Proc. Of the 1997 PSB conference,
     232-244.
    Kulp et al. (1996), ISMB, 4, 134-142.

 URL
      http://www.neomorphic.com/genie




                                                Reese et al., Tutorial #3, ISMB ‘99
                    Submission classes
                    Program name     Gene      Promoter    EST/cDNA       Protein    Repeat       Gene
                                     finding   recognition Alignement     similarity            function

Mural et al.
Oakridge, US        GRAILexp            X                      X                                    X

Guigó et al.
Barcelona, ES       GeneID              X

Krogh
Copenhagen, DK      HMMGene             X

Borodovsky et al.
Georgia, US         GeneMark.hmm        X

Henikoff et al.
Fred Hutchinson,    BLOCKS                                                    X                     X
Seattle, US
Solovyev et al.
Sanger, UK          FGenes/FGenesH      X



                                                                        Reese et al., Tutorial #3, ISMB ‘99
             Submission classes (cont.)
                  Program name    Gene      Promoter    EST/cDNA    Protein    Repeat      Gene
                                  finding   recognition Alignment   similarity           function

Gaasterland et al.
Rockefeller, US MAGPIE               X          X          X                      X         X

Benson et al.
Mount Sinai, US   TRF                                                             X

Werner et al.
Munich, GER       CoreInspector                 X

Ohler et al.
Nuermberg, GER MCPromoter                       X

Birney
Sanger, UK        Wise2                                                X                    X

Reese et al.
Berkeley/Santa    Genie              X          X
Cruz, US

                                                                Reese et al., Tutorial #3, ISMB ‘99
        Gene finding techniques
                     Program name     Statistics Promoter EST/cDNA    Protein
                                                          Alignment   similarity

Mural et al.
Oakridge, US         GRAILexp            X                   X

Guigo et al.
Barcelona, ES        GeneID              X

Krogh
Copenhagen, DK       HMMGene             X                   X             X

Borodovsky et al.
Georgia, US          GeneMark.hmm        X

Solovyev et al.
Sanger, UK           FGenes/FGenesH      X

Gaasterland et al.
Rockefeller, US      MAGPIE              X        X          X

Reese et al.
Berkeley/Santa       Genie               X        X          X             X
Cruz, US

                                                            Reese et al., Tutorial #3, ISMB ‘99
           Measuring success
   By nucleotide
      Sensitivity/Specificity (Sn/Sp)
   By exon
      Sn/Sp
      Missed exons (ME), wrong exons (WE)
   By gene
      Sn/Sp
      Missed genes (MG), wrong genes (WG)
      Average overlap statistics
   Based on Burset and Guigo (1996), “Evaluation of gene
    structure prediction programs”. Genomics, 34(3), 353-367.

                                            Reese et al., Tutorial #3, ISMB ‘99
            Definitions and formulae


               Sn = TP/(TP+FN)
               Sp = TP/(TP+FP)


   TP = True positive
   FP = False positive
   FN = False negative
                                 Reese et al., Tutorial #3, ISMB ‘99
Genes: True positives (TP)




                    Reese et al., Tutorial #3, ISMB ‘99
Genes: False positives (FP)




                     Reese et al., Tutorial #3, ISMB ‘99
Genes: False Negatives (FN)




                    Reese et al., Tutorial #3, ISMB ‘99
Toy example 1 (1)


Std1    TP   FP   FN SN SP
Pred1    2    1    1 2/3 2/3
Pred2    2    5    1 2/3 2/7




                       Sn = TP/(TP+FN)
                       Sp = TP/(TP+FP)


                               Reese et al., Tutorial #3, ISMB ‘99
Genes: Missing Genes (MG)




                   Reese et al., Tutorial #3, ISMB ‘99
Genes: Wrong Genes (WG)




                 Reese et al., Tutorial #3, ISMB ‘99
 Toy example 1 (2)


Std1    TP   FP   FN SN SP MG WG
Pred1    2    1    1 2/3 2/3 1 1
Pred2    2    5    1 2/3 2/7 0 4



                      Sn = TP/(TP+FN)
                      Sp = TP/(TP+FP)


                          Reese et al., Tutorial #3, ISMB ‘99
Genes: Std 1 versus Std 3




             Std1: “conservative gene set”
             Std3: “more complete gene set”


                         Reese et al., Tutorial #3, ISMB ‘99
 Toy example 1 (3)

Std1    TP   FP   FN SN SP MG WG
Pred1    2    1    1 2/3 2/3 1 1
Pred2    2    5    1 2/3 2/7 0 4
Std3
Pred1   2    1    2   2/4 2/3      2         1
Pred2   3    4    1   3/4 3/7      0         3



                        Sn = TP/(TP+FN)
                        Sp = TP/(TP+FP)

                                Reese et al., Tutorial #3, ISMB ‘99
Genes: Std1 and Std3 versus
“real” gene structure




                    Reese et al., Tutorial #3, ISMB ‘99
 Toy example 1 (4)

Std1   TP   FP   FN SN SP MG WG
Pred1   2    1    1 2/3 2/3 1 1
Pred2   2    5    1 2/3 2/7 0 4
Std3
Pred1   2   1    2   2/4 2/3      2         1
Pred2   3   4    1   3/4 3/7      0         3
"Real"
Pred1   3   0    1   3/4 3/3      1         0
Pred2   3   4    1   3/4 3/7      0         3




                               Reese et al., Tutorial #3, ISMB ‘99
 Toy example 1 (5): Exon level

Std1   TP   FP   FN SN SP ME WE
Pred1   5    2    1 5/6 5/7 1 2
Pred2   4    8    2 2/3 1/3 1 7
Std3
Pred1   5   2    2   5/7 5/7       2         2
Pred2   5   7    2   5/7 5/12      1         6
"Real"
Pred1   7   0    2   7/9 7/7       1         0
Pred2   6   6    3   2/3 1/2       1         5




                                Reese et al., Tutorial #3, ISMB ‘99
Genes: Joined genes (JG)




                    Reese et al., Tutorial #3, ISMB ‘99
Genes: Split genes (SG)




                     Reese et al., Tutorial #3, ISMB ‘99
           Definition: “Joined” and “split”
           genes
               # Actual genes that overlap predicted genes
JG = -------------------------------------------
       # Predicted genes that overlap one or more actual genes

                # Predicted genes that overlap actual genes
SG = -------------------------------------------
          # Actual genes that overlap one or more predicted genes

    JG > 1, tendency to join multiple actual genes into one
     prediction
    SG > 1, tendency to split actual genes into separate
     gene predictions

 Inspired by Hayes and Guigó (1999), unpublished.
                                             Reese et al., Tutorial #3, ISMB ‘99
        Toy example 2 (1)



Std1    TP   FP   FN   SN    SP MG WG      JG       SG
Pred1    0    2    3    0     0  1  1       2         1
Pred2    1    7    2   1/3   1/8 0  4       1       1.33




                                   Reese et al., Tutorial #3, ISMB ‘99
             Annotation experiment results

   Results available during tutorial and at

              http://www.fruitfly.org/GASP1/results/




                                               Reese et al., Tutorial #3, ISMB ‘99
                 Results: Base level
         Fgene Fgene Fgene Gene Gene Gene     Genie Genie Genie HMM           MAG      Grail
         s     s     s     ID v1 ID v2 Mark         EST   EST   Gene          PIE      exp
         CGG1 CGG2 CGG3                HMM                HOM
Sn       0.89 0.49 0.93 0.48 0.86 0.96 0.96 0.97 0.97 0.97 0.96 0.81
(Std1)
Sp       0.77 0.86 0.60 0.84 0.83 0.86 0.92 0.91 0.83 0.91 0.63 0.86
(Std3)


     Sensitivity:
          Low variability among predictors
          ~95% coverage of the proteome

     Specificity
          ~90%
          Programs  that are more like Genscan (used for original
           annotation) might do better?
                                                       Reese et al., Tutorial #3, ISMB ‘99
                Results: Exon level
         Fgen   Fgen   Fgen   Gene   Gene   Gene   Genie Genie Genie HMM             MAG      Grai
         es     es     es     ID     ID     Mark         EST   EST   Gene            PIE      l
         CGG1   CGG2   CGG3   v1     v2     HMM                HOM                            exp
Sn       0.65 0.44 0.75 0.27 0.58 0.70 0.70 0.77                  0.79 0.68 0.63 0.42
(Std1)
Sp       0.49 0.68 0.24 0.29 0.34 0.47 0.57 0.55                  0.52 0.53 0.41 0.41
(Std3)
ME(%) 10.5 45.5 5.6           54.4 21.1 8.1        8.1   4.8      3.2       4.8      12.1 24.3
(Std1)
WE(%) 31.6 17.2 53.3 47.9 47.4 28.9 17.4 20.1 22.8 20.2 50.2 28.7
(Std3)

    Higher variability among predictors
    Up to ~75% sensitivity (both exon boundaries correct)
    55% specificity
    Low specificity because partial exon overlaps do not count
    Missing exons below 5%
    Many wrong exons (~20%)
                                                               Reese et al., Tutorial #3, ISMB ‘99
                 Results: Gene level
         Fgene Fgene Fgene Gene Gene Gene     Genie Genie Genie HMM          MAG      Grail
         s     s     s     ID v1 ID v2 Mark         EST   EST   Gene         PIE      exp
         CGG1 CGG2 CGG3                HMM                HOM
Sn       0.51 0.16 0.60 0.07 0.35 0.56 0.56 0.65 0.65 0.56 0.47 0.33
(Std1)
Sp       0.36 0.32 0.14 0.07 0.14 0.31 0.37 0.38 0.34 0.39 0.25 0.21
(Std3)
MG(%) 27.9 81.3 13.9 81.3 46.5 20.9 18.6 11.6 9.3                  11.6 27.9 37.2
(Std1)
WG(%) 50.3 33.8 74.5 85.4 72.2 53.5 39.0 41.8 45.7 42.0 67.0 52.0
(Std3)

SG       1.10 1.10 2.11 1.06 1.06 1.07 1.17 1.15 1.16 1.04 1.22 1.23

JG       1.06 1.09 1.08 1.62 1.11 1.11 1.08 1.09 1.09 1.12 1.06 1.08




                                                       Reese et al., Tutorial #3, ISMB ‘99
            Results: Gene level

   60% of actual genes predicted completely correct
   Specificity only 30-40%
   5-10% missed genes (comparable to Sanger Center)
   40% wrong genes, a lot of short genes over-predicted
    (possibly not annotated in Standard 3)
   Splitting genes is a bigger problem than joining genes




                                         Reese et al., Tutorial #3, ISMB ‘99
 Results (protein homology):
 Base level

         BLOCKS   Wise2   MAGPIE   MAGPIE   GRAIL
                          cDNA     EST      Simila
                                            rity
Sn        0.04     0.12    0.02     0.31     0.31
(Std1)
Sp        0.80     0.82    0.55     0.32     0.81
(Std3)




                                            Reese et al., Tutorial #3, ISMB ‘99
 Results (protein homology):
 Exon level
         BLOCKS   Wise2   MAGPIE   MAGPIE   GRAIL
                          cDNA     EST      Simila
                                            rity
Sn        0.00     0.06    0.00     0.02     0.07
(Std1)
Sp        0.00     0.09    0.04     0.00     0.35
(Std3)
ME(%)     86.1     77.2    98.3     64.2     54.4
(Std3)
WE(%)     13.2     14.2    25.4     56.4     12.4
(Std3)




                                            Reese et al., Tutorial #3, ISMB ‘99
Results (protein homology):
Gene level
         BLOCKS   Wise2   MAGPIE   MAGPIE    GRAIL
                          cDNA     EST       Simila
                                             rity
Sn        0.00     0.00    0.00     0.00       0.07
(Std1)
Sp        0.00     0.00    0.00     0.00       0.18
(Std3)
MG(%)     95.3     90.6    97.6     88.3       74.4
(Std3)
WG(%)     17.5     15.7    52.6     58.5       29.7
(Std3)




                                            Reese et al., Tutorial #3, ISMB ‘99
Transcription Start Site (TSS):
Standard 1




                     Reese et al., Tutorial #3, ISMB ‘99
TSS: Standard 3




                  Reese et al., Tutorial #3, ISMB ‘99
       Results:
       TSS recognition

            MAGPIE    Genie    MCPromoter      CoreInspector

Likely       153       143        80                3
(7.7%)     (36.3%)   (61.1%)    (9.2%)           (13.0%)
Unlikely      29        62       170                3
(6.5%)      (6.8%)   (26.4%)   (19.5%)           (13.0%)
Possible     239        29       619                17
(86.8%)    (56.7%)   (12.3%)   (71.2%)           (74.0%)




                                    Reese et al., Tutorial #3, ISMB ‘99
Interesting gene examples:
bubblegum




                    Reese et al., Tutorial #3, ISMB ‘99
Adh/Adhr (Alcohol
dehydrogenase/Adh related)




                   Reese et al., Tutorial #3, ISMB ‘99
Adh/Adhr (cont..)




                    Reese et al., Tutorial #3, ISMB ‘99
            osp (outspread)

   Contains Adh and Adhr embedded in an intron




                                       Reese et al., Tutorial #3, ISMB ‘99
cact (cactus)




                Reese et al., Tutorial #3, ISMB ‘99
kuz (kuzbanian)




                  Reese et al., Tutorial #3, ISMB ‘99
beat (beaten path)




                     Reese et al., Tutorial #3, ISMB ‘99
Idfg1, Idfg2, Idfg3 (Imaginal Disc
Growth Factor)




                      Reese et al., Tutorial #3, ISMB ‘99
            Idfg1, Idfg2, Idfg3 (cont.)
   Chitinase-related
   Gene function has changed (now a growth factor)




                                        Reese et al., Tutorial #3, ISMB ‘99
             Conclusion of GASP1
   95% coverage of the proteome
   Base level prediction is easier, exon level prediction is
    harder
   Small genes over predicted (?)
   Long introns
   The high number of “wrong genes” indicates possible
    incomplete annotation in Standard 3 (Are there more
    genes?)
   HMM seems to currently be the best approach
   Major improvements in multiple gene regions
                                            Reese et al., Tutorial #3, ISMB ‘99
            Conclusion GASP1 (cont.)
   Much lower false positive rates
   Methods optimized for organism of interest do better
   Gene finding including homology not always improves
    prediction
   Split genes is more of a problem than joined genes
   No program is perfect




                                        Reese et al., Tutorial #3, ISMB ‘99
           Discussion GASP1
   Genes in introns
   Alternative splicing
   Genomic contamination in cDNA libraries
   Translation start prediction
   Biological verification of prediction needed
     Improve test bed by cDNA sequencing
     More regulation data needed to confirm promoter assessment

   Combining methods
   Better methods needed
   GASP 2 ?
                                           Reese et al., Tutorial #3, ISMB ‘99
            Conclusions on annotating
            complete eukaryotic genomes
   Throughput has to improve dramatically
   Not only genes but also their relationships have to be
    elucidated
   Complete transcript cDNAs very powerful tool for
    annotation including alternative transcripts
   Comparative genomics as well as expression analysis
    improves/completes genome annotation
   Standardization efforts needed (ontology working
    group, OMG, OiB, NCBI/EBI, Bioxml, etc.)
     Standardsfor description of gene products
     Exchange format (GFF, Genbank, EMBL, XML)
                                          Reese et al., Tutorial #3, ISMB ‘99
            Conclusions on annotating complete
            eukaryotic genomes (cont.)
   Maintenance requires even more effort than the original
    development
   Automated methods are not good enough
   Human curators can cause problems too
   Functional assignment by homology is sometimes
    unreliable




                                         Reese et al., Tutorial #3, ISMB ‘99
                Discussion on annotating complete
                eukaryotic genomes
   Re-annotation: updating results and annotations over
    time
       Genomic sequence changes (indels, point mutations)
       Analysis software changes
       New entries in public sequence databases
       Entries removed from sequence databases
   Audit trail for annotations
   Master copy of genome annotations should reside in the
    model organism databases where the expertise resides
   Community collaborative annotation



                                                       Reese et al., Tutorial #3, ISMB ‘99
           Acknowledgments

   Uwe Ohler (University of Erlangen, Germany)
   Gerry Rubin (UC Berkeley)
   Sima Misra (UC Berkeley)
   Erwin Frise (UC Berkeley)
   Roderic Guigó (Barcelona)
   GFF team (headed by Richard Bruskiewich, Sanger Centre)
   Assessment team: Michael Ashburner (EBI), Peer Bork
    (EMBL), Richard Durbin (Sanger), Roderic Guigó (Barcelona),
    Tim Hubbard (Sanger)
   Annotation experiment participants


                                           Reese et al., Tutorial #3, ISMB ‘99

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:11/6/2012
language:English
pages:182
zhaonedx zhaonedx http://
About