Genome Annotation and Databases

W
Document Sample
scope of work template
							           Genome Annotation and
                Databases
               Genomic DNA sequence
                Genomic annotation




BIO520 Bioinformatics   Jim Lund
       Genome Annotation
• Find known repeats
• Search for new repeated seqeunces
• Predict Genes
  – BLASTX
  – Genewise, Fgenes,
    Genscan…
• Integrate other data
  sources.

               Accuracy highest in “high homology” class
 Genome annotation servers
• Integrate information from several maps
  – DNA sequence (contigs, quality).
  – Physical (cytogenetic, STS content).
  – Genes (Predicted and known).
     • Several prediction programs.
     • Expressed sequence tags (ESTs, Unigene
       clusters)
     • Evidence (Predicted, confirmed)
     • Non-coding RNA (ncRNA) transcripts.
  – Regions of shared synteny.
                  Data Release
• Human genome sequence released under 1996
  Bermuda rules
   – Assembled sequence greater than 1000bp long is
     deposited in public database (GenBank/EMBL/DDBJ)
     every 24 hours
   – No patents are filed
• Bermuda principles reaffirmed at January 2003
  WT/NIH meeting
   – Pre-release of data for all “community projects”
   – Nature 421 , 875 (2003)
   – NHGRI:
       • http://www.genome.gov/page.cfm?pageID=10506376
   – WT:
       • http://www.wellcome.ac.uk/en/1/awtpubrepdat.html

• Benefits of Open Data Access supported by
  OECD report
   – http://dataaccess.ucsd.edu
            Accessing the Genome
• Genomes sequences are becoming available very rapidly
  – Large and difficult to handle computationally
  – Everyone expects to be able to access them immediately
• Bench Biologists
  –   Has my gene been sequenced?
  –   What are the genes in this region?
  –   Where are all the GPCRs
  –   Connect the genome to other resources
• Research Bioinformatics
  – Give me a dataset of human genomic DNA
  – Give me a protein dataset
   Getting information out

• Search/browse to find the gene or
  region.
• Export formats:
  – Screen shot
  – FASTA seq.
  – Genbank file with features annotated
  – Feature list (Gff, tab-delimited text)
  – Pip (plot of sequence identity between
    organisms).
            Challenges
• Scale and data flow
  – Mainly engineering problems
• Presentation, ease of use
  – Engineering problems
  – User interface design
• Algorithmic
  – Partly engineering
  – Partly research
    NCBI sequence assembly
      (sequence   chromosome)

• Remove contaminants
• Bin by chromosome arms
• Sequence Layout
• Sequence Building
• Place on chromosomes
NCBI sequence assembly - a modified greedy
                approach
                                                                         BAC Sequence
Sequence Layout
   •Curated Finished Regions
   •Curated assembly instructions                                         Fragments
   •MegaBLAST hits
   •Consider clone order
   •BAC chromosome assignment                                              Assemble

        •annotation
        •STS markers                                                        Order
        •personal communication
   •Remove conflicting overlaps, redundant BACs
                                                                         NCBI Contig
 Sequence Building
    •Consider fragment:fragment sequence overlaps for each BAC pair in
    layout
    •Meld overlapping sequence
    •Order and Orient (o+o ):
         •alignments (mRNA, EST)
         •BAC annotation
         •paired plasmid reads
             NCBI Genome Build Process
                                dbSNP       STS
                                                  Clones
  Collaboration    GenomeScan
  Curation      GenBank
                                                           LocusLink
               RefSeq
                                                                                Update:
                                                                                Links
LocusLink
                                                        Annotation              gi’s
                                    Contig Build                                Prepare for release
                Assembly                 &
                                     Release                Resource
             Freeze                                         Updates
Input Data:
 Sequences
 Curated NTs
 TPF                                              Public Release
 BLAST hits

                                                  Sequences (contig mRNA protein)
Exclude                 Analysis & Review         Map Viewer
Problem                 Corrections for
                                                  FTP
accessions              next build
                                                  BLAST
                                                  Input Resources
             What is being annotated?
                   Feature     Method

                     Genes:    By alignment, by prediction
                   Markers:    By ePCR
                  Variation:   By alignment
Clones/Cytogenetic location:   By alignment (BAC ends)
          Phenotype (MIM):     Via Gene identification, associated markers

       Cytogenetic Position:   By annotated BAC-END sequenced clones
                               By FISH-mapped clones used in assembly
  RefSeq: a reagent for Contig Annotation
                genome         Potential Problems:
RefSeq mRNAs                   •Gene Families
                               •Partial
                               •Chimeric
GenBank mRNAs                  •Intron read-through
                               •Linker
                               •Vector
                               •Wrong organism
ESTs
                               RefSeq Advantages:
                               •Separate Gene Families
                               •Not Partial
TBLASTN                        •Means to correct
                                problem sequences
RPSBLAST
                               RefSeq process results
GenomeScan                     in excluding problem
                               GenBank sequences
                               from annotation pipeline
     NCBI: Products of annotation
•   RefSeqs (transcripts, proteins)
•   Gene id (LocusID)
•   features in chromosome coordinates
•   features in contig (NT accession)
      coordinates

Available in:
• Map Viewer
   – Graphical display
   – Tabular display
   – Sequence downloads
• FTP
   – RefSeqs (contigs, transcripts, proteins)
   – Mapping Data
   – LocusLink & Other resources
NCBI Map Viewer
NCBI Map Viewer: Tabular report
Genes in regions of conserved synteny



                             Anchored by
                             human gene
                                order




                              Anchored by
                            mouse gene order
Query by sequence: Review the alignment

                           A click away:
                           •Alignments (BLAST hit)
                           •Gene Description
                           (LocusLink)
                           •Report of all features
                           in the region
                           •Contig sequence
                           •Sequence in the region
                           •other mRNAs aligning in
                           the region
                           •Define your own gene
                           model based on
                           alignments in the region
    Quality Control - Genome review

•   Is the sequence correct?
•   Is the feature correctly placed?
•   Is there a feature that should be placed?
•   Are the attributes of the feature correct?

Approaches:
•In-house analysis & review (manual curation)
•Shared information (UCSC/Ensembl)
•Solicited review by experts in local regions
              Ensembl Analysis
• Set of high quality gene predictions
   – From known human mRNAs aligned against genome
   – From similar protein and mRNAs aligned against genome
   – From Genscan predictions confirmed via BLAST of Protein,
     cDNA, ESTs databases.
• Initial functional annotation from Interpro
• Integration with external resources (SNPs, SAGE,
  OMIM)
• Comparative analysis between mouse/human
   – DNA sequence alignment
   – Protein orthologs
Ensembl prediction pipeline
                                 DNA

                          RepeatMasker


          Genscan



   Blast genscan peptides v               Pmatch all human
 Protein,unigene,est,vert mrna            Proteins and cdnas




                         MiniGenewise
                         MiniEst2genome




                             Genes
               Genome Annotation




The generic structure of an automatic genome annotation pipeline and delivery system
Chromosome


Overview
Genes and Markers
1Mb

Configuration




Detailed View
Genes, ESTs, CpG etc.
100kb
      Useful genomic annotation and
              browser URLs
Automated annotation pipelines
           EBI/Sanger Institute Ensembl Project: http://www.ensembl.org/Homo_sapiens/
           NCBI Human Genome Browser:
           http://proxy.library.uiuc.edu:3367/genome/guide/human/
           The Oak Ridge National Laboratories Genome Channel:
           http://compbio.ornl.gov/channel/
           Celera Discovery System: http://cds.celera.com/
           Incyte Genomics ¯ Genomics Knowledge Platform:
           http://www.incyte.com/incyte_science/technology/gkp/
           Paracel GeneMatcher2 System: http://www.paracel.com/products/gm2.html
Human genome browsers
           UCSC Human Genome Browser: http://genome.cse.ucsc.edu/cgi-bin/hgGateway/
           Softberry Genome Explorer: http://www.softberry.com/berry.phtml?topic=genomexp
           Viaken Enterprise Ensembl Solution:
           http://www.viaken.com/ns/solutions/ensembl.html
           LabBook Inc. Genomic Explorer Suite:
           http://www.labbook.com/products/ExplorerSuite.asp
           University of Tokyo Gene Resource Locator Browser: http://grl.gi.k.u-tokyo.ac.jp/
Other useful sites
           The Institute for Genomic Research (TIGR): http://www.tigr.org/
           Human Genome Central: http://www.ensembl.org/genome/central/ and
           http://proxy.library.uiuc.edu:3528/genome/central/
            Genome annotaion issues
•Annotation servers.
   •Pro: make genomics information accessible to biologists without expert
   bioinformatics skills.
   •Con: makes it difficult to perform large-scale data mining.
   •Solution: enable more experienced users to retrieve the data they require
   and to run analyses locally.

•Open annotation systems.
   •Biologists need to have access to annotations available in the community
   and to share their own contributions with the community.
   •A common protocol between systems that enables genome data to be
   freely exchanged
       •AGAVE (Architecture for Genomic Annotation, Visualization and
       Exchange)
       •Distributed Annotation System (DAS) projects
 Genome annotation servers

• Several ways to find information:
  – Search by clone, gene, EST, marker.
  – Browse sequence.
  – BLAST searches.
  – Homology, start in one organism, jump
    to the syntenic region of another.
UCSC Genome Browser




 http://genome.ucsc.edu/cgi-bin/hgGateway

						
Related docs