Genome Annotation and Databases
Document Sample


Genome Annotation and
Databases
Genomic DNA sequence
Genomic annotation
BIO520 Bioinformatics Jim Lund
Genome Annotation
• Find known repeats
• Search for new repeated seqeunces
• Predict Genes
– BLASTX
– Genewise, Fgenes,
Genscan…
• Integrate other data
sources.
Accuracy highest in “high homology” class
Genome annotation servers
• Integrate information from several maps
– DNA sequence (contigs, quality).
– Physical (cytogenetic, STS content).
– Genes (Predicted and known).
• Several prediction programs.
• Expressed sequence tags (ESTs, Unigene
clusters)
• Evidence (Predicted, confirmed)
• Non-coding RNA (ncRNA) transcripts.
– Regions of shared synteny.
Data Release
• Human genome sequence released under 1996
Bermuda rules
– Assembled sequence greater than 1000bp long is
deposited in public database (GenBank/EMBL/DDBJ)
every 24 hours
– No patents are filed
• Bermuda principles reaffirmed at January 2003
WT/NIH meeting
– Pre-release of data for all “community projects”
– Nature 421 , 875 (2003)
– NHGRI:
• http://www.genome.gov/page.cfm?pageID=10506376
– WT:
• http://www.wellcome.ac.uk/en/1/awtpubrepdat.html
• Benefits of Open Data Access supported by
OECD report
– http://dataaccess.ucsd.edu
Accessing the Genome
• Genomes sequences are becoming available very rapidly
– Large and difficult to handle computationally
– Everyone expects to be able to access them immediately
• Bench Biologists
– Has my gene been sequenced?
– What are the genes in this region?
– Where are all the GPCRs
– Connect the genome to other resources
• Research Bioinformatics
– Give me a dataset of human genomic DNA
– Give me a protein dataset
Getting information out
• Search/browse to find the gene or
region.
• Export formats:
– Screen shot
– FASTA seq.
– Genbank file with features annotated
– Feature list (Gff, tab-delimited text)
– Pip (plot of sequence identity between
organisms).
Challenges
• Scale and data flow
– Mainly engineering problems
• Presentation, ease of use
– Engineering problems
– User interface design
• Algorithmic
– Partly engineering
– Partly research
NCBI sequence assembly
(sequence chromosome)
• Remove contaminants
• Bin by chromosome arms
• Sequence Layout
• Sequence Building
• Place on chromosomes
NCBI sequence assembly - a modified greedy
approach
BAC Sequence
Sequence Layout
•Curated Finished Regions
•Curated assembly instructions Fragments
•MegaBLAST hits
•Consider clone order
•BAC chromosome assignment Assemble
•annotation
•STS markers Order
•personal communication
•Remove conflicting overlaps, redundant BACs
NCBI Contig
Sequence Building
•Consider fragment:fragment sequence overlaps for each BAC pair in
layout
•Meld overlapping sequence
•Order and Orient (o+o ):
•alignments (mRNA, EST)
•BAC annotation
•paired plasmid reads
NCBI Genome Build Process
dbSNP STS
Clones
Collaboration GenomeScan
Curation GenBank
LocusLink
RefSeq
Update:
Links
LocusLink
Annotation gi’s
Contig Build Prepare for release
Assembly &
Release Resource
Freeze Updates
Input Data:
Sequences
Curated NTs
TPF Public Release
BLAST hits
Sequences (contig mRNA protein)
Exclude Analysis & Review Map Viewer
Problem Corrections for
FTP
accessions next build
BLAST
Input Resources
What is being annotated?
Feature Method
Genes: By alignment, by prediction
Markers: By ePCR
Variation: By alignment
Clones/Cytogenetic location: By alignment (BAC ends)
Phenotype (MIM): Via Gene identification, associated markers
Cytogenetic Position: By annotated BAC-END sequenced clones
By FISH-mapped clones used in assembly
RefSeq: a reagent for Contig Annotation
genome Potential Problems:
RefSeq mRNAs •Gene Families
•Partial
•Chimeric
GenBank mRNAs •Intron read-through
•Linker
•Vector
•Wrong organism
ESTs
RefSeq Advantages:
•Separate Gene Families
•Not Partial
TBLASTN •Means to correct
problem sequences
RPSBLAST
RefSeq process results
GenomeScan in excluding problem
GenBank sequences
from annotation pipeline
NCBI: Products of annotation
• RefSeqs (transcripts, proteins)
• Gene id (LocusID)
• features in chromosome coordinates
• features in contig (NT accession)
coordinates
Available in:
• Map Viewer
– Graphical display
– Tabular display
– Sequence downloads
• FTP
– RefSeqs (contigs, transcripts, proteins)
– Mapping Data
– LocusLink & Other resources
NCBI Map Viewer
NCBI Map Viewer: Tabular report
Genes in regions of conserved synteny
Anchored by
human gene
order
Anchored by
mouse gene order
Query by sequence: Review the alignment
A click away:
•Alignments (BLAST hit)
•Gene Description
(LocusLink)
•Report of all features
in the region
•Contig sequence
•Sequence in the region
•other mRNAs aligning in
the region
•Define your own gene
model based on
alignments in the region
Quality Control - Genome review
• Is the sequence correct?
• Is the feature correctly placed?
• Is there a feature that should be placed?
• Are the attributes of the feature correct?
Approaches:
•In-house analysis & review (manual curation)
•Shared information (UCSC/Ensembl)
•Solicited review by experts in local regions
Ensembl Analysis
• Set of high quality gene predictions
– From known human mRNAs aligned against genome
– From similar protein and mRNAs aligned against genome
– From Genscan predictions confirmed via BLAST of Protein,
cDNA, ESTs databases.
• Initial functional annotation from Interpro
• Integration with external resources (SNPs, SAGE,
OMIM)
• Comparative analysis between mouse/human
– DNA sequence alignment
– Protein orthologs
Ensembl prediction pipeline
DNA
RepeatMasker
Genscan
Blast genscan peptides v Pmatch all human
Protein,unigene,est,vert mrna Proteins and cdnas
MiniGenewise
MiniEst2genome
Genes
Genome Annotation
The generic structure of an automatic genome annotation pipeline and delivery system
Chromosome
Overview
Genes and Markers
1Mb
Configuration
Detailed View
Genes, ESTs, CpG etc.
100kb
Useful genomic annotation and
browser URLs
Automated annotation pipelines
EBI/Sanger Institute Ensembl Project: http://www.ensembl.org/Homo_sapiens/
NCBI Human Genome Browser:
http://proxy.library.uiuc.edu:3367/genome/guide/human/
The Oak Ridge National Laboratories Genome Channel:
http://compbio.ornl.gov/channel/
Celera Discovery System: http://cds.celera.com/
Incyte Genomics ¯ Genomics Knowledge Platform:
http://www.incyte.com/incyte_science/technology/gkp/
Paracel GeneMatcher2 System: http://www.paracel.com/products/gm2.html
Human genome browsers
UCSC Human Genome Browser: http://genome.cse.ucsc.edu/cgi-bin/hgGateway/
Softberry Genome Explorer: http://www.softberry.com/berry.phtml?topic=genomexp
Viaken Enterprise Ensembl Solution:
http://www.viaken.com/ns/solutions/ensembl.html
LabBook Inc. Genomic Explorer Suite:
http://www.labbook.com/products/ExplorerSuite.asp
University of Tokyo Gene Resource Locator Browser: http://grl.gi.k.u-tokyo.ac.jp/
Other useful sites
The Institute for Genomic Research (TIGR): http://www.tigr.org/
Human Genome Central: http://www.ensembl.org/genome/central/ and
http://proxy.library.uiuc.edu:3528/genome/central/
Genome annotaion issues
•Annotation servers.
•Pro: make genomics information accessible to biologists without expert
bioinformatics skills.
•Con: makes it difficult to perform large-scale data mining.
•Solution: enable more experienced users to retrieve the data they require
and to run analyses locally.
•Open annotation systems.
•Biologists need to have access to annotations available in the community
and to share their own contributions with the community.
•A common protocol between systems that enables genome data to be
freely exchanged
•AGAVE (Architecture for Genomic Annotation, Visualization and
Exchange)
•Distributed Annotation System (DAS) projects
Genome annotation servers
• Several ways to find information:
– Search by clone, gene, EST, marker.
– Browse sequence.
– BLAST searches.
– Homology, start in one organism, jump
to the syntenic region of another.
UCSC Genome Browser
http://genome.ucsc.edu/cgi-bin/hgGateway
Related docs
Get documents about "