How to access genomic information using Ensembl
Damian Smedley and Xosé Fernández
Ensembl Project European Bioinformatics Institute Cambridge, UK
November 2004
Schedule
Today
Introduction to the Ensembl system Hands-on examples to introduce the system Evaluating genes and transcripts
Variation in Ensembl (SNPs, haplotypes)
Tomorrow
Data mining with EnsMart Comparative genomics and proteomics in Ensembl BioMart Advanced topics (Upload your own data, DAS)
2 of 45
Our goal
3 of 45
Assembly
From 325,109 initial contigs
Other ordering data
non-redundant, “virtual contig” view
to 26,720 overlapping clones
4 of 45
Mapping and Sequencing the human genome
BACs fragment
Shizuya et al 1992 Dib et al 1996 Deloukas et al 1998 Osoegawa et al 2001
bacterial artificial chromosomes avg size 150 kb
map
WGS
sequence assembly draft finished BAC pUCs
avg size 2-4 kb
fragment
Bentley et al 2001 Bruls et al 2001 McPherson et al 2001 Montgomery et al 2001 Tilford et al 2001
Status of the human sequence
finished red /orange ~96% (99.999% accurate) 30-40% repetitive elements (eg Alpha satellite, Alu repeats) All known genes, correctly identified (99.74%) heterochromatin ~4% grey
Assembled draft sequence totals 2.85 Gb
Human genome: Current status
• 22,287 'gene loci‘ defined, consisting of 19,599 protein-coding genes in the human genome and 2,188 DNA additional segments ‘predicted’ to be protein-coding genes – 1183 genes ‘were born’ in the last 60-100 My – ~ 30 genes ‘died’ in a similar time period
Finishing the euchromatic sequence of the human genome, Nature 431:931-45 (2004)
7 of 45
Ensembl - project aims
• funded to provide metazoan genomes to the world • aims to provide the world’s best automated genome annotation • a leading group for human and mouse analysis • all software, data and results freely available
8 of 45
Ensembl - project background
• • • • group split between EBI and Sanger mainly Wellcome Trust funded largest dedicated compute in biology in Europe developer community > 100 people, including companies
9 of 45
Ensembl – Open source
Freely-available
Community development. – >51 Ensembl installs worldwide. – Both public and commercial, e.g. Gramene (CSHL) Fugu-sg (ICMB) Ciona-sg (Temasek)
10 of 45
Ensembl
Final DB
Supporting Databases SNP Manual Annotation
Analysis DB
CPU
11 of 45
Genome browsing
why present the whole genome?
• • • • • Explore what is in a chromosome region See features in and around a specific gene Search & retrieve across the whole genome Investigate genome organization Compare to other genomes
12 of 45
Genome browsers
• Ensembl – public site + installable system • UCSC Human Genome Browser • NCBI Map Viewer
http://www.ensembl.org http://www.ncbi.nlm.nih.gov/mapview http://genome.ucsc.edu
13 of 45
Introduction to the Ensembl web site
Ensembl … … takes genomic sequence assemblies
human build 34, mouse, rat, Fugu,mosquito
adds annotation and links
automated process
presents all the data on a web site
14 of 45
Annotation: genes
Known genes
• where? • genomic structure? • transcripts(s)? • protein(s)? • orthologues? • attach useful links
Novel genes
• how to predict? require evidence • transcripts(s)? • protein(s)? • orthologues? • attach useful links
15 of 45
Annotation: other features
• • • • markers and SNPs cytogenetic bands repeated sequences ESTs & other sequence records
where do they show sequence similarity?
• regions homologous to other species
16 of 45
How to get started … …
• • • • • • • Species homepage Site map Map View Text search BLAST SSAHA Disease View
17 of 45
Homepage
Site map
MapView
AnchorView
BLAST and SSAHA
BLAST and SSAHA
Regions, maps and markers
ContigView
CytoView
SyntenyView
MultiContigView
MarkerView
SNPView
23 of 45
Ensembl
ContigView
ContigView close-up
Customising & short cuts Evidence Transcripts red & black (Ensembl predictions) Blue (Vega)
Pop-up menu
ContigView - Chromosome 20 close-up
Manual annotation via Vega Ensembl predictions
Other chromosomes with manual annotation from http://vega.sanger.ac.uk: 6, 7, 9, 10, 13, 14, 20, 22, X
Forward strand
Reverse strand
Ensembl EST-based predictions
CytoView
GeneSNP View
MarkerView
SNPView
Synteny View
MultiContig View
Genes & gene products
GeneView
TransView ExonView ProteinView GOView DiseaseView
32 of 45
FamilyView
DomainView
Ensembl
GeneView
TransView
ExonView
Protein View
Family View
GOView
DiseaseView
Data retrieval
EnsMart
Export View
Data sets on ftp site
MySQL queries of databases Perl API access to databases
39 of 45
ExportView
EnsMart
Mouse differences
• Genomic sequence assembly based on whole genome shotgun, with finished ‘stitched’ BACs
• BACs are shown in CytoView (FPC map), but for most no sequence is available
42 of 45
Mouse CytoView
Help!
• context sensitive help pages click
• access other documentation via generic home page • email the helpdesk
HelpDesk / Suggestions
44 of 45
Thanks
Ensembl Team
45 of 45
Ensembl Team
Database Schema and Core API Arne Stabenau Yuan Chen Ian Longden Craig Melsopp Glenn Proctor Daniel Ríos Guy Slater Distributed Annotation System Andreas Kähäri Project Leader Ewan Birney (EBI) Tim Hubbard (Sanger) Ensembl Web Team James Stalker Fiona Cunningham James Smith Analysis and Annotation Pipeline Val Curwen Steve Searle Dan Andrews Mario Caccamo Laura Clarke Martin Hammond Jan Hinnerck-Vogel Kevin Howe Vivek Iyer Kerstin Jekosch Felix Kokocinski Simon White
Vega Web Team Patrick Meidl Steve Trevianon
User Support Xosé Mª Fernández Michael Schuster
Comparative Genomics Abel Ureta-Vidal Javier Herrero Sánchez Jessica Severin Cara Woodwark
EnsMart & BioMart Arek Kasprzyk Damian Keefe Darin London Damian Smedley
November 2004