National Center for Biotechnology Information
NCBI FieldGuide
A Field Guide to GenBank
and NCBI’s Molecular Biology Resources
January 30, 2007 Washington University, St. Louis
ftp://ftp.ncbi.nih.gov/pub/FieldGuide/Slides/Current/WashU.01.30.067/Jan30
Topics
NCBI FieldGuide
About NCBI
GenBank overview
Primary vs derivative databases
The Reference Sequence (RefSeq) project
The Entrez engine and databases
-break-
Entrez text searching
Genomic resources
Sequence similarity - BLAST
An integrated example
The National Institutes of Health
NCBI FieldGuide
Bethesda, MD
The National Center for
Biotechnology Information
NCBI FieldGuide
Accepts submissions of primary data
Develops tools to analyze these data
Creates derivative databases based on the primary
data
Provides free search, link, and retrieval of these
data, primarily through the Entrez system
NCBI Web Traffic
NCBI FieldGuide
Japan 6%
Users per day Italy 4%
600,000 Canada 3%
Germany 3%
United Kingdom
3%
500,000 Netherlands 2%
Spain 2%
Brazil 2%
Sweden 1%
400,000 Switzerland 1%
Belgium1%
U.S. Other
(.com, .net, .org, 14%
gov,
.gov, .us)
300,000
40%
200,000
100,000
1998 1999 2000 2001 2002 2003 2004 2005
Christmas and New Year’s Day
NCBI Web Usage – Users per Day, January 2007
NCBI FieldGuide
NCBI Web Usage – Hits per Day, January 2007
NCBI FieldGuide
Homepage - accessing the data
NCBI FieldGuide
all[filter]
NCBI FieldGuide
01/21/2007
9/19/2006
GenBank
GenBank
NCBI FieldGuide
Release 157 December 2006
83 x 106 Records
150 x 109 Nucleotides
254 Gb (non-WGS) 1072 files (non-WGS)
• full release every two months
• incremental and cumulative updates daily
• available only via ftp
• release notes: gbrel.txt
ftp://ftp.ncbi.nih.gov/genbank/
ftp://genbank.sdsc.edu/pub
ftp://bio-mirror.net/biomirror/genbank
The Growth of GenBank
NCBI FieldGuide
Release 157
160
140
120
WGS: 81.6 billion bases
(billions)
100
Bases
80
Doubling time 12-14 months
60
40
Non-WGS: 69.0 billion bases
20
0
Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 Aug-06
What is GenBank?
NCBI FieldGuide
Nucleotide only sequence database
Archival in nature
Historical
Reflective of submitter point of view (subjective)
Redundant
GenBank Data
Direct submissions (traditional records)
Batch submissions (EST, GSS, STS)
ftp accounts (genome data)
Three collaborating databases
GenBank
DNA Database of Japan (DDBJ)
European Molecular Biology Laboratory (EMBL) Database
GenBank Divisions
NCBI FieldGuide
PRI (28) Primate “Organismal”
ROD (15) Rodent (Traditional)
PLN (20) Plant and Fungal • Organized by taxonomy (sort of)
BCT (18) Bacterial/Archeal • Direct submissions (Sequin/Bankit)
INV (7) Invertebrate • Accurate (~1 error per 10,000 bp)
VRT (7) Other Vertebrate • Well characterized
VRL (4) Viral
MAM (2) Mammalian
PHG (1) Phage
SYN (1) Synthetic
ENV (4) Envir. samples
UNA (1) Unannotated
EST (570) Expressed Sequence Tag
“Functional”
GSS (197) Genome Survey Sequence (Bulk)
HTG (88) High Throughput Genomic • Organized by sequence type
PAT (27) Patent • Batch submissions (ftp/email)
STS (9) Sequence Tagged Site • Less accurate
CON (1) Contigs, virtual • Poorly characterized
GenBank Functional (Bulk) Divisions
NCBI FieldGuide
Expressed Sequence Tag
1st pass single read cDNA
Genome Survey Sequence
EST 1st pass single read gDNA
GenBank GSS High Throughput Genomic
incomplete sequences of genomic
HTG
clones
STS
Sequence Tagged Site
PCR-based mapping reagents
Whole Genome Shotgun
EST Division: Expressed Sequence Tags
NCBI FieldGuide
>IMAGE:275615 5' mRNA sequence
GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA
nucleus
TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA
5’
GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC
30,000
TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC
genes
AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN
3’
TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
>IMAGE:275615 3', mRNA sequence - isolate unique clones
NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
RNA - sequence once from
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT
gene products each end
AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT
CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG
GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
make cDNA
80-100,000 unique
library cDNA clones in library
GenBank Bulk Sequence: EST
NCBI FieldGuide
poorly
characterized
GSS, HTG, WGS
NCBI FieldGuide
Whole BAC insert (or genome)
shred
sequence isolate clones
GSS division
whole genome shotgun
or trace archive assembly
assemblies (wgs projects)
Draft sequence (HTG division)
HTG Example: Honeybee Draft Sequences
NCBI FieldGuide
LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004
DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE,
14 unordered pieces.
ACCESSION AC141845
VERSION AC141845.1 GI:29124029
KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.
• Unfinished sequences of BACs
• Gaps and unordered pieces
• Finished sequences (Phase 3) move to
traditional GenBank division
Whole Genome Shotgun Projects
NCBI FieldGuide
685 projects
Bacteria (320)
Environmental sequences (14)
Archaea (8)
Eukaryotes (140), including:
Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human
Pufferfish (2)
Honeybee, Anopheles, Fruit Flies (3), Silkworm
Nematode (2)
Yeasts (8), Aspergillus (2)
Rice (2)
Whole Genome Shotgun (WGS) Projects
NCBI FieldGuide
wgs master[properties]
ftp://ftp.ncbi.nih.gov/genbank/wgs/
Derivative Databases
NCBI FieldGuide
Derivative Databases
NCBI FieldGuide
Sequencing
Centers UniGene
EST
UniSTS
GenBank Updated
STS
Updated ONLY by NCBI
by submitters HTG
GSS
RefSeq:
INV VRT PHG VRL
PRI ROD PLN MAM BCT
RefSeq
Entrez Gene and
annotation pipelines
Labs
Why Make Reference Sequences?
NCBI FieldGuide
Entrez Nucleotide query:
human[organism] AND lipase[title]
Why Make Reference Sequences?
Entrez Nucleotide query:
human[organism] AND lipase[title]
NCBI FieldGuide
human[organism] AND lipase[title] AND endothelial[title]
human[organism] AND lipase[title] AND endothelial[title]
NCBI FieldGuide
3927 bp
4150 bp
2323 bp
3927 bp
261 bp
NCBI FieldGuide
RefSeq Benefits
genomes
transcripts
proteins
• non-redundant; best representative
•updates to reflect current sequence data and
biology
•distinct, stable accession series
Reference Sequence: RefSeq
NCBI FieldGuide
Accession Sequence Type
NM_123456789 mRNA
NP_123456789 protein, from NM_
NR_123456 non-coding RNA
XM_123456 predicted mRNA
XP_123456 predicted protein
XR_123456 predicted non-coding RNA
ZP_12345678 predicted from NZ_
NC_123456 genomic, e.g., chromosomes
NG_123455 genomic, incomplete region
NT_123456 genomic, BAC assembly
NW_123456 genomic, WGS assembly
NZ_ABCD12345678 genomic, WGS collection
Annotation Process
NCBI FieldGuide
Genomic DNA
(NC, NT, NW)
Scanning....
Model mRNA (XM) Model protein (XP)
(XR)
Curated mRNA (NM) Curated Protein (NP)
(NR)
RefSeq
Genbank
Sequences
Creating NM_ Records
NCBI FieldGuide
Genome annotation
NM’s must have
cDNA support
transcript variant 1
transcript variant 2
transcript variant 3
Longest mRNA
Topics
NCBI FieldGuide
About NCBI
GenBank overview
Primary vs derivative databases
The Reference Sequence (RefSeq) project
The Entrez engine and databases
-break-
Entrez text searching
Genomic resources
Sequence similarity - BLAST
An integrated example
NCBI FieldGuide
entrez
The Entrez System
NCBI FieldGuide
Gene UniGene
CancerChromosomes UniST
S
Homologen
SNP
e
Genome PopSet
Nucleotide
GEO
Books
PubMed Entrez Taxonomy GENSAT
MeSH
Probe
OMIM PubChe
Protein
m
PMC
Journal Structur
s Domains 3D Domains
e
Entrez Databases
NCBI FieldGuide
● All Molecular Database entries are organized
by organism (Taxonomy Database).
● Each record is assigned a UID.
A “unique integer identifier” for internal tracking
● Each record is indexed by data fields.
[author], [title], [organism], and many others
● Each record is given a Document Summary.
a summary of the record’s content (DocSum)
● Each record is manually or computationally
assigned links to biologically related UIDs in
and across databases.
Entrez Links
NCBI FieldGuide
Links
GeneView in dbSNP
NCBI FieldGuide
Entrez Databases
NCBI FieldGuide
UniGene Clusters of ESTs, mRNAs
dbSNP Single Nucleotide Polymorphisms
…and more
CDD Conserved Domain Database
protein families (COGs and KOGs)
single domains (PFAM, SMART, CD)
NCBI FieldGuide
UniGene
Gene-oriented clusters of expressed sequences
• Automatic clustering using MegaBlast
• Each cluster represents a unique gene
• Informed by genome hits
• Information on tissue types and map locations
• Useful for gene discovery and selection of mapping
reagents
A Cluster of ESTs
NCBI FieldGuide
query
5’ EST hits
3’ EST hits
UniGene Collections
NCBI FieldGuide
UniGene Collections
NCBI FieldGuide
Species UniGene Entries
UniGene Hs build 194
NCBI FieldGuide
UniGene Cluster Hs.95351
Lipase, hormone-sensitive (LIPE)
NCBI FieldGuide
UniGene Cluster Hs.95351
NCBI FieldGuide
NCBI FieldGuide
UniGene Cluster Hs.95351: expression
NCBI FieldGuide
UniGene Cluster Hs.95351: seqs
Get Sequences
NCBI FieldGuide
web page
ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/
Entrez Databases
NCBI FieldGuide
UniGene Clusters of ESTs, mRNAs
dbSNP Single Nucleotide Polymorphisms
…and more
CDD Conserved Domain Database
protein families (COGs and KOGs)
single domains (PFAM, SMART, CD)
NCBI’s SNP Database
NCBI FieldGuide
Primary and derivative (RefSNP)
Single nucleotide polymorphisms
Repeat polymorphisms
Insertion-deletion polymorphisms
Over 30 million refSNPs (rsXXXXXXX)
NCBI FieldGuide
Searching dbSNP
RefSNP
NCBI FieldGuide
Searching dbSNP
RefSNP
NCBI FieldGuide
Searching dbSNP
NCBI FieldGuide
RefSNP
Entrez Databases
NCBI FieldGuide
UniGene Clusters of ESTs, mRNAs
dbSNP Single Nucleotide Polymorphisms
…and more
CDD Conserved Domain Database
protein families (COGs and KOGs)
single domains (PFAM, SMART, CD)
NCBI FieldGuide
Conserved Domain Database
Multiple sequence alignments
Position-specific scoring matrices (PSSM)
Sources SMART, PFAM, COGs, KOGs, and
NCBI curated domains (structure-informed
alignments)
NCBI FieldGuide
CDD
NCBI FieldGuide
Search Entrez CDD
pep*
CDD
NCBI FieldGuide
Search with Protein Query
>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus]
IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS
STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL
KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS
CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE
CDD
NCBI FieldGuide
Click on a colored bar to align your sequence to
the CD
CDD
NCBI FieldGuide
Show Alignment
CDD
NCBI FieldGuide
Full Result
CD
Pfam
COG
CDD
NCBI FieldGuide
Domain Architecture
CDD
NCBI FieldGuide
CDART: Conserved Domain Architecture Retrieval Tool
CDD
NCBI FieldGuide
CDART: Conserved Domain Architecture Retrieval Tool
CDD
NCBI FieldGuide
Show Structure
Structure – Cn3D
NCBI FieldGuide
Topics
NCBI FieldGuide
About NCBI
GenBank overview
Primary vs derivative databases
The Reference Sequence (RefSeq) project
Selected Entrez databases
Bookshelf
-break-
Entrez text searching
Selected genomic resources
Sequence similarity - BLAST
An integrated example
NCBI FieldGuide
Literature Links
NCBI FieldGuide
BOOKS Database
NCBI FieldGuide
BOOKS Database: Hyperlinked Terms
NCBI FieldGuide
BOOKS Database
NCBI FieldGuide
BOOKS Database
For More Information…
NCBI FieldGuide
NCBI FieldGuide
Intermission