NCBI Field Guide
Shared by: yaofenjin
-
Stats
- views:
- 3
- posted:
- 9/23/2011
- language:
- English
- pages:
- 93
Document Sample


NCBI Molecular Biology
Resources
Part 1
February 2007
The National Center for
Biotechnology Information
Bethesda,MD
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Web Access: www.ncbi.nlm.nih.gov
NCBI Databases and Services
• GenBank largest sequence database
• Free public access to biomedical literature
– PubMed free Medline
– PubMed Central full text online access
• Entrez integrated molecular and literature databases
• BLAST highest volume sequence search service
• VAST structure similarity searches
• Software and Databases
Types of Databases
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, SNP, GEO
• Derivative Databases
– Built from primary data
– Content controlled by third party (NCBI)
• Examples: Refseq, TPA, RefSNP, UniGene, NCBI
Protein, Structure, Conserved Domain
Entrez Nucleotides
Primary
• GenBank / EMBL / DDBJ 86,011,283
Derivative
• RefSeq 1,512,656
• Third Party Annotation 5,254
• PDB 7,261
Total 87,536,454
What is GenBank?
NCBI’s Primary Sequence Database
• Nucleotide only sequence database
• Archival in nature
– Historical
– Reflective of submitter point of view (subjective)
– Redundant
• GenBank Data
– Direct submissions (traditional records)
– Batch submissions (EST, GSS, STS)
– ftp accounts (genome data)
• Three collaborating databases
– GenBank
– DNA Database of Japan (DDBJ)
– European Molecular Biology Laboratory (EMBL)
Database
International Sequence
Database Collaboration
Entrez
NIH
NCBI
•Submissions GenBank
•Updates •Submissions
•Updates
EMBL
DDBJ EBI
CIB
NIG •Submissions
•Updates SRS
getentry EMBL
GenBank: NCBI’s Primary Sequence
Database
Release 157 December 2006
83,434,665 Records
150,630,667,561 Total Bases
254 Gigabytes (non-WGS) 1072 files (non-WGS)
• full release every two months
• incremental updates daily
• available only via ftp
ftp://ftp.ncbi.nih.gov/genbank/
The Growth of GenBank
Release 157
160
140
120
WGS: 81.6 billion bases
(billions)
100
Bases
Doubling time 12-14 months
80
60
40
Non-WGS: 69.0 billion bases
20
0
Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 Aug-06
Organization of GenBank:
Traditional Divisions
Records are divided into 18 Divisions.
12 Traditional PRI Primate
6 Bulk PLN Plant and Fungal
BCT Bacterial and Archeal
INV Invertebrate
ROD Rodent
Traditional Divisions: VRL Viral
• Direct Submissions VRT Other Vertebrate
(Sequin and BankIt) MAM Mammalian
• Accurate PHG Phage
SYN Synthetic (cloning vectors)
• Well characterized ENV Environmental Samples
UNA Unannotated
Entrez query: gbdiv_xxx[Properties]
Organization of GenBank:
Bulk Divisions
Records are divided into 18 Divisions.
12 Traditional
6 Bulk
EST Expressed Sequence Tag
GSS Genome Survey Sequence
HTG High Throughput Genomic
BULK Divisions: STS Sequence Tagged Site
• Batch Submission HTC High Throughput cDNA
(Email and FTP) PAT Patent
• Inaccurate
• Poorly characterized
Entrez query: gbdiv_xxx[Properties]
LOCUS
DEFINITION
AY182241
complete cds.
1931 bp mRNA linear PLN 04-MAY-2004
Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
A Traditional
ACCESSION AY182241
VERSION
KEYWORDS
SOURCE
AY182241.2 GI:32265057
.
Malus x domestica (cultivated apple)
ORGANISM Malus x domestica
GenBank Record
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE 1 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Cloning and functional expression of an (E,E)-alpha-farnesene
synthase cDNA from peel tissue of apple fruit Header
JOURNAL Planta 219, 84-94 (2004)
REFERENCE 2 (bases 1 to 1931)
AUTHORS Pechous,S.W. and Whitaker,B.D.
TITLE Direct Submission
JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REFERENCE
AUTHORS
TITLE
3 (bases 1 to 1931)
Pechous,S.W. and Whitaker,B.D.
Direct Submission
The Flatfile Format
JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
20705, USA
REMARK Sequence update by submitter
COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES Location/Qualifiers
source 1..1931
/organism="Malus x domestica"
/mol_type="mRNA"
/cultivar="'Law Rome'"
/db_xref="taxon:3750"
/tissue_type="peel"
gene 1..1931
/gene="AFS1"
CDS 54..1784
/gene="AFS1"
Feature Table
/note="terpene synthase"
/codon_start=1
/product="(E,E)-alpha-farnesene synthase"
/protein_id="AAO22848.2"
/db_xref="GI:32265058"
/translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
LSLLFQPLVN"
ORIGIN
1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
Sequence
241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
//
Traditional GenBank Record
Accession
•Stable
ACCESSION U07418 •Reportable
•Universal
VERSION U07418.1 GI:466461
Version GI number
Tracks changes in sequence NCBI internal use
well annotated
the sequence is the data
Bulk Divisions
•Batch Submission and htg (email and ftp)
•Inaccurate
•Poorly Characterized
• Expressed Sequence Tag
– 1st pass single read cDNA
• Genome Survey Sequence
– 1st pass single read gDNA
• High Throughput Genomic
– incomplete sequences of genomic clones
• Sequence Tagged Site
– PCR-based mapping reagents
GenBank Bulk Sequence:
EST
poorly
characterized
ESTs in Entrez
Total 41 million records
Human 7.9 million
Mouse 4.7 million
Cow 1.3 million
Rice 1.2 million
Zebrafish 1.2 million
Maize 1.2 million
Xenopus tropicalis 1.0 million
Rat 0.9 million
Wheat 0.9 million
Chicken 0.6 million
Barley 0.4 million
HTG Division: Opossum Draft
Sequences
•Unfinished sequences of BACs
•Gaps and unordered pieces
•Finished sequences move to traditional
GenBank division
Whole Genome Shotgun
Projects
ftp://ftp.ncbi.nih.gov/genbank/wgs/
• >450 Projects
• >400 Taxa
– 302 bacteria
– 128 eukaryotes
• 47 fungi
• 53 animals
• 3 flowering plants
Mammalian WGS
• Duck-billed platypus
• Nine-banded armadillo
• Northern tree shrew
• Domestic rabbit
• Guinea pig
• Mouse
• Rat
• Thirteen-lined ground squirrel
• Small-eared galago
• Human
• Chimpanzee
• Rhesus macaque
• Tenrec
• African elephant
• Cat
• Dog
• European hedgehog
• Eurasian shrew
• Cow
• Little brown bat
• Gray short-tailed opossum
Derivative Databases
Entrez Protein: Derivative
Data SourceDatabase Sequences
GenPept 6,749,369
RefSeq 3,261,525
Third Party Annotation 5,079
Swiss Prot 243,887
PIR 30,236
PRF 12,079
PDB 89,953
PAT Division 669,035
Total 10,392,118
BLAST nr total 4,180,857
(no patents or env)
GenPept: GenBank CDS
translations
FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
/gene="MLH1"
CDS 22..2292 MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
Redundant Proteins
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... GenPept
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|1079787|gb|AAA82079.1| DNA mismatch repair prot...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|4557757|ref|NP_000240.1| MutL protein homolog 1...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... NCBI RefSeq
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... Swiss-Prot
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
>gi|741682|prf||2007430A DNA mismatch repair protei...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
PRF
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
Protein Sequences from
Structures
>gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp
SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL
ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA
HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK
ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED
KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ
Primary vs. Derivative
Sequence Databases
RefSeq
Labs
TATAGCCG
AGCTCCGATA
CCGATGACAA
Sequencing
Centers Genome
Curators Assembly
Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI
GenBank
UniGene
Updated ONLY
by submitters
Algorithms
RefSeq: NCBI’s Derivative Sequence
Database
• Curated transcripts and proteins
– reviewed
– human, mouse, rat, fruit fly, zebrafish, arabidopsis
microbial genomes (proteins), and more
• Model transcripts and proteins
• Assembled Genomic Regions (contigs)
– human genome – chicken
– mouse genome – honeybee
– rat genome – sea urchin
• Chromosome records
– Human genome
– microbial
– organelle srcdb_refseq[Properties]
ftp://ftp.ncbi.nih.gov/refseq/release/
Selected RefSeq Accession
Numbers
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 Microbial replicons, organelle
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig
GenBank to RefSeq
RefSeqs: Annotation Reagents
Genomic DNA
(NC, NT, NW)
Scanning....
Model mRNA (XM) Model protein (XP)
(XR)
=?
Curated mRNA (NM) Curated Protein (NP)
(NR)
RefSeq
GenBank
Sequences
RefSeq Benefits
• non-redundancy
• explicitly linked nucleotide and protein sequences
• updates to reflect current sequence data and biology
• data validation
• format consistency
• distinct accession series
• stewardship by NCBI staff and collaborators
Mouse
Assembly
Other
WGS GenBank UniGene
Transcript
RefSeq
Contig
BAC
RefSeq
Transcript
Expressed Sequences
UniGene
GEO
What is UniGene?
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
EST hits: Human mRNA
Albumin mRNA
5’ EST hits
3’ EST hits
Chordates
UniGene
Plants
Invertebrates
Fungi et al.
Xenopus laevis MLH1Cluster
Uncharacterized ESTs
UniGene: Expressed
Sequences
Expression Data
Other NCBI Databases
•Structure: imported structures (PDB)
Cn3D viewer, NCBI curation
•CDD: conserved domain database
Protein families (COGs and KOGs)
Single domains (PFAM, SMART, CD)
•dbSNP: nucleotide polymorphism
•Gene: gene records
Unifies LocusLink and Microbial Genomes
NCBI Structures and
Domains
MMDB: Molecular Modeling Data
Base
• Derived from experimentally determined PDB records
• Value added to PDB records including:
– Addition of explicit chemical graph information
– Validation (secondary structure elements)
– Inclusion of Taxonomy, Citation
– Conversion to ASN.1 data description language
• Structure neighbors determined by
Vector Alignment Search Tool (VAST)
Cn3D 4.1: Bacillus thuringiensis
Toxin
VAST: Structure Neighbors
Vector Alignment Search Tool
4
For each protein chain,
2
locate SSEs (secondary
structure elements),
5 6
and represent them as
individual vectors. 1
3
IL-4 &
align the vectors Leptin
Human IL-4
Protein Domains
• Structural Domain
– Discrete independently folding unit of a protein
• Conserved Domain (sequence-based)
– Protein region with recognizable position specific
pattern of sequence conservation
• Sequence-based domains often roughly
correspond to structural domains
• Domains often have distinct, identifiable
functions
NCBI’s Conserved Domain
Database
• PSI-BLAST –based score matrices
• Searchable with RPS-BLAST
• Sources
– SMART
– PFAM
– COGs
– NCBI curated domains
• structure informed alignments
Src Domains
Structure vs Conserved Domain
Conserved phosphotyrosine binding residues
SH2
SH2
TyrKC
SH3
Cn3D
NCBI Molecular Biology
Resources
Using Entrez
WWW
Access
Entrez
&
BLAST
Entrez: Database
Integration
Word weight
PubMed
abstracts
Taxonomy 3 -D
3-D
Structure
Structure
VAST
Phylogeny Genomes
Neighbors
Related Structures
Nucleotide Protein
BLAST BLAST
sequences sequences
Neighbors
Neighbors
Hard Link Related Sequences
Related Sequences
BLink
Domains
Database Searching with
Entrez
Using limits and field restriction to find human MutL homolog
Linking and neighboring with MutL
Mapping SNPs onto structure and the genome
Global NCBI (Entrez) Search
Human hereditary nonpolyposis colon cancer
Global Entrez Search
Results
Nucleotide Sequences
Nucleotide database now three parts
•EST expressed sequence tags
•GSS genome survey sequences
•CoreNucleotide everything else
Advanced Search Options
Tabs
More Precise Nucleotides
Search
nonpolyposis[All Fields] AND colon cancer[Title] AND human[Organism]
AND biomol_mrna[Properties] AND srcdb_refseq[Properties]
Useful Field Restrictions
[Title]: Definition line in GenBank / GenPept format shown in Summary format
glyceraldehyde 3 phosphate dehydrogenase[Title]
[Organism]: NCBI’s taxonomy. Organizing system for molecular databases
mouse[organism]; green plants[organism]; Streptomyces coelicolor[organism]
[Properties]: molecule type, location, database source
biomol_mrna[properties]; biomol_genomic[properties];
gene_in_mitochondrion[properties]; srcdb pdb[properties]
[Filter]: subsets of data, Entrez links
all[filter]; nucleotide mapview[filter]; nucleotide omim[filter]
Organism Field: NCBI’s
Taxonomy
Useful Properties Field Terms
Molecule type GenBank division
biomol_mrna gbdiv_est
biomol_genomic gbdiv_htg
gbdiv_xxx
Gene location Source Database
gene_in_mitochondrion srcdb_refseq
gene_in_chloroplast srcdb_pdb
gene_in_genomic srcdb_swiss_prot
Human MutL RefSeq
GenBank Records
NM_000249: Links
Literature
Links
OMIM
OMIM: Human Disease Genes
Conserved Domain
Sequence Links
Finding Homologs and Structures
Protein Link
BLAST Link
Conserved Domains
Related Proteins: Homologs and
Redundancy
Bacterial Homologs
Redundant Sequences
BLink: BLAST Link
Redundant GIs
top 200 only
BLink: non-redundant relatives
zebrafish homolog
BLAST
Related Proteins: Structure
Links
Structures
Short Cut: Related Structures
E. coli MutL Structure
Cn3D viewer
Structure Neighbors
Pubchem compound 3D Domain Neighbors
Conserved Domains
MLH1 Domain Structure:
CDD
ATPase Domain
Mismatch Repair Domain
MLH1: ATPase Domain
Mapping Polymorphisms onto
Structure
GeneView: Variations Human
MLH1
ATPase domain
Related Structures
Mapping Variation Onto
Structure
Asn
Ile
Ile – Val Conserved Asn
Genome
Resources
NM_000249: Genome Links
The Map Viewer
Genome BLAST
Previous Builds Available
Map Viewer: Human MLH1
Customizable
Transcripts
EST Hits
Download data and sequences
Models
NCBI Assembly
Gene Annotations
Maps and Options
Mapped
Variations
Synteny: Mammalian Genomes
Homologene
• No longer UniGene based
orthologs orthologs
paralogs
• Protein similarities first
• Guided by taxonomic tree
• frog A chick A mouse A
Includes orthologs and paralogsmouse B chick B frog B
A-chain gene B-chain gene
gene duplication
early globin gene
Homologene Cluster
Rice Homolog
The Gene Database
• Gene Centered Information
• Unifies LocusLink and microbial Genomes
• 2.4 million records for 3,822 taxa
Human 38,603 Sea Urchin 30,603
Chimpanzee 31,502 Mosquito 13,763
Mouse 60,746 Fruit Fly 21,116
Rat 38,117 C. elegans 20,935
Dog 20,154 Fungi 168,802
Cow 23, 677 Green Plants 76,847
Chicken 18, 469 Archea 74,627
Zebrafish 38, 594 Bacteria 1,361,390
Genes MLH1: One Stop Shopping
Genes MLH1: One Stop Shopping
(cont.)
Genes: Display Options and
Links
Get documents about "