Embed
Email

National Center for Biotechnology Information

Document Sample
National Center for Biotechnology Information
Shared by: HC11112311811
Categories
Tags
Stats
views:
0
posted:
11/23/2011
language:
English
pages:
73
National Center for Biotechnology Information









NCBI FieldGuide

A Field Guide to GenBank



and NCBI’s Molecular Biology Resources









January 30, 2007 Washington University, St. Louis



ftp://ftp.ncbi.nih.gov/pub/FieldGuide/Slides/Current/WashU.01.30.067/Jan30

Topics









NCBI FieldGuide

 About NCBI

 GenBank overview

 Primary vs derivative databases

 The Reference Sequence (RefSeq) project

 The Entrez engine and databases

-break-

 Entrez text searching

 Genomic resources

 Sequence similarity - BLAST

 An integrated example

The National Institutes of Health









NCBI FieldGuide

Bethesda, MD

The National Center for

Biotechnology Information









NCBI FieldGuide

 Accepts submissions of primary data

 Develops tools to analyze these data

 Creates derivative databases based on the primary

data

 Provides free search, link, and retrieval of these

data, primarily through the Entrez system

NCBI Web Traffic









NCBI FieldGuide

Japan 6%

Users per day Italy 4%

600,000 Canada 3%

Germany 3%

United Kingdom

3%

500,000 Netherlands 2%

Spain 2%

Brazil 2%

Sweden 1%

400,000 Switzerland 1%

Belgium1%

U.S. Other

(.com, .net, .org, 14%

gov,

.gov, .us)

300,000

40%





200,000





100,000









1998 1999 2000 2001 2002 2003 2004 2005



Christmas and New Year’s Day

NCBI Web Usage – Users per Day, January 2007









NCBI FieldGuide

NCBI Web Usage – Hits per Day, January 2007









NCBI FieldGuide

Homepage - accessing the data









NCBI FieldGuide

all[filter]

NCBI FieldGuide

01/21/2007









9/19/2006

GenBank

GenBank









NCBI FieldGuide

Release 157 December 2006

83 x 106 Records

150 x 109 Nucleotides



254 Gb (non-WGS) 1072 files (non-WGS)



• full release every two months

• incremental and cumulative updates daily

• available only via ftp

• release notes: gbrel.txt



ftp://ftp.ncbi.nih.gov/genbank/

ftp://genbank.sdsc.edu/pub

ftp://bio-mirror.net/biomirror/genbank

The Growth of GenBank









NCBI FieldGuide

Release 157





160





140





120

WGS: 81.6 billion bases

(billions)









100

Bases









80

Doubling time 12-14 months

60





40

Non-WGS: 69.0 billion bases

20





0

Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 Aug-06

What is GenBank?









NCBI FieldGuide

 Nucleotide only sequence database

 Archival in nature

 Historical

 Reflective of submitter point of view (subjective)

 Redundant



 GenBank Data

 Direct submissions (traditional records)

 Batch submissions (EST, GSS, STS)

 ftp accounts (genome data)



 Three collaborating databases

 GenBank

 DNA Database of Japan (DDBJ)

 European Molecular Biology Laboratory (EMBL) Database

GenBank Divisions









NCBI FieldGuide

PRI (28) Primate “Organismal”

ROD (15) Rodent (Traditional)

PLN (20) Plant and Fungal • Organized by taxonomy (sort of)

BCT (18) Bacterial/Archeal • Direct submissions (Sequin/Bankit)

INV (7) Invertebrate • Accurate (~1 error per 10,000 bp)

VRT (7) Other Vertebrate • Well characterized

VRL (4) Viral

MAM (2) Mammalian

PHG (1) Phage

SYN (1) Synthetic

ENV (4) Envir. samples

UNA (1) Unannotated

EST (570) Expressed Sequence Tag

“Functional”

GSS (197) Genome Survey Sequence (Bulk)

HTG (88) High Throughput Genomic • Organized by sequence type

PAT (27) Patent • Batch submissions (ftp/email)

STS (9) Sequence Tagged Site • Less accurate

CON (1) Contigs, virtual • Poorly characterized

GenBank Functional (Bulk) Divisions









NCBI FieldGuide

 Expressed Sequence Tag

 1st pass single read cDNA

 Genome Survey Sequence

EST  1st pass single read gDNA



GenBank GSS  High Throughput Genomic

 incomplete sequences of genomic

HTG

clones

STS

 Sequence Tagged Site

 PCR-based mapping reagents





Whole Genome Shotgun

EST Division: Expressed Sequence Tags









NCBI FieldGuide

>IMAGE:275615 5' mRNA sequence

GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG

TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA

nucleus

TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA

5’

GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC

30,000

TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC

genes

AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN

3’

TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

>IMAGE:275615 3', mRNA sequence - isolate unique clones

NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA

RNA - sequence once from

TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT

gene products each end

AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT

CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG

GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC







make cDNA

80-100,000 unique

library cDNA clones in library

GenBank Bulk Sequence: EST









NCBI FieldGuide

poorly

characterized

GSS, HTG, WGS









NCBI FieldGuide

Whole BAC insert (or genome)



shred









sequence isolate clones





GSS division

whole genome shotgun

or trace archive assembly

assemblies (wgs projects)









Draft sequence (HTG division)

HTG Example: Honeybee Draft Sequences









NCBI FieldGuide

LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004

DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE,

14 unordered pieces.

ACCESSION AC141845

VERSION AC141845.1 GI:29124029

KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.





• Unfinished sequences of BACs

• Gaps and unordered pieces

• Finished sequences (Phase 3) move to

traditional GenBank division

Whole Genome Shotgun Projects









NCBI FieldGuide

 685 projects

 Bacteria (320)

 Environmental sequences (14)

 Archaea (8)

 Eukaryotes (140), including:

 Chicken, Rat, Mouse, Dog (2), Chimpanzee, Human

 Pufferfish (2)

 Honeybee, Anopheles, Fruit Flies (3), Silkworm

 Nematode (2)

 Yeasts (8), Aspergillus (2)

 Rice (2)

Whole Genome Shotgun (WGS) Projects









NCBI FieldGuide

wgs master[properties]





ftp://ftp.ncbi.nih.gov/genbank/wgs/

Derivative Databases









NCBI FieldGuide

Derivative Databases









NCBI FieldGuide

Sequencing

Centers UniGene









EST

UniSTS

GenBank Updated

STS

Updated ONLY by NCBI

by submitters HTG

GSS

RefSeq:

INV VRT PHG VRL

PRI ROD PLN MAM BCT

RefSeq

Entrez Gene and

annotation pipelines







Labs

Why Make Reference Sequences?









NCBI FieldGuide

Entrez Nucleotide query:



human[organism] AND lipase[title]

Why Make Reference Sequences?

Entrez Nucleotide query:

human[organism] AND lipase[title]









NCBI FieldGuide

human[organism] AND lipase[title] AND endothelial[title]

human[organism] AND lipase[title] AND endothelial[title]









NCBI FieldGuide

3927 bp







4150 bp









2323 bp









3927 bp







261 bp

NCBI FieldGuide

RefSeq Benefits



genomes

transcripts

proteins









• non-redundant; best representative



•updates to reflect current sequence data and

biology

•distinct, stable accession series

Reference Sequence: RefSeq









NCBI FieldGuide

Accession Sequence Type



NM_123456789 mRNA

NP_123456789 protein, from NM_

NR_123456 non-coding RNA

XM_123456 predicted mRNA

XP_123456 predicted protein

XR_123456 predicted non-coding RNA

ZP_12345678 predicted from NZ_



NC_123456 genomic, e.g., chromosomes

NG_123455 genomic, incomplete region



NT_123456 genomic, BAC assembly

NW_123456 genomic, WGS assembly

NZ_ABCD12345678 genomic, WGS collection

Annotation Process









NCBI FieldGuide

Genomic DNA

(NC, NT, NW)

Scanning....



Model mRNA (XM) Model protein (XP)

(XR)



Curated mRNA (NM) Curated Protein (NP)

(NR)





RefSeq



Genbank

Sequences

Creating NM_ Records









NCBI FieldGuide

Genome annotation









NM’s must have

cDNA support





transcript variant 1

transcript variant 2

transcript variant 3

Longest mRNA

Topics









NCBI FieldGuide

 About NCBI

 GenBank overview

 Primary vs derivative databases

 The Reference Sequence (RefSeq) project

 The Entrez engine and databases

-break-

 Entrez text searching

 Genomic resources

 Sequence similarity - BLAST

 An integrated example

NCBI FieldGuide

entrez

The Entrez System









NCBI FieldGuide

Gene UniGene

CancerChromosomes UniST

S

Homologen

SNP

e



Genome PopSet

Nucleotide

GEO

Books

PubMed Entrez Taxonomy GENSAT

MeSH

Probe

OMIM PubChe

Protein

m

PMC



Journal Structur

s Domains 3D Domains

e

Entrez Databases









NCBI FieldGuide

● All Molecular Database entries are organized

by organism (Taxonomy Database).

● Each record is assigned a UID.

 A “unique integer identifier” for internal tracking

● Each record is indexed by data fields.

 [author], [title], [organism], and many others



● Each record is given a Document Summary.

 a summary of the record’s content (DocSum)



● Each record is manually or computationally

assigned links to biologically related UIDs in

and across databases.

Entrez Links









NCBI FieldGuide

Links

GeneView in dbSNP









NCBI FieldGuide

Entrez Databases









NCBI FieldGuide

 UniGene Clusters of ESTs, mRNAs



 dbSNP Single Nucleotide Polymorphisms

…and more



 CDD Conserved Domain Database

protein families (COGs and KOGs)

single domains (PFAM, SMART, CD)

NCBI FieldGuide

UniGene







Gene-oriented clusters of expressed sequences



• Automatic clustering using MegaBlast



• Each cluster represents a unique gene



• Informed by genome hits



• Information on tissue types and map locations

• Useful for gene discovery and selection of mapping

reagents

A Cluster of ESTs









NCBI FieldGuide

query









5’ EST hits

3’ EST hits

UniGene Collections









NCBI FieldGuide

UniGene Collections









NCBI FieldGuide

Species UniGene Entries

UniGene Hs build 194









NCBI FieldGuide

UniGene Cluster Hs.95351

Lipase, hormone-sensitive (LIPE)









NCBI FieldGuide

UniGene Cluster Hs.95351









NCBI FieldGuide

NCBI FieldGuide

UniGene Cluster Hs.95351: expression

NCBI FieldGuide

UniGene Cluster Hs.95351: seqs

Get Sequences









NCBI FieldGuide

web page







ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

Entrez Databases









NCBI FieldGuide

 UniGene Clusters of ESTs, mRNAs



 dbSNP Single Nucleotide Polymorphisms

…and more



 CDD Conserved Domain Database

protein families (COGs and KOGs)

single domains (PFAM, SMART, CD)

NCBI’s SNP Database









NCBI FieldGuide

 Primary and derivative (RefSNP)

 Single nucleotide polymorphisms

 Repeat polymorphisms

 Insertion-deletion polymorphisms



 Over 30 million refSNPs (rsXXXXXXX)

NCBI FieldGuide

Searching dbSNP

RefSNP









NCBI FieldGuide

Searching dbSNP

RefSNP









NCBI FieldGuide

Searching dbSNP

NCBI FieldGuide

RefSNP

Entrez Databases









NCBI FieldGuide

 UniGene Clusters of ESTs, mRNAs



 dbSNP Single Nucleotide Polymorphisms

…and more



 CDD Conserved Domain Database

protein families (COGs and KOGs)

single domains (PFAM, SMART, CD)

NCBI FieldGuide

Conserved Domain Database



 Multiple sequence alignments





 Position-specific scoring matrices (PSSM)



 Sources SMART, PFAM, COGs, KOGs, and

NCBI curated domains (structure-informed

alignments)

NCBI FieldGuide

CDD

NCBI FieldGuide

Search Entrez CDD









pep*

CDD









NCBI FieldGuide

Search with Protein Query









>gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus]

IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS

STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL

KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS

CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE

CDD









NCBI FieldGuide

Click on a colored bar to align your sequence to

the CD

CDD









NCBI FieldGuide

Show Alignment

CDD









NCBI FieldGuide

Full Result









CD



Pfam



COG

CDD









NCBI FieldGuide

Domain Architecture

CDD









NCBI FieldGuide

CDART: Conserved Domain Architecture Retrieval Tool

CDD









NCBI FieldGuide

CDART: Conserved Domain Architecture Retrieval Tool

CDD









NCBI FieldGuide

Show Structure

Structure – Cn3D









NCBI FieldGuide

Topics









NCBI FieldGuide

 About NCBI

 GenBank overview

 Primary vs derivative databases

 The Reference Sequence (RefSeq) project

 Selected Entrez databases

 Bookshelf

-break-

 Entrez text searching

 Selected genomic resources

 Sequence similarity - BLAST

 An integrated example

NCBI FieldGuide

Literature Links

NCBI FieldGuide

BOOKS Database

NCBI FieldGuide

BOOKS Database: Hyperlinked Terms

NCBI FieldGuide

BOOKS Database

NCBI FieldGuide

BOOKS Database

For More Information…









NCBI FieldGuide

NCBI FieldGuide

Intermission


Related docs
Other docs by HC11112311811
BAP Minutes � January 31, 2007
Views: 2  |  Downloads: 0
Magnolia Pictures
Views: 1  |  Downloads: 0
NICS 2011
Views: 2  |  Downloads: 0
FANTAS�A Y SANACI�N
Views: 0  |  Downloads: 0
UMA JURISPRUD�NCIA HUMANIT�RIA
Views: 1  |  Downloads: 0
What Will Heaven Be Like?
Views: 2  |  Downloads: 0
Introduction to Programming
Views: 1  |  Downloads: 0
lista participantilor la concurs
Views: 8  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!