Docstoc

NCBI Field Guide

Document Sample
NCBI Field Guide Powered By Docstoc
					NCBI Molecular Biology
     Resources
         Part 1




        February 2007
      The National Center for
     Biotechnology Information



                                               Bethesda,MD



        Created in 1988 as a part of the
       National Library of Medicine at NIH
–   Establish public databases
–   Research in computational biology
–   Develop software tools for sequence analysis
–   Disseminate biomedical information
Web Access: www.ncbi.nlm.nih.gov
    NCBI Databases and Services
• GenBank largest sequence database
• Free public access to biomedical literature
    – PubMed free Medline
    – PubMed Central full text online access
•   Entrez integrated molecular and literature databases
•   BLAST highest volume sequence search service
•   VAST structure similarity searches
•   Software and Databases
           Types of Databases

• Primary Databases
   – Original submissions by experimentalists
   – Content controlled by the submitter
      • Examples: GenBank, SNP, GEO
• Derivative Databases
   – Built from primary data
   – Content controlled by third party (NCBI)
      • Examples: Refseq, TPA, RefSNP, UniGene, NCBI
        Protein, Structure, Conserved Domain
           Entrez Nucleotides
                    Primary
•   GenBank / EMBL / DDBJ       86,011,283
                   Derivative
•   RefSeq                       1,512,656
•   Third Party Annotation           5,254
•   PDB                              7,261
Total                           87,536,454
              What is GenBank?
      NCBI’s Primary Sequence Database
• Nucleotide only sequence database
• Archival in nature
  – Historical
  – Reflective of submitter point of view (subjective)
  – Redundant
• GenBank Data
   – Direct submissions (traditional records)
   – Batch submissions (EST, GSS, STS)
   – ftp accounts (genome data)
• Three collaborating databases
  – GenBank
  – DNA Database of Japan (DDBJ)
  – European Molecular Biology Laboratory (EMBL)
    Database
               International Sequence
               Database Collaboration
                                Entrez
     NIH
                  NCBI
•Submissions                GenBank
•Updates                                                    •Submissions
                                                            •Updates
                                         EMBL
                             DDBJ                     EBI
                CIB

NIG                       •Submissions
                          •Updates              SRS
               getentry                                 EMBL
GenBank: NCBI’s Primary Sequence
                 Database

   Release 157               December 2006
    83,434,665               Records
150,630,667,561              Total Bases
   254 Gigabytes (non-WGS)   1072 files (non-WGS)


       • full release every two months
       • incremental updates daily
       • available only via ftp

       ftp://ftp.ncbi.nih.gov/genbank/
                   The Growth of GenBank

                    Release 157
             160


             140


             120
                                           WGS: 81.6 billion bases
(billions)




             100
  Bases




              Doubling time 12-14 months
             80


             60


             40
                                    Non-WGS: 69.0 billion bases
             20


              0
             Aug-97 Aug-98 Aug-99 Aug-00 Aug-01 Aug-02 Aug-03 Aug-04 Aug-05 Aug-06
     Organization of GenBank:
       Traditional Divisions
Records are divided into 18 Divisions.
     12 Traditional      PRI Primate
     6 Bulk              PLN Plant and Fungal
                         BCT Bacterial and Archeal
                         INV Invertebrate
                         ROD Rodent
Traditional Divisions:   VRL Viral
• Direct Submissions     VRT Other Vertebrate
   (Sequin and BankIt)   MAM Mammalian
• Accurate               PHG Phage
                         SYN Synthetic (cloning vectors)
• Well characterized     ENV Environmental Samples
                         UNA Unannotated


       Entrez query: gbdiv_xxx[Properties]
      Organization of GenBank:
           Bulk Divisions
 Records are divided into 18 Divisions.
      12 Traditional
      6 Bulk
                         EST Expressed Sequence Tag
                         GSS Genome Survey Sequence
                         HTG High Throughput Genomic
BULK Divisions:          STS Sequence Tagged Site
• Batch Submission       HTC High Throughput cDNA
   (Email and FTP)       PAT Patent
• Inaccurate
• Poorly characterized

         Entrez query: gbdiv_xxx[Properties]
LOCUS
DEFINITION
              AY182241

              complete cds.
                                      1931 bp    mRNA    linear   PLN 04-MAY-2004
              Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA,
                                                                             A Traditional
ACCESSION     AY182241
VERSION
KEYWORDS
SOURCE
              AY182241.2 GI:32265057
              .
              Malus x domestica (cultivated apple)
   ORGANISM Malus x domestica
                                                                          GenBank Record
              Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
              Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots;
              rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus.
REFERENCE     1 (bases 1 to 1931)
   AUTHORS    Pechous,S.W. and Whitaker,B.D.
   TITLE      Cloning and functional expression of an (E,E)-alpha-farnesene
              synthase cDNA from peel tissue of apple fruit                         Header
   JOURNAL    Planta 219, 84-94 (2004)
REFERENCE     2 (bases 1 to 1931)
   AUTHORS    Pechous,S.W. and Whitaker,B.D.
   TITLE      Direct Submission
   JOURNAL    Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab,
              USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
              20705, USA
REFERENCE
   AUTHORS
   TITLE
              3 (bases 1 to 1931)
              Pechous,S.W. and Whitaker,B.D.
              Direct Submission
                                                                                      The Flatfile Format
   JOURNAL    Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab,
              USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD
              20705, USA
   REMARK     Sequence update by submitter
COMMENT       On Jun 26, 2003 this sequence version replaced gi:27804758.
FEATURES               Location/Qualifiers
      source           1..1931
                       /organism="Malus x domestica"
                       /mol_type="mRNA"
                       /cultivar="'Law Rome'"
                       /db_xref="taxon:3750"
                       /tissue_type="peel"
      gene             1..1931
                       /gene="AFS1"
      CDS              54..1784
                       /gene="AFS1"
                                                                                    Feature Table
                       /note="terpene synthase"
                       /codon_start=1
                       /product="(E,E)-alpha-farnesene synthase"
                       /protein_id="AAO22848.2"
                       /db_xref="GI:32265058"
                       /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK
                       NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF
                       EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE
                       DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK
                       GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI
                       LSLLFQPLVN"
ORIGIN
          1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat
         61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg
       121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt
       181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga
                                                                                    Sequence
       241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt
//
     Traditional GenBank Record

                                              Accession
                                              •Stable
 ACCESSION           U07418                   •Reportable
                                              •Universal
 VERSION             U07418.1   GI:466461

Version                          GI number
Tracks changes in sequence       NCBI internal use




well annotated

the sequence is the data
         Bulk Divisions
•Batch Submission and htg (email and ftp)
•Inaccurate
•Poorly Characterized

 • Expressed Sequence Tag
     – 1st pass single read cDNA
 • Genome Survey Sequence
     – 1st pass single read gDNA
 • High Throughput Genomic
     – incomplete sequences of genomic clones
 • Sequence Tagged Site
     – PCR-based mapping reagents
       GenBank Bulk Sequence:
               EST




poorly
characterized
ESTs in Entrez

   Total                41 million records
   Human                7.9 million
   Mouse                4.7 million
   Cow                  1.3 million
   Rice                 1.2 million
   Zebrafish            1.2 million
   Maize                1.2 million
   Xenopus tropicalis   1.0 million
   Rat                  0.9 million
   Wheat                0.9 million
   Chicken              0.6 million
   Barley               0.4 million
HTG Division: Opossum Draft
         Sequences



            •Unfinished sequences of BACs
            •Gaps and unordered pieces
            •Finished sequences move to traditional
            GenBank division
  Whole Genome Shotgun
         Projects

ftp://ftp.ncbi.nih.gov/genbank/wgs/

                          • >450 Projects
                          • >400 Taxa
                              – 302 bacteria
                              – 128 eukaryotes
                                 • 47 fungi
                                 • 53 animals
                                 • 3 flowering plants
Mammalian WGS
 •   Duck-billed platypus
 •   Nine-banded armadillo
 •   Northern tree shrew
 •   Domestic rabbit
 •   Guinea pig
 •   Mouse
 •   Rat
 •   Thirteen-lined ground squirrel
 •   Small-eared galago
 •   Human
 •   Chimpanzee
 •   Rhesus macaque
 •   Tenrec
 •   African elephant
 •   Cat
 •   Dog
 •   European hedgehog
 •   Eurasian shrew
 •   Cow
 •   Little brown bat
 •   Gray short-tailed opossum
Derivative Databases
Entrez Protein: Derivative
 Data SourceDatabase Sequences
 GenPept                   6,749,369
 RefSeq                    3,261,525
 Third Party Annotation        5,079
 Swiss Prot                  243,887
 PIR                          30,236
 PRF                          12,079
 PDB                          89,953
 PAT Division                669,035
 Total                    10,392,118
 BLAST nr total            4,180,857
 (no patents or env)
            GenPept: GenBank CDS
                 translations
FEATURES      Location/Qualifiers
     source   1..2484
              /organism="Homo sapiens"
              /mol_type="mRNA"
              /db_xref="taxon:9606"
              /chromosome="3"
              /map="3p22-p23"
     gene     1..2484
                       >gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
              /gene="MLH1"
     CDS      22..2292 MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
                       EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
              /gene="MLH1"
              /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
              Number P14242), S. cerevisiae MLH1 (GenBank Accession
              Number U07187), E. coli MUTL (Swiss-Prot Accession Number
              P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
              Number P14161) and Streptococcus pneumoniae (Swiss-Prot
              Accession Number P14160)"
              /codon_start=1
              /product="DNA mismatch repair protein homolog"
              /protein_id="AAC50285.1"
              /db_xref="GI:463989"
              /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
              TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
              ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
              TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
              Redundant Proteins
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|13905126|gb|AAH06850.1| MutL protein homolog 1 ...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...   GenPept
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|1079787|gb|AAA82079.1| DNA mismatch repair prot...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|4557757|ref|NP_000240.1| MutL protein homolog 1...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...   NCBI RefSeq
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...   Swiss-Prot
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

>gi|741682|prf||2007430A DNA mismatch repair protei...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
                                                         PRF
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...
          Protein Sequences from
                 Structures




>gi|5542073|pdb|1B63|A Chain A, Mutl Complexed With Adpnp
SHMPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDIDIERGGAKLIRIRDNGCGIKKDEL
ALALARHATSKIASLDDLEAIISLGFRGEALASISSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAA
HPVGTTLEVLDLFYNTPARRKFLRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQK
ERRLGAICGTAFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQACED
KLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQ
               Primary vs. Derivative
               Sequence Databases
                                                   RefSeq
                             Labs
                                                  TATAGCCG
                                                  AGCTCCGATA
                                                  CCGATGACAA

Sequencing
Centers                                             Genome
                                    Curators       Assembly


                                     Updated
                                    continually
    TATAGCCG TATAGCCG
        TATAGCCG  TATAGCCG           by NCBI

         GenBank
                                                  UniGene
             Updated ONLY
             by submitters
                                     Algorithms
RefSeq: NCBI’s Derivative Sequence
                      Database
 • Curated transcripts and proteins
    – reviewed
    – human, mouse, rat, fruit fly, zebrafish, arabidopsis
      microbial genomes (proteins), and more
 • Model transcripts and proteins
 • Assembled Genomic Regions (contigs)
    – human genome       – chicken
    – mouse genome       – honeybee
    – rat genome         – sea urchin
 • Chromosome records
    – Human genome
    – microbial
    – organelle  srcdb_refseq[Properties]


  ftp://ftp.ncbi.nih.gov/refseq/release/
    Selected RefSeq Accession
             Numbers
mRNAs and Proteins
NM_123456            Curated mRNA
NP_123456            Curated Protein
NR_123456            Curated non-coding RNA
XM_123456            Predicted mRNA
XP_123456            Predicted Protein
XR_123456            Predicted non-coding RNA
Gene Records
NG_123456            Reference Genomic Sequence
Chromosome
NC_123455            Microbial replicons, organelle
Assemblies
NT_123456            Contig
NW_123456            WGS Supercontig
GenBank to RefSeq
  RefSeqs: Annotation Reagents

                                              Genomic DNA
                                               (NC, NT, NW)
             Scanning....


             Model mRNA (XM)            Model protein (XP)
                            (XR)
                                   =?
            Curated mRNA (NM)           Curated Protein (NP)
                            (NR)


RefSeq

GenBank
Sequences
                RefSeq Benefits
•   non-redundancy
•   explicitly linked nucleotide and protein sequences
•   updates to reflect current sequence data and biology
•   data validation
•   format consistency
•   distinct accession series
•   stewardship by NCBI staff and collaborators
                                                Mouse
                                              Assembly
               Other
         WGS   GenBank           UniGene
                                 Transcript




RefSeq
Contig




     BAC
                    RefSeq
                    Transcript
Expressed Sequences

      UniGene
       GEO
       What is UniGene?
A gene-oriented view of sequence entries
•MegaBlast based automated sequence clustering
•Now informed by genome hits New!
•Nonredundant set of gene oriented clusters
•Each cluster a unique gene
•Information on tissue types and map locations
•Includes known genes and uncharacterized ESTs
•Useful for gene discovery and selection of
mapping reagents
EST hits: Human mRNA




               Albumin mRNA


5’ EST hits

                   3’ EST hits
   Chordates
                UniGene
                             Plants



Invertebrates




                    Fungi et al.
Xenopus laevis MLH1Cluster
                 Uncharacterized ESTs
UniGene: Expressed
    Sequences
Expression Data
   Other NCBI Databases

•Structure:   imported structures (PDB)
              Cn3D viewer, NCBI curation

•CDD:         conserved domain database
              Protein families (COGs and KOGs)
              Single domains (PFAM, SMART, CD)

•dbSNP:       nucleotide polymorphism
•Gene:        gene records
              Unifies LocusLink and Microbial Genomes
NCBI Structures and
     Domains
 MMDB: Molecular Modeling Data
                     Base

• Derived from experimentally determined PDB records
• Value added to PDB records including:
   – Addition of explicit chemical graph information
   – Validation (secondary structure elements)
   – Inclusion of Taxonomy, Citation
   – Conversion to ASN.1 data description language
• Structure neighbors determined by
       Vector Alignment Search Tool (VAST)
Cn3D 4.1: Bacillus thuringiensis
           Toxin
       VAST: Structure Neighbors
                    Vector Alignment Search Tool

                                                   4
For each protein chain,
                                      2
locate SSEs (secondary
structure elements),
                                  5   6
and represent them as
individual vectors.                                1
                                      3
                                                   IL-4 &
align the vectors                                  Leptin
                                      Human IL-4
             Protein Domains

• Structural Domain
  – Discrete independently folding unit of a protein
• Conserved Domain (sequence-based)
  – Protein region with recognizable position specific
    pattern of sequence conservation
• Sequence-based domains often roughly
  correspond to structural domains
• Domains often have distinct, identifiable
  functions
 NCBI’s Conserved Domain
         Database
• PSI-BLAST –based score matrices
• Searchable with RPS-BLAST
• Sources
  – SMART
  – PFAM
  – COGs
  – NCBI curated domains
    • structure informed alignments
Src Domains
Structure vs Conserved Domain
                   Conserved phosphotyrosine binding residues

SH2

             SH2



              TyrKC

SH3


      Cn3D
NCBI Molecular Biology
     Resources
      Using Entrez
WWW
Access

Entrez
&
BLAST
               Entrez: Database
                       Integration
                                         Word weight

                             PubMed
                             abstracts



            Taxonomy                               3 -D
                                                   3-D
                                                Structure
                                                Structure
                                                                   VAST

Phylogeny                    Genomes
                                                            Neighbors
                                                            Related Structures


                Nucleotide                   Protein
   BLAST                                                    BLAST
                sequences                  sequences
                                                        Neighbors
   Neighbors
                               Hard Link                Related Sequences
   Related Sequences
                                                        BLink
                                                        Domains
   Database Searching with
           Entrez
Using limits and field restriction to find human MutL homolog
Linking and neighboring with MutL
Mapping SNPs onto structure and the genome
 Global NCBI (Entrez) Search


Human hereditary nonpolyposis colon cancer
Global Entrez Search
       Results
Nucleotide Sequences




           Nucleotide database now three parts
           •EST expressed sequence tags
           •GSS genome survey sequences
           •CoreNucleotide everything else
Advanced Search Options
             Tabs
     More Precise Nucleotides
             Search




nonpolyposis[All Fields] AND colon cancer[Title] AND human[Organism]
AND biomol_mrna[Properties] AND srcdb_refseq[Properties]
             Useful Field Restrictions
[Title]: Definition line in GenBank / GenPept format shown in Summary format

    glyceraldehyde 3 phosphate dehydrogenase[Title]

[Organism]: NCBI’s taxonomy. Organizing system for molecular databases

    mouse[organism]; green plants[organism]; Streptomyces coelicolor[organism]

[Properties]: molecule type, location, database source

    biomol_mrna[properties]; biomol_genomic[properties];
    gene_in_mitochondrion[properties]; srcdb pdb[properties]


[Filter]: subsets of data, Entrez links

    all[filter]; nucleotide mapview[filter]; nucleotide omim[filter]
Organism Field: NCBI’s
     Taxonomy
Useful Properties Field Terms


        Molecule type           GenBank division
        biomol_mrna             gbdiv_est
        biomol_genomic          gbdiv_htg
                                gbdiv_xxx

        Gene location           Source Database
        gene_in_mitochondrion   srcdb_refseq
        gene_in_chloroplast     srcdb_pdb
        gene_in_genomic         srcdb_swiss_prot
         Human MutL RefSeq




GenBank Records
NM_000249: Links
Literature
   Links
   OMIM
OMIM: Human Disease Genes




                   Conserved Domain
     Sequence Links

Finding Homologs and Structures
Protein Link
      BLAST Link


       Conserved Domains
Related Proteins: Homologs and
Redundancy
                            Bacterial Homologs
Redundant Sequences
BLink: BLAST Link

                        Redundant GIs
         top 200 only
BLink: non-redundant relatives



                  zebrafish homolog


   BLAST
Related Proteins: Structure
          Links
Structures
Short Cut: Related Structures
       E. coli MutL Structure
                                     Cn3D viewer
               Structure Neighbors




Pubchem compound           3D Domain Neighbors

                            Conserved Domains
      MLH1 Domain Structure:
             CDD


ATPase Domain
                Mismatch Repair Domain
MLH1: ATPase Domain
Mapping Polymorphisms onto
         Structure
GeneView: Variations Human
          MLH1




                      ATPase domain
Related Structures
Mapping Variation Onto
      Structure



               Asn
                            Ile




Ile – Val   Conserved Asn
 Genome
Resources
NM_000249: Genome Links
The Map Viewer

       Genome BLAST




       Previous Builds Available
Map Viewer: Human MLH1
                            Customizable



                             Transcripts
                 EST Hits


                     Download data and sequences


            Models


 NCBI Assembly


                                           Gene Annotations
Maps and Options
 Mapped
Variations
Synteny: Mammalian Genomes
                   Homologene
• No longer UniGene based
          orthologs                    orthologs
                             paralogs
• Protein similarities first
• Guided by taxonomic tree
• frog A chick A mouse A
    Includes orthologs and paralogsmouse B chick B frog B




        A-chain gene                  B-chain gene

                        gene duplication


                       early globin gene
Homologene Cluster
Rice Homolog
          The Gene Database
• Gene Centered Information
• Unifies LocusLink and microbial Genomes
• 2.4 million records for 3,822 taxa
Human          38,603 Sea Urchin        30,603
Chimpanzee     31,502 Mosquito          13,763
Mouse          60,746 Fruit Fly         21,116
Rat            38,117 C. elegans        20,935
Dog            20,154 Fungi            168,802
Cow            23, 677 Green Plants     76,847
Chicken        18, 469 Archea           74,627
Zebrafish      38, 594 Bacteria       1,361,390
Genes MLH1: One Stop Shopping
Genes MLH1: One Stop Shopping
                          (cont.)
Genes: Display Options and
          Links

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:9/24/2011
language:English
pages:93