AF ield Guide to GenBank and NCBI Resources by 8KY1ZJPu

VIEWS: 8 PAGES: 74

									          NCBI Molecular Biology Resources

                     A Field Guide




                                             NCBI
Nov. 6, 2001
            NCBI Resources

   About NCBI
   NCBI Sequence Databases
    • Primary Database – GenBank
    • Derivative Databases - RefSeq
   Entrez Databases and Text Searching
   BLAST Services




                                          NCBI
   Genomic Resources
         The National Center for Biotechnology
                  Information (NCBI)
   Created as a part of the National Library of Medicine in
    1988
    •   Establish public databases
    •   Research in computational biology
    •   Develop software tools for sequence analysis
    •   Disseminate biomedical information
   Tools: BLAST(1990), Entrez (1992)
   GenBank (1992)




                                                               NCBI
   Free MEDLINE (PubMed, 1997)
   Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM,
    UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink,
    RefSeq
                         Molecular Databases
   Primary Databases
    •   Original submissions by experimentalists
    •   Database staff organize but don’t add additional information
         • Example: GenBank
   Derivative Databases
    •   Human curated
         • compilation and correction of data
         • Example: SWISS-PROT, NCBI RefSeq mRNA
    •   Computationally Derived
         • Example: UniGene




                                                                       NCBI
    •   Combinations
         • Example: NCBI Genome Assembly
What is GenBank? NCBI’s Primary Sequence Database
   Nucleotide only sequence database
   Archival in nature
   GenBank Data
    •   Direct submissions individual records (BankIt, Sequin)
    •   Batch submissions via email (EST, GSS, STS)
    •   ftp accounts sequencing centers
   Data shared nightly among three collaborating databases
    • GenBank




                                                                      NCBI
    • DNA Database of Japan (DDBJ).
    • European Molecular Biology Laboratory Database (EMBL) at EBI.
                                Entrez
     NIH
                    NCBI

•Submissions                 GenBank
•Updates                                                     •Submissions
                                                             •Updates
                                          EMBL
                              DDBJ
                CIB                                    EBI

NIG                        •Submissions
                           •Updates              SRS
               getentry                                  EMBL
NCBI
               GenBank

Release 126            October2001
    13,602,262         Records
14,396,883,064         Nucleotides
        80,000 +       Species
• full release every two months
• incremental and cumulative updates daily
• available only through internet




                                             NCBI
     ftp://ncbi.nlm.nih.gov/genbank/
                     or
       ftp://genbank.sdsc.edu/pub/
             GenBank on FTP site




                        ftp> open ftp.ncbi.nlm.nih.gov
                        .
                        .
                        ftp> cd genbank




                                                         NCBI
Release 125: 243 files; 55.23 Gigabytes uncompressed
            GenBank Divisions
Bulk Sequence Divisions
PAT      Patent
EST      Expressed Sequence Tags (133 files)
STS      Sequence Tagged Site
GSS      Genome Survey Sequence (41 files)
HTG      High Throughput Genome (25 files)
HTC      High Throughput cDNA
CON      Contig
Traditional Divisions
BCT INV MAM PHG PLN PRI
ROD SYN UNA VRL VRT
   EST Division: Expressed Sequence Tags
>IMAGE:275615 5' mRNA sequence
GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC
TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT
TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA
GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC
       nucleus
TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC
                                       5’
AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT
         30,000
TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
          genes                                           3’
>IMAGE:275615 3', mRNA sequence
                                          - isolate unique clones
NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT
                                          -sequence once
TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC
               80-100,000 RNA
AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA
                gene products              from each end
CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT




                                                                         NCBI
GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC




               make cDNA
                                          80-100,000 unique
                 library                  cDNA clones in library
      STS Division : Sequence Tagged Sites

   Segment of gene, EST , mRNA or genomic DNA of known
    position (microsatellite)
   PCR with STS primers gives unique product (one per
    genome)
   Basis of Radiation Hybrid Mapping
     • UniGene
     • Genome Assembly




                                                          NCBI
   Related resource: Electronic PCR
    http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi
                      RH mapping using STSs

                  A   B   Human Chromosome   C       D




      A   B                  C       D                       Hybrid Cells
              D                                  A       B


PCR Results




                                                                            NCBI
A         +                      -                   +
B         +                      -                   +
C         -                      +                   -
D         +                      +                   -
        ePCR Results Hexokinase 1 EST
SHGC-35892

dbSTS id: 44155, GenBank Accession: G29974
Organism: Homo sapiens
Primer1: CATACGACACGGCTCACAAA
Primer2: CTGTTTGTCTCGTGGGGG
STS location: 30..160 Chromosome: 10
Expected amplicon size: 129, Observed amplicon size: 130
Primers match in forward orientation

Query sequence:

  1   TTTTTGAATT   GGTACAAAGT   TTACTAGGTC   ATACGACACG   GCTCACAAAG   CGGTGGGAAA




                                                                                    NCBI
 61   TTCCAGTGAT   GGCATTGTTT   GTTGGTTGGT   TCCTTTTATC   CAAATGGAGA   CAAGACACAT
121   TTCCGCAGAC   GTGTCCACCT   CCCCCCACGA   GACAAACAGA   ATGCAAGACT   GTCACACGCG
181   GCTAGGACTG   GTTCCACGGA   CACACGATTT   TGTGGCATTG   ACACACCACG   ATGCGATGCC
241   AGGCCACAGT   GGGTGCCAGG   AGGGGAGGAA   GCAGCTAATG   CTATGCCCAC   ACTCGCCTTC
301   AGCATGTGCC   CCGGGAGGAG   GCCCGGCAGT   GTCTGCTGGT   GATAATACAT   TTCACACGGG
361   GAGGGGGAAC   CAAGGATGAG   CTTTGGAGGC   CAGAAGGCTG   TCAGGTGGTG   TG
                 Genome Sequencing

                  Whole BAC insert (or genome)

                                                    sonication



               sequencing     cloning isolating
GSS division
                   assembly




                                                                 NCBI
                    Draft Sequence (HTG division)
  GSS Division: Genome Survey Sequences

         •Genomic equivalent of ESTs
         •BAC and other first pass surveys
         •BAC end sequences
         •Whole Genome Shotgun (some)
         •RAPIDS and other anonymous loci

                                             SP6 end
T7 end




                                                       NCBI
                 Genomic Clone (BAC)
HTG Division: High Throughput Genome Records

 phase 1                                   HTG
 Acc = AC008701   gi = 6601005

 phase 2                                   HTG
 Acc = AC008701   gi = 6671909


 phase 3                                   PRI
 Acc = AC008701    gi = 7328720



                  40,000 to > 350,000 bp
The GenBank Record




                     NCBI
             A Simple GenBank Record
LOCUS        AF062069     3808 bp     mRNA           INV       02-MAR-2000
DEFINITION   Limulus polyphemus myosin III mRNA, complete cds.
ACCESSION    AF062069
VERSION      AF062069.2 GI:7144484
KEYWORDS     .
SOURCE       Atlantic horseshoe crab.
  ORGANISM   Limulus polyphemus
             Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
             Xiphosura; Limulidae; Limulus.
REFERENCE    1 (bases 1 to 3808)
  AUTHORS    Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
             Greenberg,R.M. and Smith,W.C.
  TITLE      A myosin III from Limulus eyes is a clock-regulated phosphoprotein
  JOURNAL    J. Neurosci. (1998) In press
REFERENCE    2 (bases 1 to 3808)
  AUTHORS    Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
             Greenberg,R.M. and Smith,W.C.
  TITLE      Direct Submission
  JOURNAL    Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
             9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE    3 (bases 1 to 3808)
  AUTHORS    Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
             Greenberg,R.M. and Smith,W.C.
  TITLE      Direct Submission
  JOURNAL    Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
             9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
  REMARK     Sequence update by submitter
COMMENT      On Mar 2, 2000 this sequence version replaced gi:3132700.
                  GenBank Record, cont.
FEATURES             Location/Qualifiers
     source          1..3808
                     /organism="Limulus polyphemus"
                     /db_xref="taxon:6850"
                     /tissue_type="lateral eye"
     CDS             258..3302
                     /note="N-terminal protein kinase domain; C-terminal
myosin
                     heavy chain head; substrate for PKA"
                     /codon_start=1
                     /product="myosin III"
                     /protein_id="AAC16332.2"
                     /db_xref="GI:7144485"

/translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA
              NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
              EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF
              SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG
              ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
              PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ

BASE COUNT     1201 a    689 c    782 g   1136 t
ORIGIN
        1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt

     3781 aagatacagt aactagggaa aaaaaaaa
//
             Sequence and Database Identifiers
                        Locus, accession, gi, version

                 Sequence mol-type                  Modification Date
                 length   mRNA (= cDNA)
                          rRNA
Locus Name                      snRNA        GB Division
                                DNA

LOCUS        AF062069     3808 bp   mRNA         INV       02-MAR-2000

DEFINITION   Limulus polyphemus myosin III mRNA, complete cds.

ACCESSION    AF062069
                            Accession Number
VERSION      AF062069.2   GI:7144484
                                                   DEF line (Title)

  Accession.version         gi number
              Keywords, Source-organism
                 Legacy field
                 exception
                 •EST               Accepted common name
                 •GSS
                 •HTG

KEYWORDS     .
SOURCE       Atlantic horseshoe crab.
                                      Scientific name
  ORGANISM   Limulus polyphemus
             Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata;
             Xiphosura; Limulidae; Limulus.




                                                                         NCBI
     Taxonomic lineage according to GenBank
                                 Citation
REFERENCE   1 (bases 1 to 3808)
  AUTHORS   Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Article
            Greenberg,R.M. and Smith,W.C.
  TITLE     A myosin III from Limulus eyes is a clock-regulated phosphoprotein
  JOURNAL   J. Neurosci. (1998) In press
REFERENCE   2 (bases 1 to 3808)
  AUTHORS   Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
            Greenberg,R.M. and Smith,W.C.
 TITLE      Direct Submission
                                                              Submitter Block
 JOURNAL    Submitted (29-APR-1998) Whitney Laboratory, University of Florida,
            9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
REFERENCE   3 (bases 1 to 3808)
  AUTHORS   Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R.,
            Greenberg,R.M. and Smith,W.C.
 TITLE      Direct Submission                  Update history




                                                                                 NCBI
 JOURNAL    Submitted (02-MAR-2000) Whitney Laboratory, University of Florida,
            9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA
  REMARK    Sequence update by submitter
COMMENT     On Mar 2, 2000 this sequence version replaced gi:3132700.


                                                   Previous version
                        Feature Table
FEATURES        Location/Qualifiers
     source     1..3808
                /organism="Limulus polyphemus"
                /db_xref="taxon:6850"                Biosource
                /tissue_type="lateral eye"
    CDS         258..3302
               /note="N-terminal protein kinase domain;
                C-terminal myosin heavy chain head; substrate for PKA"
                                                Reading Frame
    Coding     /codon_start=1
    Sequence   /product="myosin III"
               /protein_id="AAC16332.2"      GenPept Protein Identifiers
               /db_xref="GI:7144485"
               /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK
               NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL
"
                               Sequence
         Indicates beginning of sequence data

BASE COUNT      1201 a    689 c    782 g    1136 t
ORIGIN
        1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt
       <sequence omitted>
     3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata
     3781 aagatacagt aactagggaa aaaaaaaa
//

          End of record
NCBI Derivative Sequence Databases: RefSeq
    NCBI Reference Sequences
    mRNAs and Proteins
    NM_123456   Curated mRNA
    NP_123456   Curated Protein
    XM_123456   Predicted Transcript
    XP_123456   Predicted Protein

    Gene Records
    NG_123456 Reference Genomic Sequence




                                                 NCBI
    Assemblies
    NT_123456 Contig (Mouse and Human Genomes)
    NC_123455 Chromosome (Microbial Genomes)
         Curated RefSeq Records: NM_, NP_
LOCUS        NM_000492     6159 bp    mRNA            PRI      26-JUL-1999
DEFINITION Homo sapiens cystic fibrosis transmembrane conductance
             regulator(CFTR) mRNA.
    REFSEQ: This reference sequence was derived from M28668.1,
ACCESSION
    M55131.1.NM_000492                              RefSeq Nucleotide
    On Feb 17, 2000 this sequence version replaced gi:4502784.
    Summary: Cystic fibrosis transmembrane conductance regulator is
    member 7 NP_000483
LOCUS                      1480 cassete sub-family C.PRI protein
              of the ATP-bindingaa                     The     26-JUL-1999
    functions as a fibrosis transmembrane conductance regulator.
DEFINITION cystic chloride channel and controls the regulation of
             NP_000483
ACCESSION transport pathways. Mutations in this gene cause the
    other
PID autosomalg4502785                                  RefSeq Protein
               recessive disorder, cystic fibrosis (CF) and congenital
    bilateralNP_000483.1 the vas deferens (CBAVD). Alternative splice
VERSION        aplasia of GI:4502785
    variants REFSEQ: accession NM_000492.1 which result from mutations
DBSOURCE      have been described, many of
    in the CFTR gene.
    COMPLETENESS: full length.                Reviewed
COMMENT REFSEQ: This reference sequence was derived from M55131.
          PROVISIONAL RefSeq: This is a provisional reference sequence
          record that has not yet been subject to human review. The final
          curated reference sequence record may be somewhat different from
          this one.
  Alignment Generated Transcripts: XM_, XP_



LOCUS        XM_004980    6128 bp    mRNA            PRI       16-NOV-2000
DEFINITION   Homo sapiens cystic fibrosis transmembrane conductance regulator,
             ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA.
ACCESSION    XM_004980                  mismatch
VERSION      XM_004980.3 GI:13631444




                                                                                 NCBI
                  RefSeq Human Contig: NT_
 LOCUS    mRNA          complement(join(1255889..1257642,1258986..1259091, 16-NOV-2000
             NT_007935 1888399 bp        DNA                  CON
                        1259690..1259862,1271619..1271708,1281957..1282112,
 DEFINITION Homo sapiens chromosome 7 working draft sequence segment,
                        1296780..1297028,1309837..1309937,1312742..1312969,
CONTIG       join(AC073042.3:1155..2680,gap(100),AC074390.2:119526..151445,
             complete sequence.
                        1313881..1314031,1317797..1317876,1320768..1321018,
             gap(100),AC074390.2:1..5245,gap(100),
 ACCESSION NT_007935 1321687..1321724,1329492..1329620,1331893..1332616,
                        1334111..1334197,1336717..1336811,1364895..1365086,
             complement(AC074390.2:17705..23645),gap(100),
 VERSION     NT_007935.1 GI:11422165
                        1375727..1375909,1382442..1382534,1384204..1384450,
 KEYWORDS    AC074390.2:97658..119425,AC073042.3:106479..121155,
             HTG.       1387877..1388002,1389139..1389302,1390185..1390274,
 SOURCE      AC074390.2:164226..165036,AC073042.3:70628..79503,gap(100),
             human.     1393436..1393651,1415408..1415516,1420187..1420297,
                        1444403..1444587))
             AC073042.3:4627..6382,gap(100),AC073042.3:2781..4526,gap(100),
   ORGANISM Homo sapiens
                        /partial
             complement(AC073042.3:183627..209083),gap(100),
                         Metazoa;
             Eukaryota; /gene="CFTR" Chordata; Craniata; Vertebrata;
             AC073042.3:79604..88622,gap(100),AC073042.3:139234..160437,
             Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini;
                        /product="cystic fibrosis transmembrane conductance
             gap(100),complement(AC073042.3:6483..8319),gap(100),
             Hominidae; regulator, ATP-binding cassette (sub-family C, member 7)"
                         Homo.
                        /transcript_id="XM_004980.1"
             complement(AC073042.3:39354..45372),gap(100),
 REFERENCE 1 (bases 1 to 1888399)
                        /db_xref="LocusID:1080"
             complement(AC073042.3:21461..24064),gap(100),
   AUTHORS International Human Genome Project collaborators.
                        /db_xref="MIM:602421"
   TITLE     AC074390.2:156347..160294,gap(100), Reordering draft sequence
                         complete sequence of computational analysis
             Toward the /note="derived by automated the human genome using
                        gene prediction method: Acembly. Supporting
             complement(AC074390.2:5346..10750),gap(100), evidence
   JOURNAL Unpublished
 COMMENT
                        includes similarity to: 9 proteins, 1 mRNAs See
             complement(AC074390.2:153911..156246),gap(100), details in
           GENOME ANNOTATION REFSEQ: NCBI contigs are derived from
                        AceView"
             complement(AC074390.2:23746..32402),gap(100),
           assembled genomic sequence data. They may include both
          gene          complement(1255889..1444587)
             complement(AC074390.2:151546..153810),gap(100),
                        /gene="CFTR"
           draft and finished sequence.
                        /note="CF; MRP7; ABC35;
           COMPLETENESS: not full length. ABCC7"
             complement(AC074390.2:57277..75275),gap(100),
                        /db_xref="LocusID:1080"
             complement(AC074390.2:75376..97557),gap(100),
                  Map View of
                    RefSeqs


            NT_




      XM_




                                NCBI
NM_
RefSeq Genome Records: NG_




                             NCBI
                                         RefSeq Chromosomes:
                                         NC_
LOCUS        NC_002695 5498450 bp    DNA   circular BCT        02-OCT-2001
DEFINITION   Escherichia coli O157:H7, complete genome.
ACCESSION    NC_002695
VERSION      NC_002695.1 GI:15829254
KEYWORDS     .
SOURCE       Escherichia coli O157:H7.
  ORGANISM   Escherichia coli O157:H7
             Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
             Escherichia.
REFERENCE    1 (sites)
  AUTHORS    Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S.,
             Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T.,
             Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T.,
             Sasakawa,C. and Shinagawa,H.
  TITLE      Complete nucleotide sequence of the prophage VT2-Sakai carrying the




                                                                                   NCBI
             verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7
             derived from the Sakai outbreak
  JOURNAL    Genes Genet. Syst. 74 (5), 227-239 (1999)
  MEDLINE    20198780
   PUBMED    10734605
    Other NCBI Derivative Databases



UniGene   -   gene oriented expressed sequence
              clusters

LocusLink -   central resource and interface for
              known genes




                                                   NCBI
NCBI
Homepage




      NCBI
                               NCBI
Mendelian Inheritance in Man   Homepage




     Similarity        Entrez
     Searching




                                      NCBI
            Using Entrez

An integrated database search and retrieval
                  system




                                              NCBI
      Entrez: Neighboring and Hard Links
                                           Word weight

                               PubMed
                               abstracts



              Taxonomy                              3-D
                                                    3 -D
                                                 Structure
                                                 Structure
                                                                VAST
Phylogeny                                         (MMDB)
                               Genomes




                  Nucleotide                  Protein
      BLAST                                                  BLAST
                  sequences                 sequences
                       WWW Entrez
                                            •All of MEDLINE plus others
                                            •Abstracts
                    GenBank, EMBL, DDBJ
                                            •Links to online Journals
                    RefSeq, PDB
                                 GenBank, DDBJ, EMBL translations
                                 PDB, PIR, SWISS-PROT, PRF, RefSeq

NCBI’s MMDB - derived from PDB     Reference Genomes:
                                   Graphical views, assembled sequence
                                   and mapping data




                                                                          NCBI
      Database Searching with Entrez

   Using limits and field restriction to find mouse GAPD
   Linking and neighboring with mouse GAPD




                                                            NCBI
 Entrez Nucleotides




Mouse




                      NCBI
Document Summaries: Mouse[All Fields]


           3 million records




          Chicken not mouse !?




                                        NCBI
Entrez Nucleotides: Limits: Preview/Index




         Mouse




                                            NCBI
Accession
All Fields
                 Entrez Nucleotides: Limits
Author Name
EC/RN Number
Feature key
                       Mouse
Filter        Field Restriction
Gene Name
Issue
Journal Name
Keyword
Modification Date            Exclude unwanted categories of sequences
Organism
Page Number                                Gene Location
                     Molecule
Primary Accession                          Genomic DNA/RNA
Properties           Genomic DNA/RNA
                     mRNA                  Mitochondrion
Protein Name
                                           Chloroplast




                                                                        NCBI
Publication Date     rRNA
SeqID String
                                   Only From
Sequence Length
                                   RefSeq
Substance Name
                                   GenBank
Text Word
                                   EMBL
Title Word                         DDBJ
Uid
Entrez Nucleotides: Limits: Organism
       Mouse




                                       NCBI
Document Summaries: Mouse[Organism]




                  2,976,070[All Fields]
                 -2,921,009[Organism]
                     55,061




                                          NCBI
Exclude Bulk Sequences, mRNA




                               NCBI
Adding Terms: Preview/Index
Accession
All Fields
Author Name
EC/RN Number
Feature key
Filter
Gene Name
Issue
Journal Name
Keyword                Search History
Modification Date
Organism
Page Number
Primary Accession
Properties
  glyceraldehyde    3 phosphate dehydrogenase
Protein Name




                                                NCBI
Publication Date
SeqID String
Sequence Length
Substance Name
Text Word
Title Word
Uid
Volume
Mouse GAPD Records




                     NCBI
          Displaying Mouse GAPD Records
                                      Summary
                                      Brief
                                      GenBank
                                      ASN.1        Formats
                                      FASTA
                                      GI list
                                      LinkOut
                                      PubMed Links
                                      Protein Links
Links and neighbors (related records)
                                      Nucleotide Neighbors




                                                             NCBI
                                      PopSet Links
                                      Structure Links
                                      Genome Links
                                      Taxonomy Links
                                      OMIM Links
Entrez GenBank / GenPept




                           NCBI
        GenPept
                            FASTA Format
>gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald
GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC
AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC
ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
    FASTA Definition Line
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC
CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT
    >gi|193425|gb|M60978.1|MUSGAPDS
GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT
            >
AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA
CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA
     gi number
CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT
                                                                    Locus Name
ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA
CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT
                     Database Identifiers          Accession number
GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC
                     gb      GenBank
TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC
CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC
                     emb     EMBL
CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC
                     dbj     DDBJ
CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG




                                                                                    NCBI
                     sp      SWISS-PROT
CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC
GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC
                     pdb     Protein Databank
GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA
                     pir     PIR
TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG
                     prf     PRF
GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC
                     ref     RefSeq
            Abstract Syntax Notation: ASN.1
Seq-entry ::= set {
  level 1 ,
  class nuc-prot ,
  descr {     GenPept                                 GenBank
    title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate
 dehydrogenase (Gapd-S) mRNA, and translated products" ,
    update-date
      std {
        year 1994 ,
        month 11 ,
        day 9 } ,
                                  ASN.1
    source {
      org {
        taxname "Mus musculus" ,
        common "house mouse" ,




                                                                                NCBI
        db {
          {
              FASTA                                   FASTA
              Protein
             db "taxon" ,                             Nucleotide
             tag
               id 10090 } } ,
                               NCBI Toolbox
 /*****************************************************************************
*
*   asn2ff.c
*        convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs.
*
*****************************************************************************/
                 Toolbox Sources
#include <accentr.h>
#include "asn2ff.h"
#include "asn2ffp.h"
#include "ffprint.h"
                  ftp> open ncbi.nlm.nih.gov
#include <subutil.h>
#include <objall.h>
                  .
#include <objcode.h>
#include <lsqfetch.h>
                  .
#include <explore.h>

#ifdef ENABLE_ID1 ftp> cd toolbox
#include <accid1.h>
#endif            ftp> cd ncbi_tools




                                                                                           NCBI
FILE *fpl;

             ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools
Args myargs[] = {
         {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL},
         {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL},
         {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL},
         {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL},
         {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL},
Protein Neighbors-Structure Links
           Related Proteins
                              Cn3D GAPD Structure

 Structure Links




                                                    NCBI
Advanced Neighbors: BLink




                            NCBI
BLink




        NCBI
PubMed Link




              NCBI
Online Books




               NCBI
         Entrez Structures

Molecular Modeling Database (MMDB) and Cn3D




                                              NCBI
    MMDB: Molecular Modeling Data Base


   Derived from experimentally determined PDB records
   Value added to PDB records including:
    •   Addition of explicit chemical graph information
    •   Validation
    •   Inclusion of Taxonomy, Citation, and other information
    •   Conversion to parseable ASN.1 data description language
   Structure neighbors determined by




                                                                  NCBI
         Vector Alignment Search Tool (VAST)
Searching MMDB




                   NCBI
            1CET
Structure Summary




          BLAST neighbors


             VAST neighbors




                                  NCBI
                    Cn3D viewer
Cn3D : Displaying Structures




     Chloroquine




                               NCBI
Structure Neighbors




                      NCBI
Structural Alignments


           Chloroquine


                         NADH




                                NCBI
Why do we need similarity searching?



   Identification and annotation
     •Incomplete or no annotations (GenBank)
     •Incorrectly annotated sequences
    Evolutionary relationships
       homologous molecules may




                                               NCBI
       have similar functions
      but it ain’t necessarily so!
       Basic Local Alignment Search Tool

   Widely used similarity search tool
   Heuristic approach based on Smith Waterman algorithm
   Finds best local alignments
   Provides statistical significance
   All combinations (DNA/Protein) query and database.
    • DNA vs DNA
    • DNA translation vs Protein




                                                           NCBI
    • Protein vs Protein
    • Protein vs DNA translation
    • DNA translation vs DNA translation
   www, email server, standalone, and network clients
                Local Alignment Statistics
High scores of local alignments between two random sequences
follow Extreme Value Distribution
                                         For ungapped alignments:
                                    Expected number with score S or greater

                                                 E = Kmne-S
                                                      or
                                                 E = mn2-S’

                                            K = scale for search space




                                                                              NCBI
                                             = scale for scoring system
                                            S’= bitscore = (S - lnK)/ln2


http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
                   Scoring Systems
•Nucleic acids     identity matrix
•Proteins
   •Position Independent Matrices
      •PAM Matrices (Percent Accepted Mutation)
          •Implicit model of evolution
          •Higher PAM number all calculated from PAM1
          •PAM250 widely used
      •BLOSUM Matrices (BLOck SUbstition Matrices)
          •Empirically determined from alignment
          of conserved blocks




                                                                         NCBI
          •Each includes information up to a certain level of identity
          •BLOSUM62 widely used
   •Position Specific Score Matrices (PSSM)
          •PSI and RPS BLAST
A 4
R -1 5                     BLOSUM62
N -2 0 6
D -2 -2 1 6
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5                   Common amino acids have low weights
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
                                           Rare amino acids have high weights
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4




                                                                           NCBI
 Negative 0 less likely -1 -2 -2 -1
T 0 -1 for -1 -1 -1 substitutions -1 -1 -1 -2 -1 1 5
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
            -1 -2 for more -1 substitutions
X 0 -1 -1 Positive-1 -1 -1 likely-1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
   A R N D C Q E G H I L K M F P S T W Y V X
Position Specific Substitution Rates




         Typical serine       Active site serine




                                                   NCBI
          Position Specific Score Matrix (PSSM)
            A    R N D C Q E G H I L K M                    F    P    S    T    W    Y    V
206   D     0   -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2             -6    1    0   -1   -6   -4   -1
207   G    -2   -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2           -3   -2   -2   -1    0   -6   -5
208   V    -1    1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5          -6   -4    0   -2   -6   -4   -2
209   I    -3    3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2           -5   -5   -3    0   -1   -4    0
210   S    -2   -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6          -7   -5    1   -3   -7   -5   -6
211   S     4   -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4        -5   -1    4    3   -6   -5   -3
212   C    -4   -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5         0   -7   -4   -4   -5    0   -4
213   N    -2
                               Serine scored differently
                 0 2 -1 -6 7 0 -2 0 -6 -4 2 0              -2   -5   -1   -3   -3   -4   -3
214   G    -2                   in -5 7 -4 -7 -7 -5
                -3 -3 -4 -4 -4 these two positions -4      -4   -6   -3   -5   -6   -6   -6
215   D    -5   -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7         -7   -5   -4   -4   -8   -7   -7
216   S    -2   -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5        -6   -4    7   -2   -6   -5   -5
217   G    -3   -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6         -7   -6   -4   -5   -6   -7   -7




                                                                                              NCBI
218   G    -3   -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6         -7   -6   -2   -4   -6   -7   -7
                 Active site nucleophile
219   P    -2   -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6        -7    9   -4   -4   -7   -7   -6
220   L    -4   -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1            0   -6   -6   -5   -5   -4    0
221   N    -1   -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4            -3   -6   -2   -1   -6   -1    6
222   C     0   -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0          -1   -4   -1    0   -5    0    0
223   Q     0    1 4 2 -5 2 0 0 0 -4 -2 1 0                 0    0   -1   -1   -3   -3   -4
224   A    -1   -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2            -2   -3    0   -2   -2   -2   -3
               Gapped Alignments

•Gapping provides more biologically realistic alignments
•Statistical behavior not completely understood for
gapped alignments
   •Gapped BLAST parameters must be found by
   simulations for each matrix
•Affine gap costs = -(a+bk)
   a = gap open penalty b = gap extend penalty
   A gap of length 1 receives the score -(a+b)




                                                           NCBI
Intermission




               NCBI

								
To top