Document Sample
sequence_databases_09 Powered By Docstoc
					Sequence Databases – 20 June 2008

Learning objectives-
   Be able to describe how information is stored in GenBank.
   Be able to read a GenBank flat file.
   Be able to search GenBank for information.
   Be able to explain the content difference between a header,
    features and sequence.
   Be able to say what distinguishes a primary database from
    a secondary database.
   Be able to use and talk about the RefSeq and dbEST
    databases as they fit into the objectives above.
   Be able to access and navigate the ENTREZ platform for
    biological data analysis.
BIOSEQs – entry common to all
sequence databases
BIOSEQ = Biological sequence
   Central element in the NCBI database model.
   Found in both the nucleotide and protein databases
Comprises the sequence of a single continuous molecule of
nucleic acid or protein. Entry must have
   At least one sequence identifier (Seq-id)
   Information on the physical type of molecule (DNA, RNA, or
   Descriptors, which describe the entire Bioseq
   Annotations, which provide information regarding specific
    locations within the Bioseq
What is GenBank?
The NIH genetic sequence database, an annotated
collection of all publicly available NUCLEIC ACID
Each record represents a single contiguous stretch of DNA
or RNA
    DNA stretches may have more than one coding region
   (i.e., more than one gene).
   RNA sequences are presented with T, not U
Records are generated from direct submissions to the DNA
sequence databases from the investigators (authors).
GenBank is part of the International Nucleotide Sequence
Database Collaboration.
The number of basepairs
is now at over 85 billion.
The number of sequences
is approaching 83 million.
General Comments on GBFF
Three sections:
   1) Header-information about the whole record
   2) Features-description of annotations-each represented
    by a key.
   3) Nucleotide sequence-each ends with // on last line of
Nucleic acid (DNA or RNA) sequence translated
to amino acid sequence is a “feature”
Genbank Flat File (MyoD1 as an example)
   Feature Keys
 1) Indicates biological nature of sequence
 2) Supplies information about changes to
Feature Key                 Description
conflict           Separate determinations of the same seq. differ
rep_origin         Origin of replication
protein_bind       Protein binding site on DNA
CDS                Protein coding sequence
   Feature Keys-Terminology
Feature Key       Location/Qualifiers
CDS               23..400
                  /product=“alcohol dehydro.”
The feature CDS is a coding sequence beginning at base 23
  and ending at base 400, has a product called “alcohol
  dehydrogenase” and corresponds to the gene called
   Feature Keys-Terminology
Feat. Key        Location/Qualifiers
CDS              join (544..589,688..1032)
                 /product=“T-cell recep. B-ch.”

The feature CDS is a partial coding sequence formed by joining
  the indicated elements to form one contiguous sequence
  encoding a product called T-cell receptor beta-chain.

(For MyoD1 – Accession number X61655)
           Record from GenBank
             Locus name                                     GenBank division (plant, fungal and algal)
                                                                                    Modification date
LOCUS        SCU49845         5028 bp         DNA                     PLN           21-JUN-1999
DEFINITION   Saccharomyces cerevisiae TCP1-beta gene, partial cds, and
             Axl2p (AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION    U49845 Unique identifier (never changes)                       Coding region
VERSION      U49845.1     GI:1293613 GeneInfo identifier (changes whenever there is a change)
KEYWORDS     .                   Nucleotide sequence identifier (changes when there is a change
                                 in sequence (accession.version))
                            Word or phrase describing the sequence (not based on controlled vocabulary).
                            Not used in newer records.
SOURCE       baker's yeast. Common name for organism
  ORGANISM   Saccharomyces cerevisiae
             Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales;
             Saccharomycetaceae; Saccharomyces.

                                   Formal scientific name for the source organism and its lineage
                                   based on NCBI Taxonomy Database
        Record from GenBank (cont.1)
REFERENCE    1 (bases 1 to 5028)
  AUTHORS    Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
  TITLE      Cloning and sequence of REV7, a gene whose function is required
             for DNA damage-induced mutagenesis in Saccharomyces cerevisiae
  JOURNAL    Yeast 10 (11), 1503-1509 (1994)
  MEDLINE    95176709 Medline UID
REFERENCE    2 (bases 1 to 5028)
  AUTHORS    Roemer,T., Madden,K., Chang,J. and Snyder,M.
  TITLE      Selection of axial growth sites in yeast requires Axl2p, a
             novel plasma membrane glycoprotein
  JOURNAL    Genes Dev. 10 (7), 777-793 (1996)
  MEDLINE    96194260
 REFERENCE    3   (bases 1 to 5028)
   AUTHORS    Roemer,T. Submitter of sequence (always the last reference)
   TITLE      Direct Submission
   JOURNAL    Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University,
              New Haven, CT, USA
          Record from GenBank (cont.2)
        There are three parts to the feature key: a keyword (indicates functional group), a location
        (instruction for finding the feature), and a qualifier (auxiliary information about a feature)

  FEATURES                    Location/Qualifiers
       source                 1..5028 Location
                              /organism="Saccharomyces cerevisiae"
         Keys                 /db_xref="taxon:4932"
         CDS                  <1..206 The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence.   The 3’
                                       end is complete.
                              /codon_start=3 Start of open reading frame
Database cross-refs           /product="TCP1-beta" Descriptive free text must be in quotations
                              /protein_id="AAA98665.1" Protein sequence ID #
                                                  Note: only a partial sequence
  Record from GenBank (cont.3)
gene              687..3158 Another location
  CDS             687..3158
                  /note="plasma membrane glycoprotein"
                  /function="required for axial budding pattern of S.
  gene            complement(3300..4037) Another location
  CDS             complement(3300..4037)
           Record from GenBank (cont.4)
BASE COUNT      1510 a    1074 c    835 g   1609 t
          1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
         61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . .
Primary databases vs.
Secondary databases
Primary database
 comprises information submitted directly by the
 is called an archival database.

Secondary database
 comprises information derived from primary
 is a curated database.
      NCBI site map
 To notice on the map
   General organization
   Where the following fit:
       RefSeq (nucleotide, protein)
       dbEST

       Others of interest to you

  NCBI site map:
Types of primary databases
carrying biological infomation
PDB-Three-dimensional structure
coordinates of biological molecules
PROSITE-database of protein
domain/function relationships.
Types of secondary databases
carrying biological infomation
RefSeq- Comprehensive, integrated, non-
redundant, well-annotated set of sequences,
including genomic DNA, transcripts, and
     Types of secondary databases
     carrying biological infomation
Some nucleotide secondary databases
   dbEST- Sequence data and other information on "single-pass" cDNA
    (RNA-based) sequences, or "Expressed Sequence Tags", from a number
    of organisms
   Genome databases-(there are over 20 genome databases that can be
   EPD:eukaryotic promoter database
   NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100%
    sequence identity are merged as one.
  Types of secondary databases
  carrying biological infomation
Some protein secondary databases
  ProDom
     Fingerprints – conserved motifs used to classify
     Highly conserved regions of proteins
References for understanding the
NCBI sequence database model

Here is the website for NCBI developer
RNA processing

                  RNA, but NOT mRNA

                  RNA, but NOT mRNA

                      Mature mRNA