SeqDB

					         Protein Sequence
 Databases for Proteomics

Nathan Edwards
Center for Bioinformatics and Computational Biology
University of Maryland, College Park
Protein Sequence Databases

• Link between mass spectra and proteins
• A protein’s amino-acid sequence provides
  a basis for interpreting
 • Enzymatic digestion
 • Separation protocols
 • Fragmentation
 • Peptide ion masses
• We must interpret database information as
  carefully as mass spectra.
                                              2
More than sequence…

Protein sequence databases provide much
 more than sequence:

•   Names
•   Descriptions
•   Facts
•   Predictions
•   Links to other information sources

Protein databases provide a link to the current
 state of our understanding about a protein.
                                                  3
Much more than sequence

Names
  • Accession, Name, Description
Biological Source
  • Organism, Source, Taxonomy
Literature
Function
  • Biological process, molecular function,
    cellular component
  • Known and predicted
Features
  • Polymorphism, Isoforms, PTMs, Domains
Derived Data
  • Molecular weight, pI

                                              4
Database types

  Curated        Translated
  • Swiss-Prot   • TrEMBL
  • PIR          • RefSeq XP, ZP
  • RefSeq NP

  Omnibus        Other
  • NCBI’s nr    • PDB
  • MSDB         • HPRD
  • IPI          • EST
                 • Genomic
                                   5
SwissProt

• From ExPASy
 • Expert Protein Analysis System
 • Swiss Institute of Bioinformatics
• ~ 180,000 protein sequence “entries”
• ~ 9,000 species represented
• ~ 12,000 Human proteins
• Highly curated
• Minimal redundancy
• Some restrictions on commercial use

                                         6
PIR

• Protein Information Resource
    • Georgetown University Medical Center
•   ~ 280,000 protein sequence “entries”
•   Highly curated
•   Public domain resource
•   ~ 10,500 Human proteins
•   Grew out of the Atlas of Protein Sequence
    and Structure (1965-1978) edited by
    Margaret Dayhoff.


                                                7
TrEMBL

• Translated EMBL nucleotide sequences
 • European Molecular Biology Laboratory
   • European Bioinformatics Institute (EBI)
• Computer annotated
• Only sequences absent from SwissProt
• ~ 165,000 protein sequence “entries”
• ~ 88,000 species
• ~ 52,000 Human proteins

                                               8
RefSeq

• Reference Sequence
  • From NCBI (National Center for
    Biotechnology Information), NLM, NIH
• Integrated genomic, transcript, and
  protein sequences.
• Varying levels of curation
  • Reviewed, Validated, …, Predicted, …
• ~ 1,350,000 protein sequence “entries”
  • ~ 44,000 reviewed
• ~ 28,000 Human proteins
                                           9
RefSeq

• Particular focus on major research
  organisms
 • Tightly integrated with genome projects.
• Curated entries: NP accesssions
• Predicted entries: XP accessions




                                              10
UniProt

• Universal Protein Resource
• Combination of
  • Swiss-Prot
  • TrEMBL
  • PIR
• Knowledgebase is highly curated
• “Similar sequence” clusters are available
  • 50%, 90%, 100% sequence similarity


                                              11
IPI

• International Protein Index
  • From EBI
• For a specific species, combines
  • UniProt, RefSeq, Ensembl
  • Species specific databases
• ~ 48,000 protein sequence entries
• Human, mouse, rat, zebra fish,
  arabidopsis

                                      12
NCBI’s nr

• non-redundant
• Contains
 • GenBank CDS translations
 • RefSeq Proteins
 • Protein Data Bank (PDB)
 • SwissProt, TrEMBL, PIR
 • Others
• “Similar sequences” suppressed
 • 100% sequence similarity
• ~ 1,800,000 protein sequence “entries”
• ~ 33,000 species
                                           13
MSDB

• From the Imperial College (London)
• Combines
 • PIR, TrEMBL, GenBank, SwissProt
• Distributed with Mascot
 • …so well integrated with Mascot




                                       14
Others

• HPRD
 • Manually curated integration of literature
• PDB
 • Focus on protein structure
• dbEST
 • Part of GenBank - EST sequences
• Genome Sequences



                                                15
Human Sequences

• Number of Human        PIR         ~ 10,500
  Genes is believed to
                         SwissProt   ~ 12,000
  be between 20,000
  and 25,000             RefSeq      ~ 28,000
                         IPI-HUMAN ~ 48,000
                         TrEMBL      ~ 52,000
                         MSDB        ~ 105,000


                                                 16
   DNA to Protein Sequence




Derived from http://online.itp.ucsb.edu/online/infobio01/burge   17
Genome Browsers

• Link genomic, transcript, and protein
  sequence in a graphical manner
  • Genes, ESTs, SNPs, cross-species, etc.
• UC Santa Cruz
  • http://genome.ucsc.edu
• Ensembl
  • http://www.ensembl.org
• NCBI Map View
  • http://www.ncbi.nlm.nih.gov/mapview


                                             18
UCSC Genome Browser

• Shows many
  sources of protein
  sequence
  evidence in a
  unified display
• Can use EST
  accession as a
  location!



                       19
Accessions

•   Permanent labels
•   Short, machine readable
•   Enable precise communication
•   Typos render them unusable!
•   Each database uses a different format

•   Swiss-Prot: P17947
•   Ensembl: ENSG00000066336
•   PIR: S60367; S60367
•   GO: GO:0003700;
                                            20
Names / IDs

•   Compact mnemonic labels
•   Not guaranteed permanent
•   Require careful curation
•   Conceptual objects
• Swiss-Prot names changed recently!
• ALBU_HUMAN
    • Serum Albumin
• RT30_HUMAN
    • Mitochondrial 28S ribosomal protein S30
• CP3A7_HUMAN
    • Cytochrome P450 3A7
                                                21
Description / Name

• Free text description
• Human readable
• Space limited
• Hard for computers to interpret!
• No standard nomenclature or format
• Often abused….

• COX7R_HUMAN
 • Cytochrome c oxidase subunit VIIa-
   related protein, mitochondrial [Precursor]
                                                22
FASTA Format




               23
FASTA Format

•>
• Accession number
  • No uniform format
  • Multiple accessions separated by |
• One line of description
  • Usually pretty cryptic
• Organism of sequence?
  • No uniform format
  • Official latin name not necessarily used
• Amino-acid sequence in single-letter code
  • Usually spread over multiple lines.
                                               24
Organism / Species /
Taxonomy
• The protein’s organism…
  • …or the source of the biological sample
• The most reliable sequence annotation
  available
• Useful only to the extent that it is correct
• NCBI’s taxonomy is widely used
  • Provides a standard of sorts; Heirachical
  • Other databases don’t necessarily keep up
• Organism specific sequence databases
  starting to become available.
                                                 25
Organism / Species /
Taxonomy
•   Buffalo rat            •   Rattus sp. strain Wistar
•   Gunn rats              •   Sprague-Dawley rat
•   Norway rat             •   Wistar rats
•   Rattus PC12 clone IS   •   brown rat
•   Rattus norvegicus      •   laboratory rat
•   Rattus norvegicus8     •   rat
•   Rattus norwegicus      •   rats
•   Rattus rattiscus       •   zitter rats

• Rattus sp.

                                                          26
Controlled Vocabulary

• Middle ground between computers and
  people
• Provides precision for concepts
 • Searching, sorting, browsing
 • Concept relationships
• Vocabulary / Ontology must be established
 • Human curation
• Link between concept and object:
 • Manually curated
 • Automatic / Predicted
                                              27
Controlled Vocabulary




                        28
Controlled Vocabulary




                        29
Controlled Vocabulary




                        30
Controlled Vocabulary




                        31
Controlled Vocabulary




                        32
Ontology Structure

• NCBI Taxonomy
 • Tree
• Gene Ontology (GO)
 • Molecular function
 • Biological process
 • Cellular component
 • Directed, Acyclic Graph (DAG)
• Unstructured labels
 • Overlapping?

                                   33
Ontology Structure




                     34
Protein Families

• Similar sequence implies similar function
• Similar structure implies similar function
• Common domains imply similar function

• Bootstrap up from small sets of proteins
  with well understood characteristics

• Usually a hybrid manual / automatic
  approach
                                               35
Protein Families




                   36
Protein Families




                   37
Protein Families

• PROSITE, PFam, InterPro, PRINTS
• Swiss-Prot keywords

• Differences:
  • Motif style, ontology structure, degree of
    manual curation
• Similarities:
  • Primarily sequence based, cross species


                                                 38
Gene Ontology

• Hierarchical
  • Molecular function
  • Biological process
  • Cellular component

• Describes the vocabulary only!
• Protein families provide GO association
  • Not necessarily any appropriate GO category.
  • Not necessarily in all three hierarchies.
  • Sometimes general categories are used because
    none of the specific categories are correct.


                                                    39
Protein Family /
Gene Ontology




                   40
Sequence Variants

• Protein sequence can vary due to
 • Polymorphism
 • Alternative splicing
 • Post-translational modification
• Sequence databases typically do not
  capture all versions of a protein’s
  sequence



                                        41
Sequence Variants

Swiss-Prot;
 a curated protein sequence database which
 strives to provide a high level of annotation
 (such as the description of the function of a
 protein, its domains structure, post-
 translational modifications, variants, etc.), a
 minimal level of redundancy and high level
 of integration with other databases

- Swiss-Prot web site front page
                                                   42
Sequence Variants

b) Minimal redundancy

 Many sequence databases contain, for a given
 protein sequence, separate entries which
 correspond to different literature reports. In Swiss-
 Prot we try as much as possible to merge all these
 data so as to minimize the redundancy of the
 database. If conflicts exist between various
 sequencing reports, they are indicated in the feature
 table of the corresponding entry.

- Swiss-Prot User Manual, Section 1.1


                                                         43
Sequence Variants
IPI provides a top level guide to the main databases
that describe the proteomes of higher eukaryotic
organisms. IPI:
1. effectively maintains a database of cross
references between the primary data sources
2. provides minimally redundant yet maximally
complete sets of proteins for featured species (one
sequence per transcript)
3. maintains stable identifiers (with incremental
versioning) to allow the tracking of sequences in IPI
between IPI releases.
- IPI web site front page
                                                        44
Sequence Variants

• Swiss-Prot variants, isoforms and conflicts
  are retained as features
• Script varsplic.pl can enumerate all
  sequence variants

• Command-line options for full enumeration
  -which full -varsplic -variant -conflict




                                                45
Swiss-Prot
Variant Annotations




                      46
Swiss-Prot
Variant Annotations




                      47
Swiss-Prot
Variant Annotations
Feature viewer




          Variants
                      48
    Swiss-Prot VarSplic
    Output
P13746-00-01-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-01-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-00-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-03-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-03-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-04-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF
P13746-01-04-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGKPRFIAVGYVDDTQFVRF
P13746-00-05-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-05-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-00-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-00-02-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
P13746-01-02-00   MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRF
                  ******************************************:*****************


                                                                            49
    Swiss-Prot VarSplic
    Output
P13746-00-01-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-01-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-00-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-00-03-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-03-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-04-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-04-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-05-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYTQAASSDSAQ
P13746-01-05-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-01-00-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYTQAASSDSAQ
P13746-00-02-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSS------DRKGGSYSQAASSDSAQ
P13746-01-02-00   SSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSGGEGVKDRKGGSYSQAASSDSAQ
                  *************************************     *******:*********


                                                                            50
Omnibus Database
Redundancy Elimination
• Source databases often contain the same
  sequences with different descriptions
• Omnibus databases keep one copy of the
  sequence, and
  • An arbitrary description, or
  • All descriptions, or
  • Particular description, based on source preference
• Good definitions can be lost, including
  taxonomy


                                                         51
Description Elimination
• gi|12053249|emb|CAB66806.1|
  hypothetical protein [Homo sapiens]
• gi|46255828|gb|AAH68998.1|
  COMMD4 protein [Homo sapiens]
• gi|42632621|gb|AAS22242.1|
  COMMD4 [Homo sapiens]
• gi|21361661|ref|NP_060298.2|
  COMM domain containing 4 [Homo sapiens]
• gi|51316094|sp|Q9H0A8|
  COM4_HUMAN COMM domain containing protein 4
• gi|49065330|emb|CAG38483.1|
  COMMD4 [Homo sapiens]

                                                52
Description Elimination
• gi|2947219|gb|AAC39645.1|
  UDP-galactose 4' epimerase [Homo sapiens]
• gi|1119217|gb|AAB86498.1|
  UDP-galactose-4-epimerase [Homo sapiens]
• gi|14277913|pdb|1HZJ|B
  Chain B, Human Udp-Galactose 4-Epimerase: Accommodation
  Of Udp-N- Acetylglucosamine Within The Active Site
• gi|14277912|pdb|1HZJ|A
  Chain A, Human Udp-Galactose 4-Epimerase: Accommodation
  Of Udp-N- Acetylglucosamine Within The Active Site
• gi|2494659|sp|Q14376|
  GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase)
  (UDP-galactose 4-epimerase)
• gi|1585500|prf||2201313A
  UDP galactose 4'-epimerase

                                                            53
Description Elimination
• gi|4261710|gb|AAD14010.1|
  chlordecone reductase [Homo sapiens]
• gi|2117443|pir||A57407
  chlordecone reductase (EC 1.1.1.225) / 3alpha-
  hydroxysteroid dehydrogenase (EC 1.1.1.-) I [validated]
  – human
• gi|1839264|gb|AAB47003.1|
  HAKRa product/3 alpha-hydroxysteroid dehydrogenase
  homolog [human, liver, Peptide, 323 aa]
• gi|1705823|sp|P17516|AKC4_HUMAN Aldo-keto reductase
  family 1 member C4 (Chlordecone reductase) (CDR) (3-
  alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD)
  (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA)
• gi|7328948|dbj|BAA92885.1|
  dihydrodiol dehydrogenase 4 [Homo sapiens]
• gi|7328971|dbj|BAA92893.1|
  dihydrodiol dehydrogenase 4 [Homo sapiens]
                                                            54
Summary

• Protein sequence databases should be
  interpreted with as much care as mass
  spectra
• Protein sequences come from genes
• Use controlled vocabularies
• Understand the structure of ontologies
• Take advantage of computational
  predictions
• Look for sequence variants
• Be careful with omnibus databases
                                           55

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:29
posted:6/28/2011
language:English
pages:55