Automated Annotation Literature Based Curation Database Cross References

Document Sample
Automated Annotation Literature Based Curation Database Cross References Powered By Docstoc
					Protein Bioinformatics – Advances
         and Challenges

             BY
        Sona Vasudevan
        Peter McGarvey



                                    1
      Outline
• What is Bioinformatics?
  Past & Present
• About PIR
• PIR resources
• UniProt resources
• PIR’s leading role in
  CaBig; Biodefense and
  Ontology
                            2
         What is Bioinformatics?
NIH Biomedical Information Science and Technology Initiative
              (BISTI) Working Definition (2000)
   Bioinformatics: Research, development, or
    application of computational tools and
    approaches for expanding the use of biological,
    medical, behavioral or health data, including
    those to acquire, store, organize, archive,
    analyze, or visualize such data.
                  Computer + Mouse = Bioinformatics
                  (Information)   (Biology)




                                                               3
“A science which hesitates to
forget its founders is lost.”
       ---- A. N. Whitehead



                                4
 Evolution of Protein      (Georgetown University)
 databases
Dr. Margaret Oakley
Dayhoff (1925 – 1983)




The origin of the single-letter code for the amino acids   5
Challenges we are facing today!
    Total number of        ~4,919,302
    sequences in NR
    Total number of        ~6,028,191(NCBI)
    environmental
    sequences
    Number of domain       ~8957
    Families (Pfam)
    Number of domain       ~665
    Families (SMART)
    Number of Structures   ~43339
    (PDB)
    Number of COGS         ~4873 (Unicellular)
                           ~4852 (Eukaryote)
                                                 6
                   Molecular Biology
                      Databases
                                                             NAR Molecular Biology Database Collection

                                                800

                                                700




                              Database number
                                                600

                                                500

                                                400

                                                300

                                                200

                                                100

                                                  0
                                                      1999     2000   2001    2002   2003    2004   2005
                                                                              Year
The DNA sequence database
has exceeded 100 gigabases.                     719 Databases in 14 categories

                                                                                                           7
the birth of “omes”
  & "omic" era in
      biology



                      8
   Genomics
  Proteomics




Functionomics




  Unknomics


                Metagenomics   9
10
Protein Information Resource
    Integrated Protein Informatics Resource for Proteomics Research
 UniProt Universal Protein
  Resource: Central
  Resource of Protein
  Sequence and Function
 PIRSF Protein Family
  Classification System:
  Protein Classification and
  Functional Annotation
 iProClass Integrated
  Protein Knowledgebase:
  Data Integration and
  Functional Associative
  Analysis
                                                                      11
                                         http://pir.georgetown.edu
                         UniProt Databases
   UniParc: Comprehensive Sequence Archive with Sequence
    History
   UniProt: Knowledgebase with Full Classification and Functional
    Annotation
   UniRef: Non-redundant Reference Databases for Sequence
    Search
                 Clustering at 100,                        UniRef50
                  90, 50% Identity                         UniRef90
                                                  UniRef100 (NREF)

                   Classification,
              Literature-Based &
           Automated Annotation              UniProt (Knowledgebase)


           Merging
                                                   UniParc (Archive)

                      Swiss-   TrEMBL   PIR-PSD   RefSeq    GenBank /   Ensembl   PDB   Patent   Other
                       Prot                                EMBL/DDBJ                     Data     Data   12
        UniProt Knowledgebase
   Objective: Stable, Comprehensive, Fully Classified,
    Richly and Accurately Annotated
   Information Content
       Isoform Presentation
       Nomenclature
       Family Classification and Domain Identification
       Functional Annotation
   Approaches
       Full Classification
       Automated Annotation
       Literature-Based Curation
       Database Cross-References
       Controlled Vocabularies & Ontologies
       Evidence Attribution
                                                          13
    PIRSF Classification System
   PIRSF:
       Reflects evolutionary relationships of full-length proteins
       A network structure from superfamilies to subfamilies
   Definitions:
       Homeomorphic Family (HF): Basic Unit
       Homologous: Common ancestry, inferred by sequence similarity
       Homeomorphic: Full-length similarity & common domain
        architecture
       Hierarchy: Flexible number of levels with varying degrees of
        sequence conservation
       Network Structure: Allows multiple parents

   Advantages:
       Annotate both general biochemical and specific biological
        functions
       Accurate propagation of annotation and development of
        standardized protein nomenclature and ontology
                                                                               14
                                                               Credit AN Nikolskaya
             PIRSF Classification System
  Protein Classification and Functional Annotation
                           (http://pir.georgetown.edu/pirsf/)



  Comprehensive Classification of All UniProt Proteins
  Curated Families with Protein Name and Site Rules
  Classification and Visualization Tools
                                          Iterative BlastClust Tree with Annotation
Taxonomy Distribution
                                          Table, MSA & Phylogenetic tree
and Phylogenetic Pattern




                                                                                      15
   Curator-
                    Classification Tool:
    guided              BlastClust
    clustering
   Single-
    linkage
    clustering
    using BLAST
   Retrieve all
    proteins
    sharing a
    common
    domain
   Iterative
    BlastClust
    (fixed length                          16
    coverage)
              PIRSF-Based Protein Annotation
  Classification-Driven Rule-Based Annotation
  Provides Consistent Annotation and Database Integrity Check

  Includes:
  Site Rule (PIRSR): Position-Specific Site Feature (FT)
  Name Rule (PIRNR): transfer name from PIRSF to individual
  proteins
      Protein Name (DE) with Synonym, EC, Misnomer
      GO Term
  Rule ID       Rule Condition       Rule Description (Name Rule Interface)
              PIRSF000881        Name: S-acyl fatty acid synthase thioesterase
PIRNR000881
              member and         EC: oleoyl-[acyl-carrier-protein] hydrolase (EC
-1
              vertebrates        3.1.2.14)
              PIRSF000881
PIRNR000881                      Name: Type II thioesterase
              member and not
-2                               EC: thiolester hydrolases (EC 3.1.2.-)
              vertebrates
PIRNR025624   PIRSF025624        Name: ACT domain protein
-1            member             Misnomer: chorismate mutase

                                                                                   17
   Rule-based Annotation of Protein
         Entries Using PIRSF


Structure   Binding/active sites   Identification of residues




                                                                18
                     Methodology
   Defining a Rule
       Select template structure
       Align curated PIRSF seed members and structural
        template
       Structure-based sequence alignment of seeds
       Edit MSA retaining conserved regions covering all site
        residues
       Build Site HMM from concatenated conserved regions
   Rule Condition
       Membership Check (PIRSF HMM threshold)
       Conserved Region Check (site HMM threshold)
       Site Residue Check (position-specific residue in
        HMMAlign)
   Rule Propagation
       Propagate conserved feature annotation to all
        members that fit the rule
                                                                 19
An example of PIR rule Integrated into SP record




                                            PIR Rule




                                                       20
PIRSF Protein Classification provides
a platform for protein annotation
   Improves Annotation Quality
       Annotation of biological function of whole
        proteins
       Annotation of uncharacterized hypothetical
        proteins (functional predictions helped by newly
        detected family relationships)
       Correction of annotation errors
       Improvement of under- or over-annotated
        proteins
   Standardization of Protein Names
                                                           21
                   Data Integration
   Data Warehouse
       Local Copy of Databases in a Unified Database Schema
       Allows Local Control of Data; Update Problem
   Hypertext Navigation
       Browsing Model with Hypertext Links
       Allows Direct Interaction; Easily Lost in Cyberspace
   iProClass Approach
       Data Warehouse + Hypertext Navigation
       Rich Links (Links + Executive Summaries)
       Modular and Open Framework for Adding New
        Components in Distributed Networking Environment

                                                               22
                      iProClass Database
     Integrated Protein Family, Function, Structure Information

Function/Pathway        Protein Sequence                Gene/Genome
                              PIR-NREF                  GenBank/EMBL/DDBJ
      EC-IUBMB                 PIR-PSD
        KEGG                                                LocusLink
       BRENDA
                              Swiss-Prot
                               TrEMBL
                                                             UniGene
                                                               GDB
                                                                                   ~5,000,000 Protein
         WIT
       MetaCyc
                                RefSeq
                              GenePept
                                                              OMIM
                                                               SGD
                                                                                    Sequences
        EcoCyc
     Gene Ontology                                             MGI
                                                             FlyBase


  Structure
                           iProClass
                                                              MIPS
                                                              TIGR                 Rich Links to >80
                         Protein Sequence
                                                                Family
                                                                                    Databases
      PDB
     SCOP             Superfamily/Domain/Motif
     CATH                                                     PIR Superfamily
                      Protein Function/Pathway
    PDBSum
     MMDB
                                                                PIR-ASDB
                                                                  InterPro         Value-Added Views
                         Protein Interaction
     FFSP                                                           Pfam
                                                                 PROSITE
                                                                                    for UniProt
                        Protein Modification                        COG
                                                                 BLOCKS
Modification             Protein Expression                      ProClass
                          Protein Structure                      MetaFam
     RESID
  PhosphoBase
                               Gene
PhosphorylationSite
                                                              Taxonomy
        Interaction                                             NCBI Taxon
                        Expression         Literature
              DIP
             BIND           PMG                PubMed

                                                                                                         23
    iProClass Views




                Family Report

Sequence
Report                          24
PIR iProClass Searches
ID Mapping                        Text Search
                 Peptide Search
  BLAST Search




                                                25
                                                                   Albert Einstein
1. Albert Einstein College of Medicine
                                                   PNNL                                    U of Michigan
   T. gondii, C. parvum

2. Caprion Pharmaceuticals                         Harvard
                                                                         D
                                                                                            Myriad

   B. abortus                                                            A
                                                                         T
                                                   Scripps               A                 Caprion

3. Harvard Institute of Proteomics
   V. cholerae, B. anthracis
                                                                        SSS

4. Myriad Genetics
   B. anthracis, Y. pestis, F. tularensis, Vaccinia,                Resource
                                                             PIR     Center          VBI
   Variola
5. Pacific Northwest National Laboratory
   S. typhimurium, S. typhi, Vaccinia, Monkeypox

6. Scripps
   SARS CoV, Influenza

7. University of Michigan
   B. anthracis                                                                                  26
  Organism



Research Center



  Data Type




                  27
             Master Protein Directory




                                                                   28
                                         29 Colonization Pathway Proteins
Currently contains 3,733 ORF Clones out of 3,784 Proteins
                   Protein Related Proteins in Catalog by
                  Search forand Reagent Information
Order Clones from Repositories
   Clone Sequences
                 Report
Protein Summary Family Classification or Similarity Searches




                                                               29
Mouse proteins detected in B. anthracis and S. typhimurium infected macrophages
    NCI caBIG Initiative


cancer Biomedical Informatics Grid:
•   Informatics platform to enable sharing of research, data and tools
     • Designed and built by an open federation of organizations
     • Facilitate connectivity via common standards and unifying architecture
     • Open source and open access principles
•   Domain Workspaces
     • Clinical Trial Management Systems
     • Integrative Cancer Research
     • Imaging
     • Tissue Banks and Pathology Tools
•   Cross Cutting Workspaces
     • Architecture
     • Vocabularies and Common Data Elements
 PIR Activities in caBIG™


•Integrative Cancer Research Workspace
    • Developer
       • Grid-enablement of PIR




   • Adopter
       • SEED Genome Annotation Tool
         (completed)

       • GeneConnect Genomic Identifier Mapping
         Service


•Vocabularies and Common Data Elements
       • Participant
33

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:6/9/2012
language:
pages:33