Docstoc

Protein annotation

Document Sample
Protein annotation Powered By Docstoc
					                 Proteome Annotation:
                 Resources and Tools

                      Sandra Orchard
                       HUPO 2007




           Protein annotation
1. Use sequence homology against a highly
   curated protein database (UniProtKB) to
   find a match or identify orthologues and
   transfer annotation

2. Use “guilt by association” – infer function
   from the molecules your proteins interact
   with




                                                 1
               What is UniProt

• Based on the original work on PIR, Swiss-
  Prot and TrEMBL

• Funded mainly by NIH

• Collaboration between EBI, SIB and PIR




               What is UniProt

• UniProt Knowledgebase:
  2 sections

  – UniProtKB/Swiss-Prot Non-redundant, high-
    quality manual annotation. Aims to describe in
    a single record all protein products derived
    from a certain gene from a certain species

  – UniProtKB/TrEMBL Redundant, automatically
    annotated




                                                     2
                                 UniRef 50


    Proteome                 UniRef 90
                                                                     IPI
      Sets

                            UniRef 100


   UniSave                  UniProtKB                           UniMes



                                        UniParc



              Sub/                                        INSDC
                                              Patent
      PDB    Peptide   FlyBase    WormBase             (incl. WGS,   RefSeq   Ensembl   VEGA
                                               Data
              Data                                         Env.)

                                         Database sources




UniProt data sources and data flow




      The New Website




   www.uniprot.org




                                                                                               3
1. Sequence curation, stable identifiers, versioning
and archiving




1. Sequence curation, stable identifiers, versioning and archiving
    For example – erroneous gene model predictions, frameshifts
    ….




..premature stop codons, read-throughs, erroneous initiator methionines…..




                                                                             4
UniProtKB/Swiss-Prot – sequence
                   annotation
   New field – Protein existence

   PE Level: Evidence; With the following values:
           1: Evidence at protein level
           2: Evidence at transcript level
           3: Inferred from homology
           4: Predicted
           5: Uncertain




   2. Identification of splice variants




                                                    5
    3. Identification of variants (at amino acid level)….




         … and of PTMs




                                     … and also




Domain annotation




Binding sites




                                                            6
4. Consistent nomenclature (& synonyms)




                                          7
5. Annotation of literature experimental data in 27 defined fields.




Controlled vocabularies used whenever possible…




  Binary interactions taken from the IntAct database




                                                                      8
Disease specific annotation added to
human entries…




… with supporting cross-referencing




                                       9
Source references included in entry




6. Extensive cross-referencing, a central portal to a wealth
of external resources…




                                                               10
  InterPro – defines protein family
membership and enables domain annotation




                                           11
      UniProtKB/TrEMBL


• Redundant – only 100% identical sequences
merged

• Automated clean-up of annotation from original
nucleotide sequence entry

• Additional value added by using automatic
annotation




                                                   12
     Automatic Annotation


• Recognises common annotation belonging to a
closely related family within UniProtKB/Swiss-
Prot

• Identifies all members of this family using
pattern/motif/HMMs in InterPro

• Transfers common annotation to related family
members in TrEMBL




                                                  13
                        www.ebi.ac.uk/integr8




                 Integr8 – Integr8or




www.ebi.ac.uk/integr8




                                                14
   Non-redundant proteome sets
• Complete experimentally determined protein sets not yet
  available for higher organisms

• Require inclusion of predicted proteins to give full
  proteome

• International Protein Index (IPI) merges data from UniProt,
  Ensembl and Ref-Seq to produce non-redundant dataset




      International Protein Index

 • Non-redundant protein sets produced for human,
   mouse, rat, Arabidopsis, zebrafish, cow and chicken
 • effectively maintains a database of cross references
   between the primary data sources
 • provides minimally redundant yet maximally complete
   sets of proteins for featured species (one sequence
   per transcript)
 • maintains stable identifiers (with incremental
   versioning) to allow the tracking of sequences in IPI




                                                                15
Why are we interested in molecular
          interactions

1. As a means of precisely understanding a
   protein role inside a specific cell type

2. Guilt by Association – it may be the only means
   of predicting a protein’s function

3. As building blocks for System’s Biology




                      IntAct

    IntAct provides
    • data repository for molecular interactions
    • data analysis for molecular interactions
    • graph visualisation of molecular interactions
    with GO and InterPro
    • data resource for molecular interactions




           www.ebi.ac.uk/intact




                                                      16
                   IntAct

• A completely open-source resource

• Consists of a database + software suite +
  data

• All freely available at the EBI website
          www.ebi.ac.uk/intact




               Data Model

                Experiment
                     ↓
                Interaction
                     ↓
                Interactors
                     ↓
                 Features




                                              17
            The Experiment

• How is the interaction identified?
  – CV “InteractionDetectionMethod”
• How are the participants identified?
  – CV “ParticipantDetectionMethod”
• Where does the experiment take place?
  – Host Organism
• Where did the data originate from?
  – Primary reference (PMID)




      Controlled Vocabularies




                                          18
            The Interaction

• One experiment can have many
  interactions associated with it
• Each interaction can be given a type,
  assisting user in assessing data quality




            The Interactors

• Usually identified by an accession number
  from an external database – UniProtKB,
  EMBL, ChEBI. Can annotate at splice
  isoform level.

• If not present in an external resource, may
  have to created within IntAct and given an
  internal AC number




                                                19
            The Interactors

 The role of each
 interactor within an
 experiment can be
 given, and also its
 biological role




                Features

• Many proteins used in interaction
  experiments are modified e.g. tags added,
  deletion mutants created.

• IntAct can map all these to given sequence
  – features are remapped whenever there is
  a sequence update




                                               20
           Data availability

• Team of manual curators annotate
  literature

• Weekly release on ftp site

• Data available in PSI-MI XML1.0, 2.5 and
  MITAB2.5




                                             21
Searching IntAct




Searching IntAct




                   22
Searching IntAct




                   23