Assessment of Genome-wide
Protein Function Classification for
Drosophila melanogaster
Huaiyu Mi
mihn@fc.celera.com
Panther Protein Informatics group
Celera Genomics
How to classify proteins in
a robust and accurate way?
Outline
1. Introduction to PANTHER
2. Comparison of functional classification of
Drosophila proteins by FlyBase and
PANTHER
What is PANTHER?
PANTHER library (PANTHER/LIB)
a family tree
a multisequence alignment
an HMM
PANTHER index (PANTHER/X)
Molecular function
Biological process
Building the library
&
500,000 &
protein sequences
(filtered GenBank NR)
Biologist curation
MSA HMM tree 40,000 subfamilies
Family and subfamily was
2200+ protein labeled with a name and
family clusters classified by PANTHER/X
categories
PANTHER library (PANTHER/LIB)
PANTHER index (PANTHER/X)
signal transducer GO:0004871 GO
PANTHER/X => receptor GO:0004872
=> => transmembrane receptor GO:0004888
RECEPTOR transmembrane receptor protein kinase GO:0019199
=> G-protein coupled receptor => transmembrane receptor protein serine/threonine kinase GO:0004675
=> protein kinase receptor => => transforming growth factor alpha receptor GO:0005023
=> => serine/threonine protein kinase receptor => => transforming growth factor beta receptor GO:0005024
=> => tyrosine protein kinase receptor => => => activin receptor GO:0017002
=> => => => type I activin receptor GO:0016361
=> => => => type II activin receptor GO:0016362
=> => => type I transforming growth factor beta receptor GO:0005025
=> => => => type I activin receptor GO:0016361
=> => => type II transforming growth factor beta receptor GO:0005026
=> => => => type II activin receptor GO:0016362
=> transmembrane receptor protein tyrosine kinase GO:0004714
=> => boss receptor GO:0008288
=> => ephrin receptor GO:0005003
=> => => GPI-linked ephrin receptor GO:0005004
=> => => transmembrane-ephrin receptor GO:0005005
=> => epidermal growth factor receptor GO:0005006
=> => => gurken receptor GO:0008313
=> => fibroblast growth factor receptor GO:0005007
=> => hepatocyte growth factor receptor GO:0005008
=> => insulin receptor GO:0005009
=> => insulin-like growth factor receptor GO:0005010
=> => macrophage colony stimulating factor receptor GO:0005011
=> => macrophage receptor GO:0008019
=> => Neu/ErbB-2 receptor GO:0005012
=> => neurotrophin TRK receptor GO:0005013
=> => => neurotrophin TRKA receptor GO:0005014
=> => => neurotrophin TRKB receptor GO:0005015
=> => => neurotrophin TRKC receptor GO:0005016
=> => platelet-derived growth factor receptor GO:0005017
=> => => platelet-derived growth factor\, alpha-receptor GO:0005018
=> => => platelet-derived growth factor\, beta-receptor GO:0005019
=> => stem cell factor receptor GO:0005020
=> => vascular endothelial growth factor receptor GO:0005021
=> vascular endothelial growth factor receptor GO:0005021
PANTHER Scoring
yes Classified
Score above (Name
A fasta file threshold? Molecular function
Biological process)
Family and subfamily HMMs
How accurate is PANTHER?
FlyBase PANTHER
A manually curated An automated annotation process
database for Drosophila
genes
Assess the associations
Process for comparison
Fly protein
sequences
FlyBase annotation PANTHER annotation
With GO terms by Scoring against
PANTHER
Automated Comparison
of FlyBase and Match
Panther assignments
Not Match
Correct
Manual review
Inconclusive
Incorrect
Coverage of Drosophila proteins
classified by FlyBase and PANTHER.
FlyBase PANTHER Both
A B C
PANTHER
PANTHER HMM hits
classified to GO
FlyBase Not hit
classified to 4862
Molecular 6301 GO
6205
8031
FlyBase not
function classified to
GO
3265 FlyBase
PANTHER
PANTHER HMM hits not classified Classified
not classified to GO to GO overlap
3283
D E F
FlyBase classified
to GO PANTHER
PANTHER HMM hits
classified to GO
2794
3658
Not hit
Biological 6205
process 11538
4469 FlyBase
Classified
PANTHER HMM hits PANTHER
overlap 1159
FlyBase not classified to GO not classified to GO not classified
to GO
Assessment of molecular function
associations
PANTHER FlyBase
37
35
58
50
195
345
663
700
2737
2747
Auto match Manual match Correct Incorrect Inconclusive
Types of errors
•Homology error – an error cause by incorrect
functional prediction based on sequence
homology.
•Human error – an error on part of the human
curator.
•Evidence error – an error by using an evidence
that is incorrect.
Analysis of errors
PANTHER FlyBase
Number of
8 35
homology errors
Number of human
40 23
errors
Number of
2 0
evidence errors
Total number of
incorrect 50 58
associations
Association error
1.3% 1.6%
rate (%)
Example of homology error
PANTHER function inference in the context of a protein sequence tree
FBgn0032382 (CG14934)
FlyBase: alpha glucosidase
neutral amino acid transporter
PANTHER: alpha glucosidase
CG14934
Alpha
glucosidase
Alpha amylase
Neutral a.a.
transporter
Alpha amylase
Summary
•PANTHER is an automated method to classify proteins in a robust way.
•The accuracy of PANTHER was assessed by comparing its classification of
Drosophila proteins with FlyBase’s.
•A total of 3283 Drosophila proteins were associated to at least one molecular
function category by both FlyBase and PANTHER (3867 molecular function
associations by PANTHER, and 3700 by FlyBase).
•About 90% of these associations by FlyBase and PANTHER match with each
other.
•Total error rate is < 2% for both methods.
Acknowledgements
Celera Genomics FlyBase
Paul Thomas Michael Ashburner
Susanna Lewis
Jody Vandergriff
Michael Campbell
Apurva Narechania
William Majoros
Karen Diemer
Olivier Doremieux
Nan Guo
Anish Kejariwal
Steven Ladunga
Betty Lazareva
Anushya Muruganujan
Steve Rabkin