Papachristoudis
Document Sample


TAGGO: a Tool for
Automatic Grouping
of Gene Ontology
Annotations
George Papachristoudis
Sophia Kossida
Meeting on Bioinformatics and Medical Informatics
1
Athens, 4 – 5 October 2006
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
• Introduction
• Why was the development
of TAGGO a necessity?
• Description of the tool
• Results
• Conclusions
• Comparison
• Acknowledgements
2
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Introduction
What is TAGGO?
3
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Introduction
Is it Tango?
No!
But they are both appealing… Each on its domain
4
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Introduction
TAGGO is a tool
which tries to derive
the protein’s
main functions
automatically
5
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Introduction
What are the main questions to fully characterize
a protein’s main activities?
nucleus
In what processes is involved in?
metabolism
protein
protein binding
6
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
Introduction
• Why was the development
of TAGGO a necessity?
• Description of the tool
• Results
• Comparison
• Conclusions
• Acknowledgements
7
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Why was the development of
TAGGO a necessity?
three main reasons justify this…
8
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Why was the development of
TAGGO a necessity?
1) Biologists currently waste a lot of time
and effort in searching manually for the
characteristics of each protein.
9
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Why was the development of
TAGGO a necessity?
2) In virtue of saving time, there is a
refinement in the information provided
by large amounts of data.
Each dataset contains 1000 - 3000 proteins,
while each protein is annotated to 5 - 12 GO terms
10
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Why was the development of
TAGGO a necessity?
The Gene Ontology
(GO)
3) The process is hampered further by the wide
variations in terminology (nomenclature) that each
biologist group adopts
The GO project
• provides controlled, well structured • widely acceptable nomenclature
vocabularies The Gene Ontology Dictionary
http://www.geneontology.org/doc/GODict.DAT
11
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
Introduction
Why was the development
of TAGGO a necessity?
• Description of the tool
• Results
• Comparison
• Conclusions
• Acknowledgements
12
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 1st Part
What is the Gene Ontology project?
The Gene Ontology (GO) project is a collaborative effort to address the
need for consistent descriptions of gene products in different databases.
The project began as a collaboration between three model organism
databases, the FlyBase (Drosophila), the Saccharomyces Genome
Database (SGD) and the Mouse Genome Database (MGD), in 1998.
Since then it has grown to include several plant, animal and microbial
databases.
What is the Gene Ontology Consortium?
The GO Consortium is the set of model organism and protein databases
and biological research communities actively involved in the development
and application of the Gene Ontology project.
Source:
An Introduction to the Gene Ontology
The GO Consortium
13
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 1st Part
GOA Databases
14
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 1st Part
What is the Gene Annotation
(GOA) project?
The GOA project provides
high-quality Gene Ontology
(GO) annotations to proteins in
the UniProt Knowledgebase
(SWISS-PROT / TrEMBL /
PIR-PSD) and InterPro
databases.
Genomes of seven (7) species supplied:
Human, Mouse, Rat, Arabidopsis,
Zebrafish, Chicken, Cow
Source:
Gene Ontology Annotation (GOA) Database
15
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
O00154
ATP9A_HUMAN
…
…
SELECT undesired ECs SELECT species
IPI00304150
Human
Exclude entries
Exclude entries
with undesired ECs
with undesired ECs GOADB
GOA Mouse
Mouse
DB
…
GO Zebrafish
Zebrafish
annotations
file
Gene Ontology
Results
Results
CC MF BP
16
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 1st Part
• GOA file format
The associations between gene products and
GO terms are stored in tab-delimited files.
Each record is comprised of 15 fields Most important fields
(not all mandatory) 1. gene / gene product ID (e.g. Q5T0T3)
2. gene / gene product symbol
(e.g. TRIO_HUMAN)
3. gene / gene product synonym (IPI)
(e.g. IPI00396431)
4. GO term ID (e.g. GO:0007582)
5. GO term aspect ( P, F or C)
EC is an index of the record’s reliability
6. Evidence Code (EC) (e.g. TAS, IEA)
Source:
13 different EC types: 12 manual, 1 electronic (IEA)
Source: Annotation File Fields
Guide to GO Evidence Codes
17
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 1st Part
A
s
DB: p Assigned
DB ID Symbol GO ID EC Name Synonym Type Taxon Date
Reference e by
c
t
T DNASE
DNASE
DNS2A_ GO: PMID: T 2, taxon: 200309
UniPro
UniProt O00115 DNS2A_ GO: PMID: A C 2, IPI00010348 protein taxon: 200309 PINC
O00115 HUMAN 0005764 9714827 A C DNASE IPI00010348 protein 9606 04 PINC
t HUMAN 0005764 9714827 S DNASE 9606 04
S 2A…
2A…
T
T ACOT7,
BACH_
BACH_ GO:
GO: PMID:
PMID: ACOT7, taxon:
taxon: 200309
UniPro
UniProt O00154 A F BACH: IPI00010415 protein 200309 PINC
O00154 HUMAN 0000062 10578051 A F BACH: IPI00010415 protein 9606 04 PINC
t HUMAN 0000062 10578051 Cytos… 9606 04
S
S Cytos…
I
I ACOT7,
O00154
O00154 BACH_
BACH_ GO:
GO: GOA:
GOA: ACOT7, taxon:
taxon: 200607
UniPro
UniProt E F BACH: IPI00010415 protein 200607 UniProt
HUMAN 0004759 spkw E F BACH: IPI00010415 protein 9606 21 UniProt
t HUMAN 0004759 spkw Cytos… 9606 21
A
A Cytos…
T MANBA,
MANBA_ GO: PMID: MANB1: taxon: 200309
UniProt O00462 A F IPI00298793 protein PINC
HUMAN 0004567 9384606 Beta- 9606 04
S manno..
RP11-
RP11-
I 508N12.
508N12.
I
UniPro Q5T0T3_
Q5T0T3_ GO:
GO: GOA:
GOA: 1,
1, taxon:
taxon: 200607
200607
UniProt Q5T0T3
Q5T0T3 E
E C
C IPI00748153
IPI00748153 protein
protein UniProt
UniProt
t HUMAN
HUMAN 0005622
0005622 Interpro
Interpro RP11-
RP11- 9606
9606 21
21
A
508N12
…
Input proteins
------------------
BACH_HUMAN
BACH_HUMAN → GO: 0000062
O00115 O00115 BACH_
HUMAN
→ GO:
0005764
0004759
0005622
0000062 ← BACH_HUMAN → GO: 0004759
IPI00748153
…
O00115 → GO:0005764
IPI00748153
IPI00748153 → GO: 0005622 18
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
O00154
ATP9A_HUMAN
…
…
SELECT undesired ECs SELECT species
IPI00304150
Human
Exclude entries
with undesired ECs GOA DB
Mouse
…
GO Zebrafish
annotations
file
Gene Ontology
Results
Results
Results
CC MF BP
CC MF BP
19
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
What is The Gene Ontology
(GO)?
The Gene Ontology is in fact an
“umbrella” under which reside
three structured, controlled and
orthogonal vocabularies Gene
(ontologies) that describe gene Ontology
products in terms of their
associated biological processes,
cellular components and
molecular functions in a species-
independent manner.
Biological Molecular Cellular
Process Function Component
Ontology Ontology Ontology
20
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
Structure of each ontology
Each ontology is totally independent of the
others
It is a Directed Acyclic Graph (DAG) structure
It allows multiple parentalship
Comprised by “IS_A” and “PART_OF” links
IS_A : each term inherits the attributes of its
parent(s)
PART_OF: each term is a component of its parent(s)
IS_A relations outnumber PART_OF
21
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Example
of
BP
Ontology
22
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Example
of
CC
Ontology
23
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Example
of
MF
Ontology
24
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
We now have gene product – GO term pairs
O00115 – endonuclease activity
O00115 – DNA metabolism
O00151 – zinc ion binding
O00168 – integral to membrane
O00170 – transcription factor binding
O00217 – iron ion binding
O00264 – integral to plasma membrane
O00487 – ubiquitin-dependent
protein catabolism
catalytic
…and we want to
group them into membrane
activity
general categories
ion protein
binding
binding metabolism
25
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
How can we achieve that?
Here comes the Gene Ontology…
• The terms high in the hierarchy (shallow terms)
provide a sufficient deal of abstractness
(generality)
• Thus, we could consider them as categories
which hold the basic features of their children
• We could parse all the multiple paths leading from
the term to the root of the ontology
• Find all the parents of this parent
• Sort the parents in terms of descending generality
• Find the most general parent and regard it as the
term’s category
26
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
Defining generality…
The Information Content (IC) of a term is an
indication of its generality
The less informative a term c is, the more general it is
IC(c) = – logb(Prob(c))
In our case b = e (natural logarithm) In other words
where:
n nc : number of term’s children
Prob(c)= c including the term itself
nr
nr : number of root’s children
nc : The number of times the term
including the root itself
(or any of its children) occurs
nr : The number of times the root
of the ontology occurs 27
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
1. development
2. response to
stimulus
3. morphogenesis
descending IC
response to growth development
stimulus 4. response to stress
5. tissue development
response to morphogenesisdevelopmental development 6. response to
response to tissue
external stimulus growth
stress external stimulus
7. growth
8. developmental
growth
wound healing 9. wound healing
tissue regeneration development
28
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 2nd Part
descending IC
1. cell
cell organelle
2. cell part
3. intracellular
cell part 4. intracellular part
a subtle change 5. organelle
intracellular in CC ontology…
intracellular part
intracellular organelle intracellular
29
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
O00154
ATP9A_HUMAN
…
…
SELECT undesired ECs SELECT species
IPI00304150
Human
Exclude entries
with undesired ECs GOA DB
Mouse
…
GO Zebrafish
annotations
file
Gene Ontology
Results
Results
CC MF BP
30
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Description of the tool – 3rd Part
Creating pie charts…
31
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
Introduction
Why was the development
of TAGGO a necessity?
Description of the tool
• Results
• Comparison
• Conclusions
• Acknowledgements
32
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Results
• we used a human brain protein set as input set
1401 proteins
• each protein is annotated approximately to 8 terms
11429 protein – GO term annotated pairs
• huge amount of collected data
• finding the ten most general
categories of each term
• determination of the categories of
each protein
• finding 15 CC categories,
18 BP categories,
20 MF categories
33
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
Introduction
Why was the development
of TAGGO a necessity?
Description of the tool
Results
• Comparison
• Conclusions
• Acknowledgements
34
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Comparison
membrane
Manual: CC mitochondrion
4% 8% 12% cytoplasm
7% nucleus
21% endoplasmatic reticulum
extracellular CC Ontology
intracellular
15% cytoskeleton
2% cellular component unknow n
7% 3% 6% 3% 12% no entry
lysosomal
other
membrane
Automatic: CC mito cho ndrio n
0% 0% cyto plasm
The pie charts almost 1% 0%
nucleus
endo plasmic reticulum
resemble each other 5% 4% 12%
6%
extracellular regio n
intracellular
16%
cyto skeleto n
10%
cellular co mpo nent unkno wn
Only in “intracellular”,
no entry
1%
cell
10%
3% pro tein co mplex
there is a big difference 25%
2% synapse part
extracellular matrix part
5% synapse
virio n
35
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Comparison
Manual Vs. Automatic Grouping
CC ontology
600
Manual
500 Automatic
400
300 CC Ontology
200
100
0
e
l
tr y
n
r
r
m
us
um
on
on
ra
la
la
an
ow
as
di
cle
llu
en
m
llu
et
ul
br
on
kn
om
pl
el
ce
ce
t ic
nu
em
no
to
sk
un
ch
tra
tr a
re
tc
cy
to
m
ito
nt
in
ex
no
ic
cy
m
ne
at
m
po
as
om
pl
do
rc
la
en
llu
ce
• Slight differences => The most remarkable in “intracellular”
• Manual: 1783 occurences Automatic: 1961 occurences
36
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Comparison
structural activity
Manual: MF
transporter activity
1%
enzyme activity
2% catalytic activity
2% 3% 7% antioxidant activity
13% 13% ion binding
signaling
5% 9% protein/ribonucleoprotein binding MF Ontology
nucleic acid binding
ATP binding
6%
molecular function unknow n
19%
7% 10% 3% no entry
GTP-, ATPase activity
other
0% structural mo lecule activity
transpo rter activity
0% Automatic: MF
enzyme regulato r activity
1% 3%
10% 5% 3% catalytic activity
The pie charts almost 1%
5% antio xidant activity
io n binding
resemble each other 11% 24%
signal transducer activity
pro tein/ribo nucleo pro tein binding
nucleic acid binding
nucleo tide binding
Only in “catalytic activity”, 6%
0%
mo lecular functio n unkno wn
no entry
19% 10%
2%
there is a big difference mo to r activity
transcriptio n regulato r activity
chapero ne regulato r activity
binding
37
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Comparison
Manual Vs. Automatic Grouping
MF ontology
700 Manual
600 Automatic
500
400
300 MF Ontology
200
100
0
g
g
n
tr y
g
g
ty
on
it y
ity
g
ity
it y
ow
in
in
lin
in
in
ivi
t iv
tiv
en
tiv
m
t iv
nd
nd
nd
nd
na
ct
kn
om
ac
ac
ac
ac
bi
bi
bi
ra
no
bi
sig
un
tc
al
in
n
nt
e
e
ic
id
r te
m
io
id
te
ur
n
da
lyt
ac
no
tio
po
zy
ot
ro
ct
ta
xi
cle
ic
nc
op
en
ru
ns
ca
tio
cle
st
fu
nu
tra
le
an
nu
uc
ar
on
ul
ec
r ib
ol
n/
m
ei
ot
pr
• Slight differences => The most remarkable in “catalytic activity”
• Manual: 2078 occurences Automatic: 2858 occurences
38
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Comparison
apoptosis
Manual: BP
metabolism
13% 4% cell adhesion
3%
cell cycle
8% 31%
development
1%
transport BP Ontology
signaling
biological process unknow n
10%
no entry
4%
13% 4% stressrelated
9%
other
death
0% 0% Automatic: BP
1% metabo lism
1% 1% 0% cell adhesio n
cellular physio lo gical pro cess
3% 1% 0% 0%
develo pment
6% 2%
0%
The pie charts almost 6% 34%
lo calizatio n
cell co mmunicatio n
resemble each other bio lo gical pro cess unkno wn
10%
no entry
regulatio n o f physio lo gical pro cess
1%
o rganismal physio lo gical pro cess
8% 2%
13% 4% 7% respo nse to stimulus
regulatio n o f cellular pro cess
cell differentiatio n
39
ho meo stasis
Athens, 4 – 5 October 2006 regulatio n o f bio lo gical pro cess
Meeting on Bioinformatics and Medical Informatics
repro ductio n
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Comparison
Manual Vs. Automatic Grouping
BP ontology
900 Manual
800
Automatic
700
600
500 BP Ontology
400
300
200
100
0
try
rt
t
on
n
th
n
ism
e
en
n
po
io
io
cl
ow
en
ea
m
pm
es
at
cy
ns
ol
om
/d
kn
no
ic
ab
dh
tra
lo
ll
sis
un
un
ce
tc
et
ve
ll a
to
m
no
s/
m
de
s
ce
op
om
es
es
ap
oc
oc
ll c
pr
pr
ce
al
ol
si
ic
hy
og
rp
ol
bi
la
llu
ce
• Slight differences => The most remarkable in “metabolism”
• Manual: 2167 occurences Automatic: 2353 occurences
40
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
Introduction
Why was the development
of TAGGO a necessity?
Description of the tool
Results
Comparison
• Conclusions
• Acknowledgements
41
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Conclusions
What are the goodies of this tool?
• Potential of discarding annotations that are supported by not so reliable
ECs
• Extremely fast process
• All the protein – GO term pairs considered
• Convenience of searching the ten most general categories of each
term
• Usage of one of the most reliable Biological Ontologies (Gene
Ontology) for the results’ extraction (well structured ontologies based on
biological evidence, widely accepted nomenclature)
On the other hand…
• Sometimes, terms are assigned to very abstract categories (low info
content)
• Categories with tiny percentages often do not merge into a broader
one 42
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
TAGGO
43
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Organization of the presentation
Introduction
Why was the development
of TAGGO a necessity?
Description of the tool
Results
Comparison
Conclusions
• Acknowledgements
44
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Acknowledgements
My project supervisor
Assoc. Professor Sophia Kossida
for the constant feedback and support
as well as for our
constructive discussions
throughout our collaboration Special thanks go to
Karin Soderman
for supplying the data and
contributing to the results’ comparison
and evaluation
45
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Acknowledgements
I want also to sincerely thank
Fotis E. Psomopoulos
(PhD student of ISSEL)
for offering me generously and
The Director of the
straightaway his help whenever
Intelligent Systems and Software
I asked for his advice
Engineering Lab (ISSEL),
Professor Pericles A. Mitkas
and my diploma thesis supervisor,
Sotiris T. Diplaris (PhD student of ISSEL)
for introducing me to such topics
46
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
TAGGO: a Tool for Automatic Grouping of Gene Ontology Annotations
Questions
3
2
the end
1
time’s up…! ;-)
47
Athens, 4 – 5 October 2006 Meeting on Bioinformatics and Medical Informatics
Get documents about "