Associating Biomedical Terms:
Case Study for Acetylation
Aaron Buechlein
Indiana University School of Informatics
Advisor: Dr. Predrag Radivojac
Overview
• Background
• Previous Work
• Methods
• Results
Central Dogma
Background
Previous Work
Methods
Results
http://www.accessexcellence.org/RC/VL/GG/images/central.gif
Post-Translational Modifications
(PTMs)
Background
Previous Work
Methods
Results
Acetylation
Background • Acetylation involves the substitution of an acetyl group
(-COCH3) for hydrogen
Previous Work
Methods • Typically occurs on N-terminal tails and lysine residues
Results
(Lys or K)
Previous Predictors
Background • Several PTM predictors have been created prior to this
work
Previous Work
Methods • There are also acetylation predictors prior
Results
• NetAcet is a predictor for only N-terminal sites
• AutoMotif Server is a predictor for various PTMs and
includes an acetylation portion
• PAIL is a lysine acetylation predictor
Methods
Background • Create Dataset
Previous Work
• Download articles relevant to acetylation and extract
Methods sites
• Rank articles in order to elucidate sites quickly
Results
• SwissProt and Human Protein Reference Database
(HPRD)
• Create Predictors
• Leave – one – protein – out validation
• Matlab
Article Retrieval
Background • Searched individual journal sites for articles relevant to
acetylation
Previous Work
Methods • Saved resultant html pages for each journal
Results
• These pages were then used as the input for a web
crawler to download articles
• Due to varying journal site construction each journal
required a unique regular expression to extract links
for articles
Rank Articles
Background • First locate occurrences of first phrase: “phrase 1”
Previous Work
• A = {a1, a2, …, a|A |}
Methods
• Next locate occurrences of second phrase: “phrase 2”
Results
• R = {r1, r2…, r|R|}
•
• c and d are constants
• x is the distance in characters between r and the nearest
word a
An example: acetylation
Background
Previous Work
Methods
Results 1. word “acetylat”
A = {a1, a2, …, am}
2. regular expression
(k lys lysine)(space)*(digit)+
R = {r1, r2, …, rn}
An example: acetylation
Background
Previous Work
Methods
Score for article S:
Results
S i 1 score (ri , A)
n
where
and
An example: acetylation
10
Background
9
Previous Work Score for article S: 8
f ( x) 10 e 0.005x
7
S i 1 score (ri , A)
Methods n
6
f(x)
5
Results
where: 4
3
score(ri , A) f (| position(ri ) position(ak ) |)
2
and 1
0
k arg min j 1...m | position(ri ) position(a j ) | 0 100 200 300 400 500 600
Distance in characters
700 800 900 100
Papers with S > 100 are rich in sites; if S < 30 “twilight” zone
Elucidate Sites
Background • Sites were manually extracted from articles beginning
with the highest rank
Previous Work
Methods • The original experimental paper for these sites was
verified for traceable evidence
Results
• Sites were extracted from SwissProt
• Sites were extracted from HPRD
Predictors
Background • Support Vector Machine
Previous Work
Methods
• Artificial Neural Network
Results
• Decision Tree
Predictor Input
Background • Positives taken as all lysines found to be acetylated
Previous Work
• Negatives taken as all lysines not found to be
Methods acetylated
Results
• Features created based on characteristics surrounding
lysines
• Amino acid content, hydrophobicity, charge, disorder,
etc.
Predictor Input
Background Protein Features Acetylated
1 8 1 0.48609 0.001767 0.48979 0.51508 1
Previous Work
1 7 1 0.92146 0.03019 0.96423 0.79416 1
Methods 1 0 0 0.50622 0.015251 0.52335 0.51855 0
Results 2 10 2 0.2008 0.038708 0.25441 0.36071 1
2 1 0 0.62016 0.009772 0.62846 0.67525 0
2 0 0 0.27783 0.028957 0.32162 0.34207 0
3 11 1 0.89239 0.018354 0.91884 0.88125 1
3 12 2 0.87354 0.022307 0.90349 0.87446 1
3 8 1 0.81549 0.025339 0.85289 0.85702 1
3 2 0 0.84588 0.024766 0.88219 0.86599 0
Article and Ranking Results
Background • 4888 articles from 10 sites were searched
• Nature provided 2147 articles
Previous Work
• Science Direct provided1519 articles
Methods
Results • The highest ranking article was obtained from the
Journal of Biological Chemistry
• Score of 151.87
• Contained 10 acetylation sites
• The highest ranking article was obtained from Nature
when histones are excluded
• Previously ranked at #5
• score of 116.36
• Contained 9 unique acetylation sites
Top 25
Rank Score Sites Article Source
1) 151.8667 10 Journal of Biological Chemistry
Background 2) 123.2314 12 Cell / Science Direct
3) 121.9031 6 Nature
Previous Work 4) 117.7988 9 Journal of Proteome Research
5) 116.3582 9 Nature
6) 111.1745 14 Biochemistry
Methods 7) 104.4652 6 Cell / Science Direct
8) 104.0166 7 Nature
9) 102.0683 13 Molecular Cell / Science Direct
Results 10) 98.80812 6 Journal of Biological Chemistry
11) 97.64634 6 Biochemistry
12) 96.76536 6 Journal of Biological Chemistry
13) 96.0845 9 International Journal of Mass Spectrometry / Science Direct
14) 88.12967 9 Biochemistry
15) 86.17157 6 Journal of Biological Chemistry
16) 81.78705 5 Nucleic Acids Research
17) 81.30967 6 Biochemistry
18) 81.06128 6 Molecular Cell / Science Direct
19) 80.74899 9 Journal of Biological Chemistry
20) 80.16261 9 Nature
21) 79.65658 6 Molecular Cell / Science Direct
22) 77.9022 4 Cell / Science Direct
23) 77.88304 5 Nucleic Acids Research
24) 77.60087 8 Gene / Science Direct
25) 77.44198 6 Journal of the American Society for Mass Spectrometry
Ranking Results
Background • Articles with scores greater than 30 had potential for
providing at least one site
Previous Work
Methods • As scores approached 30, articles became less fruitful
Results
Dataset Results
Background • Dataset included 1442 total sites and 1085 non-
redundant sites
Previous Work
Methods • HPRD contributed 90 total sites
Results
• Swiss-Prot contributed 825
• Our Study contributed 527
Dataset Results
Background
Previous Work
Methods
Results
Sensitivity, Specificity, and Precision
Background • Sensitivity(sn) -
Previous Work
Methods
• Specificity(sp) -
Results
• Precision(pr) -
Accuracy and AUC
Background • Accuracy(acc) -
Previous Work
Methods
Results
• Area Under Curve(AUC)
• Refers to the area under the Receiver Operating Curve
(ROC)
• ROC is the graphical plot of sensitivity vs. 1-specificity
SVM Predictor
Background
Polynomial kernel
Degree
Previous Work sn sp pr acc AUC
p=1 52.3 71.0 24.6 61.6 65.2
Methods
p=2 46.1 69.8 20.3 57.9 62.8
Results p=3 31.6 80.8 23.5 56.2 60.3
Gaussian kernel
Degree
sn sp pr acc AUC
σ = 10-2 43.8 75.8 24.9 59.8 64.3
σ = 10-3 54.1 72.1 25.9 63.1 68.1
σ = 10-6 52.8 70.7 24.6 61.8 65.3
Artificial Neural Network
Background
Hidden Artificial Neural Network
Neurons
Previous Work sn sp pr acc AUC
1 68.0 47.7 20.7 57.8 61.9
Methods
3 65.2 47.7 19.4 56.4 58.9
Results 5 65.0 47.2 19.1 56.1 57.5
Decision Tree
Background
Decision Tree
Algorithm
Previous Work sn sp pr acc AUC
Decision
61.7 45.9 18.3 53.8 42.1
Methods Tree
Results
Algorithm Comparison
Background
Algorithm sn sp pr acc AUC
Previous Work SVM 54.1 72.1 25.9 63.1 68.1
Neural
68.0 47.7 20.7 57.8 61.9
Methods Network
Decision
61.7 45.9 18.3 53.8 42.1
Tree
Results
I would like to acknowledge those who have helped
me throughout the duration of this project,
Dr. Predrag Radivojac, Dr. Haixu Tang, and Wyatt Clark
I welcome your questions and/or comments
An example: acetylation
Background
Previous Work
Methods
Results 1. word “acetylat”
A = {a1, a2, …, am}
2. regular expression
(k lys lysine)(space)*(digit)+
R = {r1, r2, …, rn}
An example: acetylation
Background
Previous Work
Methods
Score for article S:
Results
S i 1 score (ri , A)
n
where
and