Learning Center
Plans & pricing Sign in
Sign Out



									Improved prediction of protein-protein binding sites using a support vector machines approach

James R. Bradford and David R. Westhead*

School Of Biochemistry and Molecular Biology, University of Leeds, Leeds, LS2 9JT, UK.

Telephone: +44 113 2333116

Fax: +44 113 2333167


*To whom correspondence should be addressed.

Running title: Binding site prediction using SVMs.

                          A comparison with other patch analysis methods

We also evaluated our method against smaller data sets previously used by Jones and Thornton (1997)
and Neuvirth et al. (2004) in patch analysis. For these data sets, we used updated versions of obsolete
PDB files and replaced structures not explicitly separated into two chains in the PDB file with
homologous structures that were, or removed the protein completely if no suitable alternative could be
found. The modified data set of Jones and Thornton, which we refer to as Thornton_97, contained 47
proteins, 22 of which were involved in homo-obligate interactions, 19 in enzyme-inhibitor interactions
and 6 in NEIT interactions.     Likewise, the modified data set of Neurvirth et al. (Schreiber_04)
contained 87 proteins, 25 of which were involved in enzyme-inhibitor interactions, eight in hetero-
obligate interactions and 54 in NEIT interactions.

Inter-data set redundancy meant that we were unable to train an SVM on our data set and then use it to
predict on the Thornton_97 and Schreiber_04 data sets. Therefore, we had to use the leave-one-out
method as before, training and cross validation occurring within each data set separately.

Supplementary Table 2: Summary of mean results from five leave-one-out cross validations on the
Thornton_97 (Jones and Thornton 1997) and Schreiber_04 (Neuvirth et al. 2004) data sets.

We successfully predicted the location of the interface in 72% (34/47) of the data set using leave-one-
out cross validation as before (Supplementary Table 2a). Predictions on enzyme-inhibitor interactions
were particularly encouraging achieving a success rate of 79% (15/19) compared to only 33% (2/6) for
NEIT interaction types, possibly because they made up the minority of the training set.

With their patch analysis method and using their own criteria based on sensitivity values only, Jones
and Thornton achieved a success rate of 64% (30/47) on the 47 test cases. Therefore our method
appeared to perform at least as well as their method, if not better, even though Jones and Thornton used
two different scoring systems: one for homodimers and small protomers, the other for large protomers.
We applied one SVM to all interaction types because the number of examples in each interaction type
in this data set was too small for meaningful SVM training, and, more importantly, we wanted a
method that would be applicable to as many proteins as possible regardless of interaction type.

In 41 out of the 47 test cases, patch sizes used by Jones and Thornton were larger than ours. If we had
increased our patch size then our sensitivity values would have increased as well but at the risk of
decreasing specificity (not reported by Jones and Thornton). We also defined our interfaces more
stringently than Jones and Thornton who included residues that lost more than 1 Å 2 accessible surface
area upon complex formation whereas we only used residues that lost more than 99% accessible
surface area. As a result, Jones and Thornton generated larger interfaces than us in most cases thus
exaggerating sensitivity values in patches around the edge of the interface. Taken together, the larger
patch sizes and interfaces used by Jones and Thornton would have increased the probability of an
apparently successful outcome. This is reflected in the numbers of successes one would expect to
obtain by random sampling of patches. Based on data from their paper, we estimate that 17 of the 30
successes achieved by Jones and Thornton could have been due to random chance. This compares to
only 15 of the 34 successes that we attained (Supplementary Table 2a).

Schreiber_04 was a more difficult data set because the majority of the test cases were of the NEIT
interaction type, and our method had performed worse on this type in the Thornton_97 and our own
data set. Two points should noted: we reduced the original bound data set of Neuvirth et al. from 92
proteins to 87 structures that could be processed by our method, and we trained the SVM and cross-
validated our method on all 87 bound proteins. Neuvirth et al. on the other hand selected 53 of these
proteins that had a homologous unbound structure of at least 70% sequence identity within the PDB,
and trained and cross-validated on these unbound examples. One would expect a higher success rate
on a bound data set compared to an unbound equivalent thus direct comparisons between the two
methods were difficult. Even so, our method is tolerant to conformational changes between bound and
unbound forms (data presented in main text) so the following comparisons were useful. Our leave-one-
out cross validation successfully predicted the location of the interface on 51% (44/87) of the 87 bound
proteins (Supplementary Table 2b). Neuvirth et al. achieved a success rate of 62% (33/53) on their
unbound data set with the criterion for success being a predicted patch with over 50% specificity.
However, applying our own criteria of 50% specificity and 20% sensitivity reduced their success rate
to 36% (19/53). On the same set of proteins in their bound form, our success rate was 53% (28/53) of
which 68% (19/28) were top ranked patches.

Neuvirth et al. put even greater emphasis on specificity than we did to the detriment of their sensitivity
values. We agree with Neuvirth et al. that specificity is important but a good prediction should at least
cover 20% of the interface. We believe our method provides the best balance between specificity and
sensitivity of the two methods. For example, the mean sensitivity of patches with over 50% specificity
from our method was 51% compared to approximately 20% from the Neuvirth et al. method.
The negative influence of including so many proteins involved in NEIT interactions was illustrated by
the success rates of each interaction type separately. 56% (14/25) and 62% (5/8) of binding sites
involved in enzyme-inhibitor and hetero-obligate interactions respectively were predicted successfully
compared to only 44% (24/54) of those involved in NEIT interactions. This suggested that additional
attributes are needed that would better distinguish the interface from the rest of the surface of proteins
involved in NEIT interactions.


Jones,S. and Thornton,J.M. (1997) Prediction of protein-protein interaction sites using patch analysis,
J Mol Biol, 272, 133-143.

Neuvirth,H., Raz,R. and Schreiber,G. (2004) ProMate: A structure based prediction program to identify
the location of protein-protein binding sites, J Mol Biol, 338, 181-199.

To top