Improved prediction of protein-protein binding sites using a support vector machines approach
James R. Bradford and David R. Westhead*
School Of Biochemistry and Molecular Biology, University of Leeds, Leeds, LS2 9JT, UK.
Telephone: +44 113 2333116
Fax: +44 113 2333167
Email: westhead@bmb.leeds.ac.uk
*To whom correspondence should be addressed.
Running title: Binding site prediction using SVMs.
Important properties
Our original choice of the seven properties used for training the SVM was based on past
studies that have implicated them in distinguishing binding sites from the rest of the protein
surface. Here we present our own posterior analysis of how each property contributed to
training the SVM. Although the SVM did not explicitly provide such details, training the SVM
on each property separately gave us a measure of their relative importance. As well as carrying
out this procedure on the whole training set, we used two subsets of the training set: one
containing the 66 proteins involved in transient interactions, the other containing the 114
proteins involved in obligomeric interactions. This allowed us to compare important properties
at transient binding sites with those at obligomeric binding sites.
We evaluated training performance using the Matthews Correlation Coefficient (MCC;
Matthews 1975):
MCC
TP TN FP FN
TP FP TP FN TN FP TN FN
where TP = true positives (correctly classified interacting patches), TN = true negatives
(correctly classified non-interacting patches), FP = false positives (non-interacting patches
classified as interacting patches), FN = false negatives (interacting patches classified as non-
interacting patches). An MCC of +1 represents perfect training classification (no false
positives or negatives) whereas –1 represents a complete failure (all interacting patches
classified as non-interacting patches and vice versa). Because the non-interacting patch was
chosen at random, it was not possible to obtain identical results from any two training runs.
Therefore, we repeated training on each data set ten times, and calculated the mean and
standard deviations of our performance indicators
Generally, results obtained using the whole training set were as expected (Supplementary
Table 3a). Attributes based on interface residue propensity, hydrophobicity and ASA achieved
the highest MCC values. Shape index and conservation also performed well. Electrostatic
potential seemed to have some differentiating power even though it has been used in past
studies more successfully to predict DNA binding sites (Jones et al. 2003). In general, using
the SVM on all attributes gave better performance than any one attribute alone, indicating that
attributes give complementary information.
Supplementary Table 3: Importance of each attribute in training the SVM.
Results from training on obligomeric and transient interfaces separately go some way to
explaining the heterogeneous cross validation results. All the properties that are important at
an obligomeric interface (Supplementary Table 3b) seem to be important to a transient
interface (Supplementary Table 3c) as well. The major differences concern electrostatic
potential and curvedness. The higher MCC value achieved with curvedness on transient
interfaces probably reflects the number of enzyme-inhibitor interfaces in the subset. It is
common for a protrusion on the inhibitor surface to bind inside a cleft on the enzyme surface
and these protrusions and clefts will be highly curved. Electrostatic potential has almost no
distinguishing power on transient interfaces, achieving an MCC value of only 0.080.01 in
contrast to obligomeric interfaces where it achieves an MCC value of 0.270.05.
A higher MCC value (0.720.03) was achieved with training on obligomeric interfaces using
all attributes than with transient interfaces (0.630.04) suggesting that obligomeric interfaces
contain stronger signals that distinguish them from the rest of the protein surface than transient
interfaces.
References
Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure of T4
phage lysozyme, Biochim Biophys Acta, 405, 442-451.
Jones,S., Shanahan,H.P., Berman,H.M. and Thornton,J.M. (2003) Using electrostatic potentials
to predict DNA-binding sites on DNA-binding proteins, Nucleic Acids Res, 31, 7189-7198.