VIEWS: 3 PAGES: 3 POSTED ON: 11/27/2011
Improved prediction of protein-protein binding sites using a support vector machines approach James R. Bradford and David R. Westhead* School Of Biochemistry and Molecular Biology, University of Leeds, Leeds, LS2 9JT, UK. Telephone: +44 113 2333116 Fax: +44 113 2333167 Email: firstname.lastname@example.org *To whom correspondence should be addressed. Running title: Binding site prediction using SVMs. A comparison with other patch analysis methods We also evaluated our method against smaller data sets previously used by Jones and Thornton (1997) and Neuvirth et al. (2004) in patch analysis. For these data sets, we used updated versions of obsolete PDB files and replaced structures not explicitly separated into two chains in the PDB file with homologous structures that were, or removed the protein completely if no suitable alternative could be found. The modified data set of Jones and Thornton, which we refer to as Thornton_97, contained 47 proteins, 22 of which were involved in homo-obligate interactions, 19 in enzyme-inhibitor interactions and 6 in NEIT interactions. Likewise, the modified data set of Neurvirth et al. (Schreiber_04) contained 87 proteins, 25 of which were involved in enzyme-inhibitor interactions, eight in hetero- obligate interactions and 54 in NEIT interactions. Inter-data set redundancy meant that we were unable to train an SVM on our data set and then use it to predict on the Thornton_97 and Schreiber_04 data sets. Therefore, we had to use the leave-one-out method as before, training and cross validation occurring within each data set separately. Supplementary Table 2: Summary of mean results from five leave-one-out cross validations on the Thornton_97 (Jones and Thornton 1997) and Schreiber_04 (Neuvirth et al. 2004) data sets. Thornton_97 We successfully predicted the location of the interface in 72% (34/47) of the data set using leave-one- out cross validation as before (Supplementary Table 2a). Predictions on enzyme-inhibitor interactions were particularly encouraging achieving a success rate of 79% (15/19) compared to only 33% (2/6) for NEIT interaction types, possibly because they made up the minority of the training set. With their patch analysis method and using their own criteria based on sensitivity values only, Jones and Thornton achieved a success rate of 64% (30/47) on the 47 test cases. Therefore our method appeared to perform at least as well as their method, if not better, even though Jones and Thornton used two different scoring systems: one for homodimers and small protomers, the other for large protomers. We applied one SVM to all interaction types because the number of examples in each interaction type in this data set was too small for meaningful SVM training, and, more importantly, we wanted a method that would be applicable to as many proteins as possible regardless of interaction type. In 41 out of the 47 test cases, patch sizes used by Jones and Thornton were larger than ours. If we had increased our patch size then our sensitivity values would have increased as well but at the risk of decreasing specificity (not reported by Jones and Thornton). We also defined our interfaces more stringently than Jones and Thornton who included residues that lost more than 1 Å 2 accessible surface area upon complex formation whereas we only used residues that lost more than 99% accessible surface area. As a result, Jones and Thornton generated larger interfaces than us in most cases thus exaggerating sensitivity values in patches around the edge of the interface. Taken together, the larger patch sizes and interfaces used by Jones and Thornton would have increased the probability of an apparently successful outcome. This is reflected in the numbers of successes one would expect to obtain by random sampling of patches. Based on data from their paper, we estimate that 17 of the 30 successes achieved by Jones and Thornton could have been due to random chance. This compares to only 15 of the 34 successes that we attained (Supplementary Table 2a). Schreiber_04 Schreiber_04 was a more difficult data set because the majority of the test cases were of the NEIT interaction type, and our method had performed worse on this type in the Thornton_97 and our own data set. Two points should noted: we reduced the original bound data set of Neuvirth et al. from 92 proteins to 87 structures that could be processed by our method, and we trained the SVM and cross- validated our method on all 87 bound proteins. Neuvirth et al. on the other hand selected 53 of these proteins that had a homologous unbound structure of at least 70% sequence identity within the PDB, and trained and cross-validated on these unbound examples. One would expect a higher success rate on a bound data set compared to an unbound equivalent thus direct comparisons between the two methods were difficult. Even so, our method is tolerant to conformational changes between bound and unbound forms (data presented in main text) so the following comparisons were useful. Our leave-one- out cross validation successfully predicted the location of the interface on 51% (44/87) of the 87 bound proteins (Supplementary Table 2b). Neuvirth et al. achieved a success rate of 62% (33/53) on their unbound data set with the criterion for success being a predicted patch with over 50% specificity. However, applying our own criteria of 50% specificity and 20% sensitivity reduced their success rate to 36% (19/53). On the same set of proteins in their bound form, our success rate was 53% (28/53) of which 68% (19/28) were top ranked patches. Neuvirth et al. put even greater emphasis on specificity than we did to the detriment of their sensitivity values. We agree with Neuvirth et al. that specificity is important but a good prediction should at least cover 20% of the interface. We believe our method provides the best balance between specificity and sensitivity of the two methods. For example, the mean sensitivity of patches with over 50% specificity from our method was 51% compared to approximately 20% from the Neuvirth et al. method. The negative influence of including so many proteins involved in NEIT interactions was illustrated by the success rates of each interaction type separately. 56% (14/25) and 62% (5/8) of binding sites involved in enzyme-inhibitor and hetero-obligate interactions respectively were predicted successfully compared to only 44% (24/54) of those involved in NEIT interactions. This suggested that additional attributes are needed that would better distinguish the interface from the rest of the surface of proteins involved in NEIT interactions. References Jones,S. and Thornton,J.M. (1997) Prediction of protein-protein interaction sites using patch analysis, J Mol Biol, 272, 133-143. Neuvirth,H., Raz,R. and Schreiber,G. (2004) ProMate: A structure based prediction program to identify the location of protein-protein binding sites, J Mol Biol, 338, 181-199.
Pages to are hidden for
"Comparison"Please download to view full document