Skip to main content

Advertisement

Log in

A comparative study of family-specific protein–ligand complex affinity prediction based on random forest approach

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

The assessment of binding affinity between ligands and the target proteins plays an essential role in drug discovery and design process. As an alternative to widely used scoring approaches, machine learning methods have also been proposed for fast prediction of the binding affinity with promising results, but most of them were developed as all-purpose models despite of the specific functions of different protein families, since proteins from different function families always have different structures and physicochemical features. In this study, we proposed a random forest method to predict the protein–ligand binding affinity based on a comprehensive feature set covering protein sequence, binding pocket, ligand structure and intermolecular interaction. Feature processing and compression was respectively implemented for different protein family datasets, which indicates that different features contribute to different models, so individual representation for each protein family is necessary. Three family-specific models were constructed for three important protein target families of HIV-1 protease, trypsin and carbonic anhydrase respectively. As a comparison, two generic models including diverse protein families were also built. The evaluation results show that models on family-specific datasets have the superior performance to those on the generic datasets and the Pearson and Spearman correlation coefficients (R p and Rs) on the test sets are 0.740, 0.874, 0.735 and 0.697, 0.853, 0.723 for HIV-1 protease, trypsin and carbonic anhydrase respectively. Comparisons with the other methods further demonstrate that individual representation and model construction for each protein family is a more reasonable way in predicting the affinity of one particular protein family.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Coupez B, Lewis RA (2006) Docking and scoring-theoretically easy, practically impossible. Curr Med Chem 13:2995–3003

    Article  CAS  Google Scholar 

  2. Kroemer RT (2007) Structure-based drug design: docking and scoring. Curr Protein Pept Sci 8:312–328

    Article  CAS  Google Scholar 

  3. Jain AN (2006) Scoring functions for protein–ligand docking. Curr Protein Pept Sci 7:407–420

    Article  CAS  Google Scholar 

  4. Li SY, Xi LL, Wang CQ, Li JZ, Lei BL, Liu HX, Yao XJ (2009) A novel method for protein–ligand binding affinity prediction and the related descriptors exploration. J Comput Chem 30:900–909

    Article  CAS  Google Scholar 

  5. Betz M, Saxena K, Schwalbe H (2006) Biomolecular NMR: a chaperone to drug discovery. Curr Opin Chem Biol 10:219–225

    Article  CAS  Google Scholar 

  6. Diercks T, Coles M, Kessler H (2001) Applications of NMR in drug discovery. Curr Opin Chem Biol 5:285–291

    Article  CAS  Google Scholar 

  7. Kuntz ID, Blaney JM, Oatley SJ, Langridge R, Ferrin TE (1982) A geometric approach to macromolecule-ligand interactions. J Mol Biol 161:269–288

    Article  CAS  Google Scholar 

  8. Jones G, Willett P, Glen RC, Leach AR, Taylor R (1997) Development and validation of a genetic algorithm for flexible docking. J Mol Biol 267:727–748

    Article  CAS  Google Scholar 

  9. Naim M, Bhat S, Rankin KN, Dennis S, Chowdhury SF, Siddiqi I, Drabik P, Sulea T, Bayly CI, Jakalian A, Purisima EO (2007) Solvated interaction energy (SIE) for scoring protein–ligand binding affinities. 1. Exploring the parameter space. J Chem Inf Model 47:122–133

    Article  Google Scholar 

  10. Aqvist J, Luzhkov VB, Brandsdal BO (2002) Ligand binding affinities from MD simulations. Acc Chem Res 35:358–365

    Article  Google Scholar 

  11. Gohlke H, Hendlich M, Klebe G (2000) Knowledge-based scoring function to predict protein–ligand interactions. J Mol Biol 295:337–356

    Article  CAS  Google Scholar 

  12. Muegge I, Martin YC (1999) A general and fast scoring function for protein–ligand interactions: a simplified potential approach. J Med Chem 42:791–804

    Article  CAS  Google Scholar 

  13. Muegge I (2006) PMF scoring revisited. J Med Chem 49:5895–5902

    Article  CAS  Google Scholar 

  14. Zhang C, Liu S, Zhu Q, Zhou Y (2005) A knowledge-based energy function for protein–ligand, protein–protein, and protein-dna complexes. J Med Chem 48:2325–2335

    Article  CAS  Google Scholar 

  15. Imai T, Hiraoka R, Seto T, Kovalenko A, Hirata F (2007) Three-dimensional distribution function theory for the prediction of protein–ligand binding sites and affinities: application to the binding of noble gases to hen egg-white lysozyme in aqueous solution. J Phys Chem B 111:11585–11591

    Article  CAS  Google Scholar 

  16. Gehlhaar DK, Verkhivker GM, Rejto PA, Sherman CJ, Fogel DR, Fogel LJ, Freer ST (1995) Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: conformationally flexible docking by evolutionary programming. Chem Biol 2:317–324

    Article  CAS  Google Scholar 

  17. Rarey M, Kramer B, Lengauer T, Klebe G (1996) A fast flexible docking method using an incremental construction algorithm. J Mol Biol 261:470–489

    Article  CAS  Google Scholar 

  18. Wang R, Lui L, Lai L, Tang Y (1998) Score: a new empirical method for estimating the binding affinity of a protein–ligand complex. J Mol Model 4:379–394

    Article  CAS  Google Scholar 

  19. Wang R, Lai L, Wang S (2002) Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J comput-Aided Mol Des 16:11–26

    Article  CAS  Google Scholar 

  20. Chen HM, Liu BF, Huang HL, Hwang SF, Ho SY (2007) SODOCK: swarm optimization for highly flexible protein–ligand docking. J Comput Chem 28:612–623

    Article  CAS  Google Scholar 

  21. Ballester PJ, Mitchell JBO (2010) A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics 26:1169–1175

    Article  CAS  Google Scholar 

  22. Smith RD, Dunbar JB, Ung PMU, Esposito EX, Yang CY, Wang S, Carlson HA (2011) CSAR benchmark exercise of 2010: combined evaluation across all submitted scoring functions. J Chem Inf Model 51:2115–2131

    Article  CAS  Google Scholar 

  23. Sotriffer C, Matter H (2011) The challenge of affinity prediction: scoring functions for structure-based virtual screening. In: Sotriffer C (ed) virtual screening: principles, challenges, and practical guidelines. Wiley-VCH, Weinheim

    Chapter  Google Scholar 

  24. Linusson A, Lindstrom A, Pettersson F, Almqvist F, Berglund A, Kihlberg J (2006) Hierarchical PLS modeling for predicting the binding of a comprehensive set of structurally diverse protein–ligand complexes. J Chem Inf Model 46:1154–1167

    Article  Google Scholar 

  25. Zhang S, Golbraikh A, Tropsha A (2006) Development of quantitative structure—binding affinity relationship modelsbased on novel geometrical chemical descriptors of the protein–ligand interfaces. J Med Chem 49:2713–2724

    Article  CAS  Google Scholar 

  26. Deng W, Breneman C, Embrechts MJ (2004) Predicting protein−ligand binding affinities using novel geometrical descriptors and machine-learning methods. J Chem Inf Comput Sci 44:699–703

    Article  CAS  Google Scholar 

  27. Zhao YQ, Huang JF (2011) Reconstruction and analysis of human heart-specific metabolic network based on transcriptome and proteome data. Biochem Biophys Res Commun 415:450–454

    Article  CAS  Google Scholar 

  28. Wang GS, Kearney DL, De Biasi M, Taffet G, Cooper TA (2007) Elevation of RNA-binding protein CUGBP1 is an early event in an inducible heart-specific mouse model of myotonic dystrophy. J Clin Investig 117:2802–2811

    Article  CAS  Google Scholar 

  29. Lewalle A, Niederer S, Smith N (2014) Species-specific comparison of the cardiac sodium/potassium pump based on a minimal biophysical model. Biophys J 106:117a

    Google Scholar 

  30. Heil F, Hemmi H, Hochrein H, Ampenberger F, Kirschning C, Akira S, Lipford G, Wagner H, Bauer S (2004) Species-specific recognition of single-stranded RNA via toll-like receptor 7 and 8. Science 303:1526–1529

    Article  CAS  Google Scholar 

  31. Xu W, McDonough MC, Erdman DD (2000) Species-specific identification of human adenoviruses by a multiplex PCR assay. J Clin Microbiol 38:4114–4120

    CAS  Google Scholar 

  32. Saranya N, Selvaraj S (2012) QSAR studies on HIV-1 protease inhibitors using non-linearly transformed descriptors. Curr Comput-Aid Drug 8:10–49

    Article  CAS  Google Scholar 

  33. Xue MZ, Zheng MY, Xiong B, Li YL, Jiang HL, Shen JK (2010) Knowledge-based scoring functions in drug design. 1. Developing a target-specific method for kinase-ligand interactions. J Chem Inf Model 50:1378–1386

    Article  CAS  Google Scholar 

  34. Wang R, Fang X, Lu Y, Wang S (2004) The PDBbind database: collection of binding affinities for protein–ligand complexes with known three-dimensional structures. J Med Chem 47:2977–2980

    Article  CAS  Google Scholar 

  35. Li HJ, Leung KS, Wong MH, Ballester PJ (2014) Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study. BMC Bioinform 15:291

    Article  Google Scholar 

  36. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 34:W32–W37

    Article  CAS  Google Scholar 

  37. Liu K, Feng J, Young SS (2005) PowerMV: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. J Chem Inf Model 45:515–522

    Article  CAS  Google Scholar 

  38. Ballester PJ, Schreyer A, Blundell TL (2014) Does a more precise chemical description of protein–ligand complexes lead to more accurate prediction of binding affinity? J Chem Inf Model 54:944–955

    Article  CAS  Google Scholar 

  39. Moody JE, Hanson SJ, Lippmann RP (1992) Advances in neural information processing systems 4. Morgan Kaufmann, Denver

    Google Scholar 

  40. Smith M (1993) Neural networks for statistical modeling. Van Nostrand Reinhold, New York

    Google Scholar 

  41. Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York

    Google Scholar 

  42. Svetnik V (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958

    Article  CAS  Google Scholar 

  43. Svetnik V, Liaw A, Tong C, Wang T (2004) Application of Breiman’s random forest to modeling structure–activity relationships of pharmaceutical molecules. In: Roli F, Kittler J, Windeatt T (eds) Lecture notes in computer science, vol 3077. Springer, Berlin, pp 334–343

    Google Scholar 

  44. Polishchuk PG, Muratov EN, Artemenko AG, Kolumbin OG, Muratov NN, Kuz’min VE (2009) Application of random forest approach to QSAR prediction of aquatic toxicity. J Chem Inf Model 49:2481–2488

    Article  CAS  Google Scholar 

  45. Core Team R (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  46. Breiman L (1996) Out-of-bag estimation. Technical report, UC Berkeley

  47. Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning. Springer, NewYork

    Google Scholar 

  48. Cheng TJ, Li X, Li Y, Liu ZH, Wang RX (2009) Comparative assessment of scoring functions on a diverse test set. J Chem Inf Model 49:1079–1093

    Article  CAS  Google Scholar 

Download references

Acknowledgments

This work was funded by the National Natural Science Foundation of China (No. 21175095, 21273154, 21375090).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yanzhi Guo or Menglong Li.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (XLSX 62 kb)

Supplementary material 2 (DOCX 22 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Guo, Y., Kuang, Q. et al. A comparative study of family-specific protein–ligand complex affinity prediction based on random forest approach. J Comput Aided Mol Des 29, 349–360 (2015). https://doi.org/10.1007/s10822-014-9827-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-014-9827-y

Keywords

Navigation