Skip to main content
Log in

Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

We investigate the use of different Machine Learning methods to construct models for aqueous solubility. Models are based on about 4000 compounds, including an in-house set of 632 drug discovery molecules of Bayer Schering Pharma. For each method, we also consider an appropriate method to obtain error bars, in order to estimate the domain of applicability (DOA) for each model. Here, we investigate error bars from a Bayesian model (Gaussian Process (GP)), an ensemble based approach (Random Forest), and approaches based on the Mahalanobis distance to training data (for Support Vector Machine and Ridge Regression models). We evaluate all approaches in terms of their prediction accuracy (in cross-validation, and on an external validation set of 536 molecules) and in how far the individual error bars can faithfully represent the actual prediction error.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. For some compounds, experimental values for both solubility and log D where available. For these compounds, we used log D predictions generated using a cross-validation procedure. This means that the predictions were always made using a log D model that has not been trained using the experimental log D value for the respective compound. This is necessary to avoid over-optimistic predictions.

  2. It has been suggested to use numeric criteria, such as log probability of the predictive distribution, for this purpose. Our experience suggests that these criteria can be misleading, they thus have not been used. In particular, log probability tends to favor over-optimistic models.

References

  1. Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sülzle D, Ganzer U, Heinrich N, Müller K-R (2007) J Chem Inf Model 47:407 URL http://dx.doi.org/10.1021/ci600205

    Google Scholar 

  2. Balakin KV, Savchuk NP, Tetko IV (2006) Curr Med Chem 13:223

    Article  CAS  Google Scholar 

  3. Johnson SR, Zheng W (2006) The AAPS J 8:E27 URL http://www.aapsj.org/articles/aapsj0801/aapsj080104/aapsj080104.pdf

  4. Göller AH, Matthias H, Jörg K, Timothy C (2006) J Chem Inf Model 46:648

    Article  CAS  Google Scholar 

  5. Delaney JS (2005) Drug Discovery Today 10:289

    Article  CAS  Google Scholar 

  6. Goldman BB, Walters WP (2006) Machine learning in computational chemistry, vol 2, chapter 8, Elsevier, pp 127

  7. Netzeva TI, Worth AP, Aldenberg T, Benigni R, Cronin MTD, Gramatica P, Jaworska JS, Kahn S, Klopman G, Marchant CA, Myatt G, Nikolova-Jeliazkova N, Patlewicz GY, Perkins R, Roberts DW, Schultz TW, Stanton DT, van de Sandt JJM, Tong W, Veith G, Yang C (2005) Altern Lab Anim 33:1

    Google Scholar 

  8. Tetko IV, Bruneau P, Mewes H-W, Rohrer DC, Poda GI (2006) Drug Discovery Today 11:700

    Article  CAS  Google Scholar 

  9. Tropsha A (2006) Variable selection qsar modeling, model validation, and virtual screening. In: Spellmeyer DC (ed) Annual reports in computational chemistry, vol 2, chapter 7, Elsevier, pp 113

  10. Bruneau P, McElroy NR (2004) J Chem Inf Model 44:1912

    Article  CAS  Google Scholar 

  11. Tong W, Xie Q, Hong U, Shi L, Fang H, Perkins R (2004) Environ Health Perspect 112:1249

    CAS  Google Scholar 

  12. Bruneau P, McElroy NR (2006) J Chem Inf Model 46:1379

    Article  CAS  Google Scholar 

  13. Silverman BW (1986) Density estimation for statistics and data analysis. Number 26 in Monographs on Statistics and Applied Probability. Chapman & Hall

  14. Manallack DT, Tehan BG, Gancia E, Hudson BD, Ford MG, Livingstone DJ, Whitley DC, Pitt WR (2003) J Chem Inf Model 43:674

    CAS  Google Scholar 

  15. Kühne R, Ebert R-U, Schüürmann G (2006) J Chem Inf Model 46:636

    Article  CAS  Google Scholar 

  16. Bender A, Mussa HY, Glen RC (2005) J Biomol Screen 10:658 http://jbx.sagepub.com/cgi/content/abstract/10/7/658

    Google Scholar 

  17. Sun H (2006) Chem Med Chem 1:315

    CAS  Google Scholar 

  18. Sadowski J, Schwab C, Gasteiger J Corina v3.1. Erlangen, Germany

  19. Todeschini R, Consonni V, Mauri A, Pavan M DRAGON v1.2. Milano, Italy

  20. Physical/Chemical Property Database (PHYSPROP). Syracuse, NY, USA

  21. Beilstein CrossFire Database. San Ramon, CA, USA

  22. Yalkowsky SH, Dannelfelser RM The arizona database of aqueous solubility. Tuscon, AZ, USA

  23. Huuskonen J (2000) J Chem Inf Comput Sci 40:773

    Article  CAS  Google Scholar 

  24. Ran Y, Jain N, Yalkowsky SH (2001) J Chem Inf Comput Sci 41:1208

    Article  CAS  Google Scholar 

  25. Tetko IV, Tanchuk VY, Kasheva TN, Villa AEP (2001) J Chem Inf Comput Sci 41:1488

    Article  CAS  Google Scholar 

  26. Yan A, Gasteiger J (2003) QSAR Comb Sci 22:821

    Article  CAS  Google Scholar 

  27. Livingstone DJ, Martyn F, Huuskonenc JJ, Salt DW (2001) J Comput-Aided Mol Des 15:741

    Article  CAS  Google Scholar 

  28. Todeschini R, Consonni V, Mauri A, Pavan M, Dragon for windows and linux 2006. URL http://www.talete.mi.it/help/dragon_help/ (accessed 14 May 2006)

  29. Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Muller K-R (2007) Chem Med Chem http://dx.doi.org/10.1002/cmdc.200700041

  30. Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, Heinrich N, Müller K-R (2007) http://dx.doi.org/10.1021/mp0700413

  31. O’Hagan A (1978) J R Stat Soc Ser B: Methodological 40:1

    Google Scholar 

  32. Rasmussen CE, Williams CKI (2005) Gaussian Processes for machine learning. MIT Press

  33. Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press

  34. Müller K-R, Mika S, Rätsch G, Tsuda K, Schölkopf B (2001) IEEE Trans Neural Netw 12:181

    Article  Google Scholar 

  35. Vapnik VN (1998) Statistical learning theory. Wiley, New York

    Google Scholar 

  36. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. Cambridge University Press, Cambridge, UK

    Google Scholar 

  37. Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge MA

    Google Scholar 

  38. Wang G, Yeung D-Y, Lochovsky FH (2006) Two-dimensional solution path for support vector regression. In: De Raedt L, Wrobel S (eds) Proceedings of ICML06, ACM Press, pp 993 URL http://www.icml2006.org/icml_documents/camera-ready/125_Two_Dimensional_Solu.pdf

  39. Breiman L (2001) Mach Learn 45:5 URL http://dx.doi.org/10.1023/A:1010933404324

  40. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference and prediction. Springer series in statistics. Springer, New York, NY

    Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge partial support from the PASCAL Network of Excellence (EU #506778) and DFG grant MU 987/4-1. We thank Vincent Schütz and Carsten Jahn for maintaining the PCADMET database, and Gilles Blanchard for implementing the random forest method as part of our machine learning toolbox.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timon Sebastian Schroeter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schroeter, T.S., Schwaighofer, A., Mika, S. et al. Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules. J Comput Aided Mol Des 21, 485–498 (2007). https://doi.org/10.1007/s10822-007-9125-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-007-9125-z

Keywords

Navigation