Skip to main content
Log in

Rational selection of training and test sets for the development of validated QSAR models

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q 2 for the training set and accuracy of prediction (R 2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Tropsha, A., Cho, S. J., and Zheng, W. In: Rational Drug Design: Novel Methodology and Practical Applications (Parrill, A.L. and Reddy, M.R., Eds), ACS Symposium Series No 719, 1999, pp. 198-211.

  2. Cho, S.J., Zheng, W., Tropsha, A.J., Chem. Inf. Comput. Sci., 38 (1998) 259.

    Google Scholar 

  3. Reynolds, C.H., Druker, R., Pfahler, L.B. J. Chem. Inf. Comput. Sci. 38 (1998) 305.

    Google Scholar 

  4. Gussio, R., Pattabiraman, N., Kellogg, G.E., Zaharevitz, D.W., Methods 14 (1998) 255.

    Google Scholar 

  5. Belkina, N.V., Skvortsov, V.S., Ivanov, A.S., Archakov, A.I., Vopr. Med. Khim. 44 (1998) 464.

    Google Scholar 

  6. Tropsha, A., Zheng, W., Curr. Pharm. Des. 7 (2001) 599.

    Google Scholar 

  7. Clementi, S., Wold, S. In: van de Waterbeemd, H. (Ed.), Chemometrics Methods in Molecular Design, VCH, 1995, pp. 319-338.

  8. Wold, S. In: van de Waterbeemd, H. (Ed.), Chemometrics Methods in Molecular Design, VCH, 1995, pp. 195-218.

  9. Zheng, W., Tropsha, A., J. Chem. Inf. Comput. Sci. 40 (2000) 185.

    Google Scholar 

  10. Hoffman, B., Cho, S.J., Zheng, W., Wyrick, S., Nichols, D.E., Mailman, R.B., Tropsha, A., J. Med. Chem. 42 (1999) 3217.

    Google Scholar 

  11. Ajay, A., J. Med. Chem. 36 (1993) 3565.

    Google Scholar 

  12. Golbraikh, A., Tropsha, A., J. Mol. Graphics Mod. 20 (2002) 269.

    Google Scholar 

  13. Wold, S., Eriksson, L. In: Chemometrics Methods in Molecular Design, van de Waterbeemd, H. (Ed.), VCH, 1995, pp. 309-318.

  14. Gironés, X., Gallegos, A., Ramon, C.-D., J. Chem Inf. Comput. Sci. 46 (2000) 1400.

    Google Scholar 

  15. Bordás, B., Kömíves, T., Szántó, Z., Lopata, A., J. Agricult. Food Chem. 48 (2000) 926.

    Google Scholar 

  16. Fan, Y., Shi, L.M., Kohn, K.W., Pommier, Y., Weinstein, J.N., J. Med. Chem. 44 (2001) 3254.

    Google Scholar 

  17. Randíc, M., Basak, S.C., J. Chem. Inf. Comput. Sci. 40 (2000) 899.

    Google Scholar 

  18. Suzuki, T., Ide, K., Ishida, M., Shapiro, S., J. Chem. Inf. Comput. Sci. 41 (2001) 718.

    Google Scholar 

  19. Recanatini, M., Cavalli, A., Belluti, F., Piazzi, L., Rampa, A., Bisi, A., Gobbi, S., Valenti, P., Andrisano, V., Bartolini, M., Cavrini, V., J. Med. Chem. 43 (2000) 2007.

    Google Scholar 

  20. Morón, J.A., Campillo, M., Perez, V., Unzeta, M., Pardo, L., J. Med. Chem. 43 (2000) 1684.

    Google Scholar 

  21. Kubinyi, H., Hamprecht, F.A., Mietzner, T., J. Med. Chem. 41 (1998) 2553.

    Google Scholar 

  22. Novellino, E., Fattorusso, C., Greco, G., Pharm. Acta Helv. 70 (1995) 149.

    Google Scholar 

  23. Norinder, U., J. Chemomet. 10 (1996) 95.

    Google Scholar 

  24. Tropsha, A., Gramatica, P., Gombar, V., Quant. Struct. Act. Relat. (2002) (in press).

  25. Golbraikh, A., Tropsha, A., J. Comput.-Aided Molec. Des., 16 (2002) 357.

    Google Scholar 

  26. Snarey, M., Terrett, N.K., Willett, P., Wilton, D.J., J. Mol. Graphics Mod. 15 (1997) 372.

    Google Scholar 

  27. Shen, M., LeTiran, A., Xiao, Y.-D., Golbraikh, A., Kohn, H., Tropsha, A., J. Med. Chem. 45 (2002) 2811.

    Google Scholar 

  28. Xiao, Z., Xiao, Y.-D., Feng, A., Golbraikh, A., Tropsha, A., Lee, K.-H., J. Med. Chem. 45 (2002) 2294.

    Google Scholar 

  29. Molconn-Z. http://www.eslc.vabiotech.com/

  30. Golbraikh, A., J. Chem. Inf. Comput. Sci. 40 (2000) 414.

    Google Scholar 

  31. Sachs, L. Applied Statistics. A Handbook of Techniques. Springer-Verlag, 1984, p. 349.

  32. Xiao, Z. Design and Synthesis of Etoposide-Related Topo II Inhibitors by Conventional and Computational Approaches. Ph.D. Dissertation. The University of North Carolina at Chapel Hill, 2003.

    Google Scholar 

  33. Zhang, Y., Lee, K.H., Chin. Pharm. J. 46 (1994) 319.

    Google Scholar 

  34. Cho, S.J., Tropsha, A., Suffness, M., Cheng, Y.C., Lee, K.H., J. Med. Chem. 39 (1996) 1383.

    Google Scholar 

  35. Xie, D., Tropsha, A., Schlick, T., J. Chem. Inf. Comput. Sci., 40 (2000) 167.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Golbraikh, A., Shen, M., Xiao, Z. et al. Rational selection of training and test sets for the development of validated QSAR models. J Comput Aided Mol Des 17, 241–253 (2003). https://doi.org/10.1023/A:1025386326946

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025386326946

Keywords

Navigation