Skip to main content
Log in

Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

It is becoming increasingly common in quantitative structure/activity relationship (QSAR) analyses to use external test sets to evaluate the likely stability and predictivity of the models obtained. In some cases, such as those involving variable selection, an internal test set – i.e., a cross-validation set – is also used. Care is sometimes taken to ensure that the subsets used exhibit response and/or property distributions similar to those of the data set as a whole, but more often the individual observations are simply assigned `at random.' In the special case of MLR without variable selection, it can be analytically demonstrated that this strategy is inferior to others. Most particularly, D-optimal design performs better if the form of the regression equation is known and the variables involved are well behaved. This report introduces an alternative, non-parametric approach termed `boosted leave-many-out' (boosted LMO) cross-validation. In this method, relatively small training sets are chosen by applying optimizable k-dissimilarity selection (OptiSim) using a small subsample size (k = 4, in this case), with the unselected observations being reserved as a test set for the corresponding reduced model. Predictive errors for the full model are then estimated by aggregating results over several such analyses. The countervailing effects of training and test set size, diversity, and representativeness on PLS model statistics are described for CoMFA analysis of a large data set of COX2 inhibitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Snedecor, G.W and Cochran, W.G., Statistical Methods, Eighth Ed., Iowa State University Press, Iowa City, 1989.

    Google Scholar 

  2. Kauffman, G.W. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553.

    Google Scholar 

  3. Baumann, K., von Korff, M. and Albert, H., J. Chemometrics 16 (2002) 351.

    Google Scholar 

  4. Næs, T. and Martens, H., J. Chemometrics 2 (1988) 155.

    Google Scholar 

  5. Wold, S., Ruhe, A., Wold, H. and Dunn, W.J. III, SIAM J. Sci. Statist. Comput. 5 (1984) 735.

    Google Scholar 

  6. Wold, S., in: van de Waterbeemd (Ed.) Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 195–218.

    Google Scholar 

  7. Morsing, T. and Ekman, C., J. Chemometrics 12 (1998) 295.

    Google Scholar 

  8. Kleinknecht, R.E., J. Chemometrics 10 (1996) 687.

    Google Scholar 

  9. Denham, M.C., J. Chemoetrics 11 (1997) 39.

    Google Scholar 

  10. Faber, K. and Kowalski, B.R., J. Chemometrics 11 (1997) 181.

    Google Scholar 

  11. Agrafiotis, D.K., Cedeño, W. and Lobanov, V.S., J. Chem. Inf. Comput. Sci. 42 (2002) 903.

    Google Scholar 

  12. Wold, S. and Eriksson, L., in: van de Waterbeemd (Ed.) Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 309–318.

    Google Scholar 

  13. Golbraikh, A. and Tropsha, A., J. Molec. Graphics Modell. 20 (2002) 269.

    Google Scholar 

  14. Wehrens, R. and van der Linden, W.E., J. Chemometrics 11 (1997) 157.

    Google Scholar 

  15. Wold, S., Johansson, E. and Cocchi, M., in: Kubinyi, H. (Ed.) 3D QSAR in Drug Design, ESCOM, Leiden, 1993, pp. 523–550.

    Google Scholar 

  16. Clark, R.D., Sprous, D.G. and Leonard, J.M., in: Höltje, H.-D. and Sippl, W. (Eds.) Rational Approaches to Drug Design, Prous Science, Barcelona, 2001, pp. 475–485.

    Google Scholar 

  17. Höskuldsson, A., J. Chemometrics 10 (1996) 637.

    Google Scholar 

  18. van de Waterbeemd, H., in: van de Waterbeemd, H. (Ed.) Structure-Property Correlations in Drug Research, G. Landes, Austin, 1996, pp. 55–80.

    Google Scholar 

  19. Oprea, T.I., Waller, C.L. and Marshall, G.R., J. Med. Chem. 37 (1994) 2206.

    Google Scholar 

  20. Chavatte, P., Yous, S, Marot, C., Baurin, N. and Lesiur, D., J. Med. Chem. 44 (2001) 3223.

    Google Scholar 

  21. Matter, H., Defossa, E., Heinelt, U., Blohm, P.-M., Schneider, D., Müller, A., Herok, S., Schreuder, H., Liesum, A., Brachvogel, V., Lönze, P., Walser, A., Al-Obeidi, F. and Wildgoose, P., J. Med. hem. 45 (2002) 2749.

    Google Scholar 

  22. Golbraikh, A. and Tropsha, A., J. Comput.-Aided Molec. Design 16 (2002), 357.

    Google Scholar 

  23. Clark, R.D., J. Chem. Inf. Comput. Sci. 37 (1997) 1181.

    Google Scholar 

  24. Clark, R.D. and Langton, W.J., J. Chem. Inf. Comput. Sci. 38 (1987) 1079.

    Google Scholar 

  25. Cramer, R. D., Patterson, D. E., Bunce, J. D., J. Amer. Chem. Assoc., 110 (1998) 5959.

    Google Scholar 

  26. SYBYL and UNITY are available from Tripos, Inc., 1699 S. Hanley Rd., St. Louis MO 63144 USA.

  27. CONCORD was developed by R.S. Pearlman, A. Rusinko, J.M. Skell and R. Balducci at the University of Texas, Austin TX and is available exclusively from Tripos, Inc., 1699 S. Hanley Road, St. Louis MO 63144 U.S.A.

  28. Clark, R.D., Ferguson, A.M. and Cramer, R.D., in: Kubinyi, H., Folkers, G. and Martin, Y.C. (Eds.), 3D QSAR in Drug Design, Vol. 2: Ligand-Protein Interactions and Molecular Similarity, Kluwer/ESCOM, Dordrecht, 1998, 213–224.

    Google Scholar 

  29. Kurumbail, R.G., Stevens, A.M., Gierse, J.K., McDonald, J.J., Stegeman, R.A., Pak, J.Y., Gildehaus, D., Miyashiro, J.M., Penning, T.D., Seibert, K., Isakson, P.C. and Stallings, W.C., Nature 385 (1997) 555.

    Google Scholar 

  30. Gasteiger, J. and Marsili, M., Tetrahedron, 36, (1980) 3219.

    Google Scholar 

  31. US Patent 6,535,819 (2003). OptiSim is available as an option in the Selector and HiVol modules of SYBYL and in ChemEnlighten. OptiSim, HiVol, Selector, SYBYL and ChemEnlighten are trademarks of Tripos, Inc., 1699 S. Hanley Road, St. Louis MO 63144 (http://www.tripos.com).

  32. Wilett, P. and Winterman, V.A., Quant. Struct.-Activity Relat. 5 (1986) 18.

    Google Scholar 

  33. Cheng, C., Maggiora, G., Lajiness, M. and Johnson, M., J. Chem. Inf. Comput. Sci. 36 (1996) 909.

    Google Scholar 

  34. Clark, R.D., in: Ghose, A.K. and Viswanadhan, V.N. (Eds.) Combinatorial Library Design and Evaluation, Marcel Dekker, Inc., New York, 2001, pp. 337–362.

    Google Scholar 

  35. Holliday, J.D. and Willett, P., J. Biomolec. Screening, 1 (1996) 145.

    Google Scholar 

  36. Wold, S., Berglund, A. and Kettaneh, N., J. Chemometrics 16 (2002) 377.

    Google Scholar 

  37. Baumann, K., Albert, H. and von Korff, M., J. Chemometrics 16 (2002) 339.

    Google Scholar 

  38. Bauman, K., von Korff, M. and Albert, H., 16 (2002) 351.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clark, R.D. Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics. J Comput Aided Mol Des 17, 265–275 (2003). https://doi.org/10.1023/A:1025366721142

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025366721142

Navigation