Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics

Clark, Robert D.

doi:10.1023/A:1025366721142

Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics

Published: February 2003

Volume 17, pages 265–275, (2003)
Cite this article

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Robert D. Clark¹

272 Accesses
56 Citations
Explore all metrics

Abstract

It is becoming increasingly common in quantitative structure/activity relationship (QSAR) analyses to use external test sets to evaluate the likely stability and predictivity of the models obtained. In some cases, such as those involving variable selection, an internal test set – i.e., a cross-validation set – is also used. Care is sometimes taken to ensure that the subsets used exhibit response and/or property distributions similar to those of the data set as a whole, but more often the individual observations are simply assigned `at random.' In the special case of MLR without variable selection, it can be analytically demonstrated that this strategy is inferior to others. Most particularly, D-optimal design performs better if the form of the regression equation is known and the variables involved are well behaved. This report introduces an alternative, non-parametric approach termed `boosted leave-many-out' (boosted LMO) cross-validation. In this method, relatively small training sets are chosen by applying optimizable k-dissimilarity selection (OptiSim) using a small subsample size (k = 4, in this case), with the unselected observations being reserved as a test set for the corresponding reduced model. Predictive errors for the full model are then estimated by aggregating results over several such analyses. The countervailing effects of training and test set size, diversity, and representativeness on PLS model statistics are described for CoMFA analysis of a large data set of COX2 inhibitors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RRegrs: an R package for computer-aided model selection with multiple regression models

Article Open access 15 September 2015

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study

Article 29 June 2018

Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction

Article Open access 02 May 2017

References

Snedecor, G.W and Cochran, W.G., Statistical Methods, Eighth Ed., Iowa State University Press, Iowa City, 1989.
Google Scholar
Kauffman, G.W. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553.
Google Scholar
Baumann, K., von Korff, M. and Albert, H., J. Chemometrics 16 (2002) 351.
Google Scholar
Næs, T. and Martens, H., J. Chemometrics 2 (1988) 155.
Google Scholar
Wold, S., Ruhe, A., Wold, H. and Dunn, W.J. III, SIAM J. Sci. Statist. Comput. 5 (1984) 735.
Google Scholar
Wold, S., in: van de Waterbeemd (Ed.) Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 195–218.
Google Scholar
Morsing, T. and Ekman, C., J. Chemometrics 12 (1998) 295.
Google Scholar
Kleinknecht, R.E., J. Chemometrics 10 (1996) 687.
Google Scholar
Denham, M.C., J. Chemoetrics 11 (1997) 39.
Google Scholar
Faber, K. and Kowalski, B.R., J. Chemometrics 11 (1997) 181.
Google Scholar
Agrafiotis, D.K., Cedeño, W. and Lobanov, V.S., J. Chem. Inf. Comput. Sci. 42 (2002) 903.
Google Scholar
Wold, S. and Eriksson, L., in: van de Waterbeemd (Ed.) Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 309–318.
Google Scholar
Golbraikh, A. and Tropsha, A., J. Molec. Graphics Modell. 20 (2002) 269.
Google Scholar
Wehrens, R. and van der Linden, W.E., J. Chemometrics 11 (1997) 157.
Google Scholar
Wold, S., Johansson, E. and Cocchi, M., in: Kubinyi, H. (Ed.) 3D QSAR in Drug Design, ESCOM, Leiden, 1993, pp. 523–550.
Google Scholar
Clark, R.D., Sprous, D.G. and Leonard, J.M., in: Höltje, H.-D. and Sippl, W. (Eds.) Rational Approaches to Drug Design, Prous Science, Barcelona, 2001, pp. 475–485.
Google Scholar
Höskuldsson, A., J. Chemometrics 10 (1996) 637.
Google Scholar
van de Waterbeemd, H., in: van de Waterbeemd, H. (Ed.) Structure-Property Correlations in Drug Research, G. Landes, Austin, 1996, pp. 55–80.
Google Scholar
Oprea, T.I., Waller, C.L. and Marshall, G.R., J. Med. Chem. 37 (1994) 2206.
Google Scholar
Chavatte, P., Yous, S, Marot, C., Baurin, N. and Lesiur, D., J. Med. Chem. 44 (2001) 3223.
Google Scholar
Matter, H., Defossa, E., Heinelt, U., Blohm, P.-M., Schneider, D., Müller, A., Herok, S., Schreuder, H., Liesum, A., Brachvogel, V., Lönze, P., Walser, A., Al-Obeidi, F. and Wildgoose, P., J. Med. hem. 45 (2002) 2749.
Google Scholar
Golbraikh, A. and Tropsha, A., J. Comput.-Aided Molec. Design 16 (2002), 357.
Google Scholar
Clark, R.D., J. Chem. Inf. Comput. Sci. 37 (1997) 1181.
Google Scholar
Clark, R.D. and Langton, W.J., J. Chem. Inf. Comput. Sci. 38 (1987) 1079.
Google Scholar
Cramer, R. D., Patterson, D. E., Bunce, J. D., J. Amer. Chem. Assoc., 110 (1998) 5959.
Google Scholar
SYBYL and UNITY are available from Tripos, Inc., 1699 S. Hanley Rd., St. Louis MO 63144 USA.
CONCORD was developed by R.S. Pearlman, A. Rusinko, J.M. Skell and R. Balducci at the University of Texas, Austin TX and is available exclusively from Tripos, Inc., 1699 S. Hanley Road, St. Louis MO 63144 U.S.A.
Clark, R.D., Ferguson, A.M. and Cramer, R.D., in: Kubinyi, H., Folkers, G. and Martin, Y.C. (Eds.), 3D QSAR in Drug Design, Vol. 2: Ligand-Protein Interactions and Molecular Similarity, Kluwer/ESCOM, Dordrecht, 1998, 213–224.
Google Scholar
Kurumbail, R.G., Stevens, A.M., Gierse, J.K., McDonald, J.J., Stegeman, R.A., Pak, J.Y., Gildehaus, D., Miyashiro, J.M., Penning, T.D., Seibert, K., Isakson, P.C. and Stallings, W.C., Nature 385 (1997) 555.
Google Scholar
Gasteiger, J. and Marsili, M., Tetrahedron, 36, (1980) 3219.
Google Scholar
US Patent 6,535,819 (2003). OptiSim is available as an option in the Selector and HiVol modules of SYBYL and in ChemEnlighten. OptiSim, HiVol, Selector, SYBYL and ChemEnlighten are trademarks of Tripos, Inc., 1699 S. Hanley Road, St. Louis MO 63144 (http://www.tripos.com).
Wilett, P. and Winterman, V.A., Quant. Struct.-Activity Relat. 5 (1986) 18.
Google Scholar
Cheng, C., Maggiora, G., Lajiness, M. and Johnson, M., J. Chem. Inf. Comput. Sci. 36 (1996) 909.
Google Scholar
Clark, R.D., in: Ghose, A.K. and Viswanadhan, V.N. (Eds.) Combinatorial Library Design and Evaluation, Marcel Dekker, Inc., New York, 2001, pp. 337–362.
Google Scholar
Holliday, J.D. and Willett, P., J. Biomolec. Screening, 1 (1996) 145.
Google Scholar
Wold, S., Berglund, A. and Kettaneh, N., J. Chemometrics 16 (2002) 377.
Google Scholar
Baumann, K., Albert, H. and von Korff, M., J. Chemometrics 16 (2002) 339.
Google Scholar
Bauman, K., von Korff, M. and Albert, H., 16 (2002) 351.

Download references

Author information

Authors and Affiliations

Tripos, Inc., 1699 S. Hanley Road, St. Louis, MO, 63144, USA
Robert D. Clark

Authors

Robert D. Clark
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clark, R.D. Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics. J Comput Aided Mol Des 17, 265–275 (2003). https://doi.org/10.1023/A:1025366721142

Download citation

Issue Date: February 2003
DOI: https://doi.org/10.1023/A:1025366721142

cross-validation; dissimilarity selection; molecular diversity; OptiSim; PLS; projection onto latent structures; representativeness; boosted LMO

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics

Abstract

Access this article

Similar content being viewed by others

RRegrs: an R package for computer-aided model selection with multiple regression models

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study

Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics

Abstract

Access this article

Similar content being viewed by others

RRegrs: an R package for computer-aided model selection with multiple regression models

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study

Efficient strategies for leave-one-out cross validation for genomic best linear unbiased prediction

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation