Abstract
It is becoming increasingly common in quantitative structure/activity relationship (QSAR) analyses to use external test sets to evaluate the likely stability and predictivity of the models obtained. In some cases, such as those involving variable selection, an internal test set – i.e., a cross-validation set – is also used. Care is sometimes taken to ensure that the subsets used exhibit response and/or property distributions similar to those of the data set as a whole, but more often the individual observations are simply assigned `at random.' In the special case of MLR without variable selection, it can be analytically demonstrated that this strategy is inferior to others. Most particularly, D-optimal design performs better if the form of the regression equation is known and the variables involved are well behaved. This report introduces an alternative, non-parametric approach termed `boosted leave-many-out' (boosted LMO) cross-validation. In this method, relatively small training sets are chosen by applying optimizable k-dissimilarity selection (OptiSim) using a small subsample size (k = 4, in this case), with the unselected observations being reserved as a test set for the corresponding reduced model. Predictive errors for the full model are then estimated by aggregating results over several such analyses. The countervailing effects of training and test set size, diversity, and representativeness on PLS model statistics are described for CoMFA analysis of a large data set of COX2 inhibitors.
Similar content being viewed by others
References
Snedecor, G.W and Cochran, W.G., Statistical Methods, Eighth Ed., Iowa State University Press, Iowa City, 1989.
Kauffman, G.W. and Jurs, P.C., J. Chem. Inf. Comput. Sci. 41 (2001) 1553.
Baumann, K., von Korff, M. and Albert, H., J. Chemometrics 16 (2002) 351.
Næs, T. and Martens, H., J. Chemometrics 2 (1988) 155.
Wold, S., Ruhe, A., Wold, H. and Dunn, W.J. III, SIAM J. Sci. Statist. Comput. 5 (1984) 735.
Wold, S., in: van de Waterbeemd (Ed.) Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 195–218.
Morsing, T. and Ekman, C., J. Chemometrics 12 (1998) 295.
Kleinknecht, R.E., J. Chemometrics 10 (1996) 687.
Denham, M.C., J. Chemoetrics 11 (1997) 39.
Faber, K. and Kowalski, B.R., J. Chemometrics 11 (1997) 181.
Agrafiotis, D.K., Cedeño, W. and Lobanov, V.S., J. Chem. Inf. Comput. Sci. 42 (2002) 903.
Wold, S. and Eriksson, L., in: van de Waterbeemd (Ed.) Chemometric Methods in Molecular Design, VCH, Weinheim, 1995, pp. 309–318.
Golbraikh, A. and Tropsha, A., J. Molec. Graphics Modell. 20 (2002) 269.
Wehrens, R. and van der Linden, W.E., J. Chemometrics 11 (1997) 157.
Wold, S., Johansson, E. and Cocchi, M., in: Kubinyi, H. (Ed.) 3D QSAR in Drug Design, ESCOM, Leiden, 1993, pp. 523–550.
Clark, R.D., Sprous, D.G. and Leonard, J.M., in: Höltje, H.-D. and Sippl, W. (Eds.) Rational Approaches to Drug Design, Prous Science, Barcelona, 2001, pp. 475–485.
Höskuldsson, A., J. Chemometrics 10 (1996) 637.
van de Waterbeemd, H., in: van de Waterbeemd, H. (Ed.) Structure-Property Correlations in Drug Research, G. Landes, Austin, 1996, pp. 55–80.
Oprea, T.I., Waller, C.L. and Marshall, G.R., J. Med. Chem. 37 (1994) 2206.
Chavatte, P., Yous, S, Marot, C., Baurin, N. and Lesiur, D., J. Med. Chem. 44 (2001) 3223.
Matter, H., Defossa, E., Heinelt, U., Blohm, P.-M., Schneider, D., Müller, A., Herok, S., Schreuder, H., Liesum, A., Brachvogel, V., Lönze, P., Walser, A., Al-Obeidi, F. and Wildgoose, P., J. Med. hem. 45 (2002) 2749.
Golbraikh, A. and Tropsha, A., J. Comput.-Aided Molec. Design 16 (2002), 357.
Clark, R.D., J. Chem. Inf. Comput. Sci. 37 (1997) 1181.
Clark, R.D. and Langton, W.J., J. Chem. Inf. Comput. Sci. 38 (1987) 1079.
Cramer, R. D., Patterson, D. E., Bunce, J. D., J. Amer. Chem. Assoc., 110 (1998) 5959.
SYBYL and UNITY are available from Tripos, Inc., 1699 S. Hanley Rd., St. Louis MO 63144 USA.
CONCORD was developed by R.S. Pearlman, A. Rusinko, J.M. Skell and R. Balducci at the University of Texas, Austin TX and is available exclusively from Tripos, Inc., 1699 S. Hanley Road, St. Louis MO 63144 U.S.A.
Clark, R.D., Ferguson, A.M. and Cramer, R.D., in: Kubinyi, H., Folkers, G. and Martin, Y.C. (Eds.), 3D QSAR in Drug Design, Vol. 2: Ligand-Protein Interactions and Molecular Similarity, Kluwer/ESCOM, Dordrecht, 1998, 213–224.
Kurumbail, R.G., Stevens, A.M., Gierse, J.K., McDonald, J.J., Stegeman, R.A., Pak, J.Y., Gildehaus, D., Miyashiro, J.M., Penning, T.D., Seibert, K., Isakson, P.C. and Stallings, W.C., Nature 385 (1997) 555.
Gasteiger, J. and Marsili, M., Tetrahedron, 36, (1980) 3219.
US Patent 6,535,819 (2003). OptiSim is available as an option in the Selector and HiVol modules of SYBYL and in ChemEnlighten. OptiSim, HiVol, Selector, SYBYL and ChemEnlighten are trademarks of Tripos, Inc., 1699 S. Hanley Road, St. Louis MO 63144 (http://www.tripos.com).
Wilett, P. and Winterman, V.A., Quant. Struct.-Activity Relat. 5 (1986) 18.
Cheng, C., Maggiora, G., Lajiness, M. and Johnson, M., J. Chem. Inf. Comput. Sci. 36 (1996) 909.
Clark, R.D., in: Ghose, A.K. and Viswanadhan, V.N. (Eds.) Combinatorial Library Design and Evaluation, Marcel Dekker, Inc., New York, 2001, pp. 337–362.
Holliday, J.D. and Willett, P., J. Biomolec. Screening, 1 (1996) 145.
Wold, S., Berglund, A. and Kettaneh, N., J. Chemometrics 16 (2002) 377.
Baumann, K., Albert, H. and von Korff, M., J. Chemometrics 16 (2002) 339.
Bauman, K., von Korff, M. and Albert, H., 16 (2002) 351.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Clark, R.D. Boosted leave-many-out cross-validation: the effect of training and test set diversity on PLS statistics. J Comput Aided Mol Des 17, 265–275 (2003). https://doi.org/10.1023/A:1025366721142
Issue Date:
DOI: https://doi.org/10.1023/A:1025366721142