Abstract
Building a robust and reliable QSAR/QSPR model should greatly consider two aspects: selecting the optimal variable subset from a large pool of molecular descriptors and detecting outliers from a pool of samples. The two problems have the specific similarity and complementarity to some extent. Given a particular learning algorithm on a particular data set, one should consider how the interaction could happen between variable selection and outlier detection. In this paper, we describe a consistent methodology for simultaneously performing variable subset selection and outlier detection using the idea of statistical distribution which can be simulated by the establishment of many cross-predictive linear models. The approach exploits the fact that the distribution of linear model coefficients provides a mechanism for ranking and interpreting the effects of variable, while the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples. The use of statistic of these distributions, namely mean value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Several examples are used to demonstrate the prediction ability of our proposed approach through the comparison of different approaches as well as their combinations.
Similar content being viewed by others
References
Dudek AZ, Arodz T, Galvez J (2006) Comb Chem High Throughput Screen 9:213
Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO (2007) J Chem Inf Model 47:150
Svetnik V, Wang T, Tong C, Liaw A, Sheridan RP, Song Q (2005) J Chem Inf Model 45:786
Xue Y, Yap CW, Sun LZ, Cao ZW, Wang JF, Chen YZ (2004) J Chem Inf Comput Sci 44:1497
Gunturi SB, Narayanan R (2007) QSAR Comb Sci 26:653
Konovalov DA, Coomans D, Deconinck E, Vander Heyden Y (2007) J Chem Inf Model 47:1648
Liang YZ, Yuan DL, Xu QS, Kvalheim OM (2008) J Chemometr 22:23
Rucker C, Meringer M, Kerber A (2005) J Chem Inf Model 45:74
Karthikeyan M, Glen RC, Bender A (2005) J Chem Inf Model 45:581
Cronin MTD, Livingstone DJ (2004) Predicting chemical toxicity and fate. CRC Press, Boca Raton
Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York, p 329
Liang Y-Z, Kvalheim OM (1996) Chemom Intell Lab Syst 32:1
Konovalov DA, Llewellyn LE, Vander Heyden Y, Coomans D (2008) J Chem Inf Model 48:2081
Huber PJ (2004) Robust statistics in Wiley Series in probability and statistics. Wiley, New York
Rousseeuw PJ (1984) J Am Stat Assoc 79:871
Agull J, Croux C, Van Aelst S (2008) J Multivar Anal 99:311
Walczak B, Massart DL (1995) Chemom Intell Lab Syst 27:41
Juan AG, Rosario R (1998) J Chemometr 12:365
Hubert M, Branden KV (2003) J Chemometr 17:537
Zhang MH, Xu QS, Massart DL (2003) Chemom Intell Lab Syst 67:175
Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) J Chem Inf Comput Sci 44:1630
Sutter JM, Dixon SL, Jurs PC (2002) J Chem Inf Comput Sci 35:77
Clark DE, Westhead DR (1996) J Comput Aided Mol Des 10:337
Rogers D, Hopfinger AJ (2002) J Chem Inf Comput Sci 34:854
Shen Q, Jiang J-H, Jiao C-X, Shen G-l, Yu R-Q (2004) Eur J Pharm Sci 22:145
Xu L, Zhang W-J (2001) Anal Chim Acta 446:475
Tibshirani R (1996) J R Stat Soc B Methodol 58:267
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Ann Stat 32:407
Rainer G, Torsten S (2008) J Comput Chem 29:847
Kirchner H (2000) Altern Lab Anim 28:364
Cronin MTD, Dearden JC, Moss GP, Murray-Dickson G (1999) Eur J Pharm Sci 7:325
Cronin MTD, Schultz TW (2003) J Mol Struct THEOCHEM 622:39
Cavill R, Keun HC, Holmes E, Lindon JC, Nicholson JK, Ebbels TMD (2009) Bioinformatics 25:112
Tolvi J (2004) Soft Comput Fusion Found Methodol Appl 8:527
Wiegand P, Pell R, Comas E (2009) Chemom Intell Lab Syst 98:108
Menjoge RS, Welsch RE (2010) Comput Stat Data Anal 54:3181
Aksenova T, Volkovich V, Villa AEP (2005) Robust structural modeling and outlier detection with GMDH-type polynomial neural networks, in artificial neural networks: formal models and their applications. ICANN, p 881
Plomin R, Haworth CMA, Davis OSP (2009) Nat Rev Genet 10:872
Manly BFJ (1998) Randomization, bootstrap and Monte Carlo in biology, in texts in statistical science, 2nd edn. Chapman and Hall, London, p 399
Robert CP, Casella G (1999) Monte Carlo statistical methods in Springer texts in statistics. Springer, New York
Efron B, Tribshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall/CRC, New York, p 436
Efron B (1979) Ann Stat 7:1
Efron B, Gong G (1983) Am Stat 37:36
Efron B, Tibshirani R (1986) Stat Sci 1:54
Gentle JE (2006) Elements of computational statistics. Springer Science and Business Media, Inc., New York
Shao J (1993) J Am Stat Assoc 88:486
Xu Q-S, Liang Y-Z (2001) Chemom Intell Lab Syst 56:1
Xu Q-S, Liang Y-Z, Du Y-P (2004) J Chemometr 18:112
Cao D-S, Liang Y-Z, Xu Q-S, Li H-D, Chen X (2010) J Comput Chem 31:592
Centner V, Massart D-L, de Noord OE, de Jong S, Vandeginste BM, Sterna C (1996) Anal Chem 68:3851
Riccardo L (1994) J Chemometr 8:65
Hawkins DM, Basak SC, Mills D (2003) J Chem Inf Comput Sci 43:579
Bak A, Gieleciak R, Magdziarz T, Polanski J (2005) J Chem Inf Model 46:2310
Myers RH (2005) Classical and modern regression with applications. PWS-KENT, Boston
Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear regression models. Irwin, Chicago
Sutherland JJ, O’Brien LA, Weaver DF (2004) J Med Chem 47:5541
Cao C, Liu S, Li Z (1999) J Chem Inf Comput Sci 39:1105
Rucker G, Rucker C (1999) J Chem Inf Comput Sci 39:788
Wessel MD, Jurs PC (1995) J Chem Inf Comput Sci 35:68
Polanski J, Gieleciak R (2003) J Chem Inf Comput Sci 43:656
Bak A, Polanski J (2007) J Chem Inf Model 47:1469
Kim K (2007) J Comput Aided Mol Des 21:63
Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Oberg T, Todeschini R, Fourches D, Varnek A (2008) J Chem Inf Model 48:1733
Beck B, Breindl A, Clark T (2000) J Chem Inf Comput Sci 40:1046
Chalk AJ, Beck B, Clark T (2001) J Chem Inf Comput Sci 41:457
Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sulzle D, Ganzer U, Heinrich N, Muller K-R (2007) J Chem Inf Model 47:407
Kolossov E, Stanforth R (2007) SAR QSAR Environ Res 18:89
Acknowledgments
We would like to thank the reviewers for their useful discussions, comments and suggestions throughout this entire work. This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 20875104 and No. 10771217), the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680). The studies meet with the approval of the university’s review board.
Author information
Authors and Affiliations
Corresponding authors
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Cao, D., Liang, Y., Xu, Q. et al. Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features. J Comput Aided Mol Des 25, 67–80 (2011). https://doi.org/10.1007/s10822-010-9401-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10822-010-9401-1