Skip to main content
Log in

Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

Building a robust and reliable QSAR/QSPR model should greatly consider two aspects: selecting the optimal variable subset from a large pool of molecular descriptors and detecting outliers from a pool of samples. The two problems have the specific similarity and complementarity to some extent. Given a particular learning algorithm on a particular data set, one should consider how the interaction could happen between variable selection and outlier detection. In this paper, we describe a consistent methodology for simultaneously performing variable subset selection and outlier detection using the idea of statistical distribution which can be simulated by the establishment of many cross-predictive linear models. The approach exploits the fact that the distribution of linear model coefficients provides a mechanism for ranking and interpreting the effects of variable, while the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples. The use of statistic of these distributions, namely mean value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Several examples are used to demonstrate the prediction ability of our proposed approach through the comparison of different approaches as well as their combinations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Dudek AZ, Arodz T, Galvez J (2006) Comb Chem High Throughput Screen 9:213

    Article  CAS  Google Scholar 

  2. Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO (2007) J Chem Inf Model 47:150

    Article  CAS  Google Scholar 

  3. Svetnik V, Wang T, Tong C, Liaw A, Sheridan RP, Song Q (2005) J Chem Inf Model 45:786

    Article  CAS  Google Scholar 

  4. Xue Y, Yap CW, Sun LZ, Cao ZW, Wang JF, Chen YZ (2004) J Chem Inf Comput Sci 44:1497

    CAS  Google Scholar 

  5. Gunturi SB, Narayanan R (2007) QSAR Comb Sci 26:653

    Article  CAS  Google Scholar 

  6. Konovalov DA, Coomans D, Deconinck E, Vander Heyden Y (2007) J Chem Inf Model 47:1648

    Article  CAS  Google Scholar 

  7. Liang YZ, Yuan DL, Xu QS, Kvalheim OM (2008) J Chemometr 22:23

    Article  CAS  Google Scholar 

  8. Rucker C, Meringer M, Kerber A (2005) J Chem Inf Model 45:74

    Article  Google Scholar 

  9. Karthikeyan M, Glen RC, Bender A (2005) J Chem Inf Model 45:581

    Article  CAS  Google Scholar 

  10. Cronin MTD, Livingstone DJ (2004) Predicting chemical toxicity and fate. CRC Press, Boca Raton

    Book  Google Scholar 

  11. Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New York, p 329

    Google Scholar 

  12. Liang Y-Z, Kvalheim OM (1996) Chemom Intell Lab Syst 32:1

    Article  CAS  Google Scholar 

  13. Konovalov DA, Llewellyn LE, Vander Heyden Y, Coomans D (2008) J Chem Inf Model 48:2081

    Article  CAS  Google Scholar 

  14. Huber PJ (2004) Robust statistics in Wiley Series in probability and statistics. Wiley, New York

    Google Scholar 

  15. Rousseeuw PJ (1984) J Am Stat Assoc 79:871

    Article  Google Scholar 

  16. Agull J, Croux C, Van Aelst S (2008) J Multivar Anal 99:311

    Article  Google Scholar 

  17. Walczak B, Massart DL (1995) Chemom Intell Lab Syst 27:41

    Article  CAS  Google Scholar 

  18. Juan AG, Rosario R (1998) J Chemometr 12:365

    Article  Google Scholar 

  19. Hubert M, Branden KV (2003) J Chemometr 17:537

    Article  CAS  Google Scholar 

  20. Zhang MH, Xu QS, Massart DL (2003) Chemom Intell Lab Syst 67:175

    Article  CAS  Google Scholar 

  21. Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) J Chem Inf Comput Sci 44:1630

    CAS  Google Scholar 

  22. Sutter JM, Dixon SL, Jurs PC (2002) J Chem Inf Comput Sci 35:77

    Google Scholar 

  23. Clark DE, Westhead DR (1996) J Comput Aided Mol Des 10:337

    Article  CAS  Google Scholar 

  24. Rogers D, Hopfinger AJ (2002) J Chem Inf Comput Sci 34:854

    Google Scholar 

  25. Shen Q, Jiang J-H, Jiao C-X, Shen G-l, Yu R-Q (2004) Eur J Pharm Sci 22:145

    Article  CAS  Google Scholar 

  26. Xu L, Zhang W-J (2001) Anal Chim Acta 446:475

    Article  Google Scholar 

  27. Tibshirani R (1996) J R Stat Soc B Methodol 58:267

    Google Scholar 

  28. Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Ann Stat 32:407

    Article  Google Scholar 

  29. Rainer G, Torsten S (2008) J Comput Chem 29:847

    Article  Google Scholar 

  30. Kirchner H (2000) Altern Lab Anim 28:364

    Google Scholar 

  31. Cronin MTD, Dearden JC, Moss GP, Murray-Dickson G (1999) Eur J Pharm Sci 7:325

    Article  CAS  Google Scholar 

  32. Cronin MTD, Schultz TW (2003) J Mol Struct THEOCHEM 622:39

    Article  CAS  Google Scholar 

  33. Cavill R, Keun HC, Holmes E, Lindon JC, Nicholson JK, Ebbels TMD (2009) Bioinformatics 25:112

    Article  CAS  Google Scholar 

  34. Tolvi J (2004) Soft Comput Fusion Found Methodol Appl 8:527

    Google Scholar 

  35. Wiegand P, Pell R, Comas E (2009) Chemom Intell Lab Syst 98:108

    Article  CAS  Google Scholar 

  36. Menjoge RS, Welsch RE (2010) Comput Stat Data Anal 54:3181

    Google Scholar 

  37. Aksenova T, Volkovich V, Villa AEP (2005) Robust structural modeling and outlier detection with GMDH-type polynomial neural networks, in artificial neural networks: formal models and their applications. ICANN, p 881

  38. Plomin R, Haworth CMA, Davis OSP (2009) Nat Rev Genet 10:872

    Article  CAS  Google Scholar 

  39. Manly BFJ (1998) Randomization, bootstrap and Monte Carlo in biology, in texts in statistical science, 2nd edn. Chapman and Hall, London, p 399

    Google Scholar 

  40. Robert CP, Casella G (1999) Monte Carlo statistical methods in Springer texts in statistics. Springer, New York

    Google Scholar 

  41. Efron B, Tribshirani RJ (1993) An introduction to the bootstrap. Chapman and Hall/CRC, New York, p 436

    Google Scholar 

  42. Efron B (1979) Ann Stat 7:1

    Article  Google Scholar 

  43. Efron B, Gong G (1983) Am Stat 37:36

    Article  Google Scholar 

  44. Efron B, Tibshirani R (1986) Stat Sci 1:54

    Article  Google Scholar 

  45. Gentle JE (2006) Elements of computational statistics. Springer Science and Business Media, Inc., New York

    Google Scholar 

  46. Shao J (1993) J Am Stat Assoc 88:486

    Article  Google Scholar 

  47. Xu Q-S, Liang Y-Z (2001) Chemom Intell Lab Syst 56:1

    Article  CAS  Google Scholar 

  48. Xu Q-S, Liang Y-Z, Du Y-P (2004) J Chemometr 18:112

    Article  CAS  Google Scholar 

  49. Cao D-S, Liang Y-Z, Xu Q-S, Li H-D, Chen X (2010) J Comput Chem 31:592

    CAS  Google Scholar 

  50. Centner V, Massart D-L, de Noord OE, de Jong S, Vandeginste BM, Sterna C (1996) Anal Chem 68:3851

    Article  CAS  Google Scholar 

  51. Riccardo L (1994) J Chemometr 8:65

    Article  Google Scholar 

  52. Hawkins DM, Basak SC, Mills D (2003) J Chem Inf Comput Sci 43:579

    CAS  Google Scholar 

  53. Bak A, Gieleciak R, Magdziarz T, Polanski J (2005) J Chem Inf Model 46:2310

    Google Scholar 

  54. Myers RH (2005) Classical and modern regression with applications. PWS-KENT, Boston

    Google Scholar 

  55. Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear regression models. Irwin, Chicago

    Google Scholar 

  56. Sutherland JJ, O’Brien LA, Weaver DF (2004) J Med Chem 47:5541

    Article  CAS  Google Scholar 

  57. Cao C, Liu S, Li Z (1999) J Chem Inf Comput Sci 39:1105

    CAS  Google Scholar 

  58. Rucker G, Rucker C (1999) J Chem Inf Comput Sci 39:788

    Google Scholar 

  59. Wessel MD, Jurs PC (1995) J Chem Inf Comput Sci 35:68

    CAS  Google Scholar 

  60. Polanski J, Gieleciak R (2003) J Chem Inf Comput Sci 43:656

    CAS  Google Scholar 

  61. Bak A, Polanski J (2007) J Chem Inf Model 47:1469

    Article  CAS  Google Scholar 

  62. Kim K (2007) J Comput Aided Mol Des 21:63

    Article  Google Scholar 

  63. Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E, Oberg T, Todeschini R, Fourches D, Varnek A (2008) J Chem Inf Model 48:1733

    Article  CAS  Google Scholar 

  64. Beck B, Breindl A, Clark T (2000) J Chem Inf Comput Sci 40:1046

    CAS  Google Scholar 

  65. Chalk AJ, Beck B, Clark T (2001) J Chem Inf Comput Sci 41:457

    CAS  Google Scholar 

  66. Schwaighofer A, Schroeter T, Mika S, Laub J, ter Laak A, Sulzle D, Ganzer U, Heinrich N, Muller K-R (2007) J Chem Inf Model 47:407

    Article  CAS  Google Scholar 

  67. Kolossov E, Stanforth R (2007) SAR QSAR Environ Res 18:89

    Article  CAS  Google Scholar 

Download references

Acknowledgments

We would like to thank the reviewers for their useful discussions, comments and suggestions throughout this entire work. This work is financially supported by the National Nature Foundation Committee of P.R. China (Grants No. 20875104 and No. 10771217), the international cooperation project on traditional Chinese medicines of ministry of science and technology of China (Grant No. 2007DFA40680). The studies meet with the approval of the university’s review board.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yizeng Liang or Qingsong Xu.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (DOC 969 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, D., Liang, Y., Xu, Q. et al. Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features. J Comput Aided Mol Des 25, 67–80 (2011). https://doi.org/10.1007/s10822-010-9401-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-010-9401-1

Keywords

Navigation