Skip to main content

Advertisement

Log in

Influence of feature rankers in the construction of molecular activity prediction models

  • Published:
Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Abstract

In the construction of activity prediction models, the use of feature ranking methods is a useful mechanism for extracting information for ranking features in terms of their significance to develop predictive models. This paper studies the influence of feature rankers in the construction of molecular activity prediction models; for this purpose, a comparative study of fourteen rankings methods for feature selection was conducted. The activity prediction models were constructed using four well-known classifiers and a wide collection of datasets. The ranking algorithms were compared considering the performance of these classifiers using different metrics and the consistency of the ranked features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Danishuddin M, Khan AU (2015) Structure based virtual screening to discover putative drug candidates: necessary considerations and successful case studies. Methods 71:135–145. https://doi.org/10.1016/j.ymeth.20s14.10.019

    Article  CAS  PubMed  Google Scholar 

  2. Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, Consonni V, Kuz’min VE, Cramer R, Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J, Richard A, Tropsha A (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57(12):4977–5010. https://doi.org/10.1021/jm4004285

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Danishuddin KAU (2016) Descriptors and their selection methods in QSAR analysis: paradigm for drug design. Drug Discov Today 21(8):1291–1302. https://doi.org/10.1016/j.drudis.2016.06.013

    Article  CAS  PubMed  Google Scholar 

  4. Goodarzi M, Dejaegher B, Heyden YV (2012) Feature selection methods in QSAR studies. J. AOAC Int 95(3):636–651

    Article  CAS  PubMed  Google Scholar 

  5. Ponzoni I, Sebastián-Pérez V, Requena-Triguero C, Roca C, Martínez MJ, Cravero F, Díaz MF, Páez JA, Arrayás RG, Adrio J (2017) Hybridizing feature selection and feature learning approaches in QSAR modeling for drug discovery. Sci Rep 7(1):2403

    Article  PubMed  PubMed Central  Google Scholar 

  6. Cerruela García G, García-Pedrajas N (2018) Boosted feature selectors: a case study on prediction P-gp inhibitors and substrates. J Comput Aided Mol Des 32(11):1273–1294

    Article  PubMed  Google Scholar 

  7. Wang L, Wang Y, Chang Q (2016) Feature selection methods for big data bioinformatics: a survey from the search perspective. Methods 111:21–31

    Article  CAS  PubMed  Google Scholar 

  8. Algamal Z, Lee M (2017) A new adaptive l1-norm for optimal descriptor selection of high-dimensional qsar classification model for anti-hepatitis c virus activity of thiourea derivatives. SAR QSAR Environ Res 28(1):75–90

    Article  CAS  PubMed  Google Scholar 

  9. Shahlaei M (2013) Descriptor selection methods in quantitative structure–activity relationship studies: a review study. Chem Rev 113(10):8093–8103

    Article  CAS  PubMed  Google Scholar 

  10. Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning, vol 1–10. Springer, New York

  11. Waad B, Brahim AB, Limam M (2013) Feature selection by rank aggregation and genetic algorithms. In: KDIR/KMIS, pp 74–81

  12. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  13. Elisseeff A, Schölkopf B, Pérez-Cruz F, Weston J, Bousquet O, Chapelle O (2003) Feature selection and transduction for prediction of molecular bioactivity for drug design. Bioinformatics 19(6):764–771. https://doi.org/10.1093/bioinformatics/btg054

    Article  PubMed  Google Scholar 

  14. Valizade Hasanloei MA, Sheikhpour R, Sarram MA, Sheikhpour E, Sharifi H (2018) A combined Fisher and Laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities. J Comput Aided Mol Des 32(2):375–384. https://doi.org/10.1007/s10822-017-0094-6

    Article  CAS  PubMed  Google Scholar 

  15. Michael AD, Andreas GKJ, Khac-Minh T, Gerhard FE, Wilfried NG (2008) Predictive QSAR models for polyspecific drug targets: the importance of feature selection. Curr Comput Aided Drug Des 4(2):91–110. https://doi.org/10.2174/157340908784533256

    Article  Google Scholar 

  16. Tan N-X, Li P, Rao H-B, Li Z-R, Li X-Y (2010) Prediction of the acute toxicity of chemical compounds to the fathead minnow by machine learning approaches. Chemom Intell Lab Syst 100(1):66–73. https://doi.org/10.1016/j.chemolab.2009.11.002

    Article  CAS  Google Scholar 

  17. Ancuceanu R, Dinu M, Neaga I, Laszlo FG, Boda D (2019) Development of QSAR machine learning-based models to forecast the effect of substances on malignant melanoma cells. Oncol Lett 17(5):4188–4196

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Sun G, Fan T, Sun X, Hao Y, Cui X, Zhao L, Ren T, Zhou Y, Zhong R, Peng Y (2018) In silico prediction of O6-methylguanine-DNA methyltransferase inhibitory potency of base analogs with QSAR and machine learning methods. Molecules 23(11):2892

    Article  PubMed Central  Google Scholar 

  19. Zhang C, Cheng F, Li W, Liu G, Lee PW, Tang Y (2016) In silico prediction of drug induced liver toxicity using substructure pattern recognition method. Mol Inf 35(3–4):136–144

    Article  Google Scholar 

  20. Bharti DR, Lynn AM (2017) QSAR based predictive modeling for anti-malarial molecules. Bioinformation 13(5):154–159. https://doi.org/10.6026/97320630013154

    Article  PubMed  PubMed Central  Google Scholar 

  21. Shen W, Xiao T, Chen S, Liu F, Chen YZ, Jiang Y (2017) Predicting the enzymatic hydrolysis half-lives of new chemicals using support vector regression models based on stepwise feature elimination. Mol Inf 36(11):1–11

    Article  Google Scholar 

  22. Schöning V, Krähenbühl S, Drewe J (2018) The hepatotoxic potential of protein kinase inhibitors predicted with random forest and artificial neural networks. Toxicol Lett 299:145–148. https://doi.org/10.1016/j.toxlet.2018.10.009

    Article  CAS  PubMed  Google Scholar 

  23. Kharangarh S, Sandhu H, Tangadpalliwar S, Garg P (2018) Predicting inhibitors for multidrug resistance associated protein-2 transporter by machine learning approach. Comb Chem High Throughput Screen 21(8):557–566. https://doi.org/10.2174/1386207321666181024104822

    Article  CAS  PubMed  Google Scholar 

  24. Chen S, Zhang P, Liu X, Qin C, Tao L, Zhang C, Yang SY, Chen YZ, Chui WK (2016) Towards cheminformatics-based estimation of drug therapeutic index: predicting the protective index of anticonvulsants using a new quantitative structure-index relationship approach. J Mol Graph Model 67:102–110. https://doi.org/10.1016/j.jmgm.2016.05.006

    Article  CAS  PubMed  Google Scholar 

  25. Bharti DR, Hemrom AJ, Lynn AM (2019) GCAC: galaxy workflow system for predictive model building for virtual screening. BMC Bioinform 19(13):199–206

    Google Scholar 

  26. Xiaolong D, Siqiao T, Yuan C, Zheming Y (2016) QSAR Study on the toxicities of alcohols and phenols based on minimal redundancy maximal relevance and distance correlation feature selection methods. Res J Biotechnol 11:1–6

    Google Scholar 

  27. Lu J, Zhang P, Bi Y, Luo X (2016) Analysis of a drug target-based classification system using molecular descriptors. Comb Chem High Throughput Screen 19(2):129–135

    Article  CAS  PubMed  Google Scholar 

  28. Onay A, Onay M, Abul O (2017) Classification of nervous system withdrawn and approved drugs with ToxPrint features via machine learning strategies. Comput Methods Programs Biomed 142:9–19. https://doi.org/10.1016/j.cmpb.2017.02.004

    Article  PubMed  Google Scholar 

  29. Tung C-W (2014) Acquiring decision rules for predicting ames-negative hepatocarcinogens using chemical–chemical interactions. International Conference on Pattern Recognition in Bioinformatics. Springer, Cham, pp 1–9

    Google Scholar 

  30. Martínez-López Y, Barigye SJ, Martínez-Santiago O, Marrero-Ponce Y, Green J, Castillo-Garit JA (2017) Prediction of aquatic toxicity of benzene derivatives using molecular descriptor from atomic weighted vectors. Environ Toxicol Pharmacol 56:314–321. https://doi.org/10.1016/j.etap.2017.10.006

    Article  CAS  PubMed  Google Scholar 

  31. Cardoso-Gajo G, Rodrigues-Silva D, Barigye SJ, da Cunha EFF (2018) Multi-objective optimization of benzamide derivatives as rho kinase inhibitors. Mol Inf 37(3):1–12. https://doi.org/10.1002/minf.201700080

    Article  CAS  Google Scholar 

  32. Guo G, Neagu D, Cronin MT (2005) A study on feature selection for toxicity prediction. In: International conference on fuzzy systems and knowledge discovery, 2005. Springer, New York, pp 31–34

    Chapter  Google Scholar 

  33. Heikamp K, Bajorath J (2011) How do 2D fingerprints detect structurally diverse active compounds? Revealing compound subset-specific fingerprint features through systematic selection. J Chem Inf Model 51(9):2254–2265. https://doi.org/10.1021/ci200275m

    Article  CAS  PubMed  Google Scholar 

  34. Hemmateenejad B, Mehdipour A, Deeb O, Sanchooli M, Miri R (2011) Toward an optimal approach for variable selection in counter-propagation neural networks: modeling protein-tyrosine kinase inhibitory of flavanoids using substituent electronic descriptors. Mol Inf 30(11–12):939–949

    Article  CAS  Google Scholar 

  35. Zhang C, Cheng F, Sun L, Zhuang S, Li W, Liu G, Lee PW, Tang Y (2015) In silico prediction of chemical toxicity on avian species using chemical category approaches. Chemosphere 122:280–287. https://doi.org/10.1016/j.chemosphere.2014.12.001

    Article  CAS  PubMed  Google Scholar 

  36. Wacker S, Noskov SY (2018) Performance of machine learning algorithms for qualitative and quantitative prediction drug blockade of hERG1 channel. Comput Toxicol 6:55–63. https://doi.org/10.1016/j.comtox.2017.05.001

    Article  PubMed  Google Scholar 

  37. Korkmaz S, Zararsiz G, Goksuluk D (2014) Drug/nondrug classification using support vector machines with various feature selection strategies. Comput Methods Programs Biomed 117(2):51–60. https://doi.org/10.1016/j.cmpb.2014.08.009

    Article  PubMed  Google Scholar 

  38. Capuzzi SJ, Kim IS-J, Lam WI, Thornton TE, Muratov EN, Pozefsky D, Tropsha A (2017) Chembench: a publicly accessible, integrated cheminformatics portal. J Chem Inf Model 57(2):105–108

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E (2003) The chemistry development kit (CDK): an open-source Java library for chemo-and bioinformatics. J Chem Inf Comput Sci 43(2):493–500

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Golbraikh A, Muratov E, Fourches D, Tropsha A (2014) Data set modelability by QSAR. J Chem Inf Model 54(1):1–4. https://doi.org/10.1021/ci400572x

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Filzmoser P, Liebmann B, Varmuza K (2009) Repeated double cross validation. J Chemom 23(4):160–171

    Article  CAS  Google Scholar 

  42. Ishibuchi H, Nojima Y (2013) Repeated double cross-validation for choosing a single solution in evolutionary multi-objective fuzzy classifier design. Knowl Based Syst 54:22–31

    Article  Google Scholar 

  43. Cerruela García G, García-Pedrajas N, Luque Ruiz I, Gómez-Nieto M (2018) Molecular activity prediction by means of supervised subspace projection based ensembles of classifiers. SAR QSAR Environ Res 29(3):187–212

    Article  PubMed  Google Scholar 

  44. Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27

    Article  Google Scholar 

  45. Quinlan JR (1996) Improved use of continuous attributes in C45. J Artif Intell Res 4:77–90

    Article  Google Scholar 

  46. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  47. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215

    Article  Google Scholar 

  48. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Article  Google Scholar 

  49. Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1):29–36

    Article  CAS  PubMed  Google Scholar 

  50. Ruiz R, Aguilar-Ruiz JS, Riquelme JC, Díaz-Díaz N (2005) Analysis of feature rankings for classification. In: International symposium on intelligent data analysis. Springer, pp 362–372

  51. Kuncheva LI (2007) A stability index for feature selection. In: Artificial intelligence and applications. Innsbruck, pp 421–427

  52. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    Google Scholar 

  53. Iman RL, Davenport JM (1980) Approximations of the critical region of the fbietkan statistic. Commun Stat Theory Methods 9(6):571–595

    Article  Google Scholar 

  54. Nemenyi PB (1963) Distribution-free multiple comparisons. Princeton University, Princeton

    Google Scholar 

  55. Alhaj TA, Siraj MM, Zainal A, Elshoush HT, Elhaj F (2016) Feature selection using information gain for improved structural-based alert correlation. PLoS ONE 11(11):e0166017

    Article  PubMed  PubMed Central  Google Scholar 

  56. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  Google Scholar 

  57. Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186

    Article  Google Scholar 

  58. Liu Y (2004) A comparative study on feature selection methods for drug discovery. J Chem Inf Comput Sci 44(5):1823–1828

    Article  CAS  PubMed  Google Scholar 

  59. Godden JW, Bajorath J (2003) An information-theoretic approach to descriptor selection for database profiling and QSAR modeling. QSAR Comb Sci 22(5):487–497

    Article  CAS  Google Scholar 

  60. Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. In: European conference on machine learning, Springer, pp 171–182

  61. Dash M, Choi K, Scheuermann P, Huan L (2002) Feature selection for clustering: a filter solution. In: Proceedings of the IEEE international conference on data mining, 9–12 Dec 2002. pp 115–122. https://doi.org/10.1109/ICDM.2002.1183893

  62. Zhou L, Lai KK, Yen J (2012) Empirical models based on features ranking techniques for corporate financial distress prediction. Comput Math Appl 64(8):2484–2496

    Article  Google Scholar 

  63. Liao C, Li S, Luo Z (2006) Gene selection using Wilcoxon rank sum test and support vector machine for cancer classification. In: international conference on computational and information science. Springer, pp 57–66

  64. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1):37–52. https://doi.org/10.1016/0169-7439(87)80084-9

    Article  CAS  Google Scholar 

  65. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mech Learn Res 7:1–30

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by Project TIN2015-66108-P of the Spanish Ministry of Science and Innovation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gonzalo Cerruela-García.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 28686 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cerruela-García, G., Pérez-Parra Toledano, J., de Haro-García, A. et al. Influence of feature rankers in the construction of molecular activity prediction models. J Comput Aided Mol Des 34, 305–325 (2020). https://doi.org/10.1007/s10822-019-00273-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10822-019-00273-1

Keywords

Navigation