Skip to main content
Log in

A methodology for evaluating multi-objective evolutionary feature selection for classification in the context of virtual screening

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Virtual screening (VS) methods have been shown to increase success rates in many drug discovery campaigns, when they complement experimental approaches, such as high-throughput screening methods or classical medicinal chemistry approaches. Nevertheless, predictive capability of VS is not yet optimal, mainly due to limitations in the underlying physical principles describing drug binding phenomena. One approach that can improve VS methods is the aid of machine learning methods. When enough experimental data are available to train such methods, predictive capability can considerably increase. We show in this research work how a multi-objective evolutionary search strategy for feature selection, which can provide with small and accurate decision trees that can be very easily understood by chemists, can drastically increase the applicability and predictive ability of these techniques and therefore aid considerable in the drug discovery problem. With the proposed methodology, we find classification models with accuracy between 0.9934 and 1.00 and area under ROC between 0.96 and 1.00 evaluated in full training sets, and accuracy between 0.9849 and 0.9940 and area under ROC between 0.89 and 0.93 evaluated with tenfold cross-validation over 30 iterations, while substantially reducing the model size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Abagyan R, Totrov M, Kuznetsov D (1994) ICM—a new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation. J Comput Chem 15(5):488–506. https://doi.org/10.1002/jcc.540150503

    Article  Google Scholar 

  • Ahmad A, Dey L (2005) A feature selection technique for classificatory analysis. Pattern Recognit Lett 26(1):43–56

    Google Scholar 

  • Anirudha R, Kannan R, Patil N (2014) Genetic algorithm based wrapper feature selection on hybrid prediction model for analysis of high dimensional data. In: 9th international conference on industrial and information systems (ICIIS), 2014, pp 1–6

  • Barrett SJ, Langdon WB (2006) Advances in the application of machine learning techniques in drug discovery, design and development. In: Tiwari A, Roy R, Knowles J, Avineri E, Dahal K (eds) Applications of soft computing. Advances in intelligent and soft computing, vol 36. Springer, Berlin, Heidelberg, pp 99–110

  • Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucleic Acids Res 28(1):235–242

    Google Scholar 

  • Bertsekas D (1999) Nonlinear programming, 2nd edn. Athena Scientific, Cambridge

    MATH  Google Scholar 

  • Beume N, Naujoks B, Emmerich M (2007) SMS-EMOA: multiobjective selection based on dominated hypervolume. Eur J Oper Res 181(3):1653–1669

    MATH  Google Scholar 

  • Bohm H-J, Stahl M (2002) The use of scoring functions in drug discovery applications. Rev Comput Chem 18:41–88

    Google Scholar 

  • Cano G, Garcia-Rodriguez J, Garcia-Garcia A, Perez-Sanchez H, Benediktsson JA, Thapa A, Barr A (2017) Automatic selection of molecular descriptors using random forest: application to drug discovery. Exp Syst Appl 72:151–159. https://doi.org/10.1016/j.eswa.2016.12.008

    Article  Google Scholar 

  • Cao D-S, Xu Q-S, Hu Q-N, Liang Y-Z (2013) Chemopy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29:1092–1094

    Google Scholar 

  • Castro PA, Von Zuben FJ (2010) Multi-objective feature selection using a bayesian artificial immune system. Int J Intell Comput Cybern 3(2):235–256

    MathSciNet  MATH  Google Scholar 

  • Chen H, Yao X (2006) Evolutionary multiobjective ensemble learning based on Bayesian feature selection. In: IEEE congress on evolutionary computation, 2006. CEC 2006, pp. 267–274

  • Collette Y, Siarry P (2004) Multiobjective optimization: principles and case studies. Springer, Berlin

    MATH  Google Scholar 

  • Daszykowski M, Walczak B, Xu QS, Daeyaert F, de Jonge MR, Heeres J, Koymans LMH, Lewi PJ, Vinkers HM, Janssen PA, Massart DL (2004) Classification and regression trees studies of HIV reverse transcriptase inhibitors. J Chem Inf Comput Sci 44(2):716–726

    Google Scholar 

  • Deb K (2001) Multi-objective optimization using evolutionary algorithms. Wiley, London

    MATH  Google Scholar 

  • Deb K, Pratab A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

    Google Scholar 

  • Drews J (2000) Drug discovery: a historical perspective. Science 287(5460):1960–1964

    Google Scholar 

  • Dreyer S (2013) Evolutionary feature selection. Norwegian University of Science and Technology. Department of Computer and Information Science, Institutt for datateknikk og informasjonsvitenskap, p 76

  • Ekbal A, Saha S, Garbe C (2010) Feature selection using multiobjective optimization for named entity recognition. In: 20th international conference on pattern recognition (ICPR), 2010, pp 1937–1940

  • ElAlami M (2009) A filter model for feature subset selection based on genetic algorithm. Knowl Based Syst 22(5):356–362

    Google Scholar 

  • Fawcett T (2006) An introduction to roc analysis. Pattern Recogn Lett 27(8):861–874

    MathSciNet  Google Scholar 

  • Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, Repasky MP, Knoll EH, Shelley M, Perry JK, Shaw DE, Francis P, Shenkin PS (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47(7):1739–1749. https://doi.org/10.1021/jm0306430 pMID: 15027865

    Article  Google Scholar 

  • García-Nieto J, Alba E, Jourdan L, Talbi E (2009) Sensitivity and specificity based multiobjective approach for feature selection: application to cancer diagnosis. Inf Process Lett 109(16):887–896

    MathSciNet  MATH  Google Scholar 

  • Gaspar-Cunha A (2010) Feature selection using multi-objective evolutionary algorithms: application to cardiac SPECT diagnosis. In: Rocha M, Riverola F, Shatkay H, Corchado J (eds) Advances in bioinformatics, volume 74 of advances in intelligent and soft computing. Springer, Berlin, pp 85–92

    Google Scholar 

  • Gaspar-Cunha A, Covas JA (2004) RPSGAe—reduced Pareto set genetic algorithm: application to polymer extrusion. In: Gandibleux X, Sevaux M, Sorensen K, Kindt VT (eds) Metaheuristics for multiobjective optimisation, volume of 535 lecture notes in economics and mathematical systems. Springer, Berlin, pp 221–249

    Google Scholar 

  • Gaspar-Cunha A, Recio G, Costa L, Estébanez C (2014) Self-adaptive MOEA feature selection for classification of bankruptcy prediction data. Sci World J 2014:314728. https://doi.org/10.1155/2014/314728

  • Goldberg D (1989) Genetic algorithms in search, optimization and machine learning, 1st edn. Addison-Wesley Longman Publishing Co. Inc., Boston

    MATH  Google Scholar 

  • Gómez-Skarmeta AF, Jiménez F, Ibánez J, Paredes S (1999) Evolutionary variable identification. In: Proceedings of 7th European congress on intelligent techniques and soft computing (EUFIT’99)

  • Hall MA (1999) Correlation-based feature selection for machine learning. Technical report, University of Waikato

  • Han L, Wang Y, Bryant SH (2008) Developing and validating predictive decision tree models from mining chemical structural fingerprints and high–throughput screening data in PubChem. BMC Bioinf 9(1):401–8

    Google Scholar 

  • Hand DJ (2009) Measuring classifier performance: a coherent alternative to the area under the roc curve. Mach Learn 77(1):103–123. https://doi.org/10.1007/s10994-009-5119-5

    Article  Google Scholar 

  • Huang N, Shoichet BK, Irwin JJ (2006) Benchmarking sets for molecular docking. J Med Chem 49(23):6789–6801

    Google Scholar 

  • Huang J, Cai Y, Xu X (2007) A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognit Lett 28(13):1825–1844

    Google Scholar 

  • Hubertus T, Klaus M, Eberhard T (2004) Optimization theory. Kluwer Academic, Dordrecht

    MATH  Google Scholar 

  • Ishibuchi H (2000) Multi-objective pattern and feature selection by a genetic algorithm. In: Proceedings of genetic and evolutionary computation conference GECCO’2000, Morgan Kaufmann, pp 1069–1076

  • Japkowicz N, Shah M (2011) Evaluating learning algorithms: a classification perspective. Cambridge University Press, New York

    MATH  Google Scholar 

  • Jara A, Martínez R, Vigueras D, Sánchez G, Jiménez F (2011) Attribute selection by multiobjective evolutionary computation applied to mortality from infection in severe burns patients. In: HEALTHINF 2011—proceedings of the international conference on health informatics, Rome, Italy, 26–29 January, 2011, pp 467–471

  • Jiménez F, Verdegay JL (2001) Evolutionary computation and mathematical programming. In: Reusch B, Temme KH (eds) Computational intelligence in theory and practice. Advances in soft computing, vol 8. Physica, Heidelberg, pp 167–182

  • Jiménez F, Gómez-Skarmeta A, Sánchez G, Deb K (2002) An evolutionary algorithm for constrained multi-objective optimization. In: Proceedings of the evolutionary computation on 2002. CEC’02. Proceedings of the 2002 congress, vol 2 of CEC’02. IEEE Computer Society, Washington, DC, USA, pp 1133–1138

  • Jiménez F, Sánchez G, Juárez JM (2014) Multi-objective evolutionary algorithms for fuzzy classification in survival prediction. Artif Intell Med 60(3):197–219

    Google Scholar 

  • Jiménez F, Jodár R, Sánchez G, Martín M, Sciavicco G (2016) Multi-objective evolutionary computation based feature selection applied to behaviour assessment of children. In: Proceedings of the 2016 international conference on educational data mining (ICEDM), vol 2(6), pp 1888–1897

  • Jiménez F, Sánchez G, García J, Sciavicco G, Miralles L (2017) Multi-objective evolutionary feature selection for online sales forecasting. Neurocomputing 234:75–92

    Google Scholar 

  • Jin Y (ed) (2006) Multi-objective machine learning, volume 16 of studies in computational intelligence. Springer, Warsaw

    Google Scholar 

  • Karegowda AG, Manjunath AS, Jayaram MA (2010) Comparative study of attribute selection using gain ratio and correlation based feature selection. Int J Inf Technol Knowl Manag 2(2):271–277

    Google Scholar 

  • Karloff H (1991) Linear programming. Birkhauser Basel, Boston

    MATH  Google Scholar 

  • Karshenas H, Larrañaga Múgica P, Zhang Q, Bielza C (2012) An interval-based multiobjective approach to feature subset selection using joint modeling of objectives and variables. Technical report, Facultad de Informática, Universidad Politécnica de Madrid

  • Kimovski D, Ortega J, Ortiz A, Banos R (2015) Parallel alternatives for evolutionary multi-objective optimization in unsupervised feature selection. Exp Syst Appl 42(9):4239–4252

    Google Scholar 

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2, IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 1137–1143

  • Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324 (special issue on relevance)

    MATH  Google Scholar 

  • Krishna B, Kaliaperumal B (2011) Efficient genetic-wrapper algorithm based data mining for feature subset selection in a power quality pattern recognition application. Int Arab J Inf Technol 8(4):397–405

    Google Scholar 

  • Li L, Li M, Lu Y, Zhang Y (2010) A new multi-objective genetic algorithm for feature subset selection in fatigue fracture image identification. JCP 5(7):1105–1111

    Google Scholar 

  • Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers, Norwell

    MATH  Google Scholar 

  • Maros I, Mitra G (1996) Simplex algorithms, Oxford Science. Chapter 1, pp 1–46

  • Martínez C, Jiménez F, Sánchez G. Multiobjective evolutionary search. https://sourceforge.net/projects/moea/files/

  • McInnes C (2007) Virtual screening strategies in drug discovery. Curr Opin Chem Biol 11(5):494–502

    Google Scholar 

  • Metz CE (1978) Basic principles of ROC analysis. Semin Nucl Med 8:283–298

    Google Scholar 

  • Mlakar U, Fister I, Brest J, Potocnik B (2017) Multi-objective differential evolution for feature selection in facial expression recognition systems. Exp Syst Appl 89:129–137. https://doi.org/10.1016/j.eswa.2017.07.037

    Article  Google Scholar 

  • Moraglio A, Di Chio C, Poli R (2007) Geometric particle swarm optimisation. In: Ebner M, Oneill M, Ekárt A, Vanneschi L, Esparcia-Alcázar A (eds) Genetic programming, volume 4445 of lecture notes in computer science. Springer, Berlin, pp 125–136

    Google Scholar 

  • Mukhopadhyay A, Maulik U, Bandyopadhyay S, Coello CC (2014a) A survey of multiobjective evolutionary algorithms for data mining (part I). IEEE Trans Evol Comput 18(1):4–19

    Google Scholar 

  • Mukhopadhyay A, Maulik U, Bandyopadhyay S, Coello CC (2014b) A survey of multiobjective evolutionary algorithms for data mining (part II). IEEE Trans Evol Comput 18(1):20–35

    Google Scholar 

  • Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52(3):239–281. https://doi.org/10.1023/A:1024068626366

  • Nayak SK, Rout PK, Jagadev AK, Swarnkar T (2017) Elitism based multi-objective differential evolution for feature selection: a filter approach with an efficient redundancy measure. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2017.08.001

    Article  Google Scholar 

  • Olsson A (2011) Particle swarm optimization: theory, techniques and applications. Nova Science Publishers, Hauppauge

    Google Scholar 

  • Package caret. http://cran.r-project.org/web/packages/caret/caret.pdf (2015)

  • Papadimitriou CH, Steiglitz K (1982) Combinatorial optimization: algorithms and complexity. Prentice-Hall Inc, Upper Saddle River

    MATH  Google Scholar 

  • Pati S, Das A, Ghosh A (2013) Gene selection using multi-objective genetic algorithm integrating cellular automata and rough set theory. In: Panigrahi B, Suganthan P, Das S, Dash S (eds) Swarm, evolutionary, and memetic computing, volume 8298 of lecture notes in computer science. Springer, Berlin, pp 144–155

    Google Scholar 

  • Pereira JC, Caffarena ER, dos Santos CN (2016) Boosting docking-based virtual screening with deep learning. J Chem Inf Model 56(12):2495–2506. https://doi.org/10.1021/acs.jcim.6b00355

    Article  Google Scholar 

  • Pérez-Sánchez H, Cano G, García-Rodríguez J (2014a) Improving drug discovery using hybrid softcomputing methods. Appl Soft Comput 20:119–126

    Google Scholar 

  • Pérez-Sánchez H, Cano G, García-Rodríguez J (2014b) Improving drug discovery using hybrid softcomputing methods. Appl Soft Comput 20:119–126. https://doi.org/10.1016/j.asoc.2013.10.033 (hybrid intelligent methods for health technologies)

    Article  Google Scholar 

  • Qiu J (2007) Traditional medicine: a culture in the balance. Nature 448(7150):126–128

    Google Scholar 

  • Reynolds AP, Corne DW, Chantler MJ (2010) Feature selection for multi-purpose predictive models: a many-objective task. In: Schaefer R, Cotta C, Kołodziej J, Rudolph G (eds) Parallel problem solving from nature, PPSN XI. PPSN 2010. Lecture notes in computer science, vol 6238. Springer, Berlin, Heidelberg, pp 384–393

  • Roy A, Skolnick J (2014) LIGSIFT: an open-source tool for ligand structural alignment and virtual screening. Bioinformatics 31:539–544

    Google Scholar 

  • Salzberg S (1994) C4.5: programs for machine learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Mach Learn 16(3):235–240. https://doi.org/10.1007/BF00993309

    Article  MathSciNet  Google Scholar 

  • Shoichet BK, Bodian DL, Kuntz ID (1992) Molecular docking using shape descriptors. J Comput Chem JCC 13:380–397

    Google Scholar 

  • Siedlecki W, Sklansky J (1989) A note on genetic algorithms for large-scale feature selection. Pattern Recognit Lett 10(5):335–347

    MATH  Google Scholar 

  • Sikdar UK, Ekbal A, Saha S (2015) Mode: multiobjective differential evolution for feature selection and classifier ensemble. Soft Comput 19(12):3529–3549. https://doi.org/10.1007/s00500-014-1565-5

    Article  Google Scholar 

  • Sinha S (2006) Mathematical programming: theory and methods. Elsevier, New York City

    Google Scholar 

  • Storn R, Price K (1997) Differential evolution—a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359. https://doi.org/10.1023/A:1008202821328

    Article  MathSciNet  MATH  Google Scholar 

  • Terstappen GC, Reggiani A (2001) In silico research in drug discovery. Trends Pharmacol Sci 22(1):23–26

    Google Scholar 

  • Vafaie H, De Jong K (1992) Genetic algorithms as a tool for feature selection in machine learning. In: Fourth international conference on tools with artificial intelligence, 1992. TAI’92, Proceedings, pp. 200–203

  • Vatolkin I, Preuß M, Rudolph G (2011) Multi-objective feature selection in music genre and style recognition tasks. In: Proceedings of the 13th annual conference on genetic and evolutionary computation, GECCO’11, ACM, New York, NY, USA, pp 411–418

  • Venkatadri M, Srinivasa Rao K (2010) A multiobjective genetic algorithm for feature selection in data mining. Int J Comput Sci Inf Technol 1(5):443–448

    Google Scholar 

  • Wang R, Lu Y, Fang X, Wang S (2004) An extensive test of 14 scoring functions using the pdbbind refined set of 800 protein-ligand complexes. J Chem Inf Comput Sci 44(6):2114–2125

    Google Scholar 

  • White RE (2000) High-throughput screening in drug metabolism and pharmacokinetic support of drug discovery. Annu Rev Pharmacol Toxicol 40(1):133–157

    Google Scholar 

  • Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn (Morgan Kaufmann series in data management systems). Morgan Kaufmann Publishers Inc., San Francisco

    Google Scholar 

  • Witten IH, Frank E, Hall MA (2011) Introduction to weka. In: Witten IH, Frank E, Hall MA (eds) Data mining: practical machine learning tools and techniques. The Morgan Kaufmann Series in data management systems, 3rd edn. Morgan Kaufmann, Boston, pp 403–406

    Google Scholar 

  • Yang S-Y (2010) Pharmacophore modeling and applications in drug discovery: challenges and recent advances. Drug Discov Today 15(11):444–450

    Google Scholar 

  • Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. Intell Syst Appl IEEE 13(2):44–49

    Google Scholar 

  • Zhao J, Fernandes V B, Jiao L, Yevseyeva I, Maulana A, Li R, Bäck T, Emmerich MTM (2016) Multiobjective optimization of classifiers by means of 3-D convex hull based evolutionary algorithm. CoRR abs/1412.5710

  • Zhu Z, Ong Y-S, Kuo J-L (2009) Feature selection using single/multi-objective memetic frameworks. In: Goh C-K, Ong Y-S, Tan K (eds) Multi-objective memetic algorithms, volume 171 of studies in computational intelligence. Springer, Berlin, pp 111–131

    Google Scholar 

Download references

Acknowledgements

This study was supported by computing facilities of Extremadura Research Centre for Advanced Technologies (CETA-CIEMAT), funded by the European Regional Development Fund (ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of Spain. This work was partially funded by the Fundación Séneca del Centro de Coordinación de la Investigación de la Región de Murcia under Project 18946/JLI/13.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Fernando Jiménez or Horacio Pérez-Sánchez.

Ethics declarations

Conflict of interest

Author Fernando Jiménez Barrionuevo declares that he has no conflict of interest. Author Horacio Pérez Sánchez declares that he has no conflict of interest. Author José Palma Méndez declares that he has no conflict of interest. Author Gracia Sánchez Carpena declares that she has no conflict of interest. Author Carlos Martínez Cortés declares that he has no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiménez, F., Pérez-Sánchez, H., Palma, J. et al. A methodology for evaluating multi-objective evolutionary feature selection for classification in the context of virtual screening. Soft Comput 23, 8775–8800 (2019). https://doi.org/10.1007/s00500-018-3479-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-3479-0

Keywords

Navigation