Abstract
We present SLUG, a method that uses genetic algorithms as a wrapper for genetic programming (GP), to perform feature selection while inducing models. This method is first tested on four regular binary classification datasets, and then on 10 synthetic datasets produced by GAMETES, a tool for embedding epistatic gene-gene interactions into noisy datasets. We compare the results of SLUG with the ones obtained by other GP-based methods that had already been used on the GAMETES problems, concluding that the proposed approach is very successful, particularly on the epistatic datasets. We discuss the merits and weaknesses of SLUG and its various parts, i.e. the wrapper and the learner, and we perform additional experiments, aimed at comparing SLUG with other state-of-the-art learners, like decision trees, random forests and extreme gradient boosting. Despite the fact that SLUG is not the most efficient method in terms of training time, it is confirmed as the most effective method in terms of accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
We performed 30 runs using the same total number of comparisons as SLUG using the STGP (10000 individuals and 1500 generations). With this, the median test accuracy achieved was 0.4982, while the best was 0.5348.
References
Aguirre, H.E., Tanaka, K.: Genetic algorithms on NK-landscapes: effects of selection, drift, mutation, and recombination. In: Cagnoni, S., et al. (eds.) Applications of Evolutionary Computing, pp. 131–142. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-78761-7
Altenberg, L.: B2.7.2. NK fitness landscapes. In: Handbook of Evolutionary Computation. pp. B2.7:5–B2.7:10. IOP Publishing Ltd. and Oxford University Press, London (1997)
Ansarifar, J., Wang, L.: New algorithms for detecting multi-effect and multi-way epistatic interactions. Bioinformatics 35(24), 5078–5085 (2019). https://doi.org/10.1093/bioinformatics/btz463
Chaikla, N., Qi, Y.: Genetic algorithms in feature selection. In: IEEE SMC 1999 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 99CH37028). vol. 5, pp. 538–540 (1999). https://doi.org/10.1109/ICSMC.1999.815609
Chan, K., Aydin, M., Fogarty, T.: An epistasis measure based on the analysis of variance for the real-coded representation in genetic algorithms. In: The 2003 Congress on Evolutionary Computation, 2003, CEC 2003. vol. 1, pp. 297–304 (2003). https://doi.org/10.1109/CEC.2003.1299588
Chiesa, M., Maioli, G., Colombo, G.: GARS: Genetic algorithm for the identification of a robust subset of features in high-dimensional datasets. BMC Bioinform. 21(54) (2020). https://doi.org/10.1186/s12859-020-3400-6
Cordell, H.J.: Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum. Mol. Gene. 11(20), 2463–2468 (2002). https://doi.org/10.1093/hmg/11.20.2463
Davidor, Y.: Epistasis variance: a viewpoint on GA-hardness. Found. Gen. Algorithms 1, 23–35 (1991). https://doi.org/10.1016/B978-0-08-050684-5.50005-7
Dua, D., Graff, C.: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml
García-Dominguez, A., et al.: Feature selection using genetic algorithms for the generation of a recognition and classification of children activities model using environmental sound. Mob. Inf. Syst. 2020, 12 p (2020). 8617430. https://doi.org/10.1155/2020/8617430
Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato (1999)
Hussein, F., Kharma, N., Ward, R.: Genetic algorithms for feature selection and weighting, a review and study. In: Proceedings of Sixth International Conference on Document Analysis and Recognition, pp. 1240–1244 (2001). https://doi.org/10.1109/ICDAR.2001.953980
Jafari, S., Kapitaniak, T., Rajagopal, K., Pham, V.-T., Alsaadi, F.E.: Effect of epistasis on the performance of genetic algorithms. J. Zhejiang Univ.-Sci. A 20(2), 109–116 (2018). https://doi.org/10.1631/jzus.A1800399
Kononenko, I.: Estimating attributes: analysis and extensions of relief. In: ECML (1994)
Korns, M.F.: Genetic programming symbolic classification: A study. In: Banzhaf, W., Olson, R.S., Tozier, W., Riolo, R. (eds.) Genetic Programming Theory and Practice XV, pp. 39–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-90512-9
La Cava, W., Silva, S., Danai, K., Spector, L., Vanneschi, L., Moore, J.H.: Multidimensional genetic programming for multiclass classification. Swarm Evol. Comput. 44, 260–272 (2019). https://doi.org/10.1016/j.swevo.2018.03.015
Lanzi, P.: Fast feature selection with genetic algorithms: a filter approach. In: Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC 1997). pp. 537–540 (1997). https://doi.org/10.1109/ICEC.1997.592369
Lavine, B.K., White, C.G.: Boosting the performance of genetic algorithms for variable selection in partial least squares spectral calibrations. Appl. Spectrosc. 71(9), 2092–2101 (2017)
Lee, J., Kim, Y.H.: Epistasis-based basis estimation method for simplifying the problem space of an evolutionary search in binary representation. Complexity 2019, 2095167, 13 pages (2019)
Lehman, J., Stanley, K.O.: Exploiting open-endedness to solve problems through the search for novelty. In: Proceedings of the Eleventh International Conference on Artificial Life, Alife XI. MIT Press, Cambridge (2008)
Li, A.D., Xue, B., Zhang, M.: Multi-objective feature selection using hybridization of a genetic algorithm and direct multisearch for key quality characteristic selection. Inf. Sci. 523, 245–265 (2020). https://doi.org/10.1016/j.ins.2020.03.032
Mathias, K.E., Eshelman, L.J., Schaffer, J.D.: Niches in NK-landscapes. In: Martin, W.N., Spears, W.M. (eds.) Foundations of Genetic Algorithms, vol. 6, pp. 27–46. Morgan Kaufmann, San Francisco (2001). https://doi.org/10.1016/B978-155860734-7/50085-8
Merz, P., Freisleben, B.: On the effectiveness of evolutionary search in high-dimensional NK-landscapes. In: 1998 IEEE International Conference on Evolutionary Computation Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98TH8360), pp. 741–745 (1998). https://doi.org/10.1109/ICEC.1998.700144
Mo, H., Li, Z., Zhu, C.: A kind of epistasis-tunable test functions for genetic algorithms. Concurr. Comput. Pract. Exp. 33(8), e5030 (2021). https://doi.org/10.1002/cpe.5030
Muñoz, L., Silva, S., Trujillo, L.: M3GP- multiclass classification with GP. In: EuroGP (2015)
Nazareth, D.L., Soofi, E.S., Zhao, H.: Visualizing attribute interdependencies using mutual information, hierarchical clustering, multidimensional scaling, and self-organizing maps. In: 2007 40th Annual Hawaii International Conference on System Sciences (HICSS 2007), pp. 53–53 (2007). https://doi.org/10.1109/HICSS.2007.608
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pelikan, M., Sastry, K., Goldberg, D.E., Butz, M.V., Hauschild, M.: Performance of evolutionary algorithms on NK landscapes with nearest neighbor interactions and tunable overlap. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation, GECCO 2009, pp. 851–858. Association for Computing Machinery, New York (2009). https://doi.org/10.1145/1569901.1570018
Petinrin, O.O., Wong, K.C.: Protocol for epistasis detection with machine learning using GenEpi package. Methods Mol. Biol. 2212, 291–305 (2021)
Reeves, C.R., Wright, C.C.: Epistasis in genetic algorithms: an experimental design perspective. In: Proceedings of the 6th International Conference on Genetic Algorithms. pp. 217–224. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Rochet, S.: Epistasis in genetic algorithms revisited. Infor. Sci. 102(1), 133–155 (1997). https://doi.org/10.1016/S0020-0255(97)00017-0
Rodrigues, N.M., Batista, J.E., Silva, S.: Ensemble genetic programming. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds.) Genetic Programming, pp. 151–166. Springer, Cham (2020). https://doi.org/10.1007/978-3-319-30668-1
Seo, K.-K.: Content-Based Image Retrieval by Combining Genetic Algorithm and Support Vector Machine. In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D. (eds.) ICANN 2007. LNCS, vol. 4669, pp. 537–545. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74695-9_55
Shik Shin, K., Lee, Y.J.: A genetic algorithm application in bankruptcy prediction modeling. Expert Syst. Appl. 23, 321–328 (2002)
Smith, M.G., Bull, L.: Feature construction and selection using genetic programming and a genetic algorithm. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E., Poli, R., Costa, E. (eds.) Genetic Programming, pp. 229–237. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-319-30668-1
Sohn, A., Olson, R.S., Moore, J.H.: Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2017, pp. 489–496. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3071178.3071212
Tinós, R., Whitley, D., Chicano, F.: Partition crossover for pseudo-Boolean optimization. In: Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII, FOGA 2015, pp. 137–149. Association for Computing Machinery, New York (2015). https://doi.org/10.1145/2725494.2725497
Urbanowicz, R., Kiralis, J., Sinnott-Armstrong, N., et al.: GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5(16) (2012). https://doi.org/10.1186/1756-0381-5-16
Urbanowicz, R.J., Kiralis, J., Sinnott-Armstrong, N.A., Heberling, T., Fisher, J.M., Moore, J.H.: Gametes: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining 5, 16–16 (2012)
Urbanowicz, R.J., Meeker, M., La Cava, W., Olson, R.S., Moore, J.H.: Relief-based feature selection: Introduction and review. J. Biomed. Inf. 85, 189–203 (2018). https://doi.org/10.1016/j.jbi.2018.07.014
Vanneschi, L., Castelli, M., Manzoni, L.: The K landscapes: a tunably difficult benchmark for genetic programming. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, GECCO 2011, Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2001576.2001773
Wutzl, B., Leibnitz, K., Rattay, F., Kronbichler, M., Murata, M., Golaszewski, S.M.: Genetic algorithms for feature selection when classifying severe chronic disorders of consciousness. PLoS ONE 14(7), 1–16 (2019). https://doi.org/10.1371/journal.pone.0219683
Xue, B., Zhang, M., Browne, W.N., Yao, X.: A survey on evolutionary computation approaches to feature selection. IEEE Trans. Evol. Comput. 20(4), 606–626 (2016). https://doi.org/10.1109/TEVC.2015.2504420
Zhang, S.: sonar.all-data (2018). https://www.kaggle.com/ypzhangsam/sonaralldata
Acknowledgment
This work was supported by FCT, Portugal, through funding of LASIGE Research Unit (UIDB/00408/2020 and UIDP/00408/2020); MAR2020 program via project MarCODE (MAR-01.03.01-FEAMP-0047); projects BINDER (PTDC/CCI-INF/29168/2017), AICE (DSAIPA/DS/0113/2019), OPTOX (PTDC/CTA-AMB/30056/2017) and GADgET (DSAIPA/DS/0022/2018). Nuno Rodrigues and João Batista were supported by PhD Grants 2021/05322/BD and SFRH/BD/143972/2019, respectively; William La Cava was supported by the National Library Of Medicine of the National Institutes of Health under Award Number R00LM012926.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Rodrigues, N.M., Batista, J.E., La Cava, W., Vanneschi, L., Silva, S. (2022). SLUG: Feature Selection Using Genetic Algorithms and Genetic Programming. In: Medvet, E., Pappa, G., Xue, B. (eds) Genetic Programming. EuroGP 2022. Lecture Notes in Computer Science, vol 13223. Springer, Cham. https://doi.org/10.1007/978-3-031-02056-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-02056-8_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-02055-1
Online ISBN: 978-3-031-02056-8
eBook Packages: Computer ScienceComputer Science (R0)