Abstract
The high-dimensional nature of medical data frequently results in suboptimal performance of machine learning models. Applying feature selection before classification is necessary to improve the performance of classifiers. Although evolutionary-based wrapper feature selection methods are acknowledged for their superior performance in exploring optimal feature subsets, they have been demonstrated to carry the risk of overfitting and a potential loss of efficient search capability in the later stages of evolution. To address these issues, we propose a generalized wrapper feature selection method called Two Populations Based Feature Selection (TPBFS), which incorporates dual populations evolving in reverse directions to improve convergence speed. It introduces a probability-based crossover operation to mitigate overfitting and a record list to systematically track and replace optimal individuals, which helps to avoid getting stuck in local optima during later stages of evaluation. The experimental results demonstrate that TPBFS is effective in reducing the dimensionality of various medical datasets while guaranteeing the performance of classifiers.






Similar content being viewed by others
Data availability
The dataset utilized in this study is sourced from the UCI Machine Learning Repository. The data is readily accessible and can be obtained from https://archive.ics.uci.edu/.
References
Shehab, M., Abualigah, L., Shambour, Q., Abu-Hashem, M.A., Shambour, M.K.Y., Alsalibi, A.I., Gandomi, A.H.: Machine learning in medical applications: a review of state-of-the-art methods. Comput. Biol. Med. 145, 105458 (2022)
Belgacem A, Khoudi A, Boudane F, Berrichi A Machine Learning in the Medical Field: A Comprehensive Overview. In: 2023 International Conference on Decision Aid Sciences and Applications (DASA), 2023. IEEE, pp 103–108
Smiti, A.: When machine learning meets medical world: current status and future challenges. Comp. Sci. Rev. 37, 100280 (2020)
Shah, D., Patel, S., Bharti, S.K.: Heart disease prediction using machine learning techniques. SN Comput. Sci. 1(6), 345 (2020)
Parthiban, G., Srivatsa, S.: Applying machine learning methods in diagnosing heart disease for diabetic patients. Int. J. Appl. Inform. Syst. 3(7), 25–30 (2012)
Ramesh T, Lilhore UK, Poongodi M, Simaiya S, Kaur A, Hamdi M (2022) Predictive analysis of heart diseases with machine learning approaches. Malaysian J. Comput. Sci. 132–148
Ahsan, M.M., Siddique, Z.: Machine learning-based heart disease diagnosis: a systematic literature review. Artif. Intell. Med. 128, 102289 (2022)
Sachdeva, R.K., Bathla, P., Rani, P., Solanki, V., Ahuja, R.: A systematic method for diagnosis of hepatitis disease using machine learning. Innov. Syst. Softw. Eng. 19(1), 71–80 (2023)
Obaido, G., Ogbuokiri, B., Swart, T.G., Ayawei, N., Kasongo, S.M., Aruleba, K., Mienye, I.D., Aruleba, I., Chukwu, W., Osaye, F.: An interpretable machine learning approach for hepatitis b diagnosis. Appl. Sci. 12(21), 11127 (2022)
Syafaâ, L., Zulfatman, Z., Pakaya, I., Lestandy, M.: Comparison of machine learning classification methods in hepatitis C virus. J. Online Informatika 6(1), 73–78 (2021)
Wang, W., Lee, J., Harrou, F., Sun, Y.: Early detection of Parkinson’s disease using deep learning and machine learning. IEEE Access 8, 147635–147646 (2020)
Ayaz, Z., Naz, S., Khan, N.H., Razzak, I., Imran, M.: Automated methods for diagnosis of Parkinson’s disease and predicting severity level. Neural Comput. Appl. 35(20), 14499–14534 (2023)
Makarious, M.B., Leonard, H.L., Vitale, D., Iwaki, H., Sargent, L., Dadu, A., Violich, I., Hutchins, E., Saffo, D., Bandres-Ciga, S.: Multi-modality machine learning predicting Parkinson’s disease. NPJ Parkinson’s Dis 8(1), 35 (2022)
Rana, A., Dumka, A., Singh, R., Panda, M.K., Priyadarshi, N., Twala, B.: Imperative role of machine learning algorithm for detection of Parkinson’s disease: review, challenges and recommendations. Diagnostics 12, 2003 (2022)
Cresswell, K., Majeed, A., Bates, D.W., Sheikh, A.: Computerised decision support systems for healthcare professionals: an interpretative review. Inform Primary Care 20(2), 115–128 (2012)
Pölsterl, S., Conjeti, S., Navab, N., Katouzian, A.: Survival analysis for high-dimensional, heterogeneous medical data: exploring feature extraction as an alternative to feature selection. Artif. Intell. Med. 72, 1–11 (2016)
Rong, M., Gong, D., Gao, X.: Feature selection and its use in big data: challenges, methods, and trends. IEEE Access 7, 19709–19725 (2019)
Sahebi, G., Movahedi, P., Ebrahimi, M., Pahikkala, T., Plosila, J., Tenhunen, H.: GeFeS: a generalized wrapper feature selection approach for optimizing classification performance. Comput. Biol. Med. 125, 103974 (2020)
Hancer, E., Xue, B., Zhang, M.: Differential evolution for filter feature selection based on information theory and feature ranking. Knowl.-Based Syst. 140, 103–119 (2018)
Kaur, S., Kumar, Y., Koul, A., Kumar Kamboj, S.: A systematic review on metaheuristic optimization techniques for feature selections in disease diagnosis: open issues and challenges. Arch. Comput. Methods Eng. 30(3), 1863–1895 (2023)
Liu, H., Zhou, M., Liu, Q.: An embedded feature selection method for imbalanced data classification. IEEE/CAA J. Autom. Sin. 6(3), 703–715 (2019)
Moslehi, F., Haeri, A.: A novel hybrid wrapper–filter approach based on genetic algorithm, particle swarm optimization for feature subset selection. J. Ambient. Intell. Humaniz. Comput. 11, 1105–1127 (2020)
Liu H, Setiono R (2022) Feature selection and classification–a probabilistic wrapper approach. In: Industrial and engineering applications or artificial intelligence and expert systems. CRC Press, pp 419–424
Le, T.M., Vo, T.M., Pham, T.N., Dao, S.V.T.: A novel wrapper–based feature selection for early diabetes prediction enhanced with a metaheuristic. IEEE Access 9, 7869–7884 (2020)
Alnowami, M.R., Abolaban, F.A., Taha, E.: A wrapper-based feature selection approach to investigate potential biomarkers for early detection of breast cancer. J. Rad. Res. Appl. Sci 15(1), 104–110 (2022)
Gheyas, I.A., Smith, L.S.: Feature subset selection in large dimensionality domains. Pattern Recogn. 43(1), 5–13 (2010)
Loughrey, J., Cunningham, P.: Overfitting in wrapper-based feature subset selection: the harder you try the worse it gets. In: International conference on innovative techniques and applications of artificial intelligence, pp. 33–43. Springer (2004)
Tian D A multi-objective genetic local search algorithm for optimal feature subset selection. In: 2016 International conference on computational science and computational intelligence (CSCI), 2016. IEEE, pp 1089–1094
Pavai, G., Geetha, T.: New crossover operators using dominance and co-dominance principles for faster convergence of genetic algorithms. Soft. Comput. 23, 3661–3686 (2019)
Asuncion A, Newman D (2007) UCI machine learning repository. Irvine, CA, USA,
Masood, F., Masood, J., Zahir, H., Driss, K., Mehmood, N., Farooq, H.: Novel approach to evaluate classification algorithms and feature selection filter algorithms using medical data. J. Comput. Cogn. Eng. 2(1), 57–67 (2023)
Omuya, E.O., Okeyo, G.O., Kimwele, M.W.: Feature selection for classification using principal component analysis and information gain. Expert Syst. Appl. 174, 114765 (2021)
Mostafa, R.R., Khedr, A.M., Al Aghbari, Z., Afyouni, I., Kamel, I., Ahmed, N.: An adaptive hybrid mutated differential evolution feature selection method for low and high-dimensional medical datasets. Knowl.-Based Syst. 283, 111218 (2024)
Kamalov, F., Thabtah, F., Leung, H.H.: Feature selection in imbalanced data. Ann. Data Sci. 10(6), 1527–1541 (2023)
Nadimi-Shahraki, M.H., Banaie-Dezfouli, M., Zamani, H., Taghian, S., Mirjalili, S.: B-MFO: a binary moth-flame optimization for feature selection from medical datasets. Computers 10(11), 136 (2021)
Kavitha, C., Gadekallu, T.R., Nimala, K., Kavin, B.P., Lai, W.-C.: Filter-based ensemble feature selection and deep learning model for intrusion detection in cloud computing. Electronics 12(3), 556 (2023)
Xue, Y., Zhu, H., Neri, F.: A feature selection approach based on NSGA-II with ReliefF. Appl. Soft Comput. 134, 109987 (2023)
Urbanowicz, R.J., Olson, R.S., Schmitt, P., Meeker, M., Moore, J.H.: Benchmarking relief-based feature selection methods for bioinformatics data mining. J. Biomed. Inform. 85, 168–188 (2018)
Sosa-Cabrera, G., García-Torres, M., Gómez-Guerrero, S., Schaerer, C.E., Divina, F.: A multivariate approach to the symmetrical uncertainty measure: application to feature selection problem. Inf. Sci. 494, 1–20 (2019)
Jiménez-Cordero, A., Morales, J.M., Pineda, S.: A novel embedded min-max approach for feature selection in nonlinear support vector machine classification. Eur. J. Oper. Res. 293(1), 24–35 (2021)
Cui, L., Bai, L., Wang, Y., Philip, S.Y., Hancock, E.R.: Fused lasso for feature selection using structural information. Pattern Recogn. 119, 108058 (2021)
Liu, J., Zhang, S., Fan, H.: A two-stage hybrid credit risk prediction model based on XGBoost and graph-based deep neural network. Expert Syst. Appl. 195, 116624 (2022)
Baldomero-Naranjo, M., Martinez-Merino, L.I., Rodriguez-Chia, A.M.: A robust SVM-based approach with feature selection and outliers detection for classification problems. Expert Syst. Appl. 178, 115017 (2021)
Wang, H.: A novel feature selection method based on quantum support vector machine. Phys. Scr. 99(5), 056006 (2024)
Hamla, H., Ghanem, K.: A hybrid feature selection based on fisher score and SVM-RFE for microarray data. Informatica (2024). https://doi.org/10.31449/inf.v48i1.4759
Zhou, J., Hua, Z.: A correlation guided genetic algorithm and its application to feature selection. Appl. Soft Comput. 123, 108964 (2022)
Spencer, R., Thabtah, F., Abdelhamid, N., Thompson, M.: Exploring feature selection and classification methods for predicting heart disease. Digit. Health 6, 2055207620914777 (2020)
Tran, B., Zhang, M., Xue, B. A.: PSO based hybrid feature selection algorithm for high-dimensional classification. In: 2016 IEEE congress on evolutionary computation (CEC), 2016. IEEE, pp 3801–3808
Nadimi-Shahraki, M.H., Zamani, H., Mirjalili, S.: Enhanced whale optimization algorithm for medical feature selection: a COVID-19 case study. Comput. Biol. Med. 148, 105858 (2022)
Hegazy, A.E., Makhlouf, M., El-Tawel, G.S.: Improved salp swarm algorithm for feature selection. J. King Saud University-Comput. Inform. Sci. 32(3), 335–344 (2020)
Peng, L., Cai, Z., Heidari, A.A., Zhang, L., Chen, H.: Hierarchical Harris hawks optimizer for feature selection. J. Adv.Res. 53, 261–278 (2023)
Islam MM, Iqbal H, Haque MR, Hasan MK Prediction of breast cancer using support vector machine and K-Nearest neighbors. In: 2017 IEEE region 10 humanitarian technology conference (R10-HTC), 2017. IEEE, pp 226–229
Zhang, S., Li, X., Zong, M., Zhu, X., Wang, R.: Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 29(5), 1774–1785 (2017)
Morgan, J.: Classification and regression tree analysis, p. 298. Boston University, Boston (2014)
Natekin, A., Knoll, A.: Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 21 (2013)
Kamel H, Abdulah D, Al-Tuwaijari JM Cancer classification using gaussian naive bayes algorithm. In: 2019 international engineering conference (IEC), 2019. IEEE, pp 165–170
Nusinovici, S., Tham, Y.C., Yan, M.Y.C., Ting, D.S.W., Li, J., Sabanayagam, C., Wong, T.Y., Cheng, C.-Y.: Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 122, 56–69 (2020)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Mirjalili S, Zhang H, Mirjalili S, Chalup S, Noman N A novel U-shaped transfer function for binary particle swarm optimisation. In: Soft Computing for Problem Solving 2019: Proceedings of SocProS 2019, Volume 1, 2020. Springer, pp 241–259
Gokulnath, C.B., Shantharajah, S.: An optimized feature selection based on genetic approach and support vector machine for heart disease. Clust. Comput. 22, 14777–14787 (2019)
Amin, M.S., Chiam, Y.K., Varathan, K.D.: Identification of significant features and data mining techniques in predicting heart disease. Telematics Inform. 36, 82–93 (2019)
Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature extraction: foundations and applications, vol. 207. Springer (2008)
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34, 483–519 (2013)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC) under grant 82374559, the Sichuan Science and Technology Program under grants 2023YFSY0027 and 2023YFS0325, the Natural Science Foundation of Sichuan under grants 2022NSFSC0958 and 2024NSFSC0717, the Fundamental Research Funds for the Central Universities under grants ZYGX2021YGLH012 and ZYGX2021J020, the Ningbo Major Research and Development Plan Project under grant 20241ZDYF020354, and the Committee of Cadre Health of Sichuan Province under grant 2023-220.
Author information
Authors and Affiliations
Contributions
All authors read and approved the final manuscript. Haodi Quan: Conceptualization, Methodology, Software, Writing- original draft. Yun Zhang: Conceptualization, Methodology, Writing—review & editing. Qiaoqin Li: Validation, Supervision. Yongguo Liu: Validation, Supervision.
Corresponding author
Ethics declarations
Conflict of Interest
The authors have no conflicts of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Quan, H., Zhang, Y., Li, Q. et al. TPBFS: two populations based feature selection method for medical data. Cluster Comput 27, 11553–11568 (2024). https://doi.org/10.1007/s10586-024-04557-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-024-04557-6