Abstract
With the advancement of artificial intelligence (AI) and machine learning techniques, the diseases can be early diagnosed and detected with the help of known healthcare data of similar diseases. However, the analysis of healthcare data becomes challenging due to the imbalance behavior of such data. Such behavior has a skewed nature and the samples of imbalanced dataset contain the majority and minority instances. The quantity of majority instances is much higher than that of the minority instances. Thus, the classifier always gets biased toward the majority instances while classifying an unknown instance. This is not a perfect deal by any of the classifiers especially for disease diagnosis. Therefore, it is important to balance the dataset by considering both the majority and minority class samples. In this context, Genetic Algorithm based on Autoencoder (GAAE) model is proposed to process the imbalance data. Initially, an autoencoder is trained by the help of genetic operators and both the majority and minority samples. The chromosome is efficiently designed to represent an autoencoder. The error function is designed as the fitness function with the help of ensemble of classifiers. The optimized autoencoder generates the synthetic data of minority class to balance the data. Once the imbalanced data is balanced using the proposed GAAE, then the feature selection is done based on the correlation coefficient. Then various classifiers, multi-layer perceptron (MLP), k-nearest neighbor (k-NN), C4.5 decision tree (DT) and random forest (RF) are employed for the data classification. An extensive simulation is performed on the proposed approach and comparison is done with the existing approaches. It is observed that consideration of both the majority and minority samples to generate the synthetic data helps the classifiers to perform better. In addition, statistical analysis has also been performed.

















Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data Availability
The data that used in this study are freely accessible which are discussed in Sect. 7.1.
References
Ayar M, Isazadeh A, Gharehchopogh FS, Seyedi M (2022) Chaotic-based divide-and-conquer feature selection method and its application in cardiac arrhythmia classification. J Supercomput 78(4):5856–5882
Xiao Y, Wu J, Lin Z (2021) Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput Biol Med 135:104540
De Angeli K, Gao S, Danciu I, Durbin EB, Wu X-C, Stroup A, Doherty J, Schwartz S, Wiggins C, Damesyn M, Coyle L, Penberthy L, Tourassi GD, Yoon H-J (2022) Class imbalance in out-of-distribution datasets: Improving the robustness of the textcnn for the classification of rare cancer types. J Biomed Inform 125:103957
Bakhsh AA (2021) High-performance in classification of heart disease using advanced supercomputing technique with cluster-based enhanced deep genetic algorithm. J Supercomput 77(9):10540–10561
Ebenuwa SH, Sharif MS, Alazab M, Al-Nemrat A (2019) Variance ranking attributes selection techniques for binary classification problem in imbalance data. IEEE Access 7:24649–24666
Shankar K, Lakshmanaprabu S, Gupta D, Maseleno A, De Albuquerque VHC (2020) Optimal feature-based multi-kernel svm approach for thyroid disease classification. J Supercomput 76(2):1128–1143
Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data 6(1):27
Rao KN, Satyananda Reddy C (2018) An efficient software defect analysis using correlation-based oversampling. Arabian J Sci Eng 43(8):4391–4411
Shanmugam S, Preethi J (2019) Improved feature selection and classification for rheumatoid arthritis disease using weighted decision tree approach (react). J Supercomput 75(8):5507–5519
Suresh A, Kumar R, Varatharajan R (2020) Health care data analysis using evolutionary algorithm. J Supercomput 76(6):4262–4271
Fotouhi S, Asadi S, Kattan MW (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform 90:103089
Liang T, Xu X, Xiao P (2017) A new image classification method based on modified condensed nearest neighbor and convolutional neural networks. Pattern Recogn Lett 94:105–111
Majzoub HA, Elgedawy I, Akaydın Ö, Köse Ulukök M (2020) HCAB-SMOTE: a hybrid clustered affinitive borderline SMOTE approach for imbalanced data binary classification. Arabian J Sci Eng 45(4):3205–3222
Liang X, Jiang A, Li T, Xue Y, Wang G (2020) LR-SMOTE- an improved unbalanced data set oversampling based on K-means and SVM. Knowl-Based Syst 196:105845
Shuja M, Mittal S, Zaman M (2020) Effective prediction of type II diabetes mellitus using data mining classifiers andSMOTE. In: Sharma H, Govindan K, Poonia RC, Kumar S, El-Medany WM (eds) Advances in Computing and Intelligent Systems. Springer, Singapore, pp 195–211
Jiang K, Lu J, Xia K (2016) A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE. Arab J Sci Eng 41(8):3255–3266
Ketu S, Mishra PK (2022) Empirical analysis of machine learning algorithms on imbalance electrocardiogram based arrhythmia dataset for heart disease detection. Arabian J Sci Eng 47:1447–1469
Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform 107:103465
Sreejith S, Nehemiah HK, Kannan A (2020) Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection. Comput Biol Med 126:103991
Zhang A, Yu H, Huan Z, Yang X, Zheng S, Gao S (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88
Wang S, Dai Y, Shen J, Xuan J (2021) Research on expansion and classification of imbalanced data based on SMOTE algorithm. Sci Rep 11(1):1–11
Kaur P, Gosain A (2021) GT2FS-SMOTE: an intelligent oversampling approach based upon general type-2 fuzzy sets to detect web spam. Arab J Sci Eng 46(4):3033–3050
Rochayani MY, Sa’adah U, Astuti AB (2020) Finding biomarkers from a high-dimensional imbalanced dataset using the hybrid method of random undersampling and LASSO, ComTech: Computer. Math Eng Appl 11(2):75–81
Du G, Zhang J, Li S, Li C (2021) Learning from class-imbalance and heterogeneous data for 30-day hospital readmission. Neurocomputing 420:27–35
Desuky AS, Hussain S (2021) An improved hybrid approach for handling class imbalance problem. Arab J Sci Eng 46(4):3853–3864
Raghuwanshi BS, Shukla S (2019) Class imbalance learning using underbagging based kernelized extreme learning machine. Neurocomputing 329:172–187
Rout S, Mallick PK, Mishra D (2022) DRBF-DS: double RBF kernel-based deep sampling with CNNs to handle complex imbalanced datasets. Arabian J Sci Eng. https://doi.org/10.1007/s13369-021-06480-z
Ghorbani M, Kazi A, Baghshah MS, Rabiee HR, Navab N (2022) RA-GCN: graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal 75:102272
Wong GY, Leung FH, Ling S-H (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Inf Sci 454:161–177
Sun B, Chen H, Wang J, Xie H (2018) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comp Sci 12(2):331–350
Aydogan EK, Ozmen M, Delice Y (2019) CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 31(10):6345–6363
Daoud M, Mayo M (2018) A novel synthetic over-sampling technique for imbalanced classification of gene expressions using autoencoders and swarm optimization, In: Australasian Joint Conference on Artificial Intelligence, Springer, pp 603–615
Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
Tahan MH, Asadi S (2018) EMDID: Evolutionary multi-objective discretization for imbalanced datasets. Inf Sci 432:442–461
Maldonado S, López J (2018) Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification. Appl Soft Comput 67:94–105
Wang Y, Yao H, Zhao S (2016) Auto-encoder based dimensionality reduction. Neurocomputing 184:232–242
Biswas T, Kuila P, Ray AK (2019) A novel scheduling with multi-criteria for high-performance computing systems: an improved genetic algorithm-based approach. Eng Comput 35(4):1475–1490
Harizan S, Kuila P (2019) Coverage and connectivity aware energy efficient scheduling in target based wireless sensor networks: An improved genetic algorithm based approach. Wireless Netw 25(4):1995–2011
Ram PK, Kuila P (2019) Feature selection from microarray data: Genetic algorithm based approach. J Inf Optim Sci 40(8):1599–1610
Mandal M, Mondal J, Mukhopadhyay A (2015) A PSO-based approach for pathway marker identification from gene expression data. IEEE Trans Nanobiosci 14(6):591–597
Lamba R, Gulati T, Jain A (2022) A hybrid feature selection approach for parkinson’s detection based on mutual information gain and recursive feature elimination. Arabian J Sci Eng. https://doi.org/10.1007/s13369-021-06544-0
Ram PK, Kuila P (2021) GSA-based approach for gene selection from microarray gene expression data. In: Srinivas M, Sucharitha G, Matta A, Chatterjee P (eds) Machine Learning Algorithms and Applications, Wiley Online Library, pp 159–174. https://doi.org/10.1002/9781119769262.ch9
Bhui N, Ram PK, Kuila P (2020) Feature selection from microarray data based on deep learning approach, In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, pp 1–5
Ram PK, Bhui N, Kuila P (2020) Gene selection from high dimensionality of data based on quantum inspired genetic algorithm, In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, pp 1–5
Funding
The research work of this article is not funded by any organizations/agencies.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ram, P.K., Kuila, P. GAAE: a novel genetic algorithm based on autoencoder with ensemble classifiers for imbalanced healthcare data. J Supercomput 79, 541–572 (2023). https://doi.org/10.1007/s11227-022-04679-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04679-x