Skip to main content

Advertisement

Log in

GAAE: a novel genetic algorithm based on autoencoder with ensemble classifiers for imbalanced healthcare data

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

With the advancement of artificial intelligence (AI) and machine learning techniques, the diseases can be early diagnosed and detected with the help of known healthcare data of similar diseases. However, the analysis of healthcare data becomes challenging due to the imbalance behavior of such data. Such behavior has a skewed nature and the samples of imbalanced dataset contain the majority and minority instances. The quantity of majority instances is much higher than that of the minority instances. Thus, the classifier always gets biased toward the majority instances while classifying an unknown instance. This is not a perfect deal by any of the classifiers especially for disease diagnosis. Therefore, it is important to balance the dataset by considering both the majority and minority class samples. In this context, Genetic Algorithm based on Autoencoder (GAAE) model is proposed to process the imbalance data. Initially, an autoencoder is trained by the help of genetic operators and both the majority and minority samples. The chromosome is efficiently designed to represent an autoencoder. The error function is designed as the fitness function with the help of ensemble of classifiers. The optimized autoencoder generates the synthetic data of minority class to balance the data. Once the imbalanced data is balanced using the proposed GAAE, then the feature selection is done based on the correlation coefficient. Then various classifiers, multi-layer perceptron (MLP), k-nearest neighbor (k-NN), C4.5 decision tree (DT) and random forest (RF) are employed for the data classification. An extensive simulation is performed on the proposed approach and comparison is done with the existing approaches. It is observed that consideration of both the majority and minority samples to generate the synthetic data helps the classifiers to perform better. In addition, statistical analysis has also been performed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data Availability

The data that used in this study are freely accessible which are discussed in Sect. 7.1.

References

  1. Ayar M, Isazadeh A, Gharehchopogh FS, Seyedi M (2022) Chaotic-based divide-and-conquer feature selection method and its application in cardiac arrhythmia classification. J Supercomput 78(4):5856–5882

    Article  Google Scholar 

  2. Xiao Y, Wu J, Lin Z (2021) Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput Biol Med 135:104540

    Article  Google Scholar 

  3. De Angeli K, Gao S, Danciu I, Durbin EB, Wu X-C, Stroup A, Doherty J, Schwartz S, Wiggins C, Damesyn M, Coyle L, Penberthy L, Tourassi GD, Yoon H-J (2022) Class imbalance in out-of-distribution datasets: Improving the robustness of the textcnn for the classification of rare cancer types. J Biomed Inform 125:103957

    Article  Google Scholar 

  4. Bakhsh AA (2021) High-performance in classification of heart disease using advanced supercomputing technique with cluster-based enhanced deep genetic algorithm. J Supercomput 77(9):10540–10561

    Article  Google Scholar 

  5. Ebenuwa SH, Sharif MS, Alazab M, Al-Nemrat A (2019) Variance ranking attributes selection techniques for binary classification problem in imbalance data. IEEE Access 7:24649–24666

    Article  Google Scholar 

  6. Shankar K, Lakshmanaprabu S, Gupta D, Maseleno A, De Albuquerque VHC (2020) Optimal feature-based multi-kernel svm approach for thyroid disease classification. J Supercomput 76(2):1128–1143

    Article  Google Scholar 

  7. Johnson JM, Khoshgoftaar TM (2019) Survey on deep learning with class imbalance. J Big Data 6(1):27

    Article  Google Scholar 

  8. Rao KN, Satyananda Reddy C (2018) An efficient software defect analysis using correlation-based oversampling. Arabian J Sci Eng 43(8):4391–4411

    Article  Google Scholar 

  9. Shanmugam S, Preethi J (2019) Improved feature selection and classification for rheumatoid arthritis disease using weighted decision tree approach (react). J Supercomput 75(8):5507–5519

    Article  Google Scholar 

  10. Suresh A, Kumar R, Varatharajan R (2020) Health care data analysis using evolutionary algorithm. J Supercomput 76(6):4262–4271

    Article  Google Scholar 

  11. Fotouhi S, Asadi S, Kattan MW (2019) A comprehensive data level analysis for cancer diagnosis on imbalanced data. J Biomed Inform 90:103089

    Article  Google Scholar 

  12. Liang T, Xu X, Xiao P (2017) A new image classification method based on modified condensed nearest neighbor and convolutional neural networks. Pattern Recogn Lett 94:105–111

    Article  Google Scholar 

  13. Majzoub HA, Elgedawy I, Akaydın Ö, Köse Ulukök M (2020) HCAB-SMOTE: a hybrid clustered affinitive borderline SMOTE approach for imbalanced data binary classification. Arabian J Sci Eng 45(4):3205–3222

    Article  Google Scholar 

  14. Liang X, Jiang A, Li T, Xue Y, Wang G (2020) LR-SMOTE- an improved unbalanced data set oversampling based on K-means and SVM. Knowl-Based Syst 196:105845

    Article  Google Scholar 

  15. Shuja M, Mittal S, Zaman M (2020) Effective prediction of type II diabetes mellitus using data mining classifiers andSMOTE. In: Sharma H, Govindan K, Poonia RC, Kumar S, El-Medany WM (eds) Advances in Computing and Intelligent Systems. Springer, Singapore, pp 195–211

    Chapter  Google Scholar 

  16. Jiang K, Lu J, Xia K (2016) A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE. Arab J Sci Eng 41(8):3255–3266

    Article  Google Scholar 

  17. Ketu S, Mishra PK (2022) Empirical analysis of machine learning algorithms on imbalance electrocardiogram based arrhythmia dataset for heart disease detection. Arabian J Sci Eng 47:1447–1469

    Article  Google Scholar 

  18. Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. J Biomed Inform 107:103465

    Article  Google Scholar 

  19. Sreejith S, Nehemiah HK, Kannan A (2020) Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection. Comput Biol Med 126:103991

    Article  Google Scholar 

  20. Zhang A, Yu H, Huan Z, Yang X, Zheng S, Gao S (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88

    Article  Google Scholar 

  21. Wang S, Dai Y, Shen J, Xuan J (2021) Research on expansion and classification of imbalanced data based on SMOTE algorithm. Sci Rep 11(1):1–11

    Article  Google Scholar 

  22. Kaur P, Gosain A (2021) GT2FS-SMOTE: an intelligent oversampling approach based upon general type-2 fuzzy sets to detect web spam. Arab J Sci Eng 46(4):3033–3050

    Article  Google Scholar 

  23. Rochayani MY, Sa’adah U, Astuti AB (2020) Finding biomarkers from a high-dimensional imbalanced dataset using the hybrid method of random undersampling and LASSO, ComTech: Computer. Math Eng Appl 11(2):75–81

    Google Scholar 

  24. Du G, Zhang J, Li S, Li C (2021) Learning from class-imbalance and heterogeneous data for 30-day hospital readmission. Neurocomputing 420:27–35

    Article  Google Scholar 

  25. Desuky AS, Hussain S (2021) An improved hybrid approach for handling class imbalance problem. Arab J Sci Eng 46(4):3853–3864

    Article  Google Scholar 

  26. Raghuwanshi BS, Shukla S (2019) Class imbalance learning using underbagging based kernelized extreme learning machine. Neurocomputing 329:172–187

    Article  Google Scholar 

  27. Rout S, Mallick PK, Mishra D (2022) DRBF-DS: double RBF kernel-based deep sampling with CNNs to handle complex imbalanced datasets. Arabian J Sci Eng. https://doi.org/10.1007/s13369-021-06480-z

    Article  Google Scholar 

  28. Ghorbani M, Kazi A, Baghshah MS, Rabiee HR, Navab N (2022) RA-GCN: graph convolutional network for disease prediction problems with imbalanced data. Med Image Anal 75:102272

    Article  Google Scholar 

  29. Wong GY, Leung FH, Ling S-H (2018) A hybrid evolutionary preprocessing method for imbalanced datasets. Inf Sci 454:161–177

    Article  MathSciNet  Google Scholar 

  30. Sun B, Chen H, Wang J, Xie H (2018) Evolutionary under-sampling based bagging ensemble method for imbalanced data classification. Front Comp Sci 12(2):331–350

    Article  Google Scholar 

  31. Aydogan EK, Ozmen M, Delice Y (2019) CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 31(10):6345–6363

    Article  Google Scholar 

  32. Daoud M, Mayo M (2018) A novel synthetic over-sampling technique for imbalanced classification of gene expressions using autoencoders and swarm optimization, In: Australasian Joint Conference on Artificial Intelligence, Springer, pp 603–615

  33. Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54

    Article  Google Scholar 

  34. Tahan MH, Asadi S (2018) EMDID: Evolutionary multi-objective discretization for imbalanced datasets. Inf Sci 432:442–461

    Article  MathSciNet  Google Scholar 

  35. Maldonado S, López J (2018) Dealing with high-dimensional class-imbalanced datasets: embedded feature selection for SVM classification. Appl Soft Comput 67:94–105

    Article  Google Scholar 

  36. Wang Y, Yao H, Zhao S (2016) Auto-encoder based dimensionality reduction. Neurocomputing 184:232–242

    Article  Google Scholar 

  37. Biswas T, Kuila P, Ray AK (2019) A novel scheduling with multi-criteria for high-performance computing systems: an improved genetic algorithm-based approach. Eng Comput 35(4):1475–1490

    Article  Google Scholar 

  38. Harizan S, Kuila P (2019) Coverage and connectivity aware energy efficient scheduling in target based wireless sensor networks: An improved genetic algorithm based approach. Wireless Netw 25(4):1995–2011

    Article  Google Scholar 

  39. Ram PK, Kuila P (2019) Feature selection from microarray data: Genetic algorithm based approach. J Inf Optim Sci 40(8):1599–1610

    Google Scholar 

  40. Mandal M, Mondal J, Mukhopadhyay A (2015) A PSO-based approach for pathway marker identification from gene expression data. IEEE Trans Nanobiosci 14(6):591–597

    Article  Google Scholar 

  41. Lamba R, Gulati T, Jain A (2022) A hybrid feature selection approach for parkinson’s detection based on mutual information gain and recursive feature elimination. Arabian J Sci Eng. https://doi.org/10.1007/s13369-021-06544-0

    Article  Google Scholar 

  42. Ram PK, Kuila P (2021) GSA-based approach for gene selection from microarray gene expression data. In: Srinivas M, Sucharitha G, Matta A, Chatterjee P (eds) Machine Learning Algorithms and Applications, Wiley Online Library, pp 159–174. https://doi.org/10.1002/9781119769262.ch9

  43. Bhui N, Ram PK, Kuila P (2020) Feature selection from microarray data based on deep learning approach, In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, pp 1–5

  44. Ram PK, Bhui N, Kuila P (2020) Gene selection from high dimensionality of data based on quantum inspired genetic algorithm, In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, pp 1–5

Download references

Funding

The research work of this article is not funded by any organizations/agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pratyay Kuila.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ram, P.K., Kuila, P. GAAE: a novel genetic algorithm based on autoencoder with ensemble classifiers for imbalanced healthcare data. J Supercomput 79, 541–572 (2023). https://doi.org/10.1007/s11227-022-04679-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04679-x

Keywords

Navigation