Skip to main content
Log in

A clustering and generative adversarial networks-based hybrid approach for imbalanced data classification

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Imbalanced data often exists in related fields such as banking, insurance, security and medical care. The imbalanced distribution of data will lead to the deviation of decision-making, making it easy for a small number of samples to be divided incorrectly. Therefore, as a challenging task, imbalanced data classification has attracted extensive research in many disciplines. In this study, we propose a method based on the combination of clustering and generative adversarial network (GAN) to deal with the problem of imbalanced data classification. Firstly, we divide the majority class samples into three types according to the K-nearest neighbor algorithm. Secondly, undersampling is performed on the three types of data in the majority class through the clustering method. Then, a GAN model for tabular data generation is designed for oversampling of minority class samples. Finally, the preprocessed majority and minority data are used to train the machine learning model. We used real-world data sets to conduct relevant test experiments. The experimental results show that the imbalanced data processed by the method in this paper have achieved excellent results in the two evaluation indicators of the three common classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availibility

The data used in this article is a common data set for imbalanced learning, which can be found here: http://www.keel.es/

References

  • Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Val Logic Soft Comput 17(2–3):255–287

    Google Scholar 

  • Andresini G, Appice A, De Rose L, Malerba D (2021) Gan augmentation to deal with imbalance in imaging-based intrusion detection. Futur Gener Comput Syst 123:108–127

    Article  Google Scholar 

  • Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning. PMLR, pp 214–223

  • Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110(2):279–301

    Article  MathSciNet  MATH  Google Scholar 

  • Chen D, Wang X-J, Zhou C, Wang B (2019) The distance-based balancing ensemble method for data with a high imbalance ratio. IEEE Access 7:68940–68956

    Article  Google Scholar 

  • Chen Y, Wang X, Liu Z, Xu H, Darrell T (2020) A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390

  • Cheng F, Zhang J, Wen C (2016) Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recogn Lett 80:107–112

    Article  Google Scholar 

  • Ding H, Chen L, Dong L, Fu Z, Cui X (2022) Imbalanced data classification: a knn and generative adversarial networks-based hybrid approach for intrusion detection. Futur Gener Comput Syst 131:240–254

    Article  Google Scholar 

  • Ding H, Sun Y, Wang Z, Huang N, Shen Z, Cui X (2023) Rgan-el: a GAN and ensemble learning-based hybrid approach for imbalanced data classification. Inf Process Manag 60(2):103235

    Article  Google Scholar 

  • Dongdong L, Ziqiu C, Bolu W, Zhe W, Hai Y, Wenli D (2021) Entropy-based hybrid sampling ensemble learning for imbalanced data. Int J Intell Syst 36(7):3039–3067

    Article  Google Scholar 

  • Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  • Engelmann J, Lessmann S (2021) Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst Appl 174:114582

    Article  Google Scholar 

  • Fan M, Yang Q, Zhang B, Zhang K, Xia J et al (2021) Cluster-based generative adversarial network imbalanced data generation method. In: 2021 IEEE 10th data driven control and learning systems conference (DDCLS). IEEE, pp 547–552

  • Gao X, Deng F, Yue X (2020) Data augmentation in fault diagnosis based on the Wasserstein generative adversarial network with gradient penalty. Neurocomputing 396:487–494

    Article  Google Scholar 

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014a) Generative adversarial nets. Adv Neural Inf Process Syst 27

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014b) Generative adversarial nets. MIT Press, New York

    Google Scholar 

  • Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A (2017) Improved training of Wasserstein Gans. arXiv preprint arXiv:1704.00028

  • He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328

  • Huang N, Hu R, Xiong M, Peng X, Ding H, Jia X, Zhang L (2022) Multi-scale interest dynamic hierarchical transformer for sequential recommendation. Neural Comput Appl 34:1–12

    Article  Google Scholar 

  • Jedrzejowicz J, Jedrzejowicz P (2021) Gep-based classifier for mining imbalanced data. Expert Syst Appl 164:114058

    Article  Google Scholar 

  • Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122

    Article  Google Scholar 

  • Jiang Y, Li X, Luo H, Yin S, Kaynak O (2022) Quo vadis artificial intelligence? Discov Artif Intell 2(1):1–19

    Article  Google Scholar 

  • Jiang C, Lu W, Wang Z, Ding Y (2023) Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Syst Appl 213:118878

    Article  Google Scholar 

  • Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30

  • Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587

    Google Scholar 

  • Kim KH, Sohn SY (2020) Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 130:176–184

    Article  Google Scholar 

  • Lei K, Xie Y, Zhong S, Dai J, Yang M, Shen Y (2020) Generative adversarial fusion network for class imbalance credit scoring. Neural Comput Appl 32(12):8451–8462

    Article  Google Scholar 

  • Li X, Du Z, Huang Y, Tan Z (2021) A deep translation (gan) based change detection network for optical and sar remote sensing images. ISPRS J Photogramm Remote Sens 179:14–34

    Article  Google Scholar 

  • Lu T, Huang Y, Zhao W, Zhang J (2019) The metering automation system based intrusion detection using random forest classifier with smote+ enn. In: 2019 IEEE 7th International conference on computer science and network technology (ICCSNT). IEEE, pp 370–374

  • Maldonado S, Vairetti C, Fernandez A, Herrera F (2022) Fw-smote: a feature-weighted oversampling approach for imbalanced classification. Pattern Recogn 124:108511

    Article  Google Scholar 

  • Marutho D, Handaka SH, Wijaya E, Muljono (2018) The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In: 2018 International seminar on application for technology of information and communication (iSemantic)

  • Mirzaei B, Nikpour B, Nezamabadi-pour H (2021) Cdbh: a clustering and density-based hybrid approach for imbalanced data classification. Expert Syst Appl 164:114035

    Article  Google Scholar 

  • Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597

    Article  Google Scholar 

  • Ng WW, Hu J, Yeung DS, Yin S, Roli F (2014) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans Cybernet 45(11):2402–2412

    Article  Google Scholar 

  • Ren J, Liu Y, Liu J (2019) Ewgan: entropy-based Wasserstein Gan for imbalanced learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 10011–10012

  • Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  • Son M, Jung S, Jung S, Hwang E (2021) Bcgan: a cgan-based over-sampling model using the boundary class for data balancing. J Supercomput 77(9):10463–10487

    Article  Google Scholar 

  • Tao X, Zheng Y, Chen W, Zhang X, Qi L, Fan Z, Huang S (2022) Svdd-based weighted oversampling technique for imbalanced and overlapped dataset learning. Inf Sci 588:13–51

    Article  Google Scholar 

  • Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54

    Article  Google Scholar 

  • Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)

  • Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70

    Article  Google Scholar 

  • Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlap problem in imbalanced data classification. Knowl-Based Syst 212:106631

    Article  Google Scholar 

  • Wen G, Li X, Zhu Y, Chen L, Luo Q, Tan M (2021) One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manag 58(1):102388

    Article  Google Scholar 

  • Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 11(1):1–34

    MathSciNet  MATH  Google Scholar 

  • Wong ML, Seng K, Wong P (2020) Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Syst Appl 141:112918

    Article  Google Scholar 

  • Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong H-S, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 31(4):1387–1400

    Article  MathSciNet  Google Scholar 

  • Yuan B-W, Luo X-G, Zhang Z-L, Yu Y, Huo H-W, Johannes T, Zou X-D (2021) A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets. Neural Comput Appl 33(9):4457–4481

    Article  Google Scholar 

  • Zhai J, Qi J, Zhang S (2020) Binary imbalanced data classification based on modified d2gan oversampling and classifier fusion. IEEE Access 8:169456–169469

    Article  Google Scholar 

  • Zhu Y, Yan Y, Zhang Y, Zhang Y (2020) Ehso: evolutionary hybrid sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the National Key R &D Program of China (no. 2018YFC 1604000).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongwei Ding.

Ethics declarations

Conflict of interest

The authors declare that they have no confict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ding, H., Cui, X. A clustering and generative adversarial networks-based hybrid approach for imbalanced data classification. J Ambient Intell Human Comput 14, 8003–8018 (2023). https://doi.org/10.1007/s12652-023-04610-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-023-04610-z

Keywords

Navigation