Abstract
Imbalanced data often exists in related fields such as banking, insurance, security and medical care. The imbalanced distribution of data will lead to the deviation of decision-making, making it easy for a small number of samples to be divided incorrectly. Therefore, as a challenging task, imbalanced data classification has attracted extensive research in many disciplines. In this study, we propose a method based on the combination of clustering and generative adversarial network (GAN) to deal with the problem of imbalanced data classification. Firstly, we divide the majority class samples into three types according to the K-nearest neighbor algorithm. Secondly, undersampling is performed on the three types of data in the majority class through the clustering method. Then, a GAN model for tabular data generation is designed for oversampling of minority class samples. Finally, the preprocessed majority and minority data are used to train the machine learning model. We used real-world data sets to conduct relevant test experiments. The experimental results show that the imbalanced data processed by the method in this paper have achieved excellent results in the two evaluation indicators of the three common classification methods.
Similar content being viewed by others
Data availibility
The data used in this article is a common data set for imbalanced learning, which can be found here: http://www.keel.es/
References
Alcala-Fdez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Val Logic Soft Comput 17(2–3):255–287
Andresini G, Appice A, De Rose L, Malerba D (2021) Gan augmentation to deal with imbalance in imaging-based intrusion detection. Futur Gener Comput Syst 123:108–127
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: International conference on machine learning. PMLR, pp 214–223
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110(2):279–301
Chen D, Wang X-J, Zhou C, Wang B (2019) The distance-based balancing ensemble method for data with a high imbalance ratio. IEEE Access 7:68940–68956
Chen Y, Wang X, Liu Z, Xu H, Darrell T (2020) A new meta-baseline for few-shot learning. arXiv preprint arXiv:2003.04390
Cheng F, Zhang J, Wen C (2016) Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recogn Lett 80:107–112
Ding H, Chen L, Dong L, Fu Z, Cui X (2022) Imbalanced data classification: a knn and generative adversarial networks-based hybrid approach for intrusion detection. Futur Gener Comput Syst 131:240–254
Ding H, Sun Y, Wang Z, Huang N, Shen Z, Cui X (2023) Rgan-el: a GAN and ensemble learning-based hybrid approach for imbalanced data classification. Inf Process Manag 60(2):103235
Dongdong L, Ziqiu C, Bolu W, Zhe W, Hai Y, Wenli D (2021) Entropy-based hybrid sampling ensemble learning for imbalanced data. Int J Intell Syst 36(7):3039–3067
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Engelmann J, Lessmann S (2021) Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning. Expert Syst Appl 174:114582
Fan M, Yang Q, Zhang B, Zhang K, Xia J et al (2021) Cluster-based generative adversarial network imbalanced data generation method. In: 2021 IEEE 10th data driven control and learning systems conference (DDCLS). IEEE, pp 547–552
Gao X, Deng F, Yue X (2020) Data augmentation in fault diagnosis based on the Wasserstein generative adversarial network with gradient penalty. Neurocomputing 396:487–494
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014a) Generative adversarial nets. Adv Neural Inf Process Syst 27
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014b) Generative adversarial nets. MIT Press, New York
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville A (2017) Improved training of Wasserstein Gans. arXiv preprint arXiv:1704.00028
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, pp 1322–1328
Huang N, Hu R, Xiong M, Peng X, Ding H, Jia X, Zhang L (2022) Multi-scale interest dynamic hierarchical transformer for sequential recommendation. Neural Comput Appl 34:1–12
Jedrzejowicz J, Jedrzejowicz P (2021) Gep-based classifier for mining imbalanced data. Expert Syst Appl 164:114058
Jian C, Gao J, Ao Y (2016) A new sampling method for classifying imbalanced data based on support vector machine ensemble. Neurocomputing 193:115–122
Jiang Y, Li X, Luo H, Yin S, Kaynak O (2022) Quo vadis artificial intelligence? Discov Artif Intell 2(1):1–19
Jiang C, Lu W, Wang Z, Ding Y (2023) Benchmarking state-of-the-art imbalanced data learning approaches for credit scoring. Expert Syst Appl 213:118878
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T-Y (2017) Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
Kim KH, Sohn SY (2020) Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 130:176–184
Lei K, Xie Y, Zhong S, Dai J, Yang M, Shen Y (2020) Generative adversarial fusion network for class imbalance credit scoring. Neural Comput Appl 32(12):8451–8462
Li X, Du Z, Huang Y, Tan Z (2021) A deep translation (gan) based change detection network for optical and sar remote sensing images. ISPRS J Photogramm Remote Sens 179:14–34
Lu T, Huang Y, Zhao W, Zhang J (2019) The metering automation system based intrusion detection using random forest classifier with smote+ enn. In: 2019 IEEE 7th International conference on computer science and network technology (ICCSNT). IEEE, pp 370–374
Maldonado S, Vairetti C, Fernandez A, Herrera F (2022) Fw-smote: a feature-weighted oversampling approach for imbalanced classification. Pattern Recogn 124:108511
Marutho D, Handaka SH, Wijaya E, Muljono (2018) The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In: 2018 International seminar on application for technology of information and communication (iSemantic)
Mirzaei B, Nikpour B, Nezamabadi-pour H (2021) Cdbh: a clustering and density-based hybrid approach for imbalanced data classification. Expert Syst Appl 164:114035
Napierala K, Stefanowski J (2016) Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 46(3):563–597
Ng WW, Hu J, Yeung DS, Yin S, Roli F (2014) Diversified sensitivity-based undersampling for imbalance classification problems. IEEE Trans Cybernet 45(11):2402–2412
Ren J, Liu Y, Liu J (2019) Ewgan: entropy-based Wasserstein Gan for imbalanced learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 10011–10012
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Son M, Jung S, Jung S, Hwang E (2021) Bcgan: a cgan-based over-sampling model using the boundary class for data balancing. J Supercomput 77(9):10463–10487
Tao X, Zheng Y, Chen W, Zhang X, Qi L, Fan Z, Huang S (2022) Svdd-based weighted oversampling technique for imbalanced and overlapped dataset learning. Inf Sci 588:13–51
Tsai C-F, Lin W-C, Hu Y-H, Yao G-T (2019) Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf Sci 477:47–54
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70
Vuttipittayamongkol P, Elyan E, Petrovski A (2021) On the class overlap problem in imbalanced data classification. Knowl-Based Syst 212:106631
Wen G, Li X, Zhu Y, Chen L, Luo Q, Tan M (2021) One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manag 58(1):102388
Wilson DR, Martinez TR (1997) Improved heterogeneous distance functions. J Artif Intell Res 11(1):1–34
Wong ML, Seng K, Wong P (2020) Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Syst Appl 141:112918
Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong H-S, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learn Syst 31(4):1387–1400
Yuan B-W, Luo X-G, Zhang Z-L, Yu Y, Huo H-W, Johannes T, Zou X-D (2021) A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets. Neural Comput Appl 33(9):4457–4481
Zhai J, Qi J, Zhang S (2020) Binary imbalanced data classification based on modified d2gan oversampling and classifier fusion. IEEE Access 8:169456–169469
Zhu Y, Yan Y, Zhang Y, Zhang Y (2020) Ehso: evolutionary hybrid sampling in overlapping scenarios for imbalanced learning. Neurocomputing 417:333–346
Acknowledgements
This research was supported by the National Key R &D Program of China (no. 2018YFC 1604000).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no confict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ding, H., Cui, X. A clustering and generative adversarial networks-based hybrid approach for imbalanced data classification. J Ambient Intell Human Comput 14, 8003–8018 (2023). https://doi.org/10.1007/s12652-023-04610-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-023-04610-z