Abstract
Multi-class imbalanced classification is a challenging problem in the field of machine learning. Many methods have been proposed to deal with it, and oversampling is one of the most popular techniques which alleviates class imbalance by generating instances for the minority classes. However, each oversampling utilizes a single way to generate instances for all candidate minority ones, which neglects the intrinsic characteristics among different minority class instances, and makes the synthetic instances redundant or ineffective. In this work, we propose a global-local-based oversampling method, termed GLOS. We introduce a new discreteness-based metric (DID) and distinguish the minority class from the majority class by comparing it with each class-level discreteness value. Then, for each minority class, some difficult-to-learn instances are selected, which have smaller instance-level dispersion than the corresponding class-level one, to generate synthetic instances. And the number of synthetic instances equals the difference between two types of dispersion values. These selected instances are assigned into different groups according to their local distribution. Furthermore, GLOS adopts a specific synthetic strategy to each group instance purposefully. Finally, all minority classes, part of the majority classes instances, and synthetic data will be used as training data. In this way, the quantity and quality of synthetic instances are guaranteed. Experimental results on KEEL and UCI data sets demonstrate the effectiveness of our proposal.
Similar content being viewed by others
References
Han X, Cui R, Lan Y, Kang Y, Deng J, Jia N (2019) A gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. Int J Mach Learn Cybernet 10(12):3687–3699
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform Sci 250:113–141
Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77
Guo H, Li Y, Jennifer S, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Li J, Zhu Q, Wu Q, Fan Z (2021) A novel oversampling technique for class-imbalanced learning based on smote and natural neighbors. Inform Sci 565:438–455
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29
Goyal S (2022) Handling class-imbalance with knn (neighbourhood) under-sampling for software defect prediction. Artificial Intell Rev 55(3):2023–2064
Tomek I (1976) Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics SMC-6(11), 769–772
Zhou Z, Liu X (2010) On multi-class cost-sensitive learning. Comput Intell 26(3):232–257
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Sun Y, Wong AKC, Wang Y (2005) Parameter inference of cost-sensitive boosting algorithms. In: Machine Learning and Data Mining in Pattern Recognition, pp. 21–30. Springer, Berlin, Heidelberg
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inform Sci 487:31–56
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
Zhang C, Tan KC, Li H, Hong GS (2019) A cost-sensitive deep belief network for imbalanced classification. IEEE Trans Neural Netw Learn Syst 30(1):109–122
Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing 343:50–64
Jia J, Zhai L, Ren W, Wang L, Ren Y (2022) An effective imbalanced jpeg steganalysis scheme based on adaptive cost-sensitive feature learning. IEEE Trans Knowl Data Eng 34(3):1038–1052
Mathew J, Pang CK, Luo M, Leong WH (2018) Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans Neural Netw Learning Syst 29(9):4065–4076
Lin C-T, Hsieh T-Y, Liu Y-T, Lin Y-Y, Fang C-N, Wang Y-K, Yen G, Pal NR, Chuang C-H (2017) Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl Data Eng 30(5):950–962
Ohsaki M, Wang P, Matsuda K, Katagiri S, Watanabe H, Ralescu A (2017) Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans Knowl Data Eng 29(9):1806–1819
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107–119. Springer
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsl 6(1):30–39
Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inform Sci 577:265–279
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Statistics and its. Interface 2(3):349–360
García S, Zhang Z-L, Altalhi A, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multi-class imbalanced datasets. Inform Sci 445:22–37
Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong H-S, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learni Syst 31(4):1387–1400
Kraiem MS, Sánchez-Hernández F, Moreno-García MN (2021) Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. an approach based on association models. Applied Sciences 11(18)
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artificial Intell Res 16:321–357
Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878–887. Springer
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Springer
Sánchez-Hernández F, Ballesteros-Herráez JC, Kraiem MS, Sánchez-Barba M, Moreno-García MN (2019) Predictive modeling of icu healthcare-associated infections from imbalanced data. using ensembles and a clustering-based undersampling approach. Applied Sciences 9(24)
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Yang X, Kuang Q, Zhang W, Zhang G (2018) AMDO: an over-sampling technique for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 30(9):1672–1685
Sharma S, Bellinger C, Krawczyk B, Zaiane O, Japkowicz N (2018) Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 447–456. https://doi.org/10.1109/ICDM.2018.00060
Krawczyk B, Koziarski M, Woźniak M (2019) Radial-based oversampling for multiclass imbalanced data classification. IEEE Trans Neural Netw Learn Syst 31(8):2818–2831
Li L, He H, Li J (2019) Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 32(11):2159–2170
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: An open-source software for multi-class imbalance learning. Knowledge-based Systems 174:137–143
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Rezvani S, Wang X (2021) Class imbalance learning using fuzzy art and intuitionistic fuzzy twin support vector machines. Inform Sci 578:659–682
Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, Shinozaki T (2021) Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inform Process Syst 34:18408–18419
Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30(1):27–38
Fernández A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl-based Syst 42:97–110
Sáez JA, Krawczyk B, Woźniak M (2016) Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit 57:164–178
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Nos.U21A20513, 62076154, U1805263, 62276157, 61503229), Key R &D program of Shanxi Province (International Cooperation) (No.201903D421050), the Natural Science Foundation of Shanxi Province (No.201901D111033) and the Central Guidance on Local Science and Technology Development Fund of Shanxi Province (YDZX20201400001224).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Han, M., Guo, H., Li, J. et al. Global-local information based oversampling for multi-class imbalanced data. Int. J. Mach. Learn. & Cyber. 14, 2071–2086 (2023). https://doi.org/10.1007/s13042-022-01746-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-022-01746-w