Skip to main content
Log in

Global-local information based oversampling for multi-class imbalanced data

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Multi-class imbalanced classification is a challenging problem in the field of machine learning. Many methods have been proposed to deal with it, and oversampling is one of the most popular techniques which alleviates class imbalance by generating instances for the minority classes. However, each oversampling utilizes a single way to generate instances for all candidate minority ones, which neglects the intrinsic characteristics among different minority class instances, and makes the synthetic instances redundant or ineffective. In this work, we propose a global-local-based oversampling method, termed GLOS. We introduce a new discreteness-based metric (DID) and distinguish the minority class from the majority class by comparing it with each class-level discreteness value. Then, for each minority class, some difficult-to-learn instances are selected, which have smaller instance-level dispersion than the corresponding class-level one, to generate synthetic instances. And the number of synthetic instances equals the difference between two types of dispersion values. These selected instances are assigned into different groups according to their local distribution. Furthermore, GLOS adopts a specific synthetic strategy to each group instance purposefully. Finally, all minority classes, part of the majority classes instances, and synthetic data will be used as training data. In this way, the quantity and quality of synthetic instances are guaranteed. Experimental results on KEEL and UCI data sets demonstrate the effectiveness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Han X, Cui R, Lan Y, Kang Y, Deng J, Jia N (2019) A gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. Int J Mach Learn Cybernet 10(12):3687–3699

    Article  Google Scholar 

  2. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform Sci 250:113–141

    Article  Google Scholar 

  3. Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77

    Article  MathSciNet  Google Scholar 

  4. Guo H, Li Y, Jennifer S, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  5. Li J, Zhu Q, Wu Q, Fan Z (2021) A novel oversampling technique for class-imbalanced learning based on smote and natural neighbors. Inform Sci 565:438–455

    Article  MathSciNet  Google Scholar 

  6. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29

    Article  Google Scholar 

  7. Goyal S (2022) Handling class-imbalance with knn (neighbourhood) under-sampling for software defect prediction. Artificial Intell Rev 55(3):2023–2064

    Article  Google Scholar 

  8. Tomek I (1976) Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics SMC-6(11), 769–772

  9. Zhou Z, Liu X (2010) On multi-class cost-sensitive learning. Comput Intell 26(3):232–257

    Article  MathSciNet  Google Scholar 

  10. Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MathSciNet  MATH  Google Scholar 

  11. Sun Y, Wong AKC, Wang Y (2005) Parameter inference of cost-sensitive boosting algorithms. In: Machine Learning and Data Mining in Pattern Recognition, pp. 21–30. Springer, Berlin, Heidelberg

  12. Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inform Sci 487:31–56

    Article  MathSciNet  Google Scholar 

  13. Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587

    Google Scholar 

  14. Zhang C, Tan KC, Li H, Hong GS (2019) A cost-sensitive deep belief network for imbalanced classification. IEEE Trans Neural Netw Learn Syst 30(1):109–122

    Article  Google Scholar 

  15. Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing 343:50–64

    Article  Google Scholar 

  16. Jia J, Zhai L, Ren W, Wang L, Ren Y (2022) An effective imbalanced jpeg steganalysis scheme based on adaptive cost-sensitive feature learning. IEEE Trans Knowl Data Eng 34(3):1038–1052

    Article  Google Scholar 

  17. Mathew J, Pang CK, Luo M, Leong WH (2018) Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans Neural Netw Learning Syst 29(9):4065–4076

    Article  Google Scholar 

  18. Lin C-T, Hsieh T-Y, Liu Y-T, Lin Y-Y, Fang C-N, Wang Y-K, Yen G, Pal NR, Chuang C-H (2017) Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl Data Eng 30(5):950–962

    Article  Google Scholar 

  19. Ohsaki M, Wang P, Matsuda K, Katagiri S, Watanabe H, Ralescu A (2017) Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans Knowl Data Eng 29(9):1806–1819

    Article  Google Scholar 

  20. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107–119. Springer

  21. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsl 6(1):30–39

    Article  Google Scholar 

  22. Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642

    Article  Google Scholar 

  23. Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inform Sci 577:265–279

    Article  MathSciNet  Google Scholar 

  24. Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Statistics and its. Interface 2(3):349–360

    MathSciNet  MATH  Google Scholar 

  25. García S, Zhang Z-L, Altalhi A, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multi-class imbalanced datasets. Inform Sci 445:22–37

    Article  MathSciNet  Google Scholar 

  26. Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong H-S, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learni Syst 31(4):1387–1400

    Article  MathSciNet  Google Scholar 

  27. Kraiem MS, Sánchez-Hernández F, Moreno-García MN (2021) Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. an approach based on association models. Applied Sciences 11(18)

  28. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artificial Intell Res 16:321–357

    Article  MATH  Google Scholar 

  29. Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878–887. Springer

  30. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE

  31. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Springer

  32. Sánchez-Hernández F, Ballesteros-Herráez JC, Kraiem MS, Sánchez-Barba M, Moreno-García MN (2019) Predictive modeling of icu healthcare-associated infections from imbalanced data. using ensembles and a clustering-based undersampling approach. Applied Sciences 9(24)

  33. Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  34. Yang X, Kuang Q, Zhang W, Zhang G (2018) AMDO: an over-sampling technique for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 30(9):1672–1685

    Article  Google Scholar 

  35. Sharma S, Bellinger C, Krawczyk B, Zaiane O, Japkowicz N (2018) Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 447–456. https://doi.org/10.1109/ICDM.2018.00060

  36. Krawczyk B, Koziarski M, Woźniak M (2019) Radial-based oversampling for multiclass imbalanced data classification. IEEE Trans Neural Netw Learn Syst 31(8):2818–2831

    Article  MathSciNet  Google Scholar 

  37. Li L, He H, Li J (2019) Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 32(11):2159–2170

    Article  Google Scholar 

  38. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17

  39. Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: An open-source software for multi-class imbalance learning. Knowledge-based Systems 174:137–143

    Article  Google Scholar 

  40. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  41. Rezvani S, Wang X (2021) Class imbalance learning using fuzzy art and intuitionistic fuzzy twin support vector machines. Inform Sci 578:659–682

    Article  MathSciNet  Google Scholar 

  42. Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, Shinozaki T (2021) Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inform Process Syst 34:18408–18419

    Google Scholar 

  43. Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30(1):27–38

    Article  Google Scholar 

  44. Fernández A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl-based Syst 42:97–110

    Article  Google Scholar 

  45. Sáez JA, Krawczyk B, Woźniak M (2016) Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit 57:164–178

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos.U21A20513, 62076154, U1805263, 62276157, 61503229), Key R &D program of Shanxi Province (International Cooperation) (No.201903D421050), the Natural Science Foundation of Shanxi Province (No.201901D111033) and the Central Guidance on Local Science and Technology Development Fund of Shanxi Province (YDZX20201400001224).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenjian Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, M., Guo, H., Li, J. et al. Global-local information based oversampling for multi-class imbalanced data. Int. J. Mach. Learn. & Cyber. 14, 2071–2086 (2023). https://doi.org/10.1007/s13042-022-01746-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-022-01746-w

Keywords

Navigation