Global-local information based oversampling for multi-class imbalanced data

Han, Mingming; Guo, Husheng; Li, Jinyan; Wang, Wenjian

doi:10.1007/s13042-022-01746-w

Global-local information based oversampling for multi-class imbalanced data

Original Article
Published: 21 December 2022

Volume 14, pages 2071–2086, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Mingming Han¹,
Husheng Guo¹,
Jinyan Li³ &
…
Wenjian Wang^1,2

238 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Multi-class imbalanced classification is a challenging problem in the field of machine learning. Many methods have been proposed to deal with it, and oversampling is one of the most popular techniques which alleviates class imbalance by generating instances for the minority classes. However, each oversampling utilizes a single way to generate instances for all candidate minority ones, which neglects the intrinsic characteristics among different minority class instances, and makes the synthetic instances redundant or ineffective. In this work, we propose a global-local-based oversampling method, termed GLOS. We introduce a new discreteness-based metric (DID) and distinguish the minority class from the majority class by comparing it with each class-level discreteness value. Then, for each minority class, some difficult-to-learn instances are selected, which have smaller instance-level dispersion than the corresponding class-level one, to generate synthetic instances. And the number of synthetic instances equals the difference between two types of dispersion values. These selected instances are assigned into different groups according to their local distribution. Furthermore, GLOS adopts a specific synthetic strategy to each group instance purposefully. Finally, all minority classes, part of the majority classes instances, and synthetic data will be used as training data. In this way, the quantity and quality of synthetic instances are guaranteed. Experimental results on KEEL and UCI data sets demonstrate the effectiveness of our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distance-based arranging oversampling technique for imbalanced data

Article 26 September 2022

Instance hardness and multivariate Gaussian distribution-based oversampling technique for imbalance classification

Article 02 January 2023

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

References

Han X, Cui R, Lan Y, Kang Y, Deng J, Jia N (2019) A gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. Int J Mach Learn Cybernet 10(12):3687–3699
Article Google Scholar
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform Sci 250:113–141
Article Google Scholar
Zhou Z, Liu X (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng 18(1):63–77
Article MathSciNet Google Scholar
Guo H, Li Y, Jennifer S, Gu M, Huang Y, Gong B (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Article Google Scholar
Li J, Zhu Q, Wu Q, Fan Z (2021) A novel oversampling technique for class-imbalanced learning based on smote and natural neighbors. Inform Sci 565:438–455
Article MathSciNet Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29
Article Google Scholar
Goyal S (2022) Handling class-imbalance with knn (neighbourhood) under-sampling for software defect prediction. Artificial Intell Rev 55(3):2023–2064
Article Google Scholar
Tomek I (1976) Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics SMC-6(11), 769–772
Zhou Z, Liu X (2010) On multi-class cost-sensitive learning. Comput Intell 26(3):232–257
Article MathSciNet Google Scholar
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Article MathSciNet MATH Google Scholar
Sun Y, Wong AKC, Wang Y (2005) Parameter inference of cost-sensitive boosting algorithms. In: Machine Learning and Data Mining in Pattern Recognition, pp. 21–30. Springer, Berlin, Heidelberg
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inform Sci 487:31–56
Article MathSciNet Google Scholar
Khan SH, Hayat M, Bennamoun M, Sohel FA, Togneri R (2017) Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
Google Scholar
Zhang C, Tan KC, Li H, Hong GS (2019) A cost-sensitive deep belief network for imbalanced classification. IEEE Trans Neural Netw Learn Syst 30(1):109–122
Article Google Scholar
Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing 343:50–64
Article Google Scholar
Jia J, Zhai L, Ren W, Wang L, Ren Y (2022) An effective imbalanced jpeg steganalysis scheme based on adaptive cost-sensitive feature learning. IEEE Trans Knowl Data Eng 34(3):1038–1052
Article Google Scholar
Mathew J, Pang CK, Luo M, Leong WH (2018) Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans Neural Netw Learning Syst 29(9):4065–4076
Article Google Scholar
Lin C-T, Hsieh T-Y, Liu Y-T, Lin Y-Y, Fang C-N, Wang Y-K, Yen G, Pal NR, Chuang C-H (2017) Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl Data Eng 30(5):950–962
Article Google Scholar
Ohsaki M, Wang P, Matsuda K, Katagiri S, Watanabe H, Ralescu A (2017) Confusion-matrix-based kernel logistic regression for imbalanced data classification. IEEE Trans Knowl Data Eng 29(9):1806–1819
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: Improving prediction of the minority class in boosting. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 107–119. Springer
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsl 6(1):30–39
Article Google Scholar
Chen S, He H, Garcia EA (2010) Ramoboost: ranked minority oversampling in boosting. IEEE Trans Neural Netw 21(10):1624–1642
Article Google Scholar
Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inform Sci 577:265–279
Article MathSciNet Google Scholar
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Statistics and its. Interface 2(3):349–360
MathSciNet MATH Google Scholar
García S, Zhang Z-L, Altalhi A, Alshomrani S, Herrera F (2018) Dynamic ensemble selection for multi-class imbalanced datasets. Inform Sci 445:22–37
Article MathSciNet Google Scholar
Yang K, Yu Z, Wen X, Cao W, Chen CP, Wong H-S, You J (2019) Hybrid classifier ensemble for imbalanced data. IEEE Trans Neural Netw Learni Syst 31(4):1387–1400
Article MathSciNet Google Scholar
Kraiem MS, Sánchez-Hernández F, Moreno-García MN (2021) Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. an approach based on association models. Applied Sciences 11(18)
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artificial Intell Res 16:321–357
Article MATH Google Scholar
Han H, Wang W, Mao B (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International Conference on Intelligent Computing, pp. 878–887. Springer
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Springer
Sánchez-Hernández F, Ballesteros-Herráez JC, Kraiem MS, Sánchez-Barba M, Moreno-García MN (2019) Predictive modeling of icu healthcare-associated infections from imbalanced data. using ensembles and a clustering-based undersampling approach. Applied Sciences 9(24)
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Article Google Scholar
Yang X, Kuang Q, Zhang W, Zhang G (2018) AMDO: an over-sampling technique for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 30(9):1672–1685
Article Google Scholar
Sharma S, Bellinger C, Krawczyk B, Zaiane O, Japkowicz N (2018) Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In: 2018 IEEE International Conference on Data Mining (ICDM), pp. 447–456. https://doi.org/10.1109/ICDM.2018.00060
Krawczyk B, Koziarski M, Woźniak M (2019) Radial-based oversampling for multiclass imbalanced data classification. IEEE Trans Neural Netw Learn Syst 31(8):2818–2831
Article MathSciNet Google Scholar
Li L, He H, Li J (2019) Entropy-based sampling approaches for multi-class imbalanced problems. IEEE Trans Knowl Data Eng 32(11):2159–2170
Article Google Scholar
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17
Zhang C, Bi J, Xu S, Ramentol E, Fan G, Qiao B, Fujita H (2019) Multi-imbalance: An open-source software for multi-class imbalance learning. Knowledge-based Systems 174:137–143
Article Google Scholar
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Rezvani S, Wang X (2021) Class imbalance learning using fuzzy art and intuitionistic fuzzy twin support vector machines. Inform Sci 578:659–682
Article MathSciNet Google Scholar
Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, Shinozaki T (2021) Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Adv Neural Inform Process Syst 34:18408–18419
Google Scholar
Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30(1):27–38
Article Google Scholar
Fernández A, LóPez V, Galar M, Del Jesus MJ, Herrera F (2013) Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl-based Syst 42:97–110
Article Google Scholar
Sáez JA, Krawczyk B, Woźniak M (2016) Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit 57:164–178
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos.U21A20513, 62076154, U1805263, 62276157, 61503229), Key R &D program of Shanxi Province (International Cooperation) (No.201903D421050), the Natural Science Foundation of Shanxi Province (No.201901D111033) and the Central Guidance on Local Science and Technology Development Fund of Shanxi Province (YDZX20201400001224).

Author information

Authors and Affiliations

School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China
Mingming Han, Husheng Guo & Wenjian Wang
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan, 030006, Shanxi, China
Wenjian Wang
Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW, Australia
Jinyan Li

Authors

Mingming Han
View author publications
You can also search for this author in PubMed Google Scholar
Husheng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jinyan Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenjian Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenjian Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Han, M., Guo, H., Li, J. et al. Global-local information based oversampling for multi-class imbalanced data. Int. J. Mach. Learn. & Cyber. 14, 2071–2086 (2023). https://doi.org/10.1007/s13042-022-01746-w

Download citation

Received: 11 July 2022
Accepted: 05 December 2022
Published: 21 December 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s13042-022-01746-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global-local information based oversampling for multi-class imbalanced data

Abstract

Access this article

Similar content being viewed by others

Distance-based arranging oversampling technique for imbalanced data

Instance hardness and multivariate Gaussian distribution-based oversampling technique for imbalance classification

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Global-local information based oversampling for multi-class imbalanced data

Abstract

Access this article

Similar content being viewed by others

Distance-based arranging oversampling technique for imbalanced data

Instance hardness and multivariate Gaussian distribution-based oversampling technique for imbalance classification

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation