Abstract
Learning from datasets that suffer from differences in absolute frequency of classes is one of the most challenging tasks in the machine learning field. Efforts have been made to tackle the problem of class imbalance by providing solutions at data and algorithmic levels. In these cases, in order to categorize the solutions according to problem class imbalance level and to obtain meaningful and consistent interpretations from the experiments, it is essential to be able to quantify the extent of dataset imbalance. A competent scale to summarize the severity of data inter-class imbalance, requires to meet at least the following three conditions: (1) the ability to calculate the imbalance extent for both binary and multi-class datasets, (2) output within a definite and fixed range of values, (3) being correlated with the performance of different classifiers. Nevertheless, none of the scales introduced so far satisfy all the enumerated requirements. In this study, we propose an informative scale called imbalance factor (IF) based on information theory, which, independent of the number of data classes, quantifies dataset imbalance extent in a single value in the range of [0, 1]. Besides, IF offers various limiting cases with different growth rates according to its α order. This property is critical as it can settle the possibility of having the same extent for distinct distributions. Eventually, empirical experiments indicate that with an average correlation of 0.766 with the classification accuracies over 15 real datasets, IF is remarkably more sensitive to class imbalance changes than other previous scales.






Similar content being viewed by others
Availability of data and materials
Data is available from https://archive.ics.uci.edu/ml/datasets.php.
Code availability
The source code to calculate the Imbalance Factor for classification problems is implemented in Python language, which is available from the paper DOI.
References
Brzezinski D, Minku LL, Pewinski T, Stefanowski J, Szumaczuk A (2021) The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 63(6):1429–1469
Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Conference of the Canadian society for computational studies of intelligence. Springer, pp 67–77
Koziarski M (2021) Potential anchoring for imbalanced data classification. Pattern Recognit 120:108114
Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern B Cybern 42(4):1119–1130
Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 110(2):279–301
Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637
Bellinger C, Sharma S, Japkowicz N, Zaïane OR (2020) Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 62(3):841–866
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Koziarski M, Bellinger C, Woźniak M (2021) RB-CCR: radial-based combined cleaning and resampling algorithm for imbalanced data classification. Mach Learn 110(11):3059–3093
Pirizadeh M, Alemohammad N, Manthouri M, Pirizadeh M (2021) A new machine learning ensemble model for class imbalance problem of screening enhanced oil recovery methods. J Pet Sci Eng 198:108214
Gillala R, Vuyyuru KR, Jatoth C, Fiore U (2021) An efficient chaotic SALP swarm optimization approach based on ensemble algorithm for class imbalance problems. Soft Comput 25(23):14955–14965
Kumar S, Biswas SK, Devi D (2019) TLUSBoost algorithm: a boosting solution for class imbalance problem. Soft Comput 23(21):10755–10767
García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
Barella VH, Garcia LP, de Souto MC, Lorena AC, de Carvalho AC (2021) Assessing the data complexity of imbalanced datasets. Inf Sci 553:83–109
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141
Ortigosa-Hernández J, Inza I, Lozano JA (2017) Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 98:32–38
Zhu R, Guo Y, Xue J-H (2020) Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett 133:217–223
Zhu R, Wang Z, Ma Z, Wang G, Xue J-H (2018) LRID: a new metric of multi-class imbalance degree based on likelihood-ratio test. Pattern Recognit Lett 116:36–42
Rényi A (1961) On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics. University of California Press, pp 547–561
Waegeman W, Verwaeren J, Slabbinck B, De Baets B (2011) Supervised learning algorithms for multi-class classification problems with partial class memberships. Fuzzy Sets Syst 184(1):106–125
Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270
Rice JA (2006) Mathematical statistics and data analysis. Cengage Learning, Boston
Arndt C (2003) Information measures: information and its description in science and engineering. Springer, Berlin
Shi-fei D, Zhong-zhi S (2005) Studies on incidence pattern recognition based on information entropy. J Inf Sci 31(6):497–502
Conrad K (2004) Probability distributions and maximum entropy. Entropy 6(452):10
UCI (2022) Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.php
Gaudreault J-G, Branco P, Gama J 2021) An analysis of performance metrics for imbalanced classification. In: International conference on discovery science. Springer, pp 67–77
Mortaz E (2020) Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowl Based Syst 210:106490
Branco P, Torgo L, Ribeiro RP (2017) Relevance-based evaluation metrics for multi-class imbalanced domains. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 698–710
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Lee Rodgers J, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66
Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36
Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE—an improved unbalanced data set oversampling based on K-means and SVM. Knowl Based Syst 196:105845
Nie Y, Zamzam AS, Brandt A (2021) Resampling and data augmentation for short-term PV output prediction based on an imbalanced sky images dataset using convolutional neural networks. Sol Energy 224:341–354
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pirizadeh, M., Farahani, H. & Kheradpisheh, S.R. Imbalance factor: a simple new scale for measuring inter-class imbalance extent in classification problems. Knowl Inf Syst 65, 4157–4183 (2023). https://doi.org/10.1007/s10115-023-01881-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01881-y