Skip to main content
Log in

Imbalance factor: a simple new scale for measuring inter-class imbalance extent in classification problems

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Learning from datasets that suffer from differences in absolute frequency of classes is one of the most challenging tasks in the machine learning field. Efforts have been made to tackle the problem of class imbalance by providing solutions at data and algorithmic levels. In these cases, in order to categorize the solutions according to problem class imbalance level and to obtain meaningful and consistent interpretations from the experiments, it is essential to be able to quantify the extent of dataset imbalance. A competent scale to summarize the severity of data inter-class imbalance, requires to meet at least the following three conditions: (1) the ability to calculate the imbalance extent for both binary and multi-class datasets, (2) output within a definite and fixed range of values, (3) being correlated with the performance of different classifiers. Nevertheless, none of the scales introduced so far satisfy all the enumerated requirements. In this study, we propose an informative scale called imbalance factor (IF) based on information theory, which, independent of the number of data classes, quantifies dataset imbalance extent in a single value in the range of [0, 1]. Besides, IF offers various limiting cases with different growth rates according to its α order. This property is critical as it can settle the possibility of having the same extent for distinct distributions. Eventually, empirical experiments indicate that with an average correlation of 0.766 with the classification accuracies over 15 real datasets, IF is remarkably more sensitive to class imbalance changes than other previous scales.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Availability of data and materials

Data is available from https://archive.ics.uci.edu/ml/datasets.php.

Code availability

The source code to calculate the Imbalance Factor for classification problems is implemented in Python language, which is available from the paper DOI.

References

  1. Brzezinski D, Minku LL, Pewinski T, Stefanowski J, Szumaczuk A (2021) The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 63(6):1429–1469

    Article  Google Scholar 

  2. Japkowicz N (2001) Concept-learning in the presence of between-class and within-class imbalances. In: Conference of the Canadian society for computational studies of intelligence. Springer, pp 67–77

  3. Koziarski M (2021) Potential anchoring for imbalanced data classification. Pattern Recognit 120:108114

    Article  Google Scholar 

  4. Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern B Cybern 42(4):1119–1130

    Article  Google Scholar 

  5. Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) LoRAS: an oversampling approach for imbalanced datasets. Mach Learn 110(2):279–301

    Article  MathSciNet  MATH  Google Scholar 

  6. Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637

    Article  MathSciNet  MATH  Google Scholar 

  7. Bellinger C, Sharma S, Japkowicz N, Zaïane OR (2020) Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 62(3):841–866

    Article  Google Scholar 

  8. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239

    Article  Google Scholar 

  9. Koziarski M, Bellinger C, Woźniak M (2021) RB-CCR: radial-based combined cleaning and resampling algorithm for imbalanced data classification. Mach Learn 110(11):3059–3093

    Article  MathSciNet  MATH  Google Scholar 

  10. Pirizadeh M, Alemohammad N, Manthouri M, Pirizadeh M (2021) A new machine learning ensemble model for class imbalance problem of screening enhanced oil recovery methods. J Pet Sci Eng 198:108214

    Article  Google Scholar 

  11. Gillala R, Vuyyuru KR, Jatoth C, Fiore U (2021) An efficient chaotic SALP swarm optimization approach based on ensemble algorithm for class imbalance problems. Soft Comput 25(23):14955–14965

    Article  Google Scholar 

  12. Kumar S, Biswas SK, Devi D (2019) TLUSBoost algorithm: a boosting solution for class imbalance problem. Soft Comput 23(21):10755–10767

    Article  Google Scholar 

  13. García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21

    Article  Google Scholar 

  14. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    Article  MATH  Google Scholar 

  15. Barella VH, Garcia LP, de Souto MC, Lorena AC, de Carvalho AC (2021) Assessing the data complexity of imbalanced datasets. Inf Sci 553:83–109

    Article  MathSciNet  MATH  Google Scholar 

  16. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  17. Ortigosa-Hernández J, Inza I, Lozano JA (2017) Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 98:32–38

    Article  Google Scholar 

  18. Zhu R, Guo Y, Xue J-H (2020) Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recognit Lett 133:217–223

    Article  Google Scholar 

  19. Zhu R, Wang Z, Ma Z, Wang G, Xue J-H (2018) LRID: a new metric of multi-class imbalance degree based on likelihood-ratio test. Pattern Recognit Lett 116:36–42

    Article  Google Scholar 

  20. Rényi A (1961) On measures of entropy and information. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, volume 1: contributions to the theory of statistics. University of California Press, pp 547–561

  21. Waegeman W, Verwaeren J, Slabbinck B, De Baets B (2011) Supervised learning algorithms for multi-class classification problems with partial class memberships. Fuzzy Sets Syst 184(1):106–125

    Article  MathSciNet  MATH  Google Scholar 

  22. Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270

    Article  Google Scholar 

  23. Rice JA (2006) Mathematical statistics and data analysis. Cengage Learning, Boston

    Google Scholar 

  24. Arndt C (2003) Information measures: information and its description in science and engineering. Springer, Berlin

    MATH  Google Scholar 

  25. Shi-fei D, Zhong-zhi S (2005) Studies on incidence pattern recognition based on information entropy. J Inf Sci 31(6):497–502

    Article  Google Scholar 

  26. Conrad K (2004) Probability distributions and maximum entropy. Entropy 6(452):10

    Google Scholar 

  27. UCI (2022) Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.php

  28. Gaudreault J-G, Branco P, Gama J 2021) An analysis of performance metrics for imbalanced classification. In: International conference on discovery science. Springer, pp 67–77

  29. Mortaz E (2020) Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowl Based Syst 210:106490

    Article  Google Scholar 

  30. Branco P, Torgo L, Ribeiro RP (2017) Relevance-based evaluation metrics for multi-class imbalanced domains. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 698–710

  31. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437

    Article  Google Scholar 

  32. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  33. Lee Rodgers J, Nicewander WA (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42(1):59–66

    Article  Google Scholar 

  34. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20(1):18–36

    Article  MathSciNet  Google Scholar 

  35. Liang XW, Jiang AP, Li T, Xue YY, Wang GT (2020) LR-SMOTE—an improved unbalanced data set oversampling based on K-means and SVM. Knowl Based Syst 196:105845

    Article  Google Scholar 

  36. Nie Y, Zamzam AS, Brandt A (2021) Resampling and data augmentation for short-term PV output prediction based on an imbalanced sky images dataset using convolutional neural networks. Sol Energy 224:341–354

    Article  Google Scholar 

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadi Farahani.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pirizadeh, M., Farahani, H. & Kheradpisheh, S.R. Imbalance factor: a simple new scale for measuring inter-class imbalance extent in classification problems. Knowl Inf Syst 65, 4157–4183 (2023). https://doi.org/10.1007/s10115-023-01881-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01881-y

Keywords

Navigation