The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Zheng, Wanwan; Jin, Mingzhe

doi:10.1007/s42979-020-0074-0

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Original Research
Published: 25 February 2020

Volume 1, article number 71, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

3278 Accesses
26 Citations
Explore all metrics

Abstract

This study discusses the effects of class imbalance and training data size on the predictive performance of classifiers. An empirical study was performed on ten classifiers arising from seven categories, which are frequently employed and have been identified to be efficient. In addition, comprehensive hyperparameter tuning was done for every data to maximize the performance of each classifier. The results indicated that (1) naïve Bayes, logistic regression and logit leaf model are less susceptible to class imbalance while they have relatively poor predictive performance; (2) ensemble classifiers AdaBoost, XGBoost and parRF have a quite poorer stability in terms of class imbalance while they achieved superior predictive accuracies; (3) for all of the classifiers employed in this study, their accuracies decreased as soon as the class imbalance skew reached a certain point 0.10; note that although using datasets with balanced class distribution would be an ideal condition to maximize the performance of classifiers, if the skew is larger than 0.10, a comprehensive hyperparameter tuning may be able to eliminate the effect of class imbalance; (4) no one classifier shows to be robust to the change of training data size; (5) CART is the last choice among the ten classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

A Comprehensive Review on the Issue of Class Imbalance in Predictive Modelling

A study of model and hyper-parameter selection strategies for classifier ensembles: a robust analysis on different optimization algorithms and extended results

Article 30 October 2020

Notes

NFL theorem: If algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A.

References

Ali S, Smith KA. On learning algorithm selection for classification. Appl Soft Comput. 2006;6(2):119–38.
Article Google Scholar
Błaszczyński J, Stefanowski J. Local data characteristics in learning classifiers from imbalanced data. In: Gawęda A, Kacprzyk J, Rutkowski L, Yen G, editors. Advances in data analysis with computational intelligence methods: studies in computational intelligence, vol. 738. Cham: Springer; 2017. p. 51–85.
Chapter Google Scholar
Brown I, Mues C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl. 2012;39(3):3446–533.
Article Google Scholar
Caigny AD, Coussement K, De Bock KW. A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. Eur J Oper Res. 2018;269(2):760–72.
Article MathSciNet Google Scholar
D’souza RN, Huang PY, Yeh FC. Small data challenge: structural analysis and optimization of convolutional neural networks with a small sample size. bioRxiv. 2018. https://doi.org/10.1101/402610.
Article Google Scholar
Foody GM, Mathur A. A relative evaluation of multiclass image classification by support vector machine. IEEE Trans Geosci Remote Sens. 2004;42(6):1335–433.
Article Google Scholar
Fernández-Delgado M, Cernadas E, Barro S. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. 2014;15:3133–81.
MathSciNet MATH Google Scholar
García V, Marquésb AI, Sánchez JS. Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Inform Fus. 2019;47:88–101.
Article Google Scholar
Georgakis G, Mousavian A, Berg AC, Kosecka J. Synthesizing training data for object detection in indoor scenes. 2017; arXiv:1702.07836. https://arxiv.org/pdf/1702.07836.pdf. Accessed 8 Sept 2017.
Halevy A, Norvig P, Pereita F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2):1541–672.
Article Google Scholar
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; pp. 770–778.
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012; pp. 1097–1105.
Mathur A, Foody GM. Crop classification by a support vector machine with intelligently selected training data for an operational application. Int J Remote Sens. 2008;29(8):2227–40.
Article Google Scholar
Nguyen T, özaslan T, Miller ID, Keller J, Loianno G, Taylor CJ, Lee DD, Kumar V, Harwood JH, Wozencraft J. U-Net for MAV-based penstock inspection: an investigation of focal loss in multi-class segmentation for corrosion identification. 2018; arXiv:1809.06576. https://arxiv.org/pdf/1809.06576.pdf. Accessed 11 Nov 2018.
Pal M, Mather PM. An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sens Environ. 2003;86(4):554–65.
Article Google Scholar
Rothe S, Kudszus B, Söffker D. Does classifier fusion improve the overall performance? Numerical analysis of data and fusion method characteristics influencing classifier fusion performance. Entropy. 2019;21(9):866. https://doi.org/10.3390/e21090866.
Article Google Scholar
Rizwan M, Nadeem A, Sindhu M. Analyses of classifier’s performance measures used in software fault prediction studies. Digit Object Identif. 2019;7:82764–75.
Google Scholar
Sun MX, Liu KH, Wu QQ, Hong QQ, Wang BZ, Zhang HY. A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis. Pattern Recogn. 2019;90:346–62.
Article Google Scholar
Sun YM, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recogn Artif Intell. 2009;24(4):687–719.
Article Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014; arXiv:1409.1556. https://arxiv.org/pdf/1409.1556.pdf. Accessed 10 Apr 2015.
Szegedy C, Liu W, Jia YQ, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pp. 1–9.
Sánchez JS, Molineda RA, Sotoca KM. An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl. 2007;10:189–201.
Article MathSciNet Google Scholar
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–37.
Article Google Scholar
Santiso S, Pérez A, Casillas A. Smoothing dense spaces for improved relation extraction between drugs and adverse reactions. Inform J Med Inform. 2009;128:39–45.
Article Google Scholar
Wainberg M, Alipanahi B, Frey BJ. Are random forests truly the best classifiers? J Mach Learn Res. 2016;17:1–5.
MathSciNet Google Scholar
Wolpert DH, Macready WG. No free lunch theorem for search. Technical Report SFI-TR-05-010, Santa Fe Institute, Santa Fe, NM; 1995.
Weiss GM, Provost F. The effect of class distribution on classifier learning. Technical Report ML-TR-43, Department of Computer Science, Rutgers University; 2001. https://pdfs.semanticscholar.org/45ca/1d5528a4e5beb5616c1ec822901be2de1d59.pdf. Accessed 2 Aug 2001.
Zhu X, Vondrick C, Fowlkes C, Ramanan D. Do we need more training data? Int J Comput Vision. 2016;19(1):76–92.
Article MathSciNet Google Scholar
Zhu XF, Huang Z, Yang Y, Shen H, Xu CH, Luo JB. Self-taught dimensionality reduction on the high-dimensional small-sized data. Pattern Recogn. 2013;46(1):215–29.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyotanabe, Kyoto, Japan
Wanwan Zheng & Mingzhe Jin

Authors

Wanwan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Mingzhe Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wanwan Zheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, W., Jin, M. The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study. SN COMPUT. SCI. 1, 71 (2020). https://doi.org/10.1007/s42979-020-0074-0

Download citation

Received: 04 December 2019
Accepted: 07 February 2020
Published: 25 February 2020
DOI: https://doi.org/10.1007/s42979-020-0074-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

A Comprehensive Review on the Issue of Class Imbalance in Predictive Modelling

A study of model and hyper-parameter selection strategies for classifier ensembles: a robust analysis on different optimization algorithms and extended results

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Effects of Training Data Size and Class Imbalance on the Performance of Classifiers

A Comprehensive Review on the Issue of Class Imbalance in Predictive Modelling

A study of model and hyper-parameter selection strategies for classifier ensembles: a robust analysis on different optimization algorithms and extended results

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation