Skip to main content

Advertisement

Log in

The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

This study discusses the effects of class imbalance and training data size on the predictive performance of classifiers. An empirical study was performed on ten classifiers arising from seven categories, which are frequently employed and have been identified to be efficient. In addition, comprehensive hyperparameter tuning was done for every data to maximize the performance of each classifier. The results indicated that (1) naïve Bayes, logistic regression and logit leaf model are less susceptible to class imbalance while they have relatively poor predictive performance; (2) ensemble classifiers AdaBoost, XGBoost and parRF have a quite poorer stability in terms of class imbalance while they achieved superior predictive accuracies; (3) for all of the classifiers employed in this study, their accuracies decreased as soon as the class imbalance skew reached a certain point 0.10; note that although using datasets with balanced class distribution would be an ideal condition to maximize the performance of classifiers, if the skew is larger than 0.10, a comprehensive hyperparameter tuning may be able to eliminate the effect of class imbalance; (4) no one classifier shows to be robust to the change of training data size; (5) CART is the last choice among the ten classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. NFL theorem: If algorithm A outperforms algorithm B on some cost functions, then loosely speaking there must exist exactly as many other functions where B outperforms A.

References

  1. Ali S, Smith KA. On learning algorithm selection for classification. Appl Soft Comput. 2006;6(2):119–38.

    Article  Google Scholar 

  2. Błaszczyński J, Stefanowski J. Local data characteristics in learning classifiers from imbalanced data. In: Gawęda A, Kacprzyk J, Rutkowski L, Yen G, editors. Advances in data analysis with computational intelligence methods: studies in computational intelligence, vol. 738. Cham: Springer; 2017. p. 51–85.

    Chapter  Google Scholar 

  3. Brown I, Mues C. An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Syst Appl. 2012;39(3):3446–533.

    Article  Google Scholar 

  4. Caigny AD, Coussement K, De Bock KW. A new hybrid classification algorithm for customer churn prediction based on logistic regression and decision trees. Eur J Oper Res. 2018;269(2):760–72.

    Article  MathSciNet  Google Scholar 

  5. D’souza RN, Huang PY, Yeh FC. Small data challenge: structural analysis and optimization of convolutional neural networks with a small sample size. bioRxiv. 2018. https://doi.org/10.1101/402610.

    Article  Google Scholar 

  6. Foody GM, Mathur A. A relative evaluation of multiclass image classification by support vector machine. IEEE Trans Geosci Remote Sens. 2004;42(6):1335–433.

    Article  Google Scholar 

  7. Fernández-Delgado M, Cernadas E, Barro S. Do we need hundreds of classifiers to solve real world classification problems? J Mach Learn Res. 2014;15:3133–81.

    MathSciNet  MATH  Google Scholar 

  8. García V, Marquésb AI, Sánchez JS. Exploring the synergetic effects of sample types on the performance of ensembles for credit risk and corporate bankruptcy prediction. Inform Fus. 2019;47:88–101.

    Article  Google Scholar 

  9. Georgakis G, Mousavian A, Berg AC, Kosecka J. Synthesizing training data for object detection in indoor scenes. 2017; arXiv:1702.07836. https://arxiv.org/pdf/1702.07836.pdf. Accessed 8 Sept 2017.

  10. Halevy A, Norvig P, Pereita F. The unreasonable effectiveness of data. IEEE Intell Syst. 2009;24(2):1541–672.

    Article  Google Scholar 

  11. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016; pp. 770–778.

  12. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 2012; pp. 1097–1105.

  13. Mathur A, Foody GM. Crop classification by a support vector machine with intelligently selected training data for an operational application. Int J Remote Sens. 2008;29(8):2227–40.

    Article  Google Scholar 

  14. Nguyen T, özaslan T, Miller ID, Keller J, Loianno G, Taylor CJ, Lee DD, Kumar V, Harwood JH, Wozencraft J. U-Net for MAV-based penstock inspection: an investigation of focal loss in multi-class segmentation for corrosion identification. 2018; arXiv:1809.06576. https://arxiv.org/pdf/1809.06576.pdf. Accessed 11 Nov 2018.

  15. Pal M, Mather PM. An assessment of the effectiveness of decision tree methods for land cover classification. Remote Sens Environ. 2003;86(4):554–65.

    Article  Google Scholar 

  16. Rothe S, Kudszus B, Söffker D. Does classifier fusion improve the overall performance? Numerical analysis of data and fusion method characteristics influencing classifier fusion performance. Entropy. 2019;21(9):866. https://doi.org/10.3390/e21090866.

    Article  Google Scholar 

  17. Rizwan M, Nadeem A, Sindhu M. Analyses of classifier’s performance measures used in software fault prediction studies. Digit Object Identif. 2019;7:82764–75.

    Google Scholar 

  18. Sun MX, Liu KH, Wu QQ, Hong QQ, Wang BZ, Zhang HY. A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis. Pattern Recogn. 2019;90:346–62.

    Article  Google Scholar 

  19. Sun YM, Wong AKC, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recogn Artif Intell. 2009;24(4):687–719.

    Article  Google Scholar 

  20. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014; arXiv:1409.1556. https://arxiv.org/pdf/1409.1556.pdf. Accessed 10 Apr 2015.

  21. Szegedy C, Liu W, Jia YQ, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pp. 1–9.

  22. Sánchez JS, Molineda RA, Sotoca KM. An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl. 2007;10:189–201.

    Article  MathSciNet  Google Scholar 

  23. Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manag. 2009;45:427–37.

    Article  Google Scholar 

  24. Santiso S, Pérez A, Casillas A. Smoothing dense spaces for improved relation extraction between drugs and adverse reactions. Inform J Med Inform. 2009;128:39–45.

    Article  Google Scholar 

  25. Wainberg M, Alipanahi B, Frey BJ. Are random forests truly the best classifiers? J Mach Learn Res. 2016;17:1–5.

    MathSciNet  Google Scholar 

  26. Wolpert DH, Macready WG. No free lunch theorem for search. Technical Report SFI-TR-05-010, Santa Fe Institute, Santa Fe, NM; 1995.

  27. Weiss GM, Provost F. The effect of class distribution on classifier learning. Technical Report ML-TR-43, Department of Computer Science, Rutgers University; 2001. https://pdfs.semanticscholar.org/45ca/1d5528a4e5beb5616c1ec822901be2de1d59.pdf. Accessed 2 Aug 2001.

  28. Zhu X, Vondrick C, Fowlkes C, Ramanan D. Do we need more training data? Int J Comput Vision. 2016;19(1):76–92.

    Article  MathSciNet  Google Scholar 

  29. Zhu XF, Huang Z, Yang Y, Shen H, Xu CH, Luo JB. Self-taught dimensionality reduction on the high-dimensional small-sized data. Pattern Recogn. 2013;46(1):215–29.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wanwan Zheng.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, W., Jin, M. The Effects of Class Imbalance and Training Data Size on Classifier Learning: An Empirical Study. SN COMPUT. SCI. 1, 71 (2020). https://doi.org/10.1007/s42979-020-0074-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-0074-0

Keywords

Navigation