Skip to main content
Log in

A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH)

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Classification tasks in datasets that suffer from high class imbalance pose challenge to machine learning algorithms and such datasets are prevalent in many real-world domains and applications. In machine learning research, ensemble methods for classification tasks in imbalanced datasets have attracted a lot of attention due to their ability to improve classification performance. However, these methods are still prone to the negative effects of noise in the training sets. Furthermore, many of them alter the original class distribution to create a sort of balance in the datasets through over-sampling or undersampling techniques, which can lead to overfitting or discarding useful data, respectively, and thus may still hamper performance. In this work, we propose a novel ensemble method for classification that creates an arbitrary number of balanced splits (sBal) of data generated based on Instance Hardness as a weighting mechanism for creating balanced bags. Each of the generated bags will contain all the minority instances, and a mixture of majority instances with varying degrees of hardness (easy, normal, and hard), and we call this approach sBal_IH technique. This will enable base learners to train on different balanced bags comprising varied characteristics of the training data. We evaluated the performance of our proposed method on a total of 100 datasets that include 30 synthetic datasets with controlled levels of noise, 29 balanced and 41 imbalanced real-world datasets, and compared its performance with both traditional ensemble methods (Bagging, Wagging Random Forest, and AdaBoost), and those specialized for class imbalanced problems (Balanced Bagging, Balanced Random forest, RUSBoost, and Easy Ensemble). The results reveal that our proposed method brings a substantial improvement in classification performance relevant to the compared methods. For statistical significance analysis, we conducted Friedman’s nonparametric statistical test with Bergman post hoc test. The analysis shows that our method performs significantly better than the compared traditional and specialized ensemble methods for imbalanced problems across many datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300

    Article  Google Scholar 

  2. Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310

    Article  Google Scholar 

  3. Ling CX and Zhang H 2003 AUC: a statistically consistent and more discriminating measure than accuracy

  4. Tapkan P, Özbakir L, Kulluk S, Baykasolu A (2016) A cost-sensitive classification algorithm: BEE-Miner. Knowledge-Based Syst 95:99–113

    Article  Google Scholar 

  5. Weiss G, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? Dmin, no pp 1–7

  6. Japkowicz N, Proc AAAI 2000 Workshop on learning from imbalanced data sets, in Proc AAAI 2000 workshop on learning from imbalanced data sets, 2000

  7. Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explor. Newsl. 6(1):1–6

    Article  Google Scholar 

  8. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data 5(1):42

    Article  Google Scholar 

  9. Liu X-Y, Jianxin Wu, Zhou Z-H (2009) exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man, Cybern Part B Cybern 39(2):539–550

    Article  Google Scholar 

  10. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting, pp 107–119

  11. Shahrabi J, Hadaegh F, Ramezankhani A, Azizi F, Khalili D, Pournik O (2014) The Impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Mak 36(1):137–144

    Google Scholar 

  12. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Humans. 40(1):185–197

    Article  Google Scholar 

  13. Zheng Z, Yunpeng Cai YL (2015) Oversampling Method for Imbalanced Classification. Comput. Inform 34:1017–1037

    Google Scholar 

  14. Liu XY, Wu J, Zhou ZH (2006) Exploratory under-sampling for class-imbalance learning. Proc IEEE Int Conf Data Mining, ICDM, pp 965–969

  15. Barandela R, Valdovinos RM, Salvador Sánchez J, Ferri FJ (2004) The imbalanced training sample problem: under or over sampling? Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 3138:806–814

    MATH  Google Scholar 

  16. Hoens TR, Chawla NV (2013). Imbalanced datasets: from sampling to classifiers. Imbalanced learn Algorithms Appl, pp 43–59

  17. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl 6(1):80

    Article  Google Scholar 

  18. Mladenić D, Grobelnik M (1999) Feature selection for unbalanced class distribution and Naive Bayes. Proc Sixt Int Conf Mach Learn, pp 258–267

  19. Khalid S, Khalil T, Nasreen S (2014) A survey of feature selection and feature extraction techniques in machine learning. Proc 2014 Sci Inf Conf SAI 2014, no. July, pp 372–378

  20. Tan J, Zhang Z, Zhen L, Zhang C, Deng N (2013) Adaptive feature selection via a new version of support vector machine. Neural Comput Appl 23(3–4):937–945

    Article  Google Scholar 

  21. Pes B (2019) Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains. Neural Comput. Appl. 3:9951–9973

    Google Scholar 

  22. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  23. Liu B, Ma Y, Wong CK (2000) Improving an association rule based classifier. In: Proceedings of the 4th european conference on principles and practice of knowledge discovery. pp 504–509

  24. Sanchez JS, Barandela R, Rangel E, Garcia V (2003) Strategies for learning in class imbalance problems. Pattern Recognit 36(3):849–851

    Article  Google Scholar 

  25. Zhou ZH, Liu XY (2010) On multi-class cost-sensitive learning. Comput Intell 26(3):232–257

    Article  MathSciNet  Google Scholar 

  26. Siers MJ, Islam MZ (2018) Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects. Inf Sci (Ny) 459:53–70

    Article  Google Scholar 

  27. Wang S, Li Z, Chao W, Cao Q (2012) Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. Proc Int Jt Conf Neural Networks, pp 10–15

  28. Zhang D, Ma J, Yi J, Niu X, Xu X (2016) An ensemble method for unbalanced sentiment classification. Proc Int Conf Nat Comput vol 2016-Janua, no 61170052, pp 440–445

  29. Pławiak P, Acharya UR (2020) Novel deep genetic ensemble of classifiers for arrhythmia detection using ECG signals. Neural Comput Appl 32(15):11137–11161

    Article  Google Scholar 

  30. Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33(1–2):1–39

    Article  Google Scholar 

  31. Tkachenko R, Izonin I, Kryvinska N, Dronyuk I, Zub K (2020) An approach towards increasing prediction accuracy for the recovery of missing IoT data based on the GRNN SGTM ensemble. Sensors. 20(9):2625

    Article  Google Scholar 

  32. Zhang C, Ma Y (2012) Ensemble machine learning-methods and applications. Springer New, New York Dordrecht Heidelberg London

    Book  MATH  Google Scholar 

  33. Wintner S (2000) Dietterich TG: an experimental comparison of three methods for constructing ensembles of decision trees. En Sci commons Org. 40(2):139–157

    Google Scholar 

  34. Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–44

    Article  Google Scholar 

  35. Kotsiantis SB (2013) Decision trees: a recent overview. Artif Intell Rev 39(4):261–283

    Article  Google Scholar 

  36. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-boosting and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C Appl Rev. 42(4):463–484

    Article  Google Scholar 

  37. Abd Elrahman SM, Abraham A (2013) A review of class imbalance problem. J. Netw. Innov. Comput. 1:332–340

    Google Scholar 

  38. Bolón-Canedo V, Alonso-Betanzos A (2019) Ensembles for feature selection: a review and future trends. Inf. Fusion 52:1–12

    Article  Google Scholar 

  39. Sagi O, Rokach L (2018) Ensemble learning: a survey. Wiley Interdiscip Rev Data Min Knowl Discov 8(4):1–18

    Article  Google Scholar 

  40. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156

    Article  Google Scholar 

  41. Bhatt J (2014) A survey on one class classification using ensembles method. Int J Innov Res Sci Technol 1(7):19–23

    Google Scholar 

  42. Jurek A, Bi Y, Wu S, Nugent C (2013) A survey of commonly used ensemble-based classification techniques. Knowl Eng Rev 29(5):551–581

    Article  Google Scholar 

  43. Gomes HM, Barddal JP, Enembreck AF, Bifet A (2017) A survey on ensemble learning for data stream classification. ACM Comput. Surv. 50(2):1–36

    Article  Google Scholar 

  44. Moyano JM, Gibaja EL, Cios KJ, Ventura S (2018) Review of ensembles of multi-label classifiers: models, experimental study and prospects. Inf. Fusion 44:33–45

    Article  Google Scholar 

  45. L. Breiman (1994) Bagging predictors: technical report No 421, Mach Learn no 2, pp 19

  46. Freund Y, Schapire RE (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(5):771–780

    Google Scholar 

  47. Schapire RE (1999) A brief introduction to boosting. IJCAI Int Joint Conf Artif Intell 2:1401–1406

    Google Scholar 

  48. Walmsley FN, Cavalcanti GDC, Oliveira DVR, Cruz RMO, Sabourin R (2018) An ensemble generation method based on instance hardness, Proc Int Jt Conf Neural Networks vol 2018-July

  49. Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

    Article  MATH  Google Scholar 

  50. Frénay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869

    Article  MATH  Google Scholar 

  51. Rahman MM, Davis DN (2013) Addressing the class imbalance problem in medical datasets. Int J Mach Learn Comput 3(2):224–228

    Article  Google Scholar 

  52. Ali A, Shamsuddin SM, Ralescu AL (2015) Classification with class imbalance problem: a review. Int J Adv Soft Comput its Appl 7(3):176–204

    Google Scholar 

  53. Barandela R, Sánchez JS, Valdovinos RM (2003) New applications of ensembles of classifiers. Pattern Anal Appl 6(3):245–256

    Article  MathSciNet  Google Scholar 

  54. Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced data sets. Soft Comput 13(3):213–225

    Article  Google Scholar 

  55. Hido S, Kashima H, Takahashi Y (2009) Roughly balanced Bagging for Imbalanced data. Stat Anal Data Min 2(5–6):412–426

    Article  MathSciNet  Google Scholar 

  56. Kasem A, Ghaibeh AA, Moriguchi H (2016) Empirical study of sampling methods for classification in imbalanced clinical datasets. In: International conference on computational intelligence in information system, pp 152–162

  57. Alcalá-Fdez J et al (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Log Soft Comput 17(2–3):255–287

    Google Scholar 

  58. Lavesson N, Davidsson P (2006) Quantifying the impact of learning algorithm parameter tuning. Proc Natl Conf Artif Intell 1(1): 395–400

  59. Kwon O, Sim JM (2013) Effects of data set features on the performances of classification algorithms. Expert Syst Appl 40(5):1847–1857

    Article  Google Scholar 

  60. Smith MR, Martinez T, Giraud-Carrier C (2014) An instance level analysis of data complexity. Mach Learn 95(2):225–256

    Article  MathSciNet  Google Scholar 

  61. Liu H, Shah S, Jiang W (2004) On-line outlier detection and data cleaning. Comput Chem Eng 28(9):1635–1647

    Article  Google Scholar 

  62. Gamberger D, Lavrac N, Dzeroski S (2000) Noise detection and elimination in data preprocessing: experiments in medical domains. Appl Artif Intell 14(2):205–223

    Article  Google Scholar 

  63. Kabir A, Ruiz C, Alvarez SA (2018) Mixed Bagging: a novel ensemble learning framework for supervised classification based on instance hardness. Proc IEEE Int Conf Data Mining, ICDM, vol 2018-Novem, pp 1073–1078

  64. Smith MR, Martinez T (2016) A comparative evaluation of curriculum learning with filtering and boosting in supervised classification problems. Comput Intell 32(2):167–195

    Article  MathSciNet  Google Scholar 

  65. Wei Q, Dunbrack RL (2013) The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE 8(7):67863

    Article  Google Scholar 

  66. Gu Q, Zhu L, Cai Z (2009) Evaluation measures of the classification performance of imbalanced data sets. Comput Intell Intell Syst 51(51):461–471

    MATH  Google Scholar 

  67. Pereira L, Nunes N (2018) A comparison of performance metrics for event classification in non-intrusive load monitoring. 2017 IEEE Int Conf Smart Grid Commun Smart GridComm 2017 vol 2018-Janua, no October, pp 159–164

  68. Bi J, Zhang C (2018) An empirical comparison on state-of-the-art multi-class imbalance learning algorithms and a new diversified ensemble learning scheme. Knowl-Based Syst 158(May):81–93

    Article  Google Scholar 

  69. Liu L, Ghosh J, Martin CE (2007) Generative oversampling for mining imbalanced datasets. Int Conf data Min, no May, pp 66–72

  70. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  71. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844

    Article  Google Scholar 

  72. Freund Y, Schapire RRE (1996) Experiments with a new boosting algorithm. Int Conf Mach Learn, pp 148–156

  73. Chawla KWPNV, Bowyer KW, Hall LO (2002) SMOTE Synthetic Minority Over Sampling Technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  74. Halimu C, Kasem A (2020) Split balancing ( sBal )—a data preprocessing sampling technique for ensemble methods for binary classification in imbalanced datasets. In: Computational science and technology. Springer, Singapore

    Google Scholar 

  75. Bauer E, Kohavi R (1999) Empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach Learn 36(1):105–139

    Article  Google Scholar 

  76. Varoquaux G, Buitinck L, Louppe G, Grisel O, Pedregosa F, Mueller A (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(1):2825–2830

    MathSciNet  MATH  Google Scholar 

  77. Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18:1–5

    Google Scholar 

  78. Chen C, Liaw A, Breiman L Using random forest to learn imbalanced data, Discovery no 1999, pp 1–12

  79. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) RUSBoost: improving classification performance when training data is skewed. Proc Int Conf Pattern Recognit, no December, 2008

  80. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. Proc 24th Int Conf Mach Learn, pp 935–942

  81. Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 6086:158–167

    Google Scholar 

  82. Koyejo OO, Natarajan N, Ravikumar PK, Dhillon IS (2014) Consistent binary classification with generalized performance metrics. Adv Neural Inf Process Syst 27 Annu Conf Neural Inf Process Syst 2014, December 8–13 2014, Montr Quebec, Canada, pp 2744–2752

  83. Öztürk MM (2017) Which type of metrics are useful to deal with class imbalance in software defect prediction? Inf Softw Technol 92:17–29

    Article  Google Scholar 

  84. Folleco A, Khoshgoftaar TM, Napolitano A (2008) Comparison of four performance metrics for evaluating sampling techniques for low quality class-imbalanced data. Proc 7th Int Conf Mach Learn Appl ICMLA, pp 153–158

  85. Guo H, Viktor HL (2007) Learning from imbalanced data sets with boosting and data generation. ACM SIGKDD Explor Newsl 6(1):30

    Article  Google Scholar 

  86. Boughorbel S, Jarray F, El-Anbari M (2017) Optimal classifier for imbalanced data using matthews correlation coefficient metric. PLoS ONE 12(6):1–17

    Article  Google Scholar 

  87. Halimu C, Kasem A, Shah N (2019) Empirical comparison of area under roc curve (AUC) and mathew correlation coefficient (MCC) for evaluating machine learning algorithms on imbalanced datasets for binary classification. Int Conf Mach Learn Soft Comput no Mcc, pp 10–15

  88. Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542

    Article  Google Scholar 

  89. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  90. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  91. Garcia S, Herrera F (2008) An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Halimu Chongomweru.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chongomweru, H., Kasem, A. A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH). Neural Comput & Applic 33, 11233–11254 (2021). https://doi.org/10.1007/s00521-020-05570-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05570-7

Keywords

Navigation