Skip to main content
Log in

A cluster impurity-based hybrid resampling for imbalanced classification problems

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

As one of the supervised learning techniques, classification plays a crucial role in categorizing and predicting the observations across a wide range of machine learning applications such as software defect detection, fraud detection in financial sector, fault and defect detection in manufacturing industry, medical diagnosis, etc. However, most classification algorithms have been developed with the assumption that the class distribution is balanced although unequal class distributions are quite common in many practical cases. When a class imbalance problem exists, in general, the classifier tends to become biased towards the majority class and thus the minority class instances are often misclassified to the majority class. Along with the class imbalance problem, the class overlap is also known as one of the sources that makes the learning task become difficult or sometimes deteriorates the classification performance, especially, when class imbalance problem also exists. Thus, in this research, we propose a cluster impurity-based hybrid resampling method including the partially balanced strategy to improve the classification performance of class imbalanced data with considering intra-cluster class imbalance and inter-cluster overlap problems. Specifically, several clustering methods are employed for identifying the groups (i.e., clusters) of all the instances and the cluster impurity of each instance is computed for measuring the degree of cluster overlap. Then, based on the cluster impurity, the instances are generated and eliminated recursively. To demonstrate the effectiveness of the proposed method, comprehensive experiments are conducted on forty imbalanced datasets and two non-parametric hypothesis tests are employed to show the statistical difference in classification performances between the proposed method and other traditional resampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1:
Algorithm 2:

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The datasets used during the training and testing are available from the public data repository https://sci2s.ugr.es/keel/imbalanced.php.

References

  1. Vladimiro C, Zelaya G (2019) Towards explaining the effects of data preprocessing on machine learning. In: Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE) 18739923. https://doi.org/10.1109/ICDE.2019.00245

  2. Luque A, Carrasco A, Martín A et al (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231. https://doi.org/10.1016/j.patcog.2019.02.023

    Article  Google Scholar 

  3. Thabtah F, Hammoud S, Kamalov F et al (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441. https://doi.org/10.1016/j.ins.2019.11.004

    Article  MathSciNet  Google Scholar 

  4. Hud S, Liu K, Abdelrazek M et al (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195. https://doi.org/10.1109/ACCESS.2018.2817572

    Article  Google Scholar 

  5. Gong L, Jiang S, Wang R et al (2020) Empirical evaluation of the impact of class overlap on software defect prediction. In: Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) 19265283. https://doi.org/10.1109/ASE.2019.00071

  6. Liang P, Liu G, Xiong Z et al (2022) A fault detection model for edge computing security using imbalanced classification. J Syst Archit 133:102779. https://doi.org/10.1016/j.sysarc.2022.102779

    Article  Google Scholar 

  7. Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Adv Intell Syst Compu Lecture Notes in Computer Science 2972:312–321. https://doi.org/10.1007/978-3-540-24694-7_32

  8. Spelmen VS, Porkodi R (2018) A review on handling imbalanced Data. In: Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT) 18290605. https://doi.org/10.1109/ICCTCT.2018.8551020

  9. Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Advances in Artificial Intelligence Canadian AI 2010 Lecture Notes in Computer Science 6085:220–231

  10. Vuttipittayamongkol P, Elyan E, Petrovski AV (2021) On the class overlap problem in imbalanced data classification. Knowl-Based Syst 212:106631

    Article  Google Scholar 

  11. Santos MS, Abreu PH, Japkowicz N et al (2022) On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55:6207–6275

    Article  Google Scholar 

  12. Barua S, Islam MM, Yao X et al (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425. https://doi.org/10.1109/TKDE.2012.232

    Article  Google Scholar 

  13. Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416. https://doi.org/10.1016/j.eswa.2015.10.031

    Article  Google Scholar 

  14. Douzas G, Bacao F (2017) Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl 82:40–52. https://doi.org/10.1016/j.eswa.2017.03.073

    Article  Google Scholar 

  15. Lin WC, Tsai CF, Hu YH et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26. https://doi.org/10.1016/j.ins.2017.05.008

    Article  Google Scholar 

  16. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056

    Article  Google Scholar 

  17. Wei J, Huang H, Yao L et al (2020) NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504

    Article  Google Scholar 

  18. Hautamäki V, Cherednichenko S, Kärkkäinen I et al (2005) Improving k-means by outlier removal. Proc Scand Conf Image Anal 3540:978–987. https://doi.org/10.1007/11499145_99

    Article  Google Scholar 

  19. Baadel S, Thabtah F, Lu J (2016) Overlapping clustering: a review. In: 2016 SAI Computing Conference (SAI), pp 233–237

  20. Vorraboot P, Rasmequan S, Chinnasarn K et al (2015) Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 152:429–443. https://doi.org/10.1016/j.neucom.2014.10.007

    Article  Google Scholar 

  21. Ofek N, Rokach L, Stern R et al (2017) Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243:88–102. https://doi.org/10.1016/j.neucom.2017.03.011

    Article  Google Scholar 

  22. Cervantes J, Garcia-Lamont F, Rodriguez L et al (2017) PSO-based method for SVM classification on skewed data sets. Neurocomputing 228:187–197. https://doi.org/10.1016/j.neucom.2016.10.041

    Article  Google Scholar 

  23. Koziarski M, Woźniak M, Krawczyk B (2020) Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise. Knowl-Based Syst 204:106223. https://doi.org/10.1016/j.knosys.2020.106223

    Article  Google Scholar 

  24. Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70. https://doi.org/10.1016/j.ins.2019.08.062

    Article  Google Scholar 

  25. Xu Z, Shen D, Nie T et al (2021) A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf Sci 572:574–589

    Article  MathSciNet  Google Scholar 

  26. Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111. https://doi.org/10.1016/j.ins.2020.07.014

    Article  MathSciNet  Google Scholar 

  27. Xie X, Liu H, Zeng S et al (2021) A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowl-Based Syst 213:106689. https://doi.org/10.1016/j.knosys.2020.106689

    Article  Google Scholar 

  28. Ma CK, Park YJ (2021) A new instance density-based synthetic minority oversampling method for imbalanced classification problems. Eng Optim 54:1743–1757. https://doi.org/10.1080/0305215X.2021.1982929

    Article  MathSciNet  Google Scholar 

  29. Mayabadi S, Saadatfar H (2022) Two density-based sampling approaches for imbalanced and overlapping data. Knowl-Based Syst 241:108217. https://doi.org/10.1016/j.knosys.2022.108217

    Article  Google Scholar 

  30. Yan Y, Jiang Y, Zheng Z et al (2022) LDAS: Local density-based adaptive sampling for imbalanced data classification. Expert Syst Appl 191:116213. https://doi.org/10.1016/j.eswa.2021.116213

    Article  Google Scholar 

  31. Sun A, Lim EP, Liu Y (2009) On strategies for imbalanced text classification using SVM: a comparative study. Decis Support Syst 48:191–201. https://doi.org/10.1016/j.dss.2009.07.011

    Article  Google Scholar 

  32. Tang Y, Zhang YQ, Chawla NV et al (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39:281–288. https://doi.org/10.1109/TSMCB.2008.2002909

    Article  Google Scholar 

  33. Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20:203–209. https://doi.org/10.1007/s00521-010-0349-9

    Article  Google Scholar 

  34. Kang Q, Shi L, Zhou MC et al (2018) A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Trans Neur Netw Lear 29:18042986. https://doi.org/10.1109/TNNLS.2017.2755595

    Article  Google Scholar 

  35. Wang Q, Tian Y, Liu D (2019) Adaptive FH-SVM for imbalanced classification. IEEE Access 7:19001876. https://doi.org/10.1109/ACCESS.2019.2940983

    Article  Google Scholar 

  36. Song Y, Peng Y (2019) A MCDM-based evaluation approach for imbalanced classification methods in financial risk prediction. IEEE Access 7:18789126. https://doi.org/10.1109/ACCESS.2019.2924923

    Article  Google Scholar 

  37. Shu T, Zhang B, Tang YY (2020) Sparse supervised representation-based classifier for uncontrolled and imbalanced classification. IEEE Trans Neur Netw Learn 31:20068464. https://doi.org/10.1109/TNNLS.2018.2884444

    Article  MathSciNet  Google Scholar 

  38. Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inf Sci 577:265–279. https://doi.org/10.1016/j.ins.2021.07.019

    Article  MathSciNet  Google Scholar 

  39. Sun Y, Kamel MS, Wong AKC et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40:3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009

    Article  Google Scholar 

  40. Seiffert C, Khoshgoftaar TM, Hulse JV et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A 40(1):185–197. https://doi.org/10.1109/TSMCA.2009.2029559

    Article  Google Scholar 

  41. Hanifah FS, Wijayanto H, Kurnia A (2015) SMOTE bagging algorithm for imbalanced dataset in logistic regression analysis. Appl Math Sci 9:6857–6865. https://doi.org/10.12988/ams.2015.58562

    Article  Google Scholar 

  42. Li Y, Guo H, Liu X et al (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104. https://doi.org/10.1016/j.knosys.2015.11.013

    Article  Google Scholar 

  43. Kirshners A, Parshutin S, Gorskis H (2017) Entropy-based classifier enhancement to handle imbalanced class problem. Procedia Comput Sci 104:586–591. https://doi.org/10.1016/j.procs.2017.01.176

    Article  Google Scholar 

  44. Tanha J, Abdi Y, Samadi N et al (2020) Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data 7:70. https://doi.org/10.1186/s40537-020-00349-y

    Article  Google Scholar 

  45. Zhao J, Jin J, Chen S et al (2020) A weighted hybrid ensemble method for classifying imbalanced data. Knowl-Based Syst 203:106087. https://doi.org/10.1016/j.knosys.2020.106087

    Article  Google Scholar 

  46. Jimenez-Castaño CA, Alvarez-Meza AM, Orozco-Gutierrez AA (2020) Enhanced automatic twin support vector machine for imbalanced data classification. Pattern Recognit 107:107442. https://doi.org/10.1016/j.patcog.2020.107442

    Article  Google Scholar 

  47. Shi P, Wang Z (2021) An Ensemble Tree Classifier for Highly Imbalanced Data Classification. J Syst Sci Complex 34:2250–2266

    Article  Google Scholar 

  48. Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  49. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing. ICIC 2005 Lecture Notes in Computer Science 3644:878887

  50. He H, Bai Y, Garcia EA et al (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

  51. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in Knowledge Discovery and Data Mining. PAKDD 2009 Lecture Notes in Computer Science 5476:475–482

  52. Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161

    Article  Google Scholar 

  53. Chen Q, Zhang ZL, Huang WP et al (2022) PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets. Neurocomputing 498:75–88

    Article  Google Scholar 

Download references

Funding

This work has been supported by the General Research Program funded by the Ministry of Science and Technology, Taiwan, (Grant No. MOST 110-2221-E-027-106-MY3).

Author information

Authors and Affiliations

Authors

Contributions

The concept and design of the study were developed by You-Jin Park. Material preparation, data collection and analysis were performed by Ke-Yong Cheng. The first draft of the manuscript was written by You-Jin Park and Ke-Yong Cheng, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to You-Jin Park.

Ethics declarations

Ethical approval and informed consent

The datasets used and analyzed during the current study are available from the public data repository.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, YJ., Cheng, KY. A cluster impurity-based hybrid resampling for imbalanced classification problems. Appl Intell 54, 9671–9684 (2024). https://doi.org/10.1007/s10489-024-05644-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05644-2

Keywords