A cluster impurity-based hybrid resampling for imbalanced classification problems

Park, You-Jin; Cheng, Ke-Yong

doi:10.1007/s10489-024-05644-2

A cluster impurity-based hybrid resampling for imbalanced classification problems

Published: 20 July 2024

Volume 54, pages 9671–9684, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

252 Accesses
Explore all metrics

Abstract

As one of the supervised learning techniques, classification plays a crucial role in categorizing and predicting the observations across a wide range of machine learning applications such as software defect detection, fraud detection in financial sector, fault and defect detection in manufacturing industry, medical diagnosis, etc. However, most classification algorithms have been developed with the assumption that the class distribution is balanced although unequal class distributions are quite common in many practical cases. When a class imbalance problem exists, in general, the classifier tends to become biased towards the majority class and thus the minority class instances are often misclassified to the majority class. Along with the class imbalance problem, the class overlap is also known as one of the sources that makes the learning task become difficult or sometimes deteriorates the classification performance, especially, when class imbalance problem also exists. Thus, in this research, we propose a cluster impurity-based hybrid resampling method including the partially balanced strategy to improve the classification performance of class imbalanced data with considering intra-cluster class imbalance and inter-cluster overlap problems. Specifically, several clustering methods are employed for identifying the groups (i.e., clusters) of all the instances and the cluster impurity of each instance is computed for measuring the degree of cluster overlap. Then, based on the cluster impurity, the instances are generated and eliminated recursively. To demonstrate the effectiveness of the proposed method, comprehensive experiments are conducted on forty imbalanced datasets and two non-parametric hypothesis tests are employed to show the statistical difference in classification performances between the proposed method and other traditional resampling methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

Novel fuzzy clustering-based undersampling framework for class imbalance problem

Article 27 March 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets used during the training and testing are available from the public data repository https://sci2s.ugr.es/keel/imbalanced.php.

References

Vladimiro C, Zelaya G (2019) Towards explaining the effects of data preprocessing on machine learning. In: Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE) 18739923. https://doi.org/10.1109/ICDE.2019.00245
Luque A, Carrasco A, Martín A et al (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231. https://doi.org/10.1016/j.patcog.2019.02.023
Article Google Scholar
Thabtah F, Hammoud S, Kamalov F et al (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441. https://doi.org/10.1016/j.ins.2019.11.004
Article MathSciNet Google Scholar
Hud S, Liu K, Abdelrazek M et al (2018) An ensemble oversampling model for class imbalance problem in software defect prediction. IEEE Access 6:24184–24195. https://doi.org/10.1109/ACCESS.2018.2817572
Article Google Scholar
Gong L, Jiang S, Wang R et al (2020) Empirical evaluation of the impact of class overlap on software defect prediction. In: Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) 19265283. https://doi.org/10.1109/ASE.2019.00071
Liang P, Liu G, Xiong Z et al (2022) A fault detection model for edge computing security using imbalanced classification. J Syst Archit 133:102779. https://doi.org/10.1016/j.sysarc.2022.102779
Article Google Scholar
Prati RC, Batista GEAPA, Monard MC (2004) Class imbalances versus class overlapping: an analysis of a learning system behavior. In: MICAI 2004: Adv Intell Syst Compu Lecture Notes in Computer Science 2972:312–321. https://doi.org/10.1007/978-3-540-24694-7_32
Spelmen VS, Porkodi R (2018) A review on handling imbalanced Data. In: Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT) 18290605. https://doi.org/10.1109/ICCTCT.2018.8551020
Denil M, Trappenberg T (2010) Overlap versus imbalance. In: Advances in Artificial Intelligence Canadian AI 2010 Lecture Notes in Computer Science 6085:220–231
Vuttipittayamongkol P, Elyan E, Petrovski AV (2021) On the class overlap problem in imbalanced data classification. Knowl-Based Syst 212:106631
Article Google Scholar
Santos MS, Abreu PH, Japkowicz N et al (2022) On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 55:6207–6275
Article Google Scholar
Barua S, Islam MM, Yao X et al (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425. https://doi.org/10.1109/TKDE.2012.232
Article Google Scholar
Nekooeimehr I, Lai-Yuen SK (2016) Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets. Expert Syst Appl 46:405–416. https://doi.org/10.1016/j.eswa.2015.10.031
Article Google Scholar
Douzas G, Bacao F (2017) Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl 82:40–52. https://doi.org/10.1016/j.eswa.2017.03.073
Article Google Scholar
Lin WC, Tsai CF, Hu YH et al (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409–410:17–26. https://doi.org/10.1016/j.ins.2017.05.008
Article Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056
Article Google Scholar
Wei J, Huang H, Yao L et al (2020) NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504
Article Google Scholar
Hautamäki V, Cherednichenko S, Kärkkäinen I et al (2005) Improving k-means by outlier removal. Proc Scand Conf Image Anal 3540:978–987. https://doi.org/10.1007/11499145_99
Article Google Scholar
Baadel S, Thabtah F, Lu J (2016) Overlapping clustering: a review. In: 2016 SAI Computing Conference (SAI), pp 233–237
Vorraboot P, Rasmequan S, Chinnasarn K et al (2015) Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 152:429–443. https://doi.org/10.1016/j.neucom.2014.10.007
Article Google Scholar
Ofek N, Rokach L, Stern R et al (2017) Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 243:88–102. https://doi.org/10.1016/j.neucom.2017.03.011
Article Google Scholar
Cervantes J, Garcia-Lamont F, Rodriguez L et al (2017) PSO-based method for SVM classification on skewed data sets. Neurocomputing 228:187–197. https://doi.org/10.1016/j.neucom.2016.10.041
Article Google Scholar
Koziarski M, Woźniak M, Krawczyk B (2020) Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise. Knowl-Based Syst 204:106223. https://doi.org/10.1016/j.knosys.2020.106223
Article Google Scholar
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70. https://doi.org/10.1016/j.ins.2019.08.062
Article Google Scholar
Xu Z, Shen D, Nie T et al (2021) A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf Sci 572:574–589
Article MathSciNet Google Scholar
Soltanzadeh P, Hashemzadeh M (2021) RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111. https://doi.org/10.1016/j.ins.2020.07.014
Article MathSciNet Google Scholar
Xie X, Liu H, Zeng S et al (2021) A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowl-Based Syst 213:106689. https://doi.org/10.1016/j.knosys.2020.106689
Article Google Scholar
Ma CK, Park YJ (2021) A new instance density-based synthetic minority oversampling method for imbalanced classification problems. Eng Optim 54:1743–1757. https://doi.org/10.1080/0305215X.2021.1982929
Article MathSciNet Google Scholar
Mayabadi S, Saadatfar H (2022) Two density-based sampling approaches for imbalanced and overlapping data. Knowl-Based Syst 241:108217. https://doi.org/10.1016/j.knosys.2022.108217
Article Google Scholar
Yan Y, Jiang Y, Zheng Z et al (2022) LDAS: Local density-based adaptive sampling for imbalanced data classification. Expert Syst Appl 191:116213. https://doi.org/10.1016/j.eswa.2021.116213
Article Google Scholar
Sun A, Lim EP, Liu Y (2009) On strategies for imbalanced text classification using SVM: a comparative study. Decis Support Syst 48:191–201. https://doi.org/10.1016/j.dss.2009.07.011
Article Google Scholar
Tang Y, Zhang YQ, Chawla NV et al (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern B 39:281–288. https://doi.org/10.1109/TSMCB.2008.2002909
Article Google Scholar
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20:203–209. https://doi.org/10.1007/s00521-010-0349-9
Article Google Scholar
Kang Q, Shi L, Zhou MC et al (2018) A distance-based weighted undersampling scheme for support vector machines and its application to imbalanced classification. IEEE Trans Neur Netw Lear 29:18042986. https://doi.org/10.1109/TNNLS.2017.2755595
Article Google Scholar
Wang Q, Tian Y, Liu D (2019) Adaptive FH-SVM for imbalanced classification. IEEE Access 7:19001876. https://doi.org/10.1109/ACCESS.2019.2940983
Article Google Scholar
Song Y, Peng Y (2019) A MCDM-based evaluation approach for imbalanced classification methods in financial risk prediction. IEEE Access 7:18789126. https://doi.org/10.1109/ACCESS.2019.2924923
Article Google Scholar
Shu T, Zhang B, Tang YY (2020) Sparse supervised representation-based classifier for uncontrolled and imbalanced classification. IEEE Trans Neur Netw Learn 31:20068464. https://doi.org/10.1109/TNNLS.2018.2884444
Article MathSciNet Google Scholar
Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inf Sci 577:265–279. https://doi.org/10.1016/j.ins.2021.07.019
Article MathSciNet Google Scholar
Sun Y, Kamel MS, Wong AKC et al (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recognit 40:3358–3378. https://doi.org/10.1016/j.patcog.2007.04.009
Article Google Scholar
Seiffert C, Khoshgoftaar TM, Hulse JV et al (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern A 40(1):185–197. https://doi.org/10.1109/TSMCA.2009.2029559
Article Google Scholar
Hanifah FS, Wijayanto H, Kurnia A (2015) SMOTE bagging algorithm for imbalanced dataset in logistic regression analysis. Appl Math Sci 9:6857–6865. https://doi.org/10.12988/ams.2015.58562
Article Google Scholar
Li Y, Guo H, Liu X et al (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl-Based Syst 94:88–104. https://doi.org/10.1016/j.knosys.2015.11.013
Article Google Scholar
Kirshners A, Parshutin S, Gorskis H (2017) Entropy-based classifier enhancement to handle imbalanced class problem. Procedia Comput Sci 104:586–591. https://doi.org/10.1016/j.procs.2017.01.176
Article Google Scholar
Tanha J, Abdi Y, Samadi N et al (2020) Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data 7:70. https://doi.org/10.1186/s40537-020-00349-y
Article Google Scholar
Zhao J, Jin J, Chen S et al (2020) A weighted hybrid ensemble method for classifying imbalanced data. Knowl-Based Syst 203:106087. https://doi.org/10.1016/j.knosys.2020.106087
Article Google Scholar
Jimenez-Castaño CA, Alvarez-Meza AM, Orozco-Gutierrez AA (2020) Enhanced automatic twin support vector machine for imbalanced data classification. Pattern Recognit 107:107442. https://doi.org/10.1016/j.patcog.2020.107442
Article Google Scholar
Shi P, Wang Z (2021) An Ensemble Tree Classifier for Highly Imbalanced Data Classification. J Syst Sci Complex 34:2250–2266
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Advances in Intelligent Computing. ICIC 2005 Lecture Notes in Computer Science 3644:878887
He H, Bai Y, Garcia EA et al (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-Level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Advances in Knowledge Discovery and Data Mining. PAKDD 2009 Lecture Notes in Computer Science 5476:475–482
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161
Article Google Scholar
Chen Q, Zhang ZL, Huang WP et al (2022) PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets. Neurocomputing 498:75–88
Article Google Scholar

Download references

Funding

This work has been supported by the General Research Program funded by the Ministry of Science and Technology, Taiwan, (Grant No. MOST 110-2221-E-027-106-MY3).

Author information

Authors and Affiliations

Department of Industrial Engineering and Management, National Taipei University of Technology, Taipei, 10608, Taiwan
You-Jin Park & Ke-Yong Cheng

Authors

You-Jin Park
View author publications
You can also search for this author inPubMed Google Scholar
Ke-Yong Cheng
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

The concept and design of the study were developed by You-Jin Park. Material preparation, data collection and analysis were performed by Ke-Yong Cheng. The first draft of the manuscript was written by You-Jin Park and Ke-Yong Cheng, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to You-Jin Park.

Ethics declarations

Ethical approval and informed consent

The datasets used and analyzed during the current study are available from the public data repository.

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Park, YJ., Cheng, KY. A cluster impurity-based hybrid resampling for imbalanced classification problems. Appl Intell 54, 9671–9684 (2024). https://doi.org/10.1007/s10489-024-05644-2

Download citation

Accepted: 25 June 2024
Published: 20 July 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s10489-024-05644-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cluster impurity-based hybrid resampling for imbalanced classification problems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

KSMOTEEN: A Cluster Based Hybrid Sampling Model for Imbalance Class Data

Learning from Imbalanced Data Using Ensemble Methods and Cluster-Based Undersampling

Novel fuzzy clustering-based undersampling framework for class imbalance problem

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval and informed consent

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now