Abstract
The Synthetic Minority Oversampling TEchnique (SMOTE) is known as the benchmark method to solve class imbalance learning. Since SMOTE was proposed, many variants of it have emerged, which are classified into two types: pre-processing and post-processing. However, most of the pre-processing methods do not filter the noisy samples; at the same time, the post-processing methods do not give attention to the focus area data. In this paper, we present an oversampling method based on kmeans-SMOTE and Iterative Partition Filter (KSIPF), which overcomes the shortcomings of the above methods. Firstly, KSIPF uses k-means to cluster the data and selects the clusters to oversample, and then, IPF is used to remove the noise samples from the data. Then, KSIPF is compared with the SMOTE and its variants on 30 synthetic imbalanced data sets and 20 real-world imbalanced data sets, and the balanced data sets are used to train SVM and AdaBoost classifiers to determine whether it is effective. Finally, the experiment results demonstrate that KSIPF performs better than the comparisons, including area under the curve, F1-measure, and the statistical test.







Similar content being viewed by others
Data availability
No data sets were generated or analyzed during the current study.
References
Lu X, Ma C, Shen J et al (2022) Deep object tracking with shrinkage loss. IEEE Trans Pattern Anal Mach Intell 44:2386–2401. https://doi.org/10.1109/TPAMI.2020.3041332
Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IIEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766
Nasrollahpour H, Isildak I, Rashidi M-R et al (2021) Ultrasensitive bioassaying of HER-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform. Cancer Nano 12:10. https://doi.org/10.1186/s12645-021-00082-y
Yao P, Shen S, Xu M et al (2022) Single model deep learning on imbalanced small datasets for skin lesion classification. IEEE Trans Med Imaging 41:1242–1254. https://doi.org/10.1109/TMI.2021.3136682
Zhao C, Xin Y, Li X et al (2020) A heterogeneous ensemble learning framework for spam detection in social networks with imbalanced data. Appl Sci 10:936. https://doi.org/10.3390/app10030936
Mahajan PD, Maurya A, Megahed A et al (2020) Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction. Eur J Oper Res 285:1095–1113. https://doi.org/10.1016/j.ejor.2020.02.036
Yeung M, Sala E, Schoenlieb C-B, Rundo L (2022) Unified focal loss: generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput Med Imaging Graph 95:102026. https://doi.org/10.1016/j.compmedimag.2021.102026
Hou S, Liu Y, Yang Q (2022) Real-time prediction of rock mass classification based on TBM operation big data and stacking technique of ensemble learning. J Rock Mech Geotech Eng 14:123–143. https://doi.org/10.1016/j.jrmge.2021.05.004
Liu W, Lin H, Huang L, et al (2022) Identification of miRNA-disease associations via deep forest ensemble learning based on autoencoder. Brief Bioinform 23. https://doi.org/10.1093/bib/bbac104
Sun J, Li H, Fujita H et al (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fusion 54:128–144. https://doi.org/10.1016/j.inffus.2019.07.006
Zhang X, Zhuang Y, Wang W, Pedrycz W (2018) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48:357–370. https://doi.org/10.1109/TCYB.2016.2636370
Yu H, Sun C, Yang X et al (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. IEEE Trans Fuzzy Syst 27:2353–2367. https://doi.org/10.1109/TFUZZ.2019.2898371
Wang Z, Wang B, Cheng Y et al (2019) Cost-sensitive fuzzy multiple kernel learning for imbalanced problem. Neurocomputing 366:178–193. https://doi.org/10.1016/j.neucom.2019.06.065
Huang Z, Dumitru CO, Pan Z et al (2021) Classification of large-scale high-resolution SAR images with deep transfer learning. IEEE Geosci Remote Sens Lett 18:107–111. https://doi.org/10.1109/LGRS.2020.2965558
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. Jair 16:321–357. https://doi.org/10.1613/jair.953
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6:20–29. https://doi.org/10.1145/1007730.1007735
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056
Zhang A, Yu H, Zhou S et al (2022) Instance weighted SMOTE by indirectly exploring the data distribution. Knowl-Based Syst 249:108919. https://doi.org/10.1016/j.knosys.2022.108919
Li M, Zhou H, Liu Q, Wang G (2022) SW: A weighted space division framework for imbalanced problems with label noise. Knowl-Based Syst 251:109233. https://doi.org/10.1016/j.knosys.2022.109233
Jia L, Wang Z, Sun P et al (2023) TDMO: dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning. Inf Sci 649:119621. https://doi.org/10.1016/j.ins.2023.119621
Sun P, Wang Z, Jia L, Xu Z (2024) SMOTE-kTLNN: a hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Syst Appl 238:121848. https://doi.org/10.1016/j.eswa.2023.121848
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396. https://doi.org/10.1007/s11390-007-9054-2
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study1. IDA 6:429–449. https://doi.org/10.3233/IDA-2002-6504
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor Newsl 6:40–49. https://doi.org/10.1145/1007730.1007737
García V, Sánchez J, Mollineda R (2008) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Rueda L, Mery D, Kittler J (eds) Progress in pattern recognition, image analysis and applications. Springer, Berlin, pp 397–406
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Szczuka M, Kryszkiewicz M, Ramanna S et al (eds) Rough sets and current trends in computing. Springer, Berlin, pp 158–167
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887
Liu J (2022) Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Comput 26:1141–1163. https://doi.org/10.1007/s00500-021-06532-4
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, China, pp 1322–1328
Alejo R, García V, Pacheco-Sánchez JH (2015) An efficient over-sampling approach based on mean square error back-propagation for dealing with the multi-class imbalance problem. Neural Process Lett 42:603–617. https://doi.org/10.1007/s11063-014-9376-3
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161. https://doi.org/10.1016/j.ins.2017.04.046
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-: safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced ProbSMOTElem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 475–482
Yan YT, Wu ZB, Du XQ et al (2019) A three-way decision ensemble method for imbalanced data oversampling. Int J Approx Reason 107:1–16. https://doi.org/10.1016/j.ijar.2018.12.011
Chen B, Xia S, Chen Z et al (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428. https://doi.org/10.1016/j.ins.2020.10.013
Barua S, Islam MdM, Yao X, Murase K (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425. https://doi.org/10.1109/TKDE.2012.232
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33:245–265. https://doi.org/10.1007/s10115-011-0465-6
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203. https://doi.org/10.1016/j.ins.2014.08.051
Xia S, Wang G, Chen Z et al (2019) Complete random forest based class noise filtering learning for improving the generalizability of classifiers. IEEE Trans Knowl Data Eng 31:2063–2078. https://doi.org/10.1109/TKDE.2018.2873791
Xia S, Chen B, Wang G et al (2022) mCRF and mRD: two classification methods based on a novel multiclass label noise filtering learning framework. IEEE Trans Neural Netw Learning Syst 33:2916–2930. https://doi.org/10.1109/TNNLS.2020.3047046
Zhang A, Yu H, Huan Z et al (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88. https://doi.org/10.1016/j.ins.2022.02.038
Kermanidis KL (2009) The effect of borderline examples on language learning. J Exp Theor Artif Intell 21:19–42. https://doi.org/10.1080/09528130802113406
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man, Cybern SMC-2:408–421. https://doi.org/10.1109/TSMC.1972.4309137
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. Jair 11:131–167. https://doi.org/10.1613/jair.606
Verbaeten S, Van Assche A (2003) Ensemble methods for noise elimination in classification problems. In: Windeatt T, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 317–325
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Ramanna S, Jain LC, Howlett RJ (eds) Emerging paradigms in machine learning. Springer, Berlin, pp 277–306
Islam A, Belhaouari SB, Rehman AU, Bensmail H (2022) KNNOR: an oversampling technique for imbalanced datasets. Appl Soft Comput 115:108288. https://doi.org/10.1016/j.asoc.2021.108288
Kunakorntum I, Hinthong W, Phunchongharn P (2020) A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access 8:114692–114704. https://doi.org/10.1109/ACCESS.2020.3003346
Wang AX, Chukova SS, Nguyen BP (2023) Synthetic minority oversampling using edited displacement-based k-nearest neighbors. Appl Soft Comput 148:110895. https://doi.org/10.1016/j.asoc.2023.110895
Zeraatkar S, Afsari F (2021) Interval–valued fuzzy and intuitionistic fuzzy–KNN for imbalanced data classification. Expert Syst Appl 184:115510. https://doi.org/10.1016/j.eswa.2021.115510
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2
Li T, Wang Y, Liu L et al (2023) Subspace-based minority oversampling for imbalance classification. Inf Sci 621:371–388. https://doi.org/10.1016/j.ins.2022.11.108
Shi S, Xiong H, Li G (2023) A no-tardiness job shop scheduling problem with overtime consideration and the solution approaches. Comput Ind Eng 178:109115. https://doi.org/10.1016/j.cie.2023.109115
Author information
Authors and Affiliations
Contributions
Pengfei Sun helped in conceptualization, methodology, software, validation, formal analysis, data curation, writing—original draft, writing—review & editing, and visualization. Zhiping Wang helped in conceptualization, methodology, methodology, writing—review & editing, and supervision. Liyan Jia helped in conceptualization, methodology, software, and writing—review & editing. Xiaoxi Wang helped in data curation and supervision.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, P., Wang, Z., Jia, L. et al. KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter. J Supercomput 81, 596 (2025). https://doi.org/10.1007/s11227-025-07081-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07081-5