KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter

Sun, Pengfei; Wang, Zhiping; Jia, Liyan; Wang, Xiaoxi

doi:10.1007/s11227-025-07081-5

KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter

Published: 10 March 2025

Volume 81, article number 596, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Pengfei Sun¹,
Zhiping Wang¹,
Liyan Jia¹ &
…
Xiaoxi Wang²

75 Accesses
Explore all metrics

Abstract

The Synthetic Minority Oversampling TEchnique (SMOTE) is known as the benchmark method to solve class imbalance learning. Since SMOTE was proposed, many variants of it have emerged, which are classified into two types: pre-processing and post-processing. However, most of the pre-processing methods do not filter the noisy samples; at the same time, the post-processing methods do not give attention to the focus area data. In this paper, we present an oversampling method based on kmeans-SMOTE and Iterative Partition Filter (KSIPF), which overcomes the shortcomings of the above methods. Firstly, KSIPF uses k-means to cluster the data and selects the clusters to oversample, and then, IPF is used to remove the noise samples from the data. Then, KSIPF is compared with the SMOTE and its variants on 30 synthetic imbalanced data sets and 20 real-world imbalanced data sets, and the balanced data sets are used to train SVM and AdaBoost classifiers to determine whether it is effective. Finally, the experiment results demonstrate that KSIPF performs better than the comparisons, including area under the curve, F1-measure, and the statistical test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Adaptive Oversampling Method for Imbalanced Datasets Based on Mean-Shift and SMOTE

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

Article 21 August 2024

A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification

Article 21 April 2022

Data availability

No data sets were generated or analyzed during the current study.

References

Lu X, Ma C, Shen J et al (2022) Deep object tracking with shrinkage loss. IEEE Trans Pattern Anal Mach Intell 44:2386–2401. https://doi.org/10.1109/TPAMI.2020.3041332
Article MATH Google Scholar
Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IIEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766
Article MATH Google Scholar
Nasrollahpour H, Isildak I, Rashidi M-R et al (2021) Ultrasensitive bioassaying of HER-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform. Cancer Nano 12:10. https://doi.org/10.1186/s12645-021-00082-y
Article Google Scholar
Yao P, Shen S, Xu M et al (2022) Single model deep learning on imbalanced small datasets for skin lesion classification. IEEE Trans Med Imaging 41:1242–1254. https://doi.org/10.1109/TMI.2021.3136682
Article MATH Google Scholar
Zhao C, Xin Y, Li X et al (2020) A heterogeneous ensemble learning framework for spam detection in social networks with imbalanced data. Appl Sci 10:936. https://doi.org/10.3390/app10030936
Article MATH Google Scholar
Mahajan PD, Maurya A, Megahed A et al (2020) Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction. Eur J Oper Res 285:1095–1113. https://doi.org/10.1016/j.ejor.2020.02.036
Article MATH Google Scholar
Yeung M, Sala E, Schoenlieb C-B, Rundo L (2022) Unified focal loss: generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput Med Imaging Graph 95:102026. https://doi.org/10.1016/j.compmedimag.2021.102026
Article Google Scholar
Hou S, Liu Y, Yang Q (2022) Real-time prediction of rock mass classification based on TBM operation big data and stacking technique of ensemble learning. J Rock Mech Geotech Eng 14:123–143. https://doi.org/10.1016/j.jrmge.2021.05.004
Article MATH Google Scholar
Liu W, Lin H, Huang L, et al (2022) Identification of miRNA-disease associations via deep forest ensemble learning based on autoencoder. Brief Bioinform 23. https://doi.org/10.1093/bib/bbac104
Sun J, Li H, Fujita H et al (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fusion 54:128–144. https://doi.org/10.1016/j.inffus.2019.07.006
Article Google Scholar
Zhang X, Zhuang Y, Wang W, Pedrycz W (2018) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48:357–370. https://doi.org/10.1109/TCYB.2016.2636370
Article MATH Google Scholar
Yu H, Sun C, Yang X et al (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. IEEE Trans Fuzzy Syst 27:2353–2367. https://doi.org/10.1109/TFUZZ.2019.2898371
Article MATH Google Scholar
Wang Z, Wang B, Cheng Y et al (2019) Cost-sensitive fuzzy multiple kernel learning for imbalanced problem. Neurocomputing 366:178–193. https://doi.org/10.1016/j.neucom.2019.06.065
Article MATH Google Scholar
Huang Z, Dumitru CO, Pan Z et al (2021) Classification of large-scale high-resolution SAR images with deep transfer learning. IEEE Geosci Remote Sens Lett 18:107–111. https://doi.org/10.1109/LGRS.2020.2965558
Article MATH Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. Jair 16:321–357. https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6:20–29. https://doi.org/10.1145/1007730.1007735
Article MATH Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056
Article MATH Google Scholar
Zhang A, Yu H, Zhou S et al (2022) Instance weighted SMOTE by indirectly exploring the data distribution. Knowl-Based Syst 249:108919. https://doi.org/10.1016/j.knosys.2022.108919
Article MATH Google Scholar
Li M, Zhou H, Liu Q, Wang G (2022) SW: A weighted space division framework for imbalanced problems with label noise. Knowl-Based Syst 251:109233. https://doi.org/10.1016/j.knosys.2022.109233
Article MATH Google Scholar
Jia L, Wang Z, Sun P et al (2023) TDMO: dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning. Inf Sci 649:119621. https://doi.org/10.1016/j.ins.2023.119621
Article Google Scholar
Sun P, Wang Z, Jia L, Xu Z (2024) SMOTE-kTLNN: a hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Syst Appl 238:121848. https://doi.org/10.1016/j.eswa.2023.121848
Article Google Scholar
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396. https://doi.org/10.1007/s11390-007-9054-2
Article MATH Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study1. IDA 6:429–449. https://doi.org/10.3233/IDA-2002-6504
Article MATH Google Scholar
Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor Newsl 6:40–49. https://doi.org/10.1145/1007730.1007737
Article MATH Google Scholar
García V, Sánchez J, Mollineda R (2008) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Rueda L, Mery D, Kittler J (eds) Progress in pattern recognition, image analysis and applications. Springer, Berlin, pp 397–406
MATH Google Scholar
Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Szczuka M, Kryszkiewicz M, Ramanna S et al (eds) Rough sets and current trends in computing. Springer, Berlin, pp 158–167
Chapter MATH Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887
Chapter MATH Google Scholar
Liu J (2022) Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Comput 26:1141–1163. https://doi.org/10.1007/s00500-021-06532-4
Article MATH Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, China, pp 1322–1328
Alejo R, García V, Pacheco-Sánchez JH (2015) An efficient over-sampling approach based on mean square error back-propagation for dealing with the multi-class imbalance problem. Neural Process Lett 42:603–617. https://doi.org/10.1007/s11063-014-9376-3
Article MATH Google Scholar
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161. https://doi.org/10.1016/j.ins.2017.04.046
Article MATH Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-: safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced ProbSMOTElem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 475–482
Chapter MATH Google Scholar
Yan YT, Wu ZB, Du XQ et al (2019) A three-way decision ensemble method for imbalanced data oversampling. Int J Approx Reason 107:1–16. https://doi.org/10.1016/j.ijar.2018.12.011
Article MathSciNet MATH Google Scholar
Chen B, Xia S, Chen Z et al (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428. https://doi.org/10.1016/j.ins.2020.10.013
Article MathSciNet MATH Google Scholar
Barua S, Islam MdM, Yao X, Murase K (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425. https://doi.org/10.1109/TKDE.2012.232
Article Google Scholar
Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33:245–265. https://doi.org/10.1007/s10115-011-0465-6
Article MATH Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203. https://doi.org/10.1016/j.ins.2014.08.051
Article Google Scholar
Xia S, Wang G, Chen Z et al (2019) Complete random forest based class noise filtering learning for improving the generalizability of classifiers. IEEE Trans Knowl Data Eng 31:2063–2078. https://doi.org/10.1109/TKDE.2018.2873791
Article MATH Google Scholar
Xia S, Chen B, Wang G et al (2022) mCRF and mRD: two classification methods based on a novel multiclass label noise filtering learning framework. IEEE Trans Neural Netw Learning Syst 33:2916–2930. https://doi.org/10.1109/TNNLS.2020.3047046
Article MATH Google Scholar
Zhang A, Yu H, Huan Z et al (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88. https://doi.org/10.1016/j.ins.2022.02.038
Article MATH Google Scholar
Kermanidis KL (2009) The effect of borderline examples on language learning. J Exp Theor Artif Intell 21:19–42. https://doi.org/10.1080/09528130802113406
Article MATH Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man, Cybern SMC-2:408–421. https://doi.org/10.1109/TSMC.1972.4309137
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. Jair 11:131–167. https://doi.org/10.1613/jair.606
Article MATH Google Scholar
Verbaeten S, Van Assche A (2003) Ensemble methods for noise elimination in classification problems. In: Windeatt T, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 317–325
Chapter MATH Google Scholar
Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Ramanna S, Jain LC, Howlett RJ (eds) Emerging paradigms in machine learning. Springer, Berlin, pp 277–306
Chapter MATH Google Scholar
Islam A, Belhaouari SB, Rehman AU, Bensmail H (2022) KNNOR: an oversampling technique for imbalanced datasets. Appl Soft Comput 115:108288. https://doi.org/10.1016/j.asoc.2021.108288
Article MATH Google Scholar
Kunakorntum I, Hinthong W, Phunchongharn P (2020) A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access 8:114692–114704. https://doi.org/10.1109/ACCESS.2020.3003346
Article Google Scholar
Wang AX, Chukova SS, Nguyen BP (2023) Synthetic minority oversampling using edited displacement-based k-nearest neighbors. Appl Soft Comput 148:110895. https://doi.org/10.1016/j.asoc.2023.110895
Article Google Scholar
Zeraatkar S, Afsari F (2021) Interval–valued fuzzy and intuitionistic fuzzy–KNN for imbalanced data classification. Expert Syst Appl 184:115510. https://doi.org/10.1016/j.eswa.2021.115510
Article MATH Google Scholar
Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2
Article MATH Google Scholar
Li T, Wang Y, Liu L et al (2023) Subspace-based minority oversampling for imbalance classification. Inf Sci 621:371–388. https://doi.org/10.1016/j.ins.2022.11.108
Article MATH Google Scholar
Shi S, Xiong H, Li G (2023) A no-tardiness job shop scheduling problem with overtime consideration and the solution approaches. Comput Ind Eng 178:109115. https://doi.org/10.1016/j.cie.2023.109115
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Science, Dalian Maritime University, Dalian, 116026, Liaoning, China
Pengfei Sun, Zhiping Wang & Liyan Jia
Department of Clinical Laboratory Medicine, The First Affiliated Hospital of Dalian Medical University, Dalian, 116011, Liaoning, China
Xiaoxi Wang

Authors

Pengfei Sun
View author publications
You can also search for this author inPubMed Google Scholar
Zhiping Wang
View author publications
You can also search for this author inPubMed Google Scholar
Liyan Jia
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoxi Wang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Pengfei Sun helped in conceptualization, methodology, software, validation, formal analysis, data curation, writing—original draft, writing—review & editing, and visualization. Zhiping Wang helped in conceptualization, methodology, methodology, writing—review & editing, and supervision. Liyan Jia helped in conceptualization, methodology, software, and writing—review & editing. Xiaoxi Wang helped in data curation and supervision.

Corresponding authors

Correspondence to Zhiping Wang or Xiaoxi Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, P., Wang, Z., Jia, L. et al. KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter. J Supercomput 81, 596 (2025). https://doi.org/10.1007/s11227-025-07081-5

Download citation

Accepted: 16 February 2025
Published: 10 March 2025
DOI: https://doi.org/10.1007/s11227-025-07081-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Adaptive Oversampling Method for Imbalanced Datasets Based on Mean-Shift and SMOTE

An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy

A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now