Skip to main content

Advertisement

Log in

KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The Synthetic Minority Oversampling TEchnique (SMOTE) is known as the benchmark method to solve class imbalance learning. Since SMOTE was proposed, many variants of it have emerged, which are classified into two types: pre-processing and post-processing. However, most of the pre-processing methods do not filter the noisy samples; at the same time, the post-processing methods do not give attention to the focus area data. In this paper, we present an oversampling method based on kmeans-SMOTE and Iterative Partition Filter (KSIPF), which overcomes the shortcomings of the above methods. Firstly, KSIPF uses k-means to cluster the data and selects the clusters to oversample, and then, IPF is used to remove the noise samples from the data. Then, KSIPF is compared with the SMOTE and its variants on 30 synthetic imbalanced data sets and 20 real-world imbalanced data sets, and the balanced data sets are used to train SVM and AdaBoost classifiers to determine whether it is effective. Finally, the experiment results demonstrate that KSIPF performs better than the comparisons, including area under the curve, F1-measure, and the statistical test.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

No data sets were generated or analyzed during the current study.

References

  1. Lu X, Ma C, Shen J et al (2022) Deep object tracking with shrinkage loss. IEEE Trans Pattern Anal Mach Intell 44:2386–2401. https://doi.org/10.1109/TPAMI.2020.3041332

    Article  MATH  Google Scholar 

  2. Bennin KE, Keung J, Phannachitta P et al (2018) MAHAKIL: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction. IIEEE Trans Softw Eng 44:534–550. https://doi.org/10.1109/TSE.2017.2731766

    Article  MATH  Google Scholar 

  3. Nasrollahpour H, Isildak I, Rashidi M-R et al (2021) Ultrasensitive bioassaying of HER-2 protein for diagnosis of breast cancer using reduced graphene oxide/chitosan as nanobiocompatible platform. Cancer Nano 12:10. https://doi.org/10.1186/s12645-021-00082-y

    Article  Google Scholar 

  4. Yao P, Shen S, Xu M et al (2022) Single model deep learning on imbalanced small datasets for skin lesion classification. IEEE Trans Med Imaging 41:1242–1254. https://doi.org/10.1109/TMI.2021.3136682

    Article  MATH  Google Scholar 

  5. Zhao C, Xin Y, Li X et al (2020) A heterogeneous ensemble learning framework for spam detection in social networks with imbalanced data. Appl Sci 10:936. https://doi.org/10.3390/app10030936

    Article  MATH  Google Scholar 

  6. Mahajan PD, Maurya A, Megahed A et al (2020) Optimizing predictive precision in imbalanced datasets for actionable revenue change prediction. Eur J Oper Res 285:1095–1113. https://doi.org/10.1016/j.ejor.2020.02.036

    Article  MATH  Google Scholar 

  7. Yeung M, Sala E, Schoenlieb C-B, Rundo L (2022) Unified focal loss: generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput Med Imaging Graph 95:102026. https://doi.org/10.1016/j.compmedimag.2021.102026

    Article  Google Scholar 

  8. Hou S, Liu Y, Yang Q (2022) Real-time prediction of rock mass classification based on TBM operation big data and stacking technique of ensemble learning. J Rock Mech Geotech Eng 14:123–143. https://doi.org/10.1016/j.jrmge.2021.05.004

    Article  MATH  Google Scholar 

  9. Liu W, Lin H, Huang L, et al (2022) Identification of miRNA-disease associations via deep forest ensemble learning based on autoencoder. Brief Bioinform 23. https://doi.org/10.1093/bib/bbac104

  10. Sun J, Li H, Fujita H et al (2020) Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting. Inf Fusion 54:128–144. https://doi.org/10.1016/j.inffus.2019.07.006

    Article  Google Scholar 

  11. Zhang X, Zhuang Y, Wang W, Pedrycz W (2018) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48:357–370. https://doi.org/10.1109/TCYB.2016.2636370

    Article  MATH  Google Scholar 

  12. Yu H, Sun C, Yang X et al (2019) Fuzzy support vector machine with relative density information for classifying imbalanced data. IEEE Trans Fuzzy Syst 27:2353–2367. https://doi.org/10.1109/TFUZZ.2019.2898371

    Article  MATH  Google Scholar 

  13. Wang Z, Wang B, Cheng Y et al (2019) Cost-sensitive fuzzy multiple kernel learning for imbalanced problem. Neurocomputing 366:178–193. https://doi.org/10.1016/j.neucom.2019.06.065

    Article  MATH  Google Scholar 

  14. Huang Z, Dumitru CO, Pan Z et al (2021) Classification of large-scale high-resolution SAR images with deep transfer learning. IEEE Geosci Remote Sens Lett 18:107–111. https://doi.org/10.1109/LGRS.2020.2965558

    Article  MATH  Google Scholar 

  15. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. Jair 16:321–357. https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  16. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6:20–29. https://doi.org/10.1145/1007730.1007735

    Article  MATH  Google Scholar 

  17. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056

    Article  MATH  Google Scholar 

  18. Zhang A, Yu H, Zhou S et al (2022) Instance weighted SMOTE by indirectly exploring the data distribution. Knowl-Based Syst 249:108919. https://doi.org/10.1016/j.knosys.2022.108919

    Article  MATH  Google Scholar 

  19. Li M, Zhou H, Liu Q, Wang G (2022) SW: A weighted space division framework for imbalanced problems with label noise. Knowl-Based Syst 251:109233. https://doi.org/10.1016/j.knosys.2022.109233

    Article  MATH  Google Scholar 

  20. Jia L, Wang Z, Sun P et al (2023) TDMO: dynamic multi-dimensional oversampling for exploring data distribution based on extreme gradient boosting learning. Inf Sci 649:119621. https://doi.org/10.1016/j.ins.2023.119621

    Article  Google Scholar 

  21. Sun P, Wang Z, Jia L, Xu Z (2024) SMOTE-kTLNN: a hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Syst Appl 238:121848. https://doi.org/10.1016/j.eswa.2023.121848

    Article  Google Scholar 

  22. Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques. J Comput Sci Technol 22:387–396. https://doi.org/10.1007/s11390-007-9054-2

    Article  MATH  Google Scholar 

  23. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study1. IDA 6:429–449. https://doi.org/10.3233/IDA-2002-6504

    Article  MATH  Google Scholar 

  24. Jo T, Japkowicz N (2004) Class imbalances versus small disjuncts. SIGKDD Explor Newsl 6:40–49. https://doi.org/10.1145/1007730.1007737

    Article  MATH  Google Scholar 

  25. García V, Sánchez J, Mollineda R (2008) An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Rueda L, Mery D, Kittler J (eds) Progress in pattern recognition, image analysis and applications. Springer, Berlin, pp 397–406

    MATH  Google Scholar 

  26. Napierała K, Stefanowski J, Wilk S (2010) Learning from imbalanced data in presence of noisy and borderline examples. In: Szczuka M, Kryszkiewicz M, Ramanna S et al (eds) Rough sets and current trends in computing. Springer, Berlin, pp 158–167

    Chapter  MATH  Google Scholar 

  27. Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in intelligent computing. Springer, Berlin, pp 878–887

    Chapter  MATH  Google Scholar 

  28. Liu J (2022) Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data. Soft Comput 26:1141–1163. https://doi.org/10.1007/s00500-021-06532-4

    Article  MATH  Google Scholar 

  29. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, Hong Kong, China, pp 1322–1328

  30. Alejo R, García V, Pacheco-Sánchez JH (2015) An efficient over-sampling approach based on mean square error back-propagation for dealing with the multi-class imbalance problem. Neural Process Lett 42:603–617. https://doi.org/10.1007/s11063-014-9376-3

    Article  MATH  Google Scholar 

  31. Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161. https://doi.org/10.1016/j.ins.2017.04.046

    Article  MATH  Google Scholar 

  32. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-: safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced ProbSMOTElem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in knowledge discovery and data mining. Springer, Berlin, pp 475–482

    Chapter  MATH  Google Scholar 

  33. Yan YT, Wu ZB, Du XQ et al (2019) A three-way decision ensemble method for imbalanced data oversampling. Int J Approx Reason 107:1–16. https://doi.org/10.1016/j.ijar.2018.12.011

    Article  MathSciNet  MATH  Google Scholar 

  34. Chen B, Xia S, Chen Z et al (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428. https://doi.org/10.1016/j.ins.2020.10.013

    Article  MathSciNet  MATH  Google Scholar 

  35. Barua S, Islam MdM, Yao X, Murase K (2014) MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425. https://doi.org/10.1109/TKDE.2012.232

    Article  Google Scholar 

  36. Ramentol E, Caballero Y, Bello R, Herrera F (2012) SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 33:245–265. https://doi.org/10.1007/s10115-011-0465-6

    Article  MATH  Google Scholar 

  37. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) SMOTE–IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203. https://doi.org/10.1016/j.ins.2014.08.051

    Article  Google Scholar 

  38. Xia S, Wang G, Chen Z et al (2019) Complete random forest based class noise filtering learning for improving the generalizability of classifiers. IEEE Trans Knowl Data Eng 31:2063–2078. https://doi.org/10.1109/TKDE.2018.2873791

    Article  MATH  Google Scholar 

  39. Xia S, Chen B, Wang G et al (2022) mCRF and mRD: two classification methods based on a novel multiclass label noise filtering learning framework. IEEE Trans Neural Netw Learning Syst 33:2916–2930. https://doi.org/10.1109/TNNLS.2020.3047046

    Article  MATH  Google Scholar 

  40. Zhang A, Yu H, Huan Z et al (2022) SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors. Inf Sci 595:70–88. https://doi.org/10.1016/j.ins.2022.02.038

    Article  MATH  Google Scholar 

  41. Kermanidis KL (2009) The effect of borderline examples on language learning. J Exp Theor Artif Intell 21:19–42. https://doi.org/10.1080/09528130802113406

    Article  MATH  Google Scholar 

  42. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man, Cybern SMC-2:408–421. https://doi.org/10.1109/TSMC.1972.4309137

  43. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. Jair 11:131–167. https://doi.org/10.1613/jair.606

    Article  MATH  Google Scholar 

  44. Verbaeten S, Van Assche A (2003) Ensemble methods for noise elimination in classification problems. In: Windeatt T, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 317–325

    Chapter  MATH  Google Scholar 

  45. Stefanowski J (2013) Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: Ramanna S, Jain LC, Howlett RJ (eds) Emerging paradigms in machine learning. Springer, Berlin, pp 277–306

    Chapter  MATH  Google Scholar 

  46. Islam A, Belhaouari SB, Rehman AU, Bensmail H (2022) KNNOR: an oversampling technique for imbalanced datasets. Appl Soft Comput 115:108288. https://doi.org/10.1016/j.asoc.2021.108288

    Article  MATH  Google Scholar 

  47. Kunakorntum I, Hinthong W, Phunchongharn P (2020) A synthetic minority based on probabilistic distribution (SyMProD) oversampling for imbalanced datasets. IEEE Access 8:114692–114704. https://doi.org/10.1109/ACCESS.2020.3003346

    Article  Google Scholar 

  48. Wang AX, Chukova SS, Nguyen BP (2023) Synthetic minority oversampling using edited displacement-based k-nearest neighbors. Appl Soft Comput 148:110895. https://doi.org/10.1016/j.asoc.2023.110895

    Article  Google Scholar 

  49. Zeraatkar S, Afsari F (2021) Interval–valued fuzzy and intuitionistic fuzzy–KNN for imbalanced data classification. Expert Syst Appl 184:115510. https://doi.org/10.1016/j.eswa.2021.115510

    Article  MATH  Google Scholar 

  50. Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159. https://doi.org/10.1016/S0031-3203(96)00142-2

    Article  MATH  Google Scholar 

  51. Li T, Wang Y, Liu L et al (2023) Subspace-based minority oversampling for imbalance classification. Inf Sci 621:371–388. https://doi.org/10.1016/j.ins.2022.11.108

    Article  MATH  Google Scholar 

  52. Shi S, Xiong H, Li G (2023) A no-tardiness job shop scheduling problem with overtime consideration and the solution approaches. Comput Ind Eng 178:109115. https://doi.org/10.1016/j.cie.2023.109115

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Pengfei Sun helped in conceptualization, methodology, software, validation, formal analysis, data curation, writing—original draft, writing—review & editing, and visualization. Zhiping Wang helped in conceptualization, methodology, methodology, writing—review & editing, and supervision. Liyan Jia helped in conceptualization, methodology, software, and writing—review & editing. Xiaoxi Wang helped in data curation and supervision.

Corresponding authors

Correspondence to Zhiping Wang or Xiaoxi Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, P., Wang, Z., Jia, L. et al. KSIPF: an effective noise filtering oversampling method based on k-means and iterative-partitioning filter. J Supercomput 81, 596 (2025). https://doi.org/10.1007/s11227-025-07081-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-07081-5

Keywords