Skip to main content
Log in

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

SMOTE has been favored by researchers in improving imbalanced classification. Nevertheless, imbalances within minority classes and noise generation are two main challenges in SMOTE. Recently, clustering-based oversampling methods are developed to improve SMOTE by eliminating imbalances within minority classes and/or overcoming noise generation. Yet, they still suffer from the following challenges: a) some create more synthetic minority samples in large-size or high-density regions; b) most fail to remove noise from the training set; c) most heavily rely on more than one parameter; d) most can not handle non-spherical data; e) almost all adopted clustering methods are not very suitable for class-imbalanced data. To overcome the above issues of existing clustering-based oversampling methods, this paper proposes a novel oversampling approach based on local density peaks clustering (OALDPC). First, a novel local density peaks clustering (LDPC) is proposed to partition the class-imbalanced training set into separated sub-clusters with different sizes and densities. Second, a novel LDPC-based noise filter is proposed to identify and remove suspicious noise from the class-imbalanced training set. Third, a novel sampling weight is proposed and calculated by weighing the sample number and density of each minority class sub-cluster. Four, a novel interpolation method based on the sampling weight and LDPC is proposed to create more synthetic minority class samples in sparser minority class regions. Intensive experiments have proven that OALDPC outperforms 8 state-of-the-art oversampling techniques in improving F-measure and G-mean of Random Forest, Neural Network and XGBoost on synthetic data and extensive real benchmark data sets from industrial applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1:
Fig. 2
Fig. 3
Algorithm 2:
Fig. 4
Algorithm 3:
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

The datasets and third-party libraries used in the experiments are open sources and accessible online (http://archive.ics.uci.edu/ml/datasets.php).

References

  1. Feng HL, Wang H, Jin B, Li H, Xue M, Wang L (2019) Learning a Distance Metric by Balancing KL-Divergence for Imbalanced Datasets. IEEE Transactions on Systems, Man, and Cybernetics: Systems 49(12):2384–2395

    Article  Google Scholar 

  2. Gu X, Chung F, Ishibuchi H, Wang S (2017) Imbalanced TSK fuzzy classifier by cross-class bayesian fuzzy clustering and imbalance learning. IEEE Trans Syst Man Cybern: Syst 47(8):2005–2020

    Article  Google Scholar 

  3. Teng A, Peng L, Xie Y, Zhang H, Chen Z (2020) Gradient descent evolved imbalanced data gravitation classification with an application on Internet video traffic identification. Inf Sci 539:447–460

    Article  Google Scholar 

  4. Ding I, Jia M, Zhuang J, Ding P (2022) Deep imbalanced regression using cost-sensitive learning and deep feature transfer for bearing remaining useful life estimation. Appl Soft Comput 127:109271

    Article  Google Scholar 

  5. Fan J, Yu Y, Wang Z (2022) Addressing label ambiguity imbalance in candidate labels: Measures and disambiguation algorithm. Inf Sci 612:1–19

    Article  Google Scholar 

  6. Shi H, Zhang Y, Chen Y, Ji S, Dong Y (2022) Resampling algorithms based on sample concatenation for imbalance learning. Knowl-Based Syst 245:108592

    Article  Google Scholar 

  7. Pérez-Ortiz M, Gutiérrez P, Tino P, Hervás-Martínez C (2016) Oversampling the minority class in the feature space. IEEE Trans Neural Netw Learn Syst 27(9):1947–1961

    Article  MathSciNet  Google Scholar 

  8. Das B, Krishnan NC, Cook DJ (2015) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234

    Article  Google Scholar 

  9. Lim P, Goh CK, Tan KC (2017) Evolutionary cluster-based synthetic oversampling ensemble (ECO-Ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861

    Article  Google Scholar 

  10. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  11. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20

    Article  Google Scholar 

  12. Sáeza JA, Luengob J, Stefanowskic J, Herreraa F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(10):184–203

    Article  Google Scholar 

  13. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. ICIC 2005. Lecture notes in computer science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_91

  14. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. PAKDD 2009. Lecture notes in computer science, vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_43

  15. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc. Int’ l Joint Conf. Neural Networks 1322–1328

  16. Prusty MR, Jayanthi T, Velusamy K (2017) Weighted-SMOTE: a modification to SMOTE for event classification in sodium cooled fast reactors. Prog Nucl Energy 100:355–364

    Article  Google Scholar 

  17. Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233

    Article  Google Scholar 

  18. Chen B, Xia S, Chen Z, Wang B, Wang G (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428

    Article  MathSciNet  Google Scholar 

  19. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  20. Verbiest N, Ramentol E, Cornelis C, Herrera F (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput 22:511–517

    Article  Google Scholar 

  21. Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223(8):107056

    Article  Google Scholar 

  22. Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. 2006 IEEE International Conference on Granular Computing, Atlanta, GA, USA, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905

  23. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

  24. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling TEchnique. Appl Intell 36:664–684

    Article  Google Scholar 

  25. Ma L, Fan SH (2017) CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on Random Forests. BMC Bioinf 18(1):1–18

    Article  MathSciNet  Google Scholar 

  26. Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425

    Article  Google Scholar 

  27. Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) NI-MWMOTE: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 4:113504

    Article  Google Scholar 

  28. Jabi M, Pedersoli M, Mitiche A, Ayed IB (2021) Deep clustering: on the link between discriminative models and k-means. IEEE Trans Pattern Anal Mach Intell 43(6):1887–1896

    Article  Google Scholar 

  29. Tao X, Guo W, Ren C, Li Q, He Q, Liu R, Zou J (2021) Density peak clustering using global and local consistency adjustable manifold distance. Inf Sci 577:759–804

    Article  MathSciNet  Google Scholar 

  30. Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large database. Inf Syst 27(2):73–84

    Google Scholar 

  31. Voorhees EM (1986) Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Inf Process Manage 22(6):465–476

    Article  Google Scholar 

  32. Wen G, Li X, Zhu B, Chen L (2021) TanM, One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manag 58(1):102388

    Article  Google Scholar 

  33. Liang J, Liang B, Dang C, Cao F (2021) The k-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst 20(4):28–745

    Google Scholar 

  34. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD'96: Proceedings of the second international conference on knowledge discovery and data mining, pp 226–231

  35. Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recogn Lett 80(1):30–36

    Article  Google Scholar 

  36. Li J, Zhu Q, Wu Q (2019) A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor. Knowl-Based Syst 184(15):104895

    Article  Google Scholar 

  37. Li J, Zhu Q (2020) A boosting self-training framework based on instance generation with natural neighbors for K nearest neighbor. Appl Intell 50:3535–3553

    Article  Google Scholar 

  38. Li J, Zhu Q, Wu Q (2020) A parameter-free hybrid instance selection algorithm based on local sets with natural neighbors. Appl Intell 50(5):1527–1541

    Article  Google Scholar 

  39. Li J, Zhu Q (2019) Semi-supervised self-training method based on an optimum-path forest. IEEE Access 7:36388–36399

    Article  Google Scholar 

  40. Ros F, Guillaume S (2019) Munec: a mutual neighbor-based clustering algorithm. Inf Sci 486:148–170

    Article  MathSciNet  Google Scholar 

  41. Zhao Y, Wang Y, Zhang J, Fu CW, Xu M, Moritz D (2022) KD-Box: Line-segment-based KD-tree for interactive exploration of large-scale time-series data. IEEE Trans Visual Comput Graph 28(1):890–900

    Article  Google Scholar 

  42. Rodriguez A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496

    Article  Google Scholar 

  43. Ghazi M, Lee L, Samsudin A, Sino H (2022) Evaluation of ensemble data preprocessing strategy on forensic gasoline classification using untargeted GC-MS data and classification and regression tree (CART) algorithm. Microchem J 182:107911

    Article  Google Scholar 

  44. Chu Y, Fei J, Hou S (2020) Adaptive global sliding-mode control for dynamic systems using double hidden layer recurrent neural network structure. IEEE Trans Neural Netw Learn Syst 31(4):1297–1309

    Article  MathSciNet  Google Scholar 

  45. Ogunleye A, Wang QG (2020) XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans Comput Biol Bioinf 17(6):2131–2140

    Article  Google Scholar 

  46. Li J, Zhou Q, Zhu Q, Wu Q (2023) A framework based on local cores and synthetic examples generation for self-labeled semi-supervised classification. Pattern Recogn 134:109060

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62006029, Postdoctoral Innovative Talent Support Program of Chongqing under Grant CQBX2021024, Natural Science Foundation of Chongqing CSTB2022NSCQMSX0258 and Chongqing Municipal Education Commission (China) under Grant KJQN202001434.

Author information

Authors and Affiliations

Authors

Contributions

Junnan Li: Software, Conceptualization, Methodology, Formal analysis.

Qingsheng Zhu: Supervision.

Corresponding author

Correspondence to Junnan Li.

Ethics declarations

Ethical and informed consent for data used

The authors declare that they have an informed consent to publish and for data used. This research is not napplicable for both human and/or animal.

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Zhu, Q. OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification. Appl Intell 53, 30987–31017 (2023). https://doi.org/10.1007/s10489-023-05030-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05030-4

Keywords

Navigation