OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Li, Junnan; Zhu, Qingsheng

doi:10.1007/s10489-023-05030-4

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Published: 30 November 2023

Volume 53, pages 30987–31017, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Junnan Li¹ &
Qingsheng Zhu¹

209 Accesses
Explore all metrics

Abstract

SMOTE has been favored by researchers in improving imbalanced classification. Nevertheless, imbalances within minority classes and noise generation are two main challenges in SMOTE. Recently, clustering-based oversampling methods are developed to improve SMOTE by eliminating imbalances within minority classes and/or overcoming noise generation. Yet, they still suffer from the following challenges: a) some create more synthetic minority samples in large-size or high-density regions; b) most fail to remove noise from the training set; c) most heavily rely on more than one parameter; d) most can not handle non-spherical data; e) almost all adopted clustering methods are not very suitable for class-imbalanced data. To overcome the above issues of existing clustering-based oversampling methods, this paper proposes a novel oversampling approach based on local density peaks clustering (OALDPC). First, a novel local density peaks clustering (LDPC) is proposed to partition the class-imbalanced training set into separated sub-clusters with different sizes and densities. Second, a novel LDPC-based noise filter is proposed to identify and remove suspicious noise from the class-imbalanced training set. Third, a novel sampling weight is proposed and calculated by weighing the sample number and density of each minority class sub-cluster. Four, a novel interpolation method based on the sampling weight and LDPC is proposed to create more synthetic minority class samples in sparser minority class regions. Intensive experiments have proven that OALDPC outperforms 8 state-of-the-art oversampling techniques in improving F-measure and G-mean of Random Forest, Neural Network and XGBoost on synthetic data and extensive real benchmark data sets from industrial applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

Air pollution prediction with machine learning: a case study of Indian cities

Article 15 May 2022

Data availability

The datasets and third-party libraries used in the experiments are open sources and accessible online (http://archive.ics.uci.edu/ml/datasets.php).

References

Feng HL, Wang H, Jin B, Li H, Xue M, Wang L (2019) Learning a Distance Metric by Balancing KL-Divergence for Imbalanced Datasets. IEEE Transactions on Systems, Man, and Cybernetics: Systems 49(12):2384–2395
Article Google Scholar
Gu X, Chung F, Ishibuchi H, Wang S (2017) Imbalanced TSK fuzzy classifier by cross-class bayesian fuzzy clustering and imbalance learning. IEEE Trans Syst Man Cybern: Syst 47(8):2005–2020
Article Google Scholar
Teng A, Peng L, Xie Y, Zhang H, Chen Z (2020) Gradient descent evolved imbalanced data gravitation classification with an application on Internet video traffic identification. Inf Sci 539:447–460
Article Google Scholar
Ding I, Jia M, Zhuang J, Ding P (2022) Deep imbalanced regression using cost-sensitive learning and deep feature transfer for bearing remaining useful life estimation. Appl Soft Comput 127:109271
Article Google Scholar
Fan J, Yu Y, Wang Z (2022) Addressing label ambiguity imbalance in candidate labels: Measures and disambiguation algorithm. Inf Sci 612:1–19
Article Google Scholar
Shi H, Zhang Y, Chen Y, Ji S, Dong Y (2022) Resampling algorithms based on sample concatenation for imbalance learning. Knowl-Based Syst 245:108592
Article Google Scholar
Pérez-Ortiz M, Gutiérrez P, Tino P, Hervás-Martínez C (2016) Oversampling the minority class in the feature space. IEEE Trans Neural Netw Learn Syst 27(9):1947–1961
Article MathSciNet Google Scholar
Das B, Krishnan NC, Cook DJ (2015) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Article Google Scholar
Lim P, Goh CK, Tan KC (2017) Evolutionary cluster-based synthetic oversampling ensemble (ECO-Ensemble) for imbalance learning. IEEE Trans Cybern 47(9):2850–2861
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20
Article Google Scholar
Sáeza JA, Luengob J, Stefanowskic J, Herreraa F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291(10):184–203
Article Google Scholar
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang DS, Zhang XP, Huang GB (eds) Advances in intelligent computing. ICIC 2005. Lecture notes in computer science, vol 3644. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11538059_91
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho TB (eds) Advances in knowledge discovery and data mining. PAKDD 2009. Lecture notes in computer science, vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_43
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. Proc. Int’ l Joint Conf. Neural Networks 1322–1328
Prusty MR, Jayanthi T, Velusamy K (2017) Weighted-SMOTE: a modification to SMOTE for event classification in sodium cooled fast reactors. Prog Nucl Energy 100:355–364
Article Google Scholar
Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233
Article Google Scholar
Chen B, Xia S, Chen Z, Wang B, Wang G (2021) RSMOTE: a self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci 553:397–428
Article MathSciNet Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Verbiest N, Ramentol E, Cornelis C, Herrera F (2014) Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection. Appl Soft Comput 22:511–517
Article Google Scholar
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223(8):107056
Article Google Scholar
Cieslak DA, Chawla NV, Striegel A (2006) Combating imbalance in network intrusion datasets. 2006 IEEE International Conference on Granular Computing, Atlanta, GA, USA, pp 732–737. https://doi.org/10.1109/GRC.2006.1635905
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), Hong Kong, pp 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) DBSMOTE: density-based synthetic minority over-sampling TEchnique. Appl Intell 36:664–684
Article Google Scholar
Ma L, Fan SH (2017) CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on Random Forests. BMC Bioinf 18(1):1–18
Article MathSciNet Google Scholar
Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425
Article Google Scholar
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) NI-MWMOTE: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 4:113504
Article Google Scholar
Jabi M, Pedersoli M, Mitiche A, Ayed IB (2021) Deep clustering: on the link between discriminative models and k-means. IEEE Trans Pattern Anal Mach Intell 43(6):1887–1896
Article Google Scholar
Tao X, Guo W, Ren C, Li Q, He Q, Liu R, Zou J (2021) Density peak clustering using global and local consistency adjustable manifold distance. Inf Sci 577:759–804
Article MathSciNet Google Scholar
Guha S, Rastogi R, Shim K (2001) CURE: an efficient clustering algorithm for large database. Inf Syst 27(2):73–84
Google Scholar
Voorhees EM (1986) Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Inf Process Manage 22(6):465–476
Article Google Scholar
Wen G, Li X, Zhu B, Chen L (2021) TanM, One-step spectral rotation clustering for imbalanced high-dimensional data. Inf Process Manag 58(1):102388
Article Google Scholar
Liang J, Liang B, Dang C, Cao F (2021) The k-means-type algorithms versus imbalanced data distributions. IEEE Trans Fuzzy Syst 20(4):28–745
Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD'96: Proceedings of the second international conference on knowledge discovery and data mining, pp 226–231
Zhu Q, Feng J, Huang J (2016) Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recogn Lett 80(1):30–36
Article Google Scholar
Li J, Zhu Q, Wu Q (2019) A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor. Knowl-Based Syst 184(15):104895
Article Google Scholar
Li J, Zhu Q (2020) A boosting self-training framework based on instance generation with natural neighbors for K nearest neighbor. Appl Intell 50:3535–3553
Article Google Scholar
Li J, Zhu Q, Wu Q (2020) A parameter-free hybrid instance selection algorithm based on local sets with natural neighbors. Appl Intell 50(5):1527–1541
Article Google Scholar
Li J, Zhu Q (2019) Semi-supervised self-training method based on an optimum-path forest. IEEE Access 7:36388–36399
Article Google Scholar
Ros F, Guillaume S (2019) Munec: a mutual neighbor-based clustering algorithm. Inf Sci 486:148–170
Article MathSciNet Google Scholar
Zhao Y, Wang Y, Zhang J, Fu CW, Xu M, Moritz D (2022) KD-Box: Line-segment-based KD-tree for interactive exploration of large-scale time-series data. IEEE Trans Visual Comput Graph 28(1):890–900
Article Google Scholar
Rodriguez A (2014) Clustering by fast search and find of density peaks. Science 344(6191):1492–1496
Article Google Scholar
Ghazi M, Lee L, Samsudin A, Sino H (2022) Evaluation of ensemble data preprocessing strategy on forensic gasoline classification using untargeted GC-MS data and classification and regression tree (CART) algorithm. Microchem J 182:107911
Article Google Scholar
Chu Y, Fei J, Hou S (2020) Adaptive global sliding-mode control for dynamic systems using double hidden layer recurrent neural network structure. IEEE Trans Neural Netw Learn Syst 31(4):1297–1309
Article MathSciNet Google Scholar
Ogunleye A, Wang QG (2020) XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Trans Comput Biol Bioinf 17(6):2131–2140
Article Google Scholar
Li J, Zhou Q, Zhu Q, Wu Q (2023) A framework based on local cores and synthetic examples generation for self-labeled semi-supervised classification. Pattern Recogn 134:109060
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant 62006029, Postdoctoral Innovative Talent Support Program of Chongqing under Grant CQBX2021024, Natural Science Foundation of Chongqing CSTB2022NSCQMSX0258 and Chongqing Municipal Education Commission (China) under Grant KJQN202001434.

Author information

Authors and Affiliations

School of Artificial Intelligence and Big Data, Chongqing Industry Polytechnic College, Chongqing, 401120, China
Junnan Li & Qingsheng Zhu

Authors

Junnan Li
View author publications
You can also search for this author in PubMed Google Scholar
Qingsheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Junnan Li: Software, Conceptualization, Methodology, Formal analysis.

Qingsheng Zhu: Supervision.

Corresponding author

Correspondence to Junnan Li.

Ethics declarations

Ethical and informed consent for data used

The authors declare that they have an informed consent to publish and for data used. This research is not napplicable for both human and/or animal.

Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, J., Zhu, Q. OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification. Appl Intell 53, 30987–31017 (2023). https://doi.org/10.1007/s10489-023-05030-4

Download citation

Accepted: 18 September 2023
Published: 30 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05030-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A comparative analysis of gradient boosting algorithms

Air pollution prediction with machine learning: a case study of Indian cities

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OALDPC: oversampling approach based on local density peaks clustering for imbalanced classification

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A comparative analysis of gradient boosting algorithms

Air pollution prediction with machine learning: a case study of Indian cities

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical and informed consent for data used

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation