Abstract
Learning a classifier from class-imbalance data is an important challenge. Among the existing solutions, SMOTE has received great praise and features an extensive range of practical applications. However, SMOTE and its extensions usually degrade due to noise generation and within-class imbalances. Although multiple variations of SMOTE are developed, few of them can solve the above problems at the same time. Besides, many improvements of SMOTE are based on advanced models with introducing external parameters. To solve imbalances between and within classes while overcoming noise generation, a novel synthetic minority oversampling technique based on relative and absolute densities is proposed. First, a novel noise filter based on relative density is proposed to remove noise and smooth class boundary. Second, sparsity and boundary weights are proposed and calculated by relative and absolute densities, respectively. Third, normalized weights based on absolute and sparse weights are proposed to generate more synthetic minority class samples in the class boundary and sparse regions. The main advantages of the proposed algorithm are that: (a) It can effectively avoid noise generation while removing noise and smoothing class the boundary in original data. (b) It generates more synthetic samples in class boundaries and sparse regions; (c) No additional parameters are introduced. Intensive experiments prove that SMOTE-RD outperforms 7 popular oversampling methods in average AUC, average F-measure and average G-mean on real data sets with the acceptable time cost.
Similar content being viewed by others
Data availability
The datasets and third-party libraries used in the experiments are open sources and accessible online (http://archive.ics.uci.edu/ml/datasets.php).
References
Li J, Zhu Q, Wu Q (2019) A self-training method based on density peaks and an extended parameter-free local noise filter for k nearest neighbor. Knowl-Based Syst 184(15):104895. https://doi.org/10.1016/j.knosys.2019.104895
Li J, Zhu Q (2019) Semi-supervised self-training method based on an optimum-path forest. IEEE Access 7:36388–36399
Chen JK, Chin YH (1999) A concurrency control algorithm for nearest neighbor query. Inf Sci 114(1–4):187–204
Tang Y, Zhang YQ, Chawla NV, Krasser S (2009) SVMs modeling for highly imbalanced classification. IEEE Trans Syst Man Cybernet 39(1):281–288
Breiman LI, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees (cart). Biometrics 40(3):358
Xu Z, Shen D, Nie T, Kou Y (2020) A hybrid sampling algorithm combining m-smote and enn based on random forest for medical imbalanced data. J Biomed Inform 107:103465. https://doi.org/10.1016/j.jbi.2020.103465
Alqatawna J, Faris H, Jaradat K, Al-Zewairi M, Adwan O (2015) Improving knowledge based spam detection methods: the effect of malicious related features in imbalance data distribution. Int J Commun Netw Syst Sci 8(5):118–129
Wang L, Wu C (2020) Dynamic imbalanced business credit evaluation based on learn++ with sliding time window and weight sampling and FCM with multiple kernels. Inf Sci 520:305–323
Pérez-Ortiz M, Sáez A, Sánchez-Monedero J, Gutiérrez PA, Hervás-Martínez C (2016) Tackling the ordinal and imbalance nature of a melanoma image classification problem. 2016 international joint conference on neural networks (IJCNN), Vancouver, pp 2156–2163. https://doi.org/10.1109/IJCNN.2016.7727466
Elreedy D, Atiya AF (2019) A comprehensive analysis of Syntheic minority oversampling technique (SMOTE) for handling class imbalance. Inf Sci 505:32–64
Fan W, Stolfo S, Zhang J, Chan P (1999) Adacost: misclassification cost-sensitive boosting. International conference on machine learning 99:97–105
Wang KJ, Adrian AM, Chen KH, Wang KM (2015) A hybrid classifier combining borderline-smote with airs algorithm for estimating brain metastasis from lung cancer: a case study in Taiwan. Comput Methods Prog Biomed 119(2):63–76
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač N, Gamberger D, Todorovski L, Blockeel H (eds) Knowledge Discovery in Databases: PKDD 2003. PKDD 2003. Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, vol 2838, pp 22–26. https://doi.org/10.1007/978-3-540-39804-2_12
Zeng ZQ, Gao J (2009) Improving SVM Classification with Imbalance Data Set. Conference: Proceedings of the 16th International Conference on Neural Information Processing: Part I, pp 389–398
Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl-Based Syst 187:104814. https://doi.org/10.1016/j.knosys.2019.06.022
Xie X, Liu H, Zeng S, Lin L, Li W (2020) A novel progressively undersampling method based on the density peaks sequence for imbalanced data. Knowl-Based Syst 213:106689. https://doi.org/10.1016/j.knosys.2020.106689
Seng Z, Kareem SA, Varathan KD (2021) A neighborhood Undersampling stacked ensemble (NUS-SE) in imbalanced classification. Expert Syst Appl 168:114246. https://doi.org/10.1016/j.eswa.2020.114246
Jia C, Zuo Y (2017) S-SulfPred: a sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. J Theor Biol 7:84–89
Susan S, Kumar A (2019) SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl Soft Comput 78:141–149
Kamarulzalis AH, Razali MHM, Moktar B (2018) Data pre-processing using smote technique for gender classification with imbalance hu’s moments features, IISA 2018: Advances in Intelligent, Interactive Systems and Applications, Springer, Singapore, pp 3510355
Liu C, Wu J, Mirador L, Song Y, Hou W (2018) Classifying dna methylation imbalance data in cancer risk prediction using smote and tomek link methods. International Conference of Pioneering Computer Scientists, Engineers and Educators, pp 1–9
Nakamura M, Kajiwara Y, Otsuka A, Kimura H (2013) Lvq-smote-learning vector quantization based synthetic minority over-sampling technique for biomedical data. BioData Min 6(1):1–10
Zhang J, Li X (2017) Phishing detection method based on borderline-smote deep belief network. In: Wang G, Atiquzzaman M, Yan Z, Choo KK (eds) Security, Privacy, and Anonymity in Computation, Communication, and Storage. SpaCCS 2017. Lecture Notes in Computer Science, pp 45–53
Georgios D, Fernando B, Felix L (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, pp 1322–1328
Chen B, Xia S, Chen Z, Wang B, Wang G (2020) RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci. https://doi.org/10.1016/j.ins.2020.10.013
Pan T, Zhao J, Wu W, Yang J (2020) Learning imbalanced datasets based on SMOTE and Gaussian distribution. Inf Sci 512:1214–1233
Li J, Zhu Q, Wu Q, Zhu F (2021) A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors. Inf Sci 565:438–455
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Sáeza JA, Luengob J, Stefanowskic J, Herreraa F (2015) SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering [J]. Inf Sci 291(10):184–203
Xia S, Xiong Z, Luo Y, Dong L, Xing C (2015) Relative density based support vector machine. Neurocomputing 149(Part C):1424–1432
Han H, Wang WY, Mao BH (2005) Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I, pp 878–887
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C, (2009) Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem, Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482
Ma L, Fan SH (2017) CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinform 18(18):169
Xu Z, Shen D, Nie T, Kou Y, Yin N, Han X (2021) A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf Sci 572:574–589
Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223(8):107056
Puntumapon K, Waiyamai K (2012) A pruning-based approach for searching precise and generalized region for synthetic minority over-sampling, advances in knowledge discovery and data mining. Springer, Berlin Heidelberg
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Commun SMC-6:769–772
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern SMC 2(3):408–421
Khoshgoftaar TM, Rebours P (2007) Improving software quality prediction by noise filtering techniques [J]. J Comput Sci Technol 22:387–396
Xu W, Dong L (2016) A novel relative density based support vector machine. Optik 127(22):10348–10354
Demiar J, Schuurmans D (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30
Code availability
Code resource is available at https://github.com/liurj2021/SMOTERDCodes.git
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, R. A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification. Appl Intell 53, 786–803 (2023). https://doi.org/10.1007/s10489-022-03512-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03512-5