Skip to main content
Log in

CCR-GSVM: A boundary data generation algorithm for support vector machine in imbalanced majority noise problem

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In imbalanced data classification, training classification model with synthetic samples can effectively improve the performance in mining minority samples. However, the majority samples are more likely to produce noise samples than the minority. The noise samples of the majority will affect the data generation algorithm, and it’s difficult to generate the synthetic boundary data for the minority. A boundary synthetic data generation algorithm called CCR-GSVM is proposed based on Combined Cleaning and Resampling(CCR) and Granular Support Vector Machine(GSVM-RU) in this paper. CCR-GSVM combines the boundary information of SVM and GSVM-RU to filter the noise samples of the majority, so as to use more effective majority samples generate synthetic data by CCR. The synthetic samples located in the margin boundary are obtain if the classification performance is improved. The comparative experiments on 12 imbalanced datasets shows that the boundary data generated by CCR-GSVM can help support vector machine with effect on improving the F1-measure and G-mean under the situation of majority noise, which shows the CCR-GSVM has a better effectiveness to generate synthetic boundary samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Nami S, Shajari M (2018) Cost-sensitive payment card fraud detection based on dynamic random forest and k-nearest neighbors. Expert Syst Appl 110:381–392

    Article  Google Scholar 

  2. Prati RC, Luengo J, Herrera F (2019) Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 60(1):63– 97

    Article  Google Scholar 

  3. Nematzadeh Z, Ibrahim R, Selamat A (2020) Improving class noise detection and classification performance: a new two-filter cndc model. Appl Soft Comput 94:106428

    Article  Google Scholar 

  4. Sabzevari M, Martínez-Muñoz G, Suárez A (2018) A two-stage ensemble method for the detection of class-label noise. Neurocomputing 275:2374–2383

    Article  Google Scholar 

  5. Hazarika BB, Gupta D (2021) Density-weighted support vector machines for binary class imbalance learning. Neural Comput Applic 33(9):4243–4261

    Article  Google Scholar 

  6. Richhariya B, Tanveer M (2020) A reduced universum twin support vector machine for class imbalance learning. Pattern Recogn 102:107150

    Article  Google Scholar 

  7. Yu S, Li X, Zhang X, Wang H (2019) The ocs-svm: An objective-cost-sensitive svm with sample-based misclassification cost invariance. IEEE Access 7:118931–118942

    Article  Google Scholar 

  8. Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2021) New imbalanced bearing fault diagnosis method based on sample-characteristic oversampling technique (scote) and multi-class ls-svm. Appl Soft Comput 101:107043

    Article  Google Scholar 

  9. Wang Q, Luo Z, Huang J, Feng Y, Liu Z (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-smote svm. Computational intelligence and neuroscience 2017

  10. Koziarski M, Woźniak M (2017) Ccr: A combined cleaning and resampling algorithm for imbalanced data classification. International Journal of Applied Mathematics and Computer Science 27(4)

  11. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  12. Koziarski M, Woźniak M, Krawczyk B (2020) Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. Knowl-Based Syst 204:106223

    Article  Google Scholar 

  13. Tang Y, Zhang Y (2006) Granular svm with repetitive undersampling for highly imbalanced protein homology prediction. In: IEEE International conference on granular computing

  14. Li M, Xiong A, Wang L, Deng S, Ye J (2020) Aco resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowl-Based Syst 196:105818

    Article  Google Scholar 

  15. Elreedy D, Atiya AF (2019) A comprehensive analysis of synthetic minority oversampling technique (smote) for handling class imbalance. Inf Sci 505:32–64

    Article  Google Scholar 

  16. Verbiest N, Ramentol E, Cornelis C, Herrera F (2012) Improving smote with fuzzy rough prototype selection to detect noise in imbalanced classification data. In: Ibero-american conference on artificial intelligence. pp 169–178. Springer

  17. Sui Y, Wei Y, Zhao D (2015) Computer-aided lung nodule recognition by svm classifier based on combination of random undersampling and smote. Computational and mathematical methods in medicine 2015

  18. Li J, Zhu Q, Wu Q, Zhang Z, Gong Y, He Z, Zhu F (2021) Smote-nan-de: Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl-Based Syst 223:107056

    Article  Google Scholar 

  19. Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428

    Article  MathSciNet  MATH  Google Scholar 

  20. Liang X, Jiang A, Li T, Xue Y, Wang G (2020) Lr-smote— an improved unbalanced data set oversampling based on k-means and svm. Knowl-Based Syst 196:105845

    Article  Google Scholar 

  21. Wang CR, Shao XH (2020) An improving majority weighted minority oversampling technique for imbalanced classification problem. IEEE Access 9:5069–5082

    Article  Google Scholar 

  22. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  23. Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161

    Article  Google Scholar 

  24. Vo MT, Nguyen T, Vo HA, Le T (2021) Noise-adaptive synthetic oversampling technique. Applied Intelligence pp 1–10

  25. Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) Ni-mwmote: an improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst Appl 158:113504

    Article  Google Scholar 

  26. Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rs b*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  27. Cheng K, Zhang C, Yu H, Yang X, Zou H, Gao S (2019) Grouped smote with noise filtering mechanism for classifying imbalanced data. IEEE Access 7:170668–170681

    Article  Google Scholar 

  28. Lee W, Jun CH, Lee JS (2017) Instance categorization by support vector machines to adjust weights in adaboost for imbalanced data classification. Inf Sci 381:92–103

    Article  Google Scholar 

  29. Garcia L, Lehmann J, de Carvalho AC, Lorena AC (2019) New label noise injection methods for the evaluation of noise filters. Knowl-Based Syst 163(JAN.1):693–704

    Article  Google Scholar 

  30. Kovács G. (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354

    Article  Google Scholar 

  31. Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(1):559–563

    Google Scholar 

  32. Bej S, Davtyan N, Wolfien M, Nassar M, Wolkenhauer O (2021) Loras: an oversampling approach for imbalanced datasets. Mach Learn 110(2):1–23

    Article  MathSciNet  MATH  Google Scholar 

  33. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  34. Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135

    Article  Google Scholar 

  35. Guan H, Zhang Y, Xian M, Cheng HD, Tang X (2020) Smote-wenn: Solving class imbalance and small sample problems by oversampling and distance scaling. Applied Intelligence (4)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Huang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, K., Wang, X. CCR-GSVM: A boundary data generation algorithm for support vector machine in imbalanced majority noise problem. Appl Intell 53, 1192–1204 (2023). https://doi.org/10.1007/s10489-022-03408-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03408-4

Keywords

Navigation