Skip to main content
Log in

Perturbation-based oversampling technique for imbalanced classification problems

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

We present a simple yet effective idea, perturbation-based oversampling (POS), to tackle imbalanced classification problems. In this method, we perturb each feature of a given minority instance to generate a new instance. The originality and advantage of the POS is that a hyperparameter p is introduced to control the variance of the perturbation, which provides flexibility to adapt the algorithm to data with different characteristics. Experimental results yielded by using five types of classifiers and 11 performance metrics on 103 imbalanced datasets show that the POS offers comparable or better results than those yielded by 11 reference methods in terms of multiple performance metrics. An important finding of this work is that a simple perturbation-based oversampling method is able to yield better classification results than many advanced oversampling methods by controlling the variance of input perturbation. This reminds us it may need to conduct comparisons with simple oversampling methods, e.g., POS, when designing new oversampling approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232

    Article  Google Scholar 

  2. Bugnon LA, Yones C, Milone DH, Stegmayer G (2020) Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Trans Neural Netw Learn Syst 31(8):2857–2867

    Article  Google Scholar 

  3. Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) New imbalanced fault diagnosis framework based on cluster-mwmote and mfo-optimized ls-svm using limited and complex bearing data. Eng Appl Artif Intell 96:103966. https://doi.org/10.1016/j.engappai.2020.103966

    Article  Google Scholar 

  4. Zhang J, Chen X, Ng WW, Lai CS, Lai LL (2019) New appliance detection for nonintrusive load monitoring. IEEE Trans Ind Inf 15(8):4819–4829

    Article  Google Scholar 

  5. Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowl-Based Syst 41:16–25. https://doi.org/10.1016/j.knosys.2012.12.007

    Article  Google Scholar 

  6. Wang J, Bretz M, Dewan MAA, Delavar MA (2022) Machine learning in modelling land-use and land cover-change (lulcc): current status, challenges and prospects. Sci Total Environ 822:153559. https://doi.org/10.1016/j.scitotenv.2022.153559

    Article  Google Scholar 

  7. Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surveys (CSUR) 49(2):1–50

    Article  Google Scholar 

  8. Han X, Cui R, Lan Y, Kang Y, Jia N (2019) A gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. Int J Mach Learn Cybern 10:3687–3699

    Article  Google Scholar 

  9. Shu T, Zhang B, Tang YY (2020) Sparse supervised representation-based classifier for uncontrolled and imbalanced classification. IEEE Trans Neural Netw Learn Syst 31(8):2847–2856. https://doi.org/10.1109/TNNLS.2018.2884444

    Article  MathSciNet  Google Scholar 

  10. Ng WW, Zeng G, Zhang J, Yeung DS, Pedrycz W (2016) Dual autoencoders features for imbalance classification problem. Pattern Recogn 60:875–889

    Article  Google Scholar 

  11. Ri JH, Tian G, Liu Y, Xu WH, Lou JG (2020) Extreme learning machine with hybrid cost function of g-mean and probability for imbalance learning. Int J Mach Learn Cybern 11:2007–2020

    Article  Google Scholar 

  12. Zhai J, Qi J, Zhang S (2021) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 3:735–750

    Google Scholar 

  13. Kamalov F (2020) Kernel density estimation based sampling for imbalanced class distribution. Inf Sci 512:1192–1201

    Article  MathSciNet  Google Scholar 

  14. Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637

    Article  MathSciNet  MATH  Google Scholar 

  15. Zhang H, Li M (2014) Rwo-sampling: a random walk over-sampling approach to imbalanced data classification. Inf Fusion 20:99–116

    Article  Google Scholar 

  16. Ng WWY, Xu S, Zhang J, Tian X, Rong T, Kwong S (2020) Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.3000754

    Article  Google Scholar 

  17. de Morais RF, Vasconcelos GC (2019) Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing 343:3–18

    Article  Google Scholar 

  18. Zhang J, Ng W (2018) Stochastic sensitivity measure-based noise filtering and oversampling method for imbalanced classification problems. In: 2018 IEEE international conference on systems, man, and cybernetics (SMC), pp 403–408. IEEE

  19. Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

    Article  MathSciNet  MATH  Google Scholar 

  20. Lusa L (2013) Smote for high-dimensional class-imbalanced data. BMC Bioinf 14(1):106

    Article  Google Scholar 

  21. Zhang J, Wang T, Ng WWY, Pedrycz W, Zhang S, Nugent CD (2020) Minority oversampling using sensitivity. In: 2020 international joint conference on neural networks (IJCNN), pp 1–7

  22. Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl-Based Syst 203:106116. https://doi.org/10.1016/j.knosys.2020.106116

    Article  Google Scholar 

  23. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  24. Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135

    Article  Google Scholar 

  25. Ren J, Liu Y, Liu J (2019) Ewgan: entropy-based wasserstein gan for imbalanced learning. Proc AAAI Conf Artif Intell 33:10011–10012

    Google Scholar 

  26. Xie Y, Peng L, Chen Z, Yang B, Zhang H (2019) Generative learning for imbalanced data using the gaussian mixed model. Appl Soft Comput

  27. Liu S, Zhang J, Yang X, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490

    Article  Google Scholar 

  28. Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662

    Article  Google Scholar 

  29. Gazzah S, Amara NEB (2008) New oversampling approaches based on polynomial fitting for imbalanced data sets. In: 2008 the eighth IAPR international workshop on document analysis systems. IEEE, pp 677–684

  30. Barua S, Islam MM, Murase K (2013) Prowsyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 317–328

  31. Wang G, Teoh YC, Lu J, Choi KS (2020) Least squares support vector machines with fast leave-one-out auc optimization on imbalanced prostate cancer data. Int J Mach Learn Cybern 11(4):1909–1922

    Article  Google Scholar 

  32. Raghuwanshi BS, Shukla S (2021) Classifying imbalanced data using smote based class-specific kernelized elm. Int J Mach Learn Cybern 12(104):1255–1280

    Article  Google Scholar 

  33. Bader-El-Den M, Teitei E, Perry T (2018) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netw Learn Syst 30(7):2163–2172

    Article  Google Scholar 

  34. Xu Y, Zhang Y, Zhao J, Yang Z, Pan X (2019) Knn-based maximum margin and minimum volume hyper-sphere machine for imbalanced data classification. Int J Mach Learn Cybern 10(2):357–368

    Article  Google Scholar 

  35. Khan S, Hayat M, Zamir SW, Shen J, Shao L (2019) Striking the right balance with uncertainty. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 103–112

  36. Ng WWY, Liu Z, Zhang J, Pedrycz W (2021) Maximizing minority accuracy for imbalanced pattern classification problems using cost-sensitive localized generalization error model. Appl Soft Comput 104:107178

    Article  Google Scholar 

  37. Ghazikhani A, Monsefi R, Yazdi HS (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5(1):51–62

    Article  Google Scholar 

  38. Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B 39(2):539–550

    Article  Google Scholar 

  39. Chung YA, Lin HT, Yang SW (2015) Cost-aware pre-training for multiclass cost-sensitive deep learning. Computer ENCE

  40. Zong W, Huang GB, Chen Y (2013) Weighted extreme learning machine for imbalance learning. Neurocomputing 101:229–242

    Article  Google Scholar 

  41. Khan SH, Hayat M, Bennamoun M, Sohel F, Togneri R (2018) Cost sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587

    Article  Google Scholar 

  42. Raghuwanshi BS, Shukla S (2019) Classifying imbalanced data using ensemble of reduced kernelized weighted extreme learning machine. Int J Mach Learn Cybern 10(1–3):1–27

    Google Scholar 

  43. Li Y, Guo H, Liu X, Li Y, Li J (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Syst 94:88–104

    Article  Google Scholar 

  44. Fan QF, Huang H, Chen Q, Yao L, Yang K, Huang D (2021) A modified self-adaptive marine predators algorithm: framework and engineering applications. Eng Comput. https://doi.org/10.1007/s00366-021-01319-5

    Article  Google Scholar 

  45. Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26

    Article  Google Scholar 

  46. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  47. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  48. Wilcoxon F (1945) Individual comparisons by ranking methods. Biometr Bull 1(6):80–83

    Article  Google Scholar 

  49. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  50. Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318

    Article  Google Scholar 

  51. Dua D, Graff C (2017) UCI machine learning repository . http://archive.ics.uci.edu/ml

  52. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Comput Surv (CSUR) 50(6):1–45

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61876066, in part by Guangdong Province Science and Technology Plan Project (Collaborative Innovation and Platform Environment Construction) 2019A050510006, in part by Science and Technology Program of Guangzhou under Grant SL2023A04J01464, in part by China Postdoctoral Science Foundation under Grant 2021M700930, and in part by Guangzhou Postdoctoral Research Foundation under Grant BHSKY20211204. Support from the Canada Research Chair (CRC) is fully acknowledged.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, methodology, writing—original draft preparation: JZ; Formal analysis and investigation: TW; writing—review and editing: WP; funding acquisition: TW, WWYN; resources, supervision: WWYN.

Corresponding authors

Correspondence to Ting Wang or Wing W. Y. Ng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 264 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Wang, T., Ng, W.W.Y. et al. Perturbation-based oversampling technique for imbalanced classification problems. Int. J. Mach. Learn. & Cyber. 14, 773–787 (2023). https://doi.org/10.1007/s13042-022-01662-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-022-01662-z

Keywords

Navigation