Abstract
We present a simple yet effective idea, perturbation-based oversampling (POS), to tackle imbalanced classification problems. In this method, we perturb each feature of a given minority instance to generate a new instance. The originality and advantage of the POS is that a hyperparameter p is introduced to control the variance of the perturbation, which provides flexibility to adapt the algorithm to data with different characteristics. Experimental results yielded by using five types of classifiers and 11 performance metrics on 103 imbalanced datasets show that the POS offers comparable or better results than those yielded by 11 reference methods in terms of multiple performance metrics. An important finding of this work is that a simple perturbation-based oversampling method is able to yield better classification results than many advanced oversampling methods by controlling the variance of input perturbation. This reminds us it may need to conduct comparisons with simple oversampling methods, e.g., POS, when designing new oversampling approaches.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232
Bugnon LA, Yones C, Milone DH, Stegmayer G (2020) Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Trans Neural Netw Learn Syst 31(8):2857–2867
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) New imbalanced fault diagnosis framework based on cluster-mwmote and mfo-optimized ls-svm using limited and complex bearing data. Eng Appl Artif Intell 96:103966. https://doi.org/10.1016/j.engappai.2020.103966
Zhang J, Chen X, Ng WW, Lai CS, Lai LL (2019) New appliance detection for nonintrusive load monitoring. IEEE Trans Ind Inf 15(8):4819–4829
Zhou L (2013) Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowl-Based Syst 41:16–25. https://doi.org/10.1016/j.knosys.2012.12.007
Wang J, Bretz M, Dewan MAA, Delavar MA (2022) Machine learning in modelling land-use and land cover-change (lulcc): current status, challenges and prospects. Sci Total Environ 822:153559. https://doi.org/10.1016/j.scitotenv.2022.153559
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surveys (CSUR) 49(2):1–50
Han X, Cui R, Lan Y, Kang Y, Jia N (2019) A gaussian mixture model based combined resampling algorithm for classification of imbalanced credit data sets. Int J Mach Learn Cybern 10:3687–3699
Shu T, Zhang B, Tang YY (2020) Sparse supervised representation-based classifier for uncontrolled and imbalanced classification. IEEE Trans Neural Netw Learn Syst 31(8):2847–2856. https://doi.org/10.1109/TNNLS.2018.2884444
Ng WW, Zeng G, Zhang J, Yeung DS, Pedrycz W (2016) Dual autoencoders features for imbalance classification problem. Pattern Recogn 60:875–889
Ri JH, Tian G, Liu Y, Xu WH, Lou JG (2020) Extreme learning machine with hybrid cost function of g-mean and probability for imbalance learning. Int J Mach Learn Cybern 11:2007–2020
Zhai J, Qi J, Zhang S (2021) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 3:735–750
Kamalov F (2020) Kernel density estimation based sampling for imbalanced class distribution. Inf Sci 512:1192–1201
Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637
Zhang H, Li M (2014) Rwo-sampling: a random walk over-sampling approach to imbalanced data classification. Inf Fusion 20:99–116
Ng WWY, Xu S, Zhang J, Tian X, Rong T, Kwong S (2020) Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2020.3000754
de Morais RF, Vasconcelos GC (2019) Boosting the performance of over-sampling algorithms through under-sampling the minority class. Neurocomputing 343:3–18
Zhang J, Ng W (2018) Stochastic sensitivity measure-based noise filtering and oversampling method for imbalanced classification problems. In: 2018 IEEE international conference on systems, man, and cybernetics (SMC), pp 403–408. IEEE
Fernández A, Garcia S, Herrera F, Chawla NV (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Lusa L (2013) Smote for high-dimensional class-imbalanced data. BMC Bioinf 14(1):106
Zhang J, Wang T, Ng WWY, Pedrycz W, Zhang S, Nugent CD (2020) Minority oversampling using sensitivity. In: 2020 international joint conference on neural networks (IJCNN), pp 1–7
Wei J, Huang H, Yao L, Hu Y, Fan Q, Huang D (2020) Ia-suwo: an improving adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl-Based Syst 203:106116. https://doi.org/10.1016/j.knosys.2020.106116
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Douzas G, Bacao F (2019) Geometric smote a geometrically enhanced drop-in replacement for smote. Inf Sci 501:118–135
Ren J, Liu Y, Liu J (2019) Ewgan: entropy-based wasserstein gan for imbalanced learning. Proc AAAI Conf Artif Intell 33:10011–10012
Xie Y, Peng L, Chen Z, Yang B, Zhang H (2019) Generative learning for imbalanced data using the gaussian mixed model. Appl Soft Comput
Liu S, Zhang J, Yang X, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490
Kovács G (2019) An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 83:105662
Gazzah S, Amara NEB (2008) New oversampling approaches based on polynomial fitting for imbalanced data sets. In: 2008 the eighth IAPR international workshop on document analysis systems. IEEE, pp 677–684
Barua S, Islam MM, Murase K (2013) Prowsyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 317–328
Wang G, Teoh YC, Lu J, Choi KS (2020) Least squares support vector machines with fast leave-one-out auc optimization on imbalanced prostate cancer data. Int J Mach Learn Cybern 11(4):1909–1922
Raghuwanshi BS, Shukla S (2021) Classifying imbalanced data using smote based class-specific kernelized elm. Int J Mach Learn Cybern 12(104):1255–1280
Bader-El-Den M, Teitei E, Perry T (2018) Biased random forest for dealing with the class imbalance problem. IEEE Trans Neural Netw Learn Syst 30(7):2163–2172
Xu Y, Zhang Y, Zhao J, Yang Z, Pan X (2019) Knn-based maximum margin and minimum volume hyper-sphere machine for imbalanced data classification. Int J Mach Learn Cybern 10(2):357–368
Khan S, Hayat M, Zamir SW, Shen J, Shao L (2019) Striking the right balance with uncertainty. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 103–112
Ng WWY, Liu Z, Zhang J, Pedrycz W (2021) Maximizing minority accuracy for imbalanced pattern classification problems using cost-sensitive localized generalization error model. Appl Soft Comput 104:107178
Ghazikhani A, Monsefi R, Yazdi HS (2014) Online neural network model for non-stationary and imbalanced data stream classification. Int J Mach Learn Cybern 5(1):51–62
Liu XY, Wu J, Zhou ZH (2009) Exploratory undersampling for class-imbalance learning. IEEE Trans Syst Man Cybern B 39(2):539–550
Chung YA, Lin HT, Yang SW (2015) Cost-aware pre-training for multiclass cost-sensitive deep learning. Computer ENCE
Zong W, Huang GB, Chen Y (2013) Weighted extreme learning machine for imbalance learning. Neurocomputing 101:229–242
Khan SH, Hayat M, Bennamoun M, Sohel F, Togneri R (2018) Cost sensitive learning of deep feature representations from imbalanced data. IEEE Trans Neural Netw Learn Syst 29(8):3573–3587
Raghuwanshi BS, Shukla S (2019) Classifying imbalanced data using ensemble of reduced kernelized weighted extreme learning machine. Int J Mach Learn Cybern 10(1–3):1–27
Li Y, Guo H, Liu X, Li Y, Li J (2016) Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowledge-Based Syst 94:88–104
Fan QF, Huang H, Chen Q, Yao L, Yang K, Huang D (2021) A modified self-adaptive marine predators algorithm: framework and engineering applications. Eng Comput. https://doi.org/10.1007/s00366-021-01319-5
Lin W-C, Tsai C-F, Hu Y-H, Jhang J-S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometr Bull 1(6):80–83
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064
Alcalá-Fdez J, Sanchez L, Garcia S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307–318
Dua D, Graff C (2017) UCI machine learning repository . http://archive.ics.uci.edu/ml
Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, Liu H (2017) Feature selection: a data perspective. ACM Comput Surv (CSUR) 50(6):1–45
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61876066, in part by Guangdong Province Science and Technology Plan Project (Collaborative Innovation and Platform Environment Construction) 2019A050510006, in part by Science and Technology Program of Guangzhou under Grant SL2023A04J01464, in part by China Postdoctoral Science Foundation under Grant 2021M700930, and in part by Guangzhou Postdoctoral Research Foundation under Grant BHSKY20211204. Support from the Canada Research Chair (CRC) is fully acknowledged.
Author information
Authors and Affiliations
Contributions
Conceptualization, methodology, writing—original draft preparation: JZ; Formal analysis and investigation: TW; writing—review and editing: WP; funding acquisition: TW, WWYN; resources, supervision: WWYN.
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Wang, T., Ng, W.W.Y. et al. Perturbation-based oversampling technique for imbalanced classification problems. Int. J. Mach. Learn. & Cyber. 14, 773–787 (2023). https://doi.org/10.1007/s13042-022-01662-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-022-01662-z