A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

Dai, Qizhu; Li, Donggen; Xia, Shuyin

doi:10.1007/s13042-023-01804-x

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

Original Article
Published: 01 March 2023

Volume 14, pages 2877–2886, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Qizhu Dai¹,
Donggen Li² &
Shuyin Xia²

231 Accesses
1 Altmetric
Explore all metrics

Abstract

Imbalance classification has always been a popular research point in the application of machine learning, data mining and pattern recognition. At present, there are also many techniques to reduce the negative impact of imbalance on classification performance, and oversampling is the most commonly used one. In this paper, we illustrate the relationship between imbalance rate and classification performance in the oversampling process from a novel perspective that oversampling may cause the loss of the distribution while minority class is enhanced. In addition, this paper proposes a novel cross-validation framework called “icross-validation” that can be used in sampling to find a better state than the balanced state. This framework is general and can be applied into various oversampling methods. In comparison with some state-of-the-art and widely used oversampling methods, the experimental results on some real data sets demonstrate the effectiveness of the icross-validation. All code has been released in the open source icross-validation library at https://github.com/syxiaa/icross-valiation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Adaptive Oversampling Technique for Imbalanced Datasets

A new technique for classification method with imbalanced training data

Article 24 February 2024

Subsampling bias and the best-discrepancy systematic cross validation

Article 21 November 2019

References

Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428
Article MathSciNet MATH Google Scholar
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Article Google Scholar
Alam TM, Shaukat K, Mahboob H, Sarwar MU, Iqbal F, Nasir A, Hameed IA, Luo S (2021) A machine learning approach for identification of malignant mesothelioma etiological factors in an imbalanced dataset. Comput J 65(7):1740–1751
Article Google Scholar
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232
Article Google Scholar
López V, Fernández A, Moreno-Torres JG, Herrera F (2012) Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608
Petrides G, Moldovan D, Coenen L, Guns T, Verbeke W (2022) Cost-sensitive learning for profit-driven credit scoring. J Oper Res Soc 73(2):338–350
Article Google Scholar
Datta S, Nag S, Das S (2019) Boosting with lexicographic programming: addressing class imbalance without cost tuning. IEEE Trans Knowl Data Eng 32(5):883–897
Article Google Scholar
Datta S, Das S (2018) Multiobjective support vector machines: handling class imbalance with pareto optimality. IEEE Trans Neural Netw Learn Syst 30(5):1602–1608
Article MathSciNet Google Scholar
Maulidevi NU, Surendro K et al (2022) Smote-lof for noise identification in imbalanced data classification. J King Saud Univ Comput Inf Sci 34(6, Part B):3413–3423
Ren J, Wang Y, Cheung Y-M, Gao X-Z, Guo X (2023) Grouping-based oversampling in kernel space for imbalanced data classification. Pattern Recognit 133:108992
Article Google Scholar
Sandhan T, Choi JY (2014) Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In: 2014 22nd international conference on pattern recognition. IEEE, pp 1449–1453
Japkowicz N et al (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol 68. Menlo Park, CA, pp 10–15
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684
Article Google Scholar
Zhai J, Qi J, Shen C (2022) Binary imbalanced data classification based on diversity oversampling by generative models. Inf Sci 585:313–343
Article Google Scholar
Lunardon N, Menardi G, Torelli N (2014) Rose: a package for binary imbalanced learning. R J 6(1)
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425
Article Google Scholar
Arafa A, El-Fishawy N, Badawy M, Radad M (2022) Rn-smote: reduced noise smote based on dbscan for enhancing imbalanced data classification. J King Saud Univ Comput Inf Sci 34(8, Part A):5059–5074
Soltanzadeh P, Hashemzadeh M (2021) Rcsmote: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111
Article MathSciNet MATH Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161
Article Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203
Article Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887
Das B, Krishnan NC, Cook DJ (2014) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234
Article Google Scholar
Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251
Article Google Scholar
Xie Z, Jiang L, Ye T, Li X (2015) A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: International conference on database systems for advanced applications. Springer, pp 3–18
Zhou H, Dong X, Xia S, Wang G (2021) Weighted oversampling algorithms for imbalanced problems and application in prediction of streamflow. Knowl Based Syst 229:107306
Article Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World congress on computational intelligence). IEEE, pp 1322–1328
Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270
Article Google Scholar
Barella V, Garcia L, de Carvalho A (2018) The influence of sampling on imbalanced data classification. In: 2019 8th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 210–215
Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441
Article MathSciNet Google Scholar
He J, Zhang S, Yang M, Shan Y, Huang T (2020) Bi-directional cascade network for perceptual edge detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62176033 and 61936001, Key Cooperation Project of Chongqing Municipal Education Commission under Grant No. HZ2021008, and Natural Science Foundation of Chongqing under Grant No. cstc2019jcyj-cxttX0002, National Key Research and Development Program of China under Grant No. 2019QY(Y)0301.

Author information

Authors and Affiliations

College of Computer Science, Chongqing University, Chongqing, 400044, China
Qizhu Dai
the College of Chongqing Key Laboratory of Computational Intelligence, Chongqing University of Telecommunications and Posts, Chongqing, 400065, China
Donggen Li & Shuyin Xia

Authors

Qizhu Dai
View author publications
You can also search for this author in PubMed Google Scholar
Donggen Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuyin Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuyin Xia.

Ethics declarations

Conflict of interest

All authors have no conflict of interest to declare

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 78 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dai, Q., Li, D. & Xia, S. A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification. Int. J. Mach. Learn. & Cyber. 14, 2877–2886 (2023). https://doi.org/10.1007/s13042-023-01804-x

Download citation

Received: 28 October 2022
Accepted: 16 February 2023
Published: 01 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s13042-023-01804-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

Abstract

Access this article

Similar content being viewed by others

An Adaptive Oversampling Technique for Imbalanced Datasets

A new technique for classification method with imbalanced training data

Subsampling bias and the best-discrepancy systematic cross validation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (PDF 78 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

Abstract

Access this article

Similar content being viewed by others

An Adaptive Oversampling Technique for Imbalanced Datasets

A new technique for classification method with imbalanced training data

Subsampling bias and the best-discrepancy systematic cross validation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (PDF 78 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation