OUBoost: boosting based over and under sampling technique for handling imbalanced data

Mostafaei, Sahar Hassanzadeh; Tanha, Jafar

doi:10.1007/s13042-023-01839-0

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Original Article
Published: 10 May 2023

Volume 14, pages 3393–3411, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

363 Accesses
2 Citations
Explore all metrics

Abstract

Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

A Review on Random Forest: An Ensemble Classifier

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Data Availability

We have used a set of general datasets and cite them in the manuscript.

References

Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 113(3):792–808
Article Google Scholar
Di Martino M, Decia F, Molinelli J, Fernández A (2012) Improving electric fraud detection using class imbalance strategies. ICPRAM (2).
Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6
Article Google Scholar
Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19
Article Google Scholar
Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47
Google Scholar
Hernandez J, Carrasco-Ochoa JA, Martínez-Trinidad JF (2013) An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Iberoamerican Congress on Pattern Recognition.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article MATH Google Scholar
Estabrooks A (2000) A combination scheme for inductive learning from imbalanced data sets [DalTech].
Kubat M, & Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml.
Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Machine Learning Proceedings 1994 (pp. 217–225). Elsevier.
Japkowicz N (2001) Supervised versus unsupervised binary-learning by feedforward neural networks. Mach Learn 42(1):97–122
Article MATH Google Scholar
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29
Article Google Scholar
Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II.
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. icml.
Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612
Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Resampling or reweighting: a comparison of boosting implementations. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence.
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197
Article Google Scholar
Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery.
Hasib KM, Iqbal M, Shah FM, Mahmud JA, Popel MH, Showrov M, Hossain I, Ahmed S, Rahman O (2020) A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:2012.11870.
Gong J, Kim H (2017) RHSBoost: improving classification performance in imbalance data. Comput Stat Data Anal 111:1–13
Article MathSciNet MATH Google Scholar
Popel MH, Hasib KM, Habib SA, Shah FM (2018) A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In: 2018 21st international conference of computer and information technology (ICCIT).
Ahmed S, Rayhan F, Mahbub A, Jani R, Shatabda S, Farid DM. (2019). LIUBoost: locality informed under-boosting for imbalanced data classification. In: Emerging Technologies in Data Mining and Information Security (pp. 133–144). Springer.
Raghuwanshi BS, Shukla S (2019) Classifying imbalanced data using ensemble of reduced kernelized weighted extreme learning machine. Int J Mach Learn Cybern 10(11):3071–3097
Article Google Scholar
Hsiao Y-H, Su C-T, Fu P-C (2020) Integrating MTS with bagging strategy for class imbalance problems. Int J Mach Learn Cybern 11(6):1217–1230
Article Google Scholar
Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl Based Syst 187:104814
Article Google Scholar
Raghuwanshi BS, Shukla S (2021) Classifying imbalanced data using SMOTE based class-specific kernelized ELM. Int J Mach Learn Cybern 12(5):1255–1280
Article Google Scholar
Jiang M, Yang Y, Qiu H (2022) Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52(4):4126–4143
Article Google Scholar
Dong J, Qian Q (2022) A density-based random forest for imbalanced data classification. Fut Internet 14(3):90
Article Google Scholar
Kamalov F, Moussa S, Avante Reyes J (2022) KDE-based ensemble learning for imbalanced data. Electronics 11(17):2703
Article Google Scholar
Puri A, Kumar Gupta M (2022) Improved hybrid bag-boost ensemble with K-means-SMOTE–ENN technique for handling noisy class imbalanced data. Comput J 65(1):124–138
Article Google Scholar
Zhai J, Qi J, Zhang S (2022) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 13(3):735–750
Article Google Scholar
Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science, 344(6191): 1492–1496.
Li Z, Tang Y (2018) Comparative density peaks clustering. Expert Syst Appl 95:236–247
Article Google Scholar
Mohseni M, Tanha J (2021) A density-based undersampling approach to intrusion detection. In: 2021 5th International Conference on Pattern Recognition and Image Analysis (IPRIA).
Bache K, Lichman M (2017) UCI machine learning repository. In: University of California, School of Information and Computer Science, Irvine, CA (2013)
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput: 17.
Machine Learning Mastery repository, available on: https://github.com/jbrownlee.
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20
Article Google Scholar
Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1:108–116
Google Scholar
Mammogaphy dataset, available on: https://www.bcsc-research.org/data/mammography_dataset/digitial-mammo-dataset-download.
Creditcardfraud dataset, available on: https://www.kaggle.com/mlg-ulb/creditcardfraud.
Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom 21(1):1–13
Article Google Scholar
Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Sixth international conference on data mining (ICDM'06).
Rahman MS, Rahman MK, Kaykobad M, Rahman MS (2018) isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif Intell Med 84:90–100
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat: 65–70.

Download references

Author information

Authors and Affiliations

Faculty of Electrical and Computer Engineering, University of Tabriz, P.O. Box, Tabriz, 51666-16471, Iran
Sahar Hassanzadeh Mostafaei & Jafar Tanha

Authors

Sahar Hassanzadeh Mostafaei
View author publications
You can also search for this author in PubMed Google Scholar
Jafar Tanha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jafar Tanha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Mostafaei, S.H., Tanha, J. OUBoost: boosting based over and under sampling technique for handling imbalanced data. Int. J. Mach. Learn. & Cyber. 14, 3393–3411 (2023). https://doi.org/10.1007/s13042-023-01839-0

Download citation

Received: 05 February 2022
Accepted: 17 April 2023
Published: 10 May 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s13042-023-01839-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

A Review on Random Forest: An Ensemble Classifier

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OUBoost: boosting based over and under sampling technique for handling imbalanced data

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

A Review on Random Forest: An Ensemble Classifier

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation