Skip to main content
Log in

OUBoost: boosting based over and under sampling technique for handling imbalanced data

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Most real-world datasets usually contain imbalanced data. Learning from datasets where the number of samples in one class (minority) is much smaller than in another class (majority) creates biased classifiers to the majority class. The overall prediction accuracy in imbalanced datasets is higher than 90%, while this accuracy is relatively lower for minority classes. In this paper, we first propose a new technique for under-sampling based on the Peak clustering method from the majority class on imbalanced datasets. We then propose a novel boosting-based algorithm for learning from imbalanced datasets, based on a combination of the proposed Peak under-sampling algorithm and over-sampling technique (SMOTE) in the boosting procedure, named OUBoost. In the proposed OUBoost algorithm, misclassified examples are not given equal weights. OUBoost selects useful examples from the majority class and creates synthetic examples for the minority class. In fact, it indirectly updates the weights of samples. We designed experiments using several evaluation metrics, such as Recall, MCC, Gmean, and F-score on 30 real-world imbalanced datasets. The results show improved prediction performance in the minority class in most used datasets using OUBoost. We further report time comparisons and statistical tests to analyze our proposed algorithm in more details.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

We have used a set of general datasets and cite them in the manuscript.

References

  1. Majid A, Ali S, Iqbal M, Kausar N (2014) Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 113(3):792–808

    Article  Google Scholar 

  2. Di Martino M, Decia F, Molinelli J, Fernández A (2012) Improving electric fraud detection using class imbalance strategies. ICPRAM (2).

  3. Chawla NV, Japkowicz N, Kotcz A (2004) Special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6

    Article  Google Scholar 

  4. Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19

    Article  Google Scholar 

  5. Ganganwar V (2012) An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng 2(4):42–47

    Google Scholar 

  6. Hernandez J, Carrasco-Ochoa JA, Martínez-Trinidad JF (2013) An empirical study of oversampling and undersampling for instance selection methods on imbalance datasets. In: Iberoamerican Congress on Pattern Recognition.

  7. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  8. Estabrooks A (2000) A combination scheme for inductive learning from imbalanced data sets [DalTech].

  9. Kubat M, & Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Icml.

  10. Pazzani M, Merz C, Murphy P, Ali K, Hume T, Brunk C (1994) Reducing misclassification costs. In: Machine Learning Proceedings 1994 (pp. 217–225). Elsevier.

  11. Japkowicz N (2001) Supervised versus unsupervised binary-learning by feedforward neural networks. Mach Learn 42(1):97–122

    Article  MATH  Google Scholar 

  12. Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

    Article  Google Scholar 

  13. Drummond C, Holte RC (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on learning from imbalanced datasets II.

  14. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. icml.

  15. Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14(771–780):1612

    Google Scholar 

  16. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) Resampling or reweighting: a comparison of boosting implementations. In: 2008 20th IEEE International Conference on Tools with Artificial Intelligence.

  17. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2009) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Part A Syst Hum 40(1):185–197

    Article  Google Scholar 

  18. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) SMOTEBoost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery.

  19. Hasib KM, Iqbal M, Shah FM, Mahmud JA, Popel MH, Showrov M, Hossain I, Ahmed S, Rahman O (2020) A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:2012.11870.

  20. Gong J, Kim H (2017) RHSBoost: improving classification performance in imbalance data. Comput Stat Data Anal 111:1–13

    Article  MathSciNet  MATH  Google Scholar 

  21. Popel MH, Hasib KM, Habib SA, Shah FM (2018) A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In: 2018 21st international conference of computer and information technology (ICCIT).

  22. Ahmed S, Rayhan F, Mahbub A, Jani R, Shatabda S, Farid DM. (2019). LIUBoost: locality informed under-boosting for imbalanced data classification. In: Emerging Technologies in Data Mining and Information Security (pp. 133–144). Springer.

  23. Raghuwanshi BS, Shukla S (2019) Classifying imbalanced data using ensemble of reduced kernelized weighted extreme learning machine. Int J Mach Learn Cybern 10(11):3071–3097

    Article  Google Scholar 

  24. Hsiao Y-H, Su C-T, Fu P-C (2020) Integrating MTS with bagging strategy for class imbalance problems. Int J Mach Learn Cybern 11(6):1217–1230

    Article  Google Scholar 

  25. Raghuwanshi BS, Shukla S (2020) SMOTE based class-specific extreme learning machine for imbalanced learning. Knowl Based Syst 187:104814

    Article  Google Scholar 

  26. Raghuwanshi BS, Shukla S (2021) Classifying imbalanced data using SMOTE based class-specific kernelized ELM. Int J Mach Learn Cybern 12(5):1255–1280

    Article  Google Scholar 

  27. Jiang M, Yang Y, Qiu H (2022) Fuzzy entropy and fuzzy support-based boosting random forests for imbalanced data. Appl Intell 52(4):4126–4143

    Article  Google Scholar 

  28. Dong J, Qian Q (2022) A density-based random forest for imbalanced data classification. Fut Internet 14(3):90

    Article  Google Scholar 

  29. Kamalov F, Moussa S, Avante Reyes J (2022) KDE-based ensemble learning for imbalanced data. Electronics 11(17):2703

    Article  Google Scholar 

  30. Puri A, Kumar Gupta M (2022) Improved hybrid bag-boost ensemble with K-means-SMOTE–ENN technique for handling noisy class imbalanced data. Comput J 65(1):124–138

    Article  Google Scholar 

  31. Zhai J, Qi J, Zhang S (2022) Imbalanced data classification based on diverse sample generation and classifier fusion. Int J Mach Learn Cybern 13(3):735–750

    Article  Google Scholar 

  32. Rodriguez A, Laio A (2014) Clustering by fast search and find of density peaks. Science, 344(6191): 1492–1496.

  33. Li Z, Tang Y (2018) Comparative density peaks clustering. Expert Syst Appl 95:236–247

    Article  Google Scholar 

  34. Mohseni M, Tanha J (2021) A density-based undersampling approach to intrusion detection. In: 2021 5th International Conference on Pattern Recognition and Image Analysis (IPRIA).

  35. Bache K, Lichman M (2017) UCI machine learning repository. In: University of California, School of Information and Computer Science, Irvine, CA (2013)

  36. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput: 17.

  37. Machine Learning Mastery repository, available on: https://github.com/jbrownlee.

  38. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20

    Article  Google Scholar 

  39. Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 1:108–116

    Google Scholar 

  40. Mammogaphy dataset, available on: https://www.bcsc-research.org/data/mammography_dataset/digitial-mammo-dataset-download.

  41. Creditcardfraud dataset, available on: https://www.kaggle.com/mlg-ulb/creditcardfraud.

  42. Chicco D, Jurman G (2020) The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom 21(1):1–13

    Article  Google Scholar 

  43. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. In: Sixth international conference on data mining (ICDM'06).

  44. Rahman MS, Rahman MK, Kaykobad M, Rahman MS (2018) isGPT: an optimized model to identify sub-Golgi protein types using SVM and Random Forest based feature selection. Artif Intell Med 84:90–100

    Article  Google Scholar 

  45. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  46. Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat: 65–70.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jafar Tanha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mostafaei, S.H., Tanha, J. OUBoost: boosting based over and under sampling technique for handling imbalanced data. Int. J. Mach. Learn. & Cyber. 14, 3393–3411 (2023). https://doi.org/10.1007/s13042-023-01839-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01839-0

Keywords

Navigation