Abstract
The issue of imbalanced datasets, i.e., uneven sample distribution among different classes causes training biases and degrades learning algorithm performance. In past, several solutions for data imbalance handling have been proposed but most of them focus on removing the majority class instances, leading to loss of important information. An alternate strategy to mitigate this issue that has been investigated in literature is minority class samples generation. However, generation of quality synthetic samples for minority class remains an open problem. In this study, a fusion of grey wolf optimizer (GWO) with artificial bee colony (ABC) is proposed to generate good representative samples of the minority class. The combination is analysed because GWO has good exploitation abilities, while ABC is good at exploration. The effectiveness of the proposed method is tested on 20 real-world benchmark datasets and on one real-life application, i.e., scam video classification on YouTube using standard assessment indicators. The performance of the proposed method is compared against 18 state-of-the-art data imbalance handling methods using three classification algorithms, i.e., support vector machine (SVM), k-nearest neighbours (KNN) and decision tree (DT). Our experimental results show an improvement in G-mean score on 18 out of 20 datasets with a maximum improvement of 8% for SVM, and on 17 out of 20 datasets with a maximum improvement of 10.7% for KNN and 6.3% for DT respectively. An improvement in AUC score is also seen on 17 out of 20 datasets for SVM and DT with a maximum improvement of 4.5% and 6% respectively, and on 16 out of 20 datasets for KNN with a maximum improvement of 7.7%. These results show that the proposed method is robust.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
https://sci2s.ugr.es/keel/datasets.php
https://sci2s.ugr.es/keel/datasets.php
https://developers.google.com/youtube/v3/docs
References
Ala A, Alsaadi FE, Ahmadi M, Mirjalili S (2021) Optimization of an appointment scheduling problem for healthcare systems based on the quality of fairness service using whale optimization algorithm and nsga-ii. Sci Rep 11:19816
Ala A, Mahmoudi A, Mirjalili S, Simic V, Pamucar D (2023) Evaluating the performance of various algorithms for wind energy optimization: a hybrid decision-making model. Expert Syst Appl 221:119731
Ala A, Simic V, Bacanin N, Tirkolaee EB (2024) Blood supply chain network design with lateral freight: a robust possibilistic optimization model. Eng Appl Artif Intell 133:108053
Ala A, Simic V, Pamucar D, Bacanin N (2024) Enhancing patient information performance in internet of things-based smart healthcare system: hybrid artificial intelligence and optimization approaches. Eng Appl Artif Intell 131:107889
Aslan S, Arslan S (2022) A modified artificial bee colony algorithm for classification optimisation. Int J Bio-Inspired Comput 20:11–22
Azizia H, Rezab H (2021) Data mining based investigation of the impact of imbalanced dataset over fractured zone detection. Int J Eng Technol 10:124–133
Bansal M, Goyal A, Choudhary A (2022) A comparative analysis of K-nearest neighbour, genetic, support vector machine, decision tree, and long short term memory algorithms in machine learning. Decis Anal J 3:100071
Barua S, Islam M, Murase K, et al (2013) Prowsyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Pacific-Asia conference on knowledge discovery and data mining, Springer. pp 317–328
Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26:405–425
Batista GE, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6:20–29
Bunkhumpornpat C, Sinapiromsaran K (2003) Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. Adv Knowl Discov Data Mining. Springer, pp 475–482
Chakraborty A, Ghosh KK, De R, Cuevas E, Sarkar R (2021) Learning automata based particle swarm optimization for solving class imbalance problem. Appl Soft Comput 113:107959
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chen L, Cai Z, Chen L, Gu Q, (2010) A novel differential evolution-clustering hybrid resampling algorithm on imbalanced datasets. In: 2010 Third international conference on knowledge discovery and data mining, IEEE. pp 81–85
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297
De La Calleja J, Fuentes O (2007) A distance-based over-sampling method for learning from imbalanced data sets. In: FLAIRS conference, pp 634–635
Douzas G, Bacao F (2017) Self-organizing map oversampling (SOMO) for imbalanced data set learning. Expert Syst Appl 82:40–52
Esposito C, Landrum GA, Schneider N, Stiefl N, Riniker S (2021) Ghost: adjusting the decision threshold to handle imbalanced data in machine learning. J Chem Inf Model 61:2623–2640
Fix E, Hodges JL (1989) Discriminatory analysis. nonparametric discrimination: consistency properties. Int Stat Rev 57:238–247
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76:378
Gao M, Hong X, Chen S, Harris CJ, Khalaf E (2014) Pdfos: Pdf estimation based over-sampling for imbalanced two-class problems. Neurocomputing 138:248–259
Gazzah S, Amara NEB (2008) New oversampling approaches based on polynomial fitting for imbalanced data sets. In: 2008 the eighth IAPR international workshop on document analysis systems, IEEE. PP 677–684
Gosain A, Sardana S (2017) Handling class imbalance problem using oversampling techniques: a review. In: 2017 international conference on advances in computing, communications and informatics (ICACCI), IEEE. pp. 79–85
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G (2017) Learning from class-imbalanced data: review of methods and applications. Expert Syst Appl 73:220–239
Han H, Wang WY, Mao BH, (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing, Springer. pp 878–887
He H, Bai Y, Garcia EA, Li S (2008) Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International joint conference on neural networks (IEEE world congress on computational intelligence), IEEE. pp 1322–1328
Islam A, Belhaouari SB, Rehman AU, Bensmail H (2022) Knnor: an oversampling technique for imbalanced datasets. Appl Soft Comput 115:108288
Karaboga D, et al (2005) An idea based on honey bee swarm for numerical optimization. Technical report. Technical report-tr06, Erciyes university, engineering faculty, computer
Kaya E, Korkmaz S, Sahman MA, Cinar AC (2021) Debohid: a differential evolution based oversampling approach for highly imbalanced datasets. Expert Syst Appl 169:114482
Kovács G (2019) Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366:352–354
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221–232
Lei D, Cui Z, Li M (2022) A dynamical artificial bee colony for vehicle routing problem with drones. Eng Appl Artif Intell 107:104510
Maciejewski T, Stefanowski J (2011) Local neighbourhood extension of smote for mining imbalanced data. In: 2011 IEEE symposium on computational intelligence and data mining (CIDM), IEEE. pp 104–111
Mavrovouniotis M, Li C, Yang S (2017) A survey of swarm intelligence for dynamic optimization: algorithms and applications. Swarm Evol Comput 33:1–17
Menardi G, Torelli N (2014) Training and assessing classification rules with imbalanced data. Data Min Knowl Disc 28:92–122
Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
Mishra S (2017) Handling imbalanced data: smote vs. random undersampling. Int Res J Eng Technol 4:317–320
Nguyen HM, Cooper EW, Kamei K (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig 3:4–21
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Sanchez AI, Morales EF, Gonzalez JA (2013) Synthetic oversampling of instances using clustering. Int J Artif Intell Tools 22:1350008
Tang S, Chen SP, (2008) The generation mechanism of synthetic minority class examples. In: 2008 International conference on information technology and applications in biomedicine, IEEE. 444–447
Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441
Tsai CF, Lin WC (2021) Feature selection and ensemble learning techniques in one-class classifiers: an empirical study of two-class imbalanced datasets. IEEE Access 9:13717–13726
Wei G, Mu W, Song Y, Dou J (2022) An improved and random synthetic minority oversampling technique for imbalanced data. Knowl-Based Syst 248:108839
Yu L, Zhou N (2021) Survey of imbalanced data methodologies. arXiv preprint arXiv:2104.02240
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Human and animal resources
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
The research does not include any such participants which requires informed consent to be taken. Hence, this statement is not applicable to this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bharti, K.K., Tripathi, A. & Ghosh, M. A fused grey wolf and artificial bee colony model for imbalanced data classification problems. Int J Syst Assur Eng Manag 15, 4085–4104 (2024). https://doi.org/10.1007/s13198-024-02412-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-024-02412-w