Abstract
The difficulty in classifying imbalanced datasets is one of the complexities preventing meaningful information from being extracted. It is common in the actual data applications for instances from a class of primary concern to be overshadowed by instances from other classes. The majority of known methods are designed for binary imbalanced classification with well-defined minority and majority classes. Additionally, in the frameworks that process the data in a distributed manner like Apache Spark, data partitioning is not controlled, which leads to chunks of data with corrupted spatial relationships. In this study, we propose an efficient method for dealing with the multi-class classification of imbalanced large datasets in Apache Spark. It utilizes Random Sample Partitioning to ensure the effective distribution of the dataset among different partitions. Subsequently, block-level sampling deals with efficient data sampling. This method overcomes the spatial limitations of distributed environments by utilizing an improved version of the synthetic minority oversampling technique (SMOTE). The synthetic data point generation through improved SMOTE assures a sufficient number of instances for the minority classes against the entire dataset. Extensive experiments were carried out on Apache Spark Cluster using the Random Forest and Naive Bayes algorithms on the four benchmarks imbalanced datasets. With relatively limited data samples, the proposed methodology can effectively classify unknown data samples from large datasets.










Similar content being viewed by others
Data availability
The datasets analyzed during the current study are available at: a. https://archive.ics.uci.edu/ml/datasets/Covertype. b. https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT. c. https://www.seer.cancer.gov. d. https://catalog.data.gov/dataset/traffic-violations-56dda.
References
López V, Del Río S, Benítez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. https://doi.org/10.1016/j.fss.2014.01.015
Katal A, Wazid M, Goudar RH (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth International Conference on Contemporary Computing (IC3). IEEE, pp 404–409. https://doi.org/10.1109/ic3.2013.6612229
Bauder R, Khoshgoftaar T (2018) Medicare fraud detection using random forest with class imbalanced big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE pp 80–87. https://doi.org/10.1109/iri.2018.00019
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp 1–16. https://doi.org/10.1145/2523616.2523633
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on apache spark. Int J Data Sci Anal 1(3):145–164. https://doi.org/10.1007/s41060-016-0027-9
Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE, pp 640–647. https://doi.org/10.1109/cec.2016.7743853
Triguero I, Galar M, Bustince H, Herrera F (2017) A first attempt on global evolutionary undersampling for imbalanced big data. In: 2017 IEEE Congress on Evolutionary Computation (CEC). IEEE, pp 2054–2061. https://doi.org/10.1109/cec.2017.7969553
Del Río S, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using random forest. Inf Sci 285:112–137. https://doi.org/10.1016/j.ins.2014.03.043
Hasanin T, Khoshgoftaar T (2018) The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE, pp 70–79. https://doi.org/10.1109/iri.2018.00018
Davies E (1988) Training sets and a priori probabilities with the nearest neighbour method of pattern recognition. Pattern Recognit Lett 8(1):11–13. https://doi.org/10.1016/0167-8655(88)90017-7
Holte RC, Acker L, Porter BW et al (1989) Concept learning and the problem of small disjuncts. In: IJCAI’89: Proceedings of the 11th international joint conference on Artificial intelligence, vol 1, pp 813–818. https://doi.org/10.5555/1623755.1623884
Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4):311–322. https://doi.org/10.3233/IDA-2002-6402
Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, López-Nozal C, García-Osorio C (2021) Approx-smote: fast smote for big data on apache spark. Neurocomputing 464:432–437. https://doi.org/10.1016/j.neucom.2021.08.086
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Patil SS, Sonavane S (2017) Enhanced over_sampling techniques for imbalanced big data set classification. In: Data science and big data: an environment of computational intelligence. Springer, pp 49–81. https://doi.org/10.1007/978-3-319-53474-9_3
Patil SS, Sonavane SP (2017) Improved classification of large imbalanced data sets using rationalized technique: updated class purity maximization over_sampling technique (UCPMOT). J Big Data 4(1):1–32. https://doi.org/10.1186/s40537-017-0108-1
Chao X, Zhang L (2021) Few-shot imbalanced classification based on data augmentation. Multimed Syst. https://doi.org/10.1007/s00530-021-00827-0
Woźniak M, Grana M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Inf Fusion 16:3–17. https://doi.org/10.1016/j.inffus.2013.04.006
Díez-Pastor JF, Rodríguez JJ, Garcia-Osorio C, Kuncheva LI (2015) Random balance: ensembles of variable priors classifiers for imbalanced data. Knowledge-Based Syst 85:96–111. https://doi.org/10.1016/j.knosys.2015.04.022
Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542. https://doi.org/10.1016/j.neucom.2014.07.064
Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min ASA Data Sci J 2(5–6):412–426. https://doi.org/10.1002/sam.10061
Datta S, Nag S, Das S (2019) Boosting with lexicographic programming: addressing class imbalance without cost tuning. IEEE Trans Knowl Data Eng 32(5):883–897. https://doi.org/10.1109/TKDE.2019.2894148
Krawczyk B, Galar M, Jeleń Ł, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726. https://doi.org/10.1016/j.asoc.2015.08.060
Zhang X, Zhuang Y, Wang W, Pedrycz W (2016) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48(1):357–370. https://doi.org/10.1109/TCYB.2016.2636370
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56. https://doi.org/10.1016/j.ins.2019.02.062
Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl based syst 95:1–11. https://doi.org/10.1016/j.knosys.2015.11.010
Roy A, Cruz RM, Sabourin R, Cavalcanti GD (2018) A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286:179–192
Souza MA, Cavalcanti GD, Cruz RM, Sabourin R (2019) On evaluating the online local pool generation method for imbalance learning. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8. https://doi.org/10.1109/IJCNN.2019.8852126
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562. https://doi.org/10.1016/j.fss.2014.01.015
Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neurocomputing 390:327–340. https://doi.org/10.1016/j.neucom.2019.08.095
Zhao Y, Hao K, Tang X.-S, Chen L, Wei B, (2021) A conditional variational autoencoder based self-transferred algorithm for imbalanced classification. Knowledge-Based Syst 218:106756. https://doi.org/10.1016/j.knosys.2021.106756
Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inf Sci 577:265–279. https://doi.org/10.1016/j.ins.2021.07.019
Sleeman WC IV, Krawczyk B (2021) Multi-class imbalanced big data classification on spark. Knowledge-Based Syst 212:106598. https://doi.org/10.1016/j.knosys.2020.106598
Salloum S, Huang JZ, He Y (2019) Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inform 15(11):5846–5854. https://doi.org/10.1109/tii.2019.2912723
Singh T, Khanna R, Satakshi, Kumar M (2021) Multiclass imbalanced big data classification utilizing spark cluster. In: 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp 1–7 (2021). https://doi.org/10.1109/ICCCNT51525.2021.9580029
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for In-Memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 15–28. USENIX Association, San Jose, CA. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
RDD Programming Guide - Spark 3.3.0 Documentation (2022) spark.apache.org. https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations. Online accessed 15 Apr 2022
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Ho TK (1995) Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. IEEE, vol 1, pp 278–282. https://doi.org/10.1109/icdar.1995.598929
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9(7):1545–1588. https://doi.org/10.1162/neco.1997.9.7.1545
Breiman L (1996) Bagging predictors (technical report 421). Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Islam MJ, Wu QJ, Ahmadi M, Sid-Ahmed MA (2007) Investigating the performance of naive-bayes classifiers and k-nearest neighbor classifiers. In: 2007 International Conference on Convergence Information Technology (ICCIT 2007). IEEE, pp 1541–1546. https://doi.org/10.1109/ICCIT.2007.148
Charte F, Rivera A, Jesus MJd, Herrera F (2013) A first approach to deal with imbalance in multi-label datasets. In: International Conference on Hybrid Artificial Intelligence Systems. Springer, pp 150–160. https://doi.org/10.1007/978-3-642-40846-5_16
Akosa J (2017) Predictive accuracy: a misleading performance measure for highly imbalanced data. In: Proceedings of the SAS Global Forum, vol 12. http://support.sas.com/resources/papers/proceedings17/0942-2017.pdf
Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151. https://doi.org/10.1016/S0168-1699(99)00046-0
Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) N-baiot-network-based detection of IOT botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22. https://doi.org/10.1109/MPRV.2018.03367731
Epidemiology S, Results E (2021) Seer cancer dataset. https://seer.cancer.gov/
Data.gov: dataset repository by U.S. general services administration, Powered by two open source applications. CKAN and WordPress. https://catalog.data.gov/dataset/traffic-violations-56dda
Weisstein EW (2003) CRC concise encyclopedia of mathematics. https://mathworld.wolfram.com/.https://doi.org/10.1201/9781420035223
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Singh, T., Khanna, R., Satakshi et al. Improved multi-class classification approach for imbalanced big data on spark. J Supercomput 79, 6583–6611 (2023). https://doi.org/10.1007/s11227-022-04908-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04908-3