Skip to main content

Advertisement

Log in

Improved multi-class classification approach for imbalanced big data on spark

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The difficulty in classifying imbalanced datasets is one of the complexities preventing meaningful information from being extracted. It is common in the actual data applications for instances from a class of primary concern to be overshadowed by instances from other classes. The majority of known methods are designed for binary imbalanced classification with well-defined minority and majority classes. Additionally, in the frameworks that process the data in a distributed manner like Apache Spark, data partitioning is not controlled, which leads to chunks of data with corrupted spatial relationships. In this study, we propose an efficient method for dealing with the multi-class classification of imbalanced large datasets in Apache Spark. It utilizes Random Sample Partitioning to ensure the effective distribution of the dataset among different partitions. Subsequently, block-level sampling deals with efficient data sampling. This method overcomes the spatial limitations of distributed environments by utilizing an improved version of the synthetic minority oversampling technique (SMOTE). The synthetic data point generation through improved SMOTE assures a sufficient number of instances for the minority classes against the entire dataset. Extensive experiments were carried out on Apache Spark Cluster using the Random Forest and Naive Bayes algorithms on the four benchmarks imbalanced datasets. With relatively limited data samples, the proposed methodology can effectively classify unknown data samples from large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The datasets analyzed during the current study are available at: a. https://archive.ics.uci.edu/ml/datasets/Covertype. b. https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT. c. https://www.seer.cancer.gov. d. https://catalog.data.gov/dataset/traffic-violations-56dda.

References

  1. López V, Del Río S, Benítez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. https://doi.org/10.1016/j.fss.2014.01.015

    Article  MathSciNet  Google Scholar 

  2. Katal A, Wazid M, Goudar RH (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth International Conference on Contemporary Computing (IC3). IEEE, pp 404–409. https://doi.org/10.1109/ic3.2013.6612229

  3. Bauder R, Khoshgoftaar T (2018) Medicare fraud detection using random forest with class imbalanced big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE pp 80–87. https://doi.org/10.1109/iri.2018.00019

  4. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp 1–16. https://doi.org/10.1145/2523616.2523633

  5. Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on apache spark. Int J Data Sci Anal 1(3):145–164. https://doi.org/10.1007/s41060-016-0027-9

    Article  Google Scholar 

  6. Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE, pp 640–647. https://doi.org/10.1109/cec.2016.7743853

  7. Triguero I, Galar M, Bustince H, Herrera F (2017) A first attempt on global evolutionary undersampling for imbalanced big data. In: 2017 IEEE Congress on Evolutionary Computation (CEC). IEEE, pp 2054–2061. https://doi.org/10.1109/cec.2017.7969553

  8. Del Río S, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using random forest. Inf Sci 285:112–137. https://doi.org/10.1016/j.ins.2014.03.043

    Article  Google Scholar 

  9. Hasanin T, Khoshgoftaar T (2018) The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE, pp 70–79. https://doi.org/10.1109/iri.2018.00018

  10. Davies E (1988) Training sets and a priori probabilities with the nearest neighbour method of pattern recognition. Pattern Recognit Lett 8(1):11–13. https://doi.org/10.1016/0167-8655(88)90017-7

    Article  MathSciNet  MATH  Google Scholar 

  11. Holte RC, Acker L, Porter BW et al (1989) Concept learning and the problem of small disjuncts. In: IJCAI’89: Proceedings of the 11th international joint conference on Artificial intelligence, vol 1, pp 813–818. https://doi.org/10.5555/1623755.1623884

  12. Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4):311–322. https://doi.org/10.3233/IDA-2002-6402

    Article  MATH  Google Scholar 

  13. Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, López-Nozal C, García-Osorio C (2021) Approx-smote: fast smote for big data on apache spark. Neurocomputing 464:432–437. https://doi.org/10.1016/j.neucom.2021.08.086

    Article  Google Scholar 

  14. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953

    Article  MATH  Google Scholar 

  15. Patil SS, Sonavane S (2017) Enhanced over_sampling techniques for imbalanced big data set classification. In: Data science and big data: an environment of computational intelligence. Springer, pp 49–81. https://doi.org/10.1007/978-3-319-53474-9_3

  16. Patil SS, Sonavane SP (2017) Improved classification of large imbalanced data sets using rationalized technique: updated class purity maximization over_sampling technique (UCPMOT). J Big Data 4(1):1–32. https://doi.org/10.1186/s40537-017-0108-1

    Article  Google Scholar 

  17. Chao X, Zhang L (2021) Few-shot imbalanced classification based on data augmentation. Multimed Syst. https://doi.org/10.1007/s00530-021-00827-0

    Article  Google Scholar 

  18. Woźniak M, Grana M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Inf Fusion 16:3–17. https://doi.org/10.1016/j.inffus.2013.04.006

    Article  Google Scholar 

  19. Díez-Pastor JF, Rodríguez JJ, Garcia-Osorio C, Kuncheva LI (2015) Random balance: ensembles of variable priors classifiers for imbalanced data. Knowledge-Based Syst 85:96–111. https://doi.org/10.1016/j.knosys.2015.04.022

    Article  Google Scholar 

  20. Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542. https://doi.org/10.1016/j.neucom.2014.07.064

    Article  Google Scholar 

  21. Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min ASA Data Sci J 2(5–6):412–426. https://doi.org/10.1002/sam.10061

    Article  MathSciNet  Google Scholar 

  22. Datta S, Nag S, Das S (2019) Boosting with lexicographic programming: addressing class imbalance without cost tuning. IEEE Trans Knowl Data Eng 32(5):883–897. https://doi.org/10.1109/TKDE.2019.2894148

    Article  Google Scholar 

  23. Krawczyk B, Galar M, Jeleń Ł, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726. https://doi.org/10.1016/j.asoc.2015.08.060

    Article  Google Scholar 

  24. Zhang X, Zhuang Y, Wang W, Pedrycz W (2016) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48(1):357–370. https://doi.org/10.1109/TCYB.2016.2636370

    Article  Google Scholar 

  25. Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56. https://doi.org/10.1016/j.ins.2019.02.062

    Article  MathSciNet  Google Scholar 

  26. Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl based syst 95:1–11. https://doi.org/10.1016/j.knosys.2015.11.010

    Article  Google Scholar 

  27. Roy A, Cruz RM, Sabourin R, Cavalcanti GD (2018) A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286:179–192

    Article  Google Scholar 

  28. Souza MA, Cavalcanti GD, Cruz RM, Sabourin R (2019) On evaluating the online local pool generation method for imbalance learning. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8. https://doi.org/10.1109/IJCNN.2019.8852126

  29. Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562. https://doi.org/10.1016/j.fss.2014.01.015

    Article  Google Scholar 

  30. Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neurocomputing 390:327–340. https://doi.org/10.1016/j.neucom.2019.08.095

    Article  Google Scholar 

  31. Zhao Y, Hao K, Tang X.-S, Chen L, Wei B, (2021) A conditional variational autoencoder based self-transferred algorithm for imbalanced classification. Knowledge-Based Syst 218:106756. https://doi.org/10.1016/j.knosys.2021.106756

  32. Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inf Sci 577:265–279. https://doi.org/10.1016/j.ins.2021.07.019

    Article  MathSciNet  Google Scholar 

  33. Sleeman WC IV, Krawczyk B (2021) Multi-class imbalanced big data classification on spark. Knowledge-Based Syst 212:106598. https://doi.org/10.1016/j.knosys.2020.106598

    Article  Google Scholar 

  34. Salloum S, Huang JZ, He Y (2019) Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inform 15(11):5846–5854. https://doi.org/10.1109/tii.2019.2912723

    Article  Google Scholar 

  35. Singh T, Khanna R, Satakshi, Kumar M (2021) Multiclass imbalanced big data classification utilizing spark cluster. In: 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp 1–7 (2021). https://doi.org/10.1109/ICCCNT51525.2021.9580029

  36. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for In-Memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 15–28. USENIX Association, San Jose, CA. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

  37. RDD Programming Guide - Spark 3.3.0 Documentation (2022) spark.apache.org. https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations. Online accessed 15 Apr 2022

  38. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  39. Ho TK (1995) Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. IEEE, vol 1, pp 278–282. https://doi.org/10.1109/icdar.1995.598929

  40. Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9(7):1545–1588. https://doi.org/10.1162/neco.1997.9.7.1545

    Article  Google Scholar 

  41. Breiman L (1996) Bagging predictors (technical report 421). Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655

    Article  MATH  Google Scholar 

  42. Islam MJ, Wu QJ, Ahmadi M, Sid-Ahmed MA (2007) Investigating the performance of naive-bayes classifiers and k-nearest neighbor classifiers. In: 2007 International Conference on Convergence Information Technology (ICCIT 2007). IEEE, pp 1541–1546. https://doi.org/10.1109/ICCIT.2007.148

  43. Charte F, Rivera A, Jesus MJd, Herrera F (2013) A first approach to deal with imbalance in multi-label datasets. In: International Conference on Hybrid Artificial Intelligence Systems. Springer, pp 150–160. https://doi.org/10.1007/978-3-642-40846-5_16

  44. Akosa J (2017) Predictive accuracy: a misleading performance measure for highly imbalanced data. In: Proceedings of the SAS Global Forum, vol 12. http://support.sas.com/resources/papers/proceedings17/0942-2017.pdf

  45. Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151. https://doi.org/10.1016/S0168-1699(99)00046-0

    Article  Google Scholar 

  46. Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) N-baiot-network-based detection of IOT botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22. https://doi.org/10.1109/MPRV.2018.03367731

    Article  Google Scholar 

  47. Epidemiology S, Results E (2021) Seer cancer dataset. https://seer.cancer.gov/

  48. Data.gov: dataset repository by U.S. general services administration, Powered by two open source applications. CKAN and WordPress. https://catalog.data.gov/dataset/traffic-violations-56dda

  49. Weisstein EW (2003) CRC concise encyclopedia of mathematics. https://mathworld.wolfram.com/.https://doi.org/10.1201/9781420035223

  50. Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437. https://doi.org/10.1016/j.ipm.2009.03.002

    Article  Google Scholar 

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tinku Singh.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, T., Khanna, R., Satakshi et al. Improved multi-class classification approach for imbalanced big data on spark. J Supercomput 79, 6583–6611 (2023). https://doi.org/10.1007/s11227-022-04908-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04908-3

Keywords