Improved multi-class classification approach for imbalanced big data on spark

Singh, Tinku; Khanna, Riya; Satakshi; Kumar, Manish

doi:10.1007/s11227-022-04908-3

Improved multi-class classification approach for imbalanced big data on spark

Published: 10 November 2022

Volume 79, pages 6583–6611, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Tinku Singh ORCID: orcid.org/0000-0002-9146-8682¹,
Riya Khanna¹^na1,
Satakshi²^na1 &
…
Manish Kumar¹^na1

482 Accesses
Explore all metrics

Abstract

The difficulty in classifying imbalanced datasets is one of the complexities preventing meaningful information from being extracted. It is common in the actual data applications for instances from a class of primary concern to be overshadowed by instances from other classes. The majority of known methods are designed for binary imbalanced classification with well-defined minority and majority classes. Additionally, in the frameworks that process the data in a distributed manner like Apache Spark, data partitioning is not controlled, which leads to chunks of data with corrupted spatial relationships. In this study, we propose an efficient method for dealing with the multi-class classification of imbalanced large datasets in Apache Spark. It utilizes Random Sample Partitioning to ensure the effective distribution of the dataset among different partitions. Subsequently, block-level sampling deals with efficient data sampling. This method overcomes the spatial limitations of distributed environments by utilizing an improved version of the synthetic minority oversampling technique (SMOTE). The synthetic data point generation through improved SMOTE assures a sufficient number of instances for the minority classes against the entire dataset. Extensive experiments were carried out on Apache Spark Cluster using the Random Forest and Naive Bayes algorithms on the four benchmarks imbalanced datasets. With relatively limited data samples, the proposed methodology can effectively classify unknown data samples from large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

A hybrid system for imbalanced data mining

Article 08 August 2019

Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

Data availability

The datasets analyzed during the current study are available at: a. https://archive.ics.uci.edu/ml/datasets/Covertype. b. https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_attacks_N_BaIoT. c. https://www.seer.cancer.gov. d. https://catalog.data.gov/dataset/traffic-violations-56dda.

References

López V, Del Río S, Benítez JM, Herrera F (2015) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 258:5–38. https://doi.org/10.1016/j.fss.2014.01.015
Article MathSciNet Google Scholar
Katal A, Wazid M, Goudar RH (2013) Big data: issues, challenges, tools and good practices. In: 2013 Sixth International Conference on Contemporary Computing (IC3). IEEE, pp 404–409. https://doi.org/10.1109/ic3.2013.6612229
Bauder R, Khoshgoftaar T (2018) Medicare fraud detection using random forest with class imbalanced big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE pp 80–87. https://doi.org/10.1109/iri.2018.00019
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, pp 1–16. https://doi.org/10.1145/2523616.2523633
Salloum S, Dautov R, Chen X, Peng PX, Huang JZ (2016) Big data analytics on apache spark. Int J Data Sci Anal 1(3):145–164. https://doi.org/10.1007/s41060-016-0027-9
Article Google Scholar
Triguero I, Galar M, Merino D, Maillo J, Bustince H, Herrera F (2016) Evolutionary undersampling for extremely imbalanced big data classification under apache spark. In: 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE, pp 640–647. https://doi.org/10.1109/cec.2016.7743853
Triguero I, Galar M, Bustince H, Herrera F (2017) A first attempt on global evolutionary undersampling for imbalanced big data. In: 2017 IEEE Congress on Evolutionary Computation (CEC). IEEE, pp 2054–2061. https://doi.org/10.1109/cec.2017.7969553
Del Río S, López V, Benítez JM, Herrera F (2014) On the use of MapReduce for imbalanced big data using random forest. Inf Sci 285:112–137. https://doi.org/10.1016/j.ins.2014.03.043
Article Google Scholar
Hasanin T, Khoshgoftaar T (2018) The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI). IEEE, pp 70–79. https://doi.org/10.1109/iri.2018.00018
Davies E (1988) Training sets and a priori probabilities with the nearest neighbour method of pattern recognition. Pattern Recognit Lett 8(1):11–13. https://doi.org/10.1016/0167-8655(88)90017-7
Article MathSciNet MATH Google Scholar
Holte RC, Acker L, Porter BW et al (1989) Concept learning and the problem of small disjuncts. In: IJCAI’89: Proceedings of the 11th international joint conference on Artificial intelligence, vol 1, pp 813–818. https://doi.org/10.5555/1623755.1623884
Laurikkala J (2002) Instance-based data reduction for improved identification of difficult small classes. Intell Data Anal 6(4):311–322. https://doi.org/10.3233/IDA-2002-6402
Article MATH Google Scholar
Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, López-Nozal C, García-Osorio C (2021) Approx-smote: fast smote for big data on apache spark. Neurocomputing 464:432–437. https://doi.org/10.1016/j.neucom.2021.08.086
Article Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357. https://doi.org/10.1613/jair.953
Article MATH Google Scholar
Patil SS, Sonavane S (2017) Enhanced over_sampling techniques for imbalanced big data set classification. In: Data science and big data: an environment of computational intelligence. Springer, pp 49–81. https://doi.org/10.1007/978-3-319-53474-9_3
Patil SS, Sonavane SP (2017) Improved classification of large imbalanced data sets using rationalized technique: updated class purity maximization over_sampling technique (UCPMOT). J Big Data 4(1):1–32. https://doi.org/10.1186/s40537-017-0108-1
Article Google Scholar
Chao X, Zhang L (2021) Few-shot imbalanced classification based on data augmentation. Multimed Syst. https://doi.org/10.1007/s00530-021-00827-0
Article Google Scholar
Woźniak M, Grana M, Corchado E (2014) A survey of multiple classifier systems as hybrid systems. Inf Fusion 16:3–17. https://doi.org/10.1016/j.inffus.2013.04.006
Article Google Scholar
Díez-Pastor JF, Rodríguez JJ, Garcia-Osorio C, Kuncheva LI (2015) Random balance: ensembles of variable priors classifiers for imbalanced data. Knowledge-Based Syst 85:96–111. https://doi.org/10.1016/j.knosys.2015.04.022
Article Google Scholar
Błaszczyński J, Stefanowski J (2015) Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150:529–542. https://doi.org/10.1016/j.neucom.2014.07.064
Article Google Scholar
Hido S, Kashima H, Takahashi Y (2009) Roughly balanced bagging for imbalanced data. Stat Anal Data Min ASA Data Sci J 2(5–6):412–426. https://doi.org/10.1002/sam.10061
Article MathSciNet Google Scholar
Datta S, Nag S, Das S (2019) Boosting with lexicographic programming: addressing class imbalance without cost tuning. IEEE Trans Knowl Data Eng 32(5):883–897. https://doi.org/10.1109/TKDE.2019.2894148
Article Google Scholar
Krawczyk B, Galar M, Jeleń Ł, Herrera F (2016) Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl Soft Comput 38:714–726. https://doi.org/10.1016/j.asoc.2015.08.060
Article Google Scholar
Zhang X, Zhuang Y, Wang W, Pedrycz W (2016) Transfer boosting with synthetic instances for class imbalanced object recognition. IEEE Trans Cybern 48(1):357–370. https://doi.org/10.1109/TCYB.2016.2636370
Article Google Scholar
Tao X, Li Q, Guo W, Ren C, Li C, Liu R, Zou J (2019) Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. Inf Sci 487:31–56. https://doi.org/10.1016/j.ins.2019.02.062
Article MathSciNet Google Scholar
Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl based syst 95:1–11. https://doi.org/10.1016/j.knosys.2015.11.010
Article Google Scholar
Roy A, Cruz RM, Sabourin R, Cavalcanti GD (2018) A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 286:179–192
Article Google Scholar
Souza MA, Cavalcanti GD, Cruz RM, Sabourin R (2019) On evaluating the online local pool generation method for imbalance learning. In: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–8. https://doi.org/10.1109/IJCNN.2019.8852126
Krawczyk B, Woźniak M, Schaefer G (2014) Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl Soft Comput 14:554–562. https://doi.org/10.1016/j.fss.2014.01.015
Article Google Scholar
Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neurocomputing 390:327–340. https://doi.org/10.1016/j.neucom.2019.08.095
Article Google Scholar
Zhao Y, Hao K, Tang X.-S, Chen L, Wei B, (2021) A conditional variational autoencoder based self-transferred algorithm for imbalanced classification. Knowledge-Based Syst 218:106756. https://doi.org/10.1016/j.knosys.2021.106756
Sanz J, Sesma-Sara M, Bustince H (2021) A fuzzy association rule-based classifier for imbalanced classification problems. Inf Sci 577:265–279. https://doi.org/10.1016/j.ins.2021.07.019
Article MathSciNet Google Scholar
Sleeman WC IV, Krawczyk B (2021) Multi-class imbalanced big data classification on spark. Knowledge-Based Syst 212:106598. https://doi.org/10.1016/j.knosys.2020.106598
Article Google Scholar
Salloum S, Huang JZ, He Y (2019) Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Inform 15(11):5846–5854. https://doi.org/10.1109/tii.2019.2912723
Article Google Scholar
Singh T, Khanna R, Satakshi, Kumar M (2021) Multiclass imbalanced big data classification utilizing spark cluster. In: 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp 1–7 (2021). https://doi.org/10.1109/ICCCNT51525.2021.9580029
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for In-Memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp 15–28. USENIX Association, San Jose, CA. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
RDD Programming Guide - Spark 3.3.0 Documentation (2022) spark.apache.org. https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-operations. Online accessed 15 Apr 2022
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Article MATH Google Scholar
Ho TK (1995) Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition. IEEE, vol 1, pp 278–282. https://doi.org/10.1109/icdar.1995.598929
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9(7):1545–1588. https://doi.org/10.1162/neco.1997.9.7.1545
Article Google Scholar
Breiman L (1996) Bagging predictors (technical report 421). Mach Learn 24(2):123–140. https://doi.org/10.1007/BF00058655
Article MATH Google Scholar
Islam MJ, Wu QJ, Ahmadi M, Sid-Ahmed MA (2007) Investigating the performance of naive-bayes classifiers and k-nearest neighbor classifiers. In: 2007 International Conference on Convergence Information Technology (ICCIT 2007). IEEE, pp 1541–1546. https://doi.org/10.1109/ICCIT.2007.148
Charte F, Rivera A, Jesus MJd, Herrera F (2013) A first approach to deal with imbalance in multi-label datasets. In: International Conference on Hybrid Artificial Intelligence Systems. Springer, pp 150–160. https://doi.org/10.1007/978-3-642-40846-5_16
Akosa J (2017) Predictive accuracy: a misleading performance measure for highly imbalanced data. In: Proceedings of the SAS Global Forum, vol 12. http://support.sas.com/resources/papers/proceedings17/0942-2017.pdf
Blackard JA, Dean DJ (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Comput Electron Agric 24(3):131–151. https://doi.org/10.1016/S0168-1699(99)00046-0
Article Google Scholar
Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) N-baiot-network-based detection of IOT botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22. https://doi.org/10.1109/MPRV.2018.03367731
Article Google Scholar
Epidemiology S, Results E (2021) Seer cancer dataset. https://seer.cancer.gov/
Data.gov: dataset repository by U.S. general services administration, Powered by two open source applications. CKAN and WordPress. https://catalog.data.gov/dataset/traffic-violations-56dda
Weisstein EW (2003) CRC concise encyclopedia of mathematics. https://mathworld.wolfram.com/.https://doi.org/10.1201/9781420035223
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437. https://doi.org/10.1016/j.ipm.2009.03.002
Article Google Scholar

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Riya Khanna, Satakshi and Manish Kumar have contributed equally to this work.

Authors and Affiliations

Department of IT, Indian Institute of Information Technology Allahabad, Prayagraj, U.P., India
Tinku Singh, Riya Khanna & Manish Kumar
Department of Mathematics & Statistics, SHUATS, Prayagraj, U.P., India
Satakshi

Authors

Tinku Singh
View author publications
You can also search for this author inPubMed Google Scholar
Riya Khanna
View author publications
You can also search for this author inPubMed Google Scholar
Satakshi
View author publications
You can also search for this author inPubMed Google Scholar
Manish Kumar
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Tinku Singh.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Singh, T., Khanna, R., Satakshi et al. Improved multi-class classification approach for imbalanced big data on spark. J Supercomput 79, 6583–6611 (2023). https://doi.org/10.1007/s11227-022-04908-3

Download citation

Accepted: 21 October 2022
Published: 10 November 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s11227-022-04908-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved multi-class classification approach for imbalanced big data on spark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Enhanced Over_Sampling Techniques for Imbalanced Big Data Set Classification

A hybrid system for imbalanced data mining

Investigation of Imbalanced Big Data Set Classification: Clustering Minority Samples Over Sampling Technique

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now