Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Sewal, Piyush; Singh, Hari

doi:10.1007/s11042-023-17330-5

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Published: 17 October 2023

Volume 83, pages 44047–44066, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

138 Accesses
1 Citation
Explore all metrics

A Correction to this article was published on 25 January 2024

This article has been updated

Abstract

Numerous studies emphasize accuracy in machine learning regression models, yet scalability and execution efficiency are often overlooked, critical for large datasets or extensive computations. This paper introduces a scalable, distributed Spark MLlib regression model through the best subset selection approach to predict Covid-19 statistics in India, demonstrating high accuracy, scalability, and execution efficiency. Notably, limited research focuses on tree-based regression, particularly gradient boost regression, in the context of the Covid-19 dataset. The proposed work optimizes regression models for accuracy and execution time on Spark clusters of varying sizes using the best subset selection approach. Evaluation encompasses Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R² Error for accuracy, and execution time analysis. Results indicate superior prediction accuracy in tree-based regression, with Gradient Boosted Tree Regression (GBTR) leading, and Random Forest Regression (RFR) surpassing Decision Tree Regression (DTR). Accuracy remains consistent across Python library, Spark MLlib on a single machine, and clusters of varying sizes, with Spark MLlib displaying lower execution times than Python's machine learning library on a single machine. Furthermore, execution times decrease substantially within Spark clusters, particularly for the iterative GBTR. This research uncovers scalability and execution efficiency aspects, highlighting tree-based regression's accuracy and advocating for Spark MLlib's efficacy in enhancing execution efficiency, especially across multi-node clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Article 23 August 2022

Screening hardware and volume factors in distributed machine learning algorithms on spark

Article 15 June 2021

Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform

Article Open access 30 July 2019

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Change history

25 January 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11042-023-17827-z

References

Singh H, Vasuja R, Sharma R (2018) A Survey of Diversified Domain of Big Data Technologies. Adv Parallel Comput 29(September):1–27. https://doi.org/10.3233/978-1-61499-814-3-1
Article Google Scholar
Singh H, Bawa S (2017) A MapReduce-based scalable discovery and indexing of structured big data. Futur Gener Comput Syst 73:32–43. https://doi.org/10.1016/j.future.2017.03.028
Article Google Scholar
BazzazAbkenar S, HaghiKashani M, Mahdipour E, Jameii SM (2021) Big data analytics meets social media A systematic review of techniques, open issues, and future directions. Telemat Informatics 57:101517. https://doi.org/10.1016/j.tele.2020.101517
Article Google Scholar
Mehta N, Pandit A (2018) Concurrence of big data analytics and healthcare: A systematic review. Int J Med Inform 114(March):57–65. https://doi.org/10.1016/j.ijmedinf.2018.03.013
Article Google Scholar
Le TM, Liaw SY (2017) Effects of pros and cons of applying big data analytics to consumers’ responses in an e-commerce context. Sustain 9(5). https://doi.org/10.3390/su9050798
Agerri R, Artola X, Beloki Z, Rigau G, Soroa A (2015) Big data for Natural Language Processing: A streaming approach. Knowledge-Based Syst 79:36–42. https://doi.org/10.1016/j.knosys.2014.11.007
Article Google Scholar
Janssen M et al (2015) Open and Big Data Management and Innovation. Lect Notes Comput Sci 3:200–211. https://doi.org/10.1007/978-3-319-25013-7
Article Google Scholar
Sewal P, Singh H (2021) A Critical Analysis of Apache Hadoop and Spark for Big Data Processing, in 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC). pp. 308–313. https://doi.org/10.1109/ISPCC53510.2021.9609518
Sewal P, Singh H (2022) A Machine Learning Approach for Predicting Execution Statistics of Spark Application. PDGC 2022 - 2022 7th Int. Conf. Parallel, Distrib. Grid Comput. pp 331–336. https://doi.org/10.1109/PDGC56933.2022.10053356
Chapter Google Scholar
Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on Apache Spark. Gigascience 7(8):1–10. https://doi.org/10.1093/gigascience/giy098
Article Google Scholar
Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G (2023) Framing Apache Spark in life sciences. Heliyon 9(2):e13368. https://doi.org/10.1016/j.heliyon.2023.e13368
Article Google Scholar
Chicco D, Ferraro Petrillo U, Cattaneo G (2023) Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 19(7):e1011272. https://doi.org/10.1371/journal.pcbi.1011272
Article Google Scholar
Arpaci I, Al-Emran M, Al-Sharafi MA, Marques G (2021) Emerging Technologies During the Era of COVID-19 Pandemic. Studies in Systems, Decision and Control, 348. [Online]. Available: https://doi.org/10.1007/978-3-030-67716-9
Kamalov F, Cherukuri AK, Sulieman H, Thabtah F, Hossain A (2022) Machine learning applications for COVID-19: a state-of-the-art review, in Data Science for Genomics, Academic Press. pp. 277–289. https://doi.org/10.1016/B978-0-323-98352-5.00010-0
Zaharia M et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of NSDI 2012: 9th USENIX Symposium on Networked Systems Design and Implementation. pp. 15–28
Han S, Choi W, Muwafiq R, Nah Y (2017) Impact of Memory Size on Bigdata Processing based on Hadoop and Spark, in Proceedings of the International Conference on Research in Adaptive and Convergent Systems. 2017:275–280. https://doi.org/10.1145/3129676.3129688
Gopalani S, Arora R (2015) Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. Int J Comput Appl 113(1):8–11. https://doi.org/10.5120/19788-0531
Article Google Scholar
Sharma T, Shokeen DV, Mathur DS (2016) Multiple K Means++ Clustering of Satellite Image Using Hadoop MapReduce and Spark. Int J Adv Stud Comput Sci Eng 5(4):23–31 (Available: http://arxiv.org/abs/1605.01802)
Google Scholar
Lin X, Wang P, Wu B (2013) Log analysis in cloud computing environment with Hadoop and Spark, Proc. 2013 5th IEEE Int. Conf. Broadband Netw. Multimed. Technol. IEEE IC-BNMT. pp. 273–276. https://doi.org/10.1109/ICBNMT.2013.6823956
Gu L, Li H (2013) Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark, in 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. pp. 721–727. https://doi.org/10.1109/HPCC.and.EUC.2013.106
Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, ArockiaDhanraj J (2021) Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300. https://doi.org/10.1007/s11227-020-03328-5
Article Google Scholar
Melenli S, Topkaya A (2021) Real-Time Maintaining of Social Distance in Covid-19 Environment Using Image Processing and Big Data. Lect Notes Data Eng Commun Technol 76:578–589. https://doi.org/10.1007/978-3-030-79357-9_55
Article Google Scholar
Azeroual O, Fabre R (2021) Processing big data with apache hadoop in the current challenging era of COVID-19. Big Data Cogn. Comput. 5(1):2021. https://doi.org/10.3390/bdcc5010012
Article Google Scholar
Çakan S (2020) Dynamic analysis of a mathematical model with health care capacity for COVID-19 pandemic. Chaos, Solitons and Fractals 139. https://doi.org/10.1016/j.chaos.2020.110033
Singhal A, Singh P, Lall B, Joshi SD (2020) Modeling and prediction of COVID-19 pandemic using Gaussian mixture model. Chaos, Solitons Fractals 138:110023. https://doi.org/10.1016/j.chaos.2020.110023
Article MathSciNet Google Scholar
AL-Rousan N, AL-Najjar H (2020) Data analysis of coronavirus COVID-19 epidemic in South Korea based on recovered and death cases. J Med Virol 92(9):1603–1608. https://doi.org/10.1002/jmv.25850
Article Google Scholar
Sun J et al (2020) Forecasting the long-term trend of COVID-19 epidemic using a dynamic model. Sci Rep 10(1):1–10. https://doi.org/10.1038/s41598-020-78084-w
Article MathSciNet Google Scholar
Prieto K (2022) Current forecast of COVID-19 in Mexico: A Bayesian and machine learning approaches. PLoS One 17(1 January):1–21. https://doi.org/10.1371/journal.pone.0259958
Article Google Scholar
Shinde GR, Kalamkar AB, Mahalle PN, Dey N, Chaki J, Hassanien AE (2020) Forecasting Models for Coronavirus Disease (COVID-19): A Survey of the State-of-the-Art. SN Comput Sci 1(4):1–15. https://doi.org/10.1007/s42979-020-00209-9
Article Google Scholar
Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F (Aug.2020) Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst 44(8):1–12. https://doi.org/10.1007/s10916-020-01597-4
Article Google Scholar
Assaf D et al (2020) Utilization of machine-learning models to accurately predict the risk for critical COVID-19. Intern Emerg Med 15(8):1435–1443. https://doi.org/10.1007/s11739-020-02475-0
Article Google Scholar
Magdon-Ismail M (202) Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics. pp. 1–16. https://doi.org/10.48550/arXiv.2003.07602
Ostertagová E (2012) Modelling using polynomial regression. Procedia Eng 48(December 2012):500–506. https://doi.org/10.1016/j.proeng.2012.09.545
Article Google Scholar
Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE (2021) A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput 113:107946. https://doi.org/10.1016/j.asoc.2021.107946
Article Google Scholar
Singh H, Bawa S (2021) Predicting COVID-19 statistics using machine learning regression model: Li-MuLi-Poly. Multimed Syst 28(1):1–8. https://doi.org/10.1007/s00530-021-00798-2
Article Google Scholar
Kwekha-Rashid AS, Abduljabbar HN, Alhayani B (2021) Coronavirus disease (COVID-19) cases analysis using machine-learning applications, Appl. Nanosci., no. 0123456789. https://doi.org/10.1007/s13204-021-01868-7
Ghosal S, Sengupta S, Majumder M, Sinha B (2020) Diabetes & Metabolic Syndrome : Clinical Research & Reviews Linear Regression Analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases - March 14th. Diabetes Metab Syndr Clin Res Rev 14(4):311–315. https://doi.org/10.1016/j.dsx.2020.03.017
Article Google Scholar
Yadav RS (2020) Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India. Int J Inf Technol 12(4):1321–1330. https://doi.org/10.1007/s41870-020-00484-y
Article Google Scholar
Muhammad LJ, Islam MM, Usman SS, Ayon SI (2020) Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients’ Recovery. SN Comput Sci 1(4):1–7. https://doi.org/10.1007/s42979-020-00216-w
Article Google Scholar
Peng Y, Nagata MH (2020) An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data. Chaos, Solitons Fractals 139:1–15. https://doi.org/10.1016/j.chaos.2020.110055
Article MathSciNet Google Scholar
Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA (2021) Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Comput Sci 2(1):1–13. https://doi.org/10.1007/s42979-020-00394-7
Article Google Scholar
Kumar V, Unnati S (2020) Modeling and Forecasting of COVID - 19 Growth Curve in India. Trans Indian Natl Acad Eng 5(4):697–710. https://doi.org/10.1007/s41403-020-00165-z
Article Google Scholar
Anastassopoulou C, Russo L, Tsakris A, Siettos C (2020) Data-based analysis, modelling and forecasting of the COVID-19 outbreak. PLoS ONE 15(3):1–21. https://doi.org/10.1371/journal.pone.0230405
Article Google Scholar
Nabi KN (2020) Forecasting COVID-19 pandemic: A data-driven analysis. Chaos, Solitons Fractals 139:1–15. https://doi.org/10.1016/j.chaos.2020.110046
Article MathSciNet Google Scholar
Nayak J, Naik B, Dinesh P, Vakula K, Dash PB, Pelusi D (2022) Significance of deep learning for Covid-19: state-of-the-art review. Res Biomed Eng 38(1):243–266. https://doi.org/10.1007/s42600-021-00135-6
Article Google Scholar
Kamalov F, Rajab K, Cherukuri AK, Elnagar A, Safaraliev M (2022) Deep learning for Covid-19 forecasting: State-of-the-art review. Neurocomputing 511:142–154. https://doi.org/10.1016/j.neucom.2022.09.005
Article Google Scholar
Assefi M, Behravesh E, Liu G, Tafti AP (2017) Big data machine learning using apache spark MLlib, in Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017, 2018:3492–3498. https://doi.org/10.1109/BigData.2017.8258338
“Kaggle: Your Machine Learning and Data Science Community.” https://www.kaggle.com/. Accessed 23 March 2022

Download references

Funding

No Funding has been received for this research work.

Author information

Authors and Affiliations

Computer Science & Engineering Department, Jaypee University of Information Technology, Solan, Himachal Pradesh, India
Piyush Sewal & Hari Singh

Authors

Piyush Sewal
View author publications
You can also search for this author in PubMed Google Scholar
Hari Singh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Piyush Sewal: Conceptualization, Methodology, Writing- Original draft preparation, Visualization, Investigation, Software, Validation.

Hari Singh: Supervision, Reviewing and Editing.

Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Piyush Sewal.

Ethics declarations

Competing interests

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: Table 3 in the original publication of this article contains incorrect values for the parameters "Days," "Confirmed," and "Cured" in the "Independent Parameters" section.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sewal, P., Singh, H. Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach. Multimed Tools Appl 83, 44047–44066 (2024). https://doi.org/10.1007/s11042-023-17330-5

Download citation

Received: 08 May 2023
Revised: 16 August 2023
Accepted: 27 September 2023
Published: 17 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17330-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Abstract

Access this article

Similar content being viewed by others

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Screening hardware and volume factors in distributed machine learning algorithms on spark

Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform

Data availability

Change history

25 January 2024

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

Abstract

Access this article

Similar content being viewed by others

Performance Evaluation of Data-driven Intelligent Algorithms for Big data Ecosystem

Screening hardware and volume factors in distributed machine learning algorithms on spark

Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform

Data availability

Change history

25 January 2024

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation