Skip to main content
Log in

Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

A Correction to this article was published on 25 January 2024

This article has been updated

Abstract 

Numerous studies emphasize accuracy in machine learning regression models, yet scalability and execution efficiency are often overlooked, critical for large datasets or extensive computations. This paper introduces a scalable, distributed Spark MLlib regression model through the best subset selection approach to predict Covid-19 statistics in India, demonstrating high accuracy, scalability, and execution efficiency. Notably, limited research focuses on tree-based regression, particularly gradient boost regression, in the context of the Covid-19 dataset. The proposed work optimizes regression models for accuracy and execution time on Spark clusters of varying sizes using the best subset selection approach. Evaluation encompasses Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R2 Error for accuracy, and execution time analysis. Results indicate superior prediction accuracy in tree-based regression, with Gradient Boosted Tree Regression (GBTR) leading, and Random Forest Regression (RFR) surpassing Decision Tree Regression (DTR). Accuracy remains consistent across Python library, Spark MLlib on a single machine, and clusters of varying sizes, with Spark MLlib displaying lower execution times than Python's machine learning library on a single machine. Furthermore, execution times decrease substantially within Spark clusters, particularly for the iterative GBTR. This research uncovers scalability and execution efficiency aspects, highlighting tree-based regression's accuracy and advocating for Spark MLlib's efficacy in enhancing execution efficiency, especially across multi-node clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.

Change history

References 

  1. Singh H, Vasuja R, Sharma R (2018) A Survey of Diversified Domain of Big Data Technologies. Adv Parallel Comput 29(September):1–27. https://doi.org/10.3233/978-1-61499-814-3-1

    Article  Google Scholar 

  2. Singh H, Bawa S (2017) A MapReduce-based scalable discovery and indexing of structured big data. Futur Gener Comput Syst 73:32–43. https://doi.org/10.1016/j.future.2017.03.028

    Article  Google Scholar 

  3. BazzazAbkenar S, HaghiKashani M, Mahdipour E, Jameii SM (2021) Big data analytics meets social media A systematic review of techniques, open issues, and future directions. Telemat Informatics 57:101517. https://doi.org/10.1016/j.tele.2020.101517

    Article  Google Scholar 

  4. Mehta N, Pandit A (2018) Concurrence of big data analytics and healthcare: A systematic review. Int J Med Inform 114(March):57–65. https://doi.org/10.1016/j.ijmedinf.2018.03.013

    Article  Google Scholar 

  5. Le TM, Liaw SY (2017) Effects of pros and cons of applying big data analytics to consumers’ responses in an e-commerce context. Sustain 9(5). https://doi.org/10.3390/su9050798

  6. Agerri R, Artola X, Beloki Z, Rigau G, Soroa A (2015) Big data for Natural Language Processing: A streaming approach. Knowledge-Based Syst 79:36–42. https://doi.org/10.1016/j.knosys.2014.11.007

    Article  Google Scholar 

  7. Janssen M et al (2015) Open and Big Data Management and Innovation. Lect Notes Comput Sci 3:200–211. https://doi.org/10.1007/978-3-319-25013-7

    Article  Google Scholar 

  8. Sewal P, Singh H (2021) A Critical Analysis of Apache Hadoop and Spark for Big Data Processing, in 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC). pp. 308–313. https://doi.org/10.1109/ISPCC53510.2021.9609518

  9. Sewal P, Singh H (2022) A Machine Learning Approach for Predicting Execution Statistics of Spark Application. PDGC 2022 - 2022 7th Int. Conf. Parallel, Distrib. Grid Comput. pp 331–336. https://doi.org/10.1109/PDGC56933.2022.10053356

    Chapter  Google Scholar 

  10. Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on Apache Spark. Gigascience 7(8):1–10. https://doi.org/10.1093/gigascience/giy098

    Article  Google Scholar 

  11. Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G (2023) Framing Apache Spark in life sciences. Heliyon 9(2):e13368. https://doi.org/10.1016/j.heliyon.2023.e13368

    Article  Google Scholar 

  12. Chicco D, Ferraro Petrillo U, Cattaneo G (2023) Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 19(7):e1011272. https://doi.org/10.1371/journal.pcbi.1011272

    Article  Google Scholar 

  13. Arpaci I, Al-Emran M, Al-Sharafi MA, Marques G (2021) Emerging Technologies During the Era of COVID-19 Pandemic. Studies in Systems, Decision and Control, 348. [Online]. Available: https://doi.org/10.1007/978-3-030-67716-9

  14. Kamalov F, Cherukuri AK, Sulieman H, Thabtah F, Hossain A (2022) Machine learning applications for COVID-19: a state-of-the-art review, in Data Science for Genomics, Academic Press. pp. 277–289. https://doi.org/10.1016/B978-0-323-98352-5.00010-0

  15. Zaharia M et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of NSDI 2012: 9th USENIX Symposium on Networked Systems Design and Implementation. pp. 15–28

  16. Han S, Choi W, Muwafiq R, Nah Y (2017) Impact of Memory Size on Bigdata Processing based on Hadoop and Spark, in Proceedings of the International Conference on Research in Adaptive and Convergent Systems. 2017:275–280. https://doi.org/10.1145/3129676.3129688

  17. Gopalani S, Arora R (2015) Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. Int J Comput Appl 113(1):8–11. https://doi.org/10.5120/19788-0531

    Article  Google Scholar 

  18. Sharma T, Shokeen DV, Mathur DS (2016) Multiple K Means++ Clustering of Satellite Image Using Hadoop MapReduce and Spark. Int J Adv Stud Comput Sci Eng 5(4):23–31 (Available: http://arxiv.org/abs/1605.01802)

    Google Scholar 

  19. Lin X, Wang P, Wu B (2013) Log analysis in cloud computing environment with Hadoop and Spark, Proc. 2013 5th IEEE Int. Conf. Broadband Netw. Multimed. Technol. IEEE IC-BNMT. pp. 273–276. https://doi.org/10.1109/ICBNMT.2013.6823956

  20. Gu L, Li H (2013) Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark, in 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. pp. 721–727. https://doi.org/10.1109/HPCC.and.EUC.2013.106

  21. Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, ArockiaDhanraj J (2021) Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300. https://doi.org/10.1007/s11227-020-03328-5

    Article  Google Scholar 

  22. Melenli S, Topkaya A (2021) Real-Time Maintaining of Social Distance in Covid-19 Environment Using Image Processing and Big Data. Lect Notes Data Eng Commun Technol 76:578–589. https://doi.org/10.1007/978-3-030-79357-9_55

    Article  Google Scholar 

  23. Azeroual O, Fabre R (2021) Processing big data with apache hadoop in the current challenging era of COVID-19. Big Data Cogn. Comput. 5(1):2021. https://doi.org/10.3390/bdcc5010012

    Article  Google Scholar 

  24. Çakan S (2020) Dynamic analysis of a mathematical model with health care capacity for COVID-19 pandemic. Chaos, Solitons and Fractals 139. https://doi.org/10.1016/j.chaos.2020.110033

  25. Singhal A, Singh P, Lall B, Joshi SD (2020) Modeling and prediction of COVID-19 pandemic using Gaussian mixture model. Chaos, Solitons Fractals 138:110023. https://doi.org/10.1016/j.chaos.2020.110023

    Article  MathSciNet  Google Scholar 

  26. AL-Rousan N, AL-Najjar H (2020) Data analysis of coronavirus COVID-19 epidemic in South Korea based on recovered and death cases. J Med Virol 92(9):1603–1608. https://doi.org/10.1002/jmv.25850

    Article  Google Scholar 

  27. Sun J et al (2020) Forecasting the long-term trend of COVID-19 epidemic using a dynamic model. Sci Rep 10(1):1–10. https://doi.org/10.1038/s41598-020-78084-w

    Article  MathSciNet  Google Scholar 

  28. Prieto K (2022) Current forecast of COVID-19 in Mexico: A Bayesian and machine learning approaches. PLoS One 17(1 January):1–21. https://doi.org/10.1371/journal.pone.0259958

    Article  Google Scholar 

  29. Shinde GR, Kalamkar AB, Mahalle PN, Dey N, Chaki J, Hassanien AE (2020) Forecasting Models for Coronavirus Disease (COVID-19): A Survey of the State-of-the-Art. SN Comput Sci 1(4):1–15. https://doi.org/10.1007/s42979-020-00209-9

    Article  Google Scholar 

  30. Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F (Aug.2020) Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst 44(8):1–12. https://doi.org/10.1007/s10916-020-01597-4

    Article  Google Scholar 

  31. Assaf D et al (2020) Utilization of machine-learning models to accurately predict the risk for critical COVID-19. Intern Emerg Med 15(8):1435–1443. https://doi.org/10.1007/s11739-020-02475-0

    Article  Google Scholar 

  32. Magdon-Ismail M (202) Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics. pp. 1–16. https://doi.org/10.48550/arXiv.2003.07602

  33. Ostertagová E (2012) Modelling using polynomial regression. Procedia Eng 48(December 2012):500–506. https://doi.org/10.1016/j.proeng.2012.09.545

    Article  Google Scholar 

  34. Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE (2021) A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput 113:107946. https://doi.org/10.1016/j.asoc.2021.107946

    Article  Google Scholar 

  35. Singh H, Bawa S (2021) Predicting COVID-19 statistics using machine learning regression model: Li-MuLi-Poly. Multimed Syst 28(1):1–8. https://doi.org/10.1007/s00530-021-00798-2

    Article  Google Scholar 

  36. Kwekha-Rashid AS, Abduljabbar HN, Alhayani B (2021) Coronavirus disease (COVID-19) cases analysis using machine-learning applications, Appl. Nanosci., no. 0123456789. https://doi.org/10.1007/s13204-021-01868-7

  37. Ghosal S, Sengupta S, Majumder M, Sinha B (2020) Diabetes & Metabolic Syndrome : Clinical Research & Reviews Linear Regression Analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases - March 14th. Diabetes Metab Syndr Clin Res Rev 14(4):311–315. https://doi.org/10.1016/j.dsx.2020.03.017

    Article  Google Scholar 

  38. Yadav RS (2020) Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India. Int J Inf Technol 12(4):1321–1330. https://doi.org/10.1007/s41870-020-00484-y

    Article  Google Scholar 

  39. Muhammad LJ, Islam MM, Usman SS, Ayon SI (2020) Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients’ Recovery. SN Comput Sci 1(4):1–7. https://doi.org/10.1007/s42979-020-00216-w

    Article  Google Scholar 

  40. Peng Y, Nagata MH (2020) An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data. Chaos, Solitons Fractals 139:1–15. https://doi.org/10.1016/j.chaos.2020.110055

    Article  MathSciNet  Google Scholar 

  41. Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA (2021) Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Comput Sci 2(1):1–13. https://doi.org/10.1007/s42979-020-00394-7

    Article  Google Scholar 

  42. Kumar V, Unnati S (2020) Modeling and Forecasting of COVID - 19 Growth Curve in India. Trans Indian Natl Acad Eng 5(4):697–710. https://doi.org/10.1007/s41403-020-00165-z

    Article  Google Scholar 

  43. Anastassopoulou C, Russo L, Tsakris A, Siettos C (2020) Data-based analysis, modelling and forecasting of the COVID-19 outbreak. PLoS ONE 15(3):1–21. https://doi.org/10.1371/journal.pone.0230405

    Article  Google Scholar 

  44. Nabi KN (2020) Forecasting COVID-19 pandemic: A data-driven analysis. Chaos, Solitons Fractals 139:1–15. https://doi.org/10.1016/j.chaos.2020.110046

    Article  MathSciNet  Google Scholar 

  45. Nayak J, Naik B, Dinesh P, Vakula K, Dash PB, Pelusi D (2022) Significance of deep learning for Covid-19: state-of-the-art review. Res Biomed Eng 38(1):243–266. https://doi.org/10.1007/s42600-021-00135-6

    Article  Google Scholar 

  46. Kamalov F, Rajab K, Cherukuri AK, Elnagar A, Safaraliev M (2022) Deep learning for Covid-19 forecasting: State-of-the-art review. Neurocomputing 511:142–154. https://doi.org/10.1016/j.neucom.2022.09.005

    Article  Google Scholar 

  47. Assefi M, Behravesh E, Liu G, Tafti AP (2017) Big data machine learning using apache spark MLlib, in Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017, 2018:3492–3498. https://doi.org/10.1109/BigData.2017.8258338

  48. “Kaggle: Your Machine Learning and Data Science Community.” https://www.kaggle.com/. Accessed 23 March 2022

Download references

Funding

No Funding has been received for this research work.

Author information

Authors and Affiliations

Authors

Contributions

Piyush Sewal: Conceptualization, Methodology, Writing- Original draft preparation, Visualization, Investigation, Software, Validation.

Hari Singh: Supervision, Reviewing and Editing.

Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Piyush Sewal.

Ethics declarations

Competing interests

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: Table 3 in the original publication of this article contains incorrect values for the parameters "Days," "Confirmed," and "Cured" in the "Independent Parameters" section.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sewal, P., Singh, H. Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach. Multimed Tools Appl 83, 44047–44066 (2024). https://doi.org/10.1007/s11042-023-17330-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17330-5

Keywords

Navigation