Abstract
Numerous studies emphasize accuracy in machine learning regression models, yet scalability and execution efficiency are often overlooked, critical for large datasets or extensive computations. This paper introduces a scalable, distributed Spark MLlib regression model through the best subset selection approach to predict Covid-19 statistics in India, demonstrating high accuracy, scalability, and execution efficiency. Notably, limited research focuses on tree-based regression, particularly gradient boost regression, in the context of the Covid-19 dataset. The proposed work optimizes regression models for accuracy and execution time on Spark clusters of varying sizes using the best subset selection approach. Evaluation encompasses Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R2 Error for accuracy, and execution time analysis. Results indicate superior prediction accuracy in tree-based regression, with Gradient Boosted Tree Regression (GBTR) leading, and Random Forest Regression (RFR) surpassing Decision Tree Regression (DTR). Accuracy remains consistent across Python library, Spark MLlib on a single machine, and clusters of varying sizes, with Spark MLlib displaying lower execution times than Python's machine learning library on a single machine. Furthermore, execution times decrease substantially within Spark clusters, particularly for the iterative GBTR. This research uncovers scalability and execution efficiency aspects, highlighting tree-based regression's accuracy and advocating for Spark MLlib's efficacy in enhancing execution efficiency, especially across multi-node clusters.
Similar content being viewed by others
Data availability
The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request.
Change history
25 January 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11042-023-17827-z
References
Singh H, Vasuja R, Sharma R (2018) A Survey of Diversified Domain of Big Data Technologies. Adv Parallel Comput 29(September):1–27. https://doi.org/10.3233/978-1-61499-814-3-1
Singh H, Bawa S (2017) A MapReduce-based scalable discovery and indexing of structured big data. Futur Gener Comput Syst 73:32–43. https://doi.org/10.1016/j.future.2017.03.028
BazzazAbkenar S, HaghiKashani M, Mahdipour E, Jameii SM (2021) Big data analytics meets social media A systematic review of techniques, open issues, and future directions. Telemat Informatics 57:101517. https://doi.org/10.1016/j.tele.2020.101517
Mehta N, Pandit A (2018) Concurrence of big data analytics and healthcare: A systematic review. Int J Med Inform 114(March):57–65. https://doi.org/10.1016/j.ijmedinf.2018.03.013
Le TM, Liaw SY (2017) Effects of pros and cons of applying big data analytics to consumers’ responses in an e-commerce context. Sustain 9(5). https://doi.org/10.3390/su9050798
Agerri R, Artola X, Beloki Z, Rigau G, Soroa A (2015) Big data for Natural Language Processing: A streaming approach. Knowledge-Based Syst 79:36–42. https://doi.org/10.1016/j.knosys.2014.11.007
Janssen M et al (2015) Open and Big Data Management and Innovation. Lect Notes Comput Sci 3:200–211. https://doi.org/10.1007/978-3-319-25013-7
Sewal P, Singh H (2021) A Critical Analysis of Apache Hadoop and Spark for Big Data Processing, in 2021 6th International Conference on Signal Processing, Computing and Control (ISPCC). pp. 308–313. https://doi.org/10.1109/ISPCC53510.2021.9609518
Sewal P, Singh H (2022) A Machine Learning Approach for Predicting Execution Statistics of Spark Application. PDGC 2022 - 2022 7th Int. Conf. Parallel, Distrib. Grid Comput. pp 331–336. https://doi.org/10.1109/PDGC56933.2022.10053356
Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on Apache Spark. Gigascience 7(8):1–10. https://doi.org/10.1093/gigascience/giy098
Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G (2023) Framing Apache Spark in life sciences. Heliyon 9(2):e13368. https://doi.org/10.1016/j.heliyon.2023.e13368
Chicco D, Ferraro Petrillo U, Cattaneo G (2023) Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 19(7):e1011272. https://doi.org/10.1371/journal.pcbi.1011272
Arpaci I, Al-Emran M, Al-Sharafi MA, Marques G (2021) Emerging Technologies During the Era of COVID-19 Pandemic. Studies in Systems, Decision and Control, 348. [Online]. Available: https://doi.org/10.1007/978-3-030-67716-9
Kamalov F, Cherukuri AK, Sulieman H, Thabtah F, Hossain A (2022) Machine learning applications for COVID-19: a state-of-the-art review, in Data Science for Genomics, Academic Press. pp. 277–289. https://doi.org/10.1016/B978-0-323-98352-5.00010-0
Zaharia M et al. (2012) Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proceedings of NSDI 2012: 9th USENIX Symposium on Networked Systems Design and Implementation. pp. 15–28
Han S, Choi W, Muwafiq R, Nah Y (2017) Impact of Memory Size on Bigdata Processing based on Hadoop and Spark, in Proceedings of the International Conference on Research in Adaptive and Convergent Systems. 2017:275–280. https://doi.org/10.1145/3129676.3129688
Gopalani S, Arora R (2015) Comparing Apache Spark and Map Reduce with Performance Analysis using K-Means. Int J Comput Appl 113(1):8–11. https://doi.org/10.5120/19788-0531
Sharma T, Shokeen DV, Mathur DS (2016) Multiple K Means++ Clustering of Satellite Image Using Hadoop MapReduce and Spark. Int J Adv Stud Comput Sci Eng 5(4):23–31 (Available: http://arxiv.org/abs/1605.01802)
Lin X, Wang P, Wu B (2013) Log analysis in cloud computing environment with Hadoop and Spark, Proc. 2013 5th IEEE Int. Conf. Broadband Netw. Multimed. Technol. IEEE IC-BNMT. pp. 273–276. https://doi.org/10.1109/ICBNMT.2013.6823956
Gu L, Li H (2013) Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark, in 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. pp. 721–727. https://doi.org/10.1109/HPCC.and.EUC.2013.106
Mostafaeipour A, Jahangard Rafsanjani A, Ahmadi M, ArockiaDhanraj J (2021) Investigating the performance of Hadoop and Spark platforms on machine learning algorithms. J Supercomput 77(2):1273–1300. https://doi.org/10.1007/s11227-020-03328-5
Melenli S, Topkaya A (2021) Real-Time Maintaining of Social Distance in Covid-19 Environment Using Image Processing and Big Data. Lect Notes Data Eng Commun Technol 76:578–589. https://doi.org/10.1007/978-3-030-79357-9_55
Azeroual O, Fabre R (2021) Processing big data with apache hadoop in the current challenging era of COVID-19. Big Data Cogn. Comput. 5(1):2021. https://doi.org/10.3390/bdcc5010012
Çakan S (2020) Dynamic analysis of a mathematical model with health care capacity for COVID-19 pandemic. Chaos, Solitons and Fractals 139. https://doi.org/10.1016/j.chaos.2020.110033
Singhal A, Singh P, Lall B, Joshi SD (2020) Modeling and prediction of COVID-19 pandemic using Gaussian mixture model. Chaos, Solitons Fractals 138:110023. https://doi.org/10.1016/j.chaos.2020.110023
AL-Rousan N, AL-Najjar H (2020) Data analysis of coronavirus COVID-19 epidemic in South Korea based on recovered and death cases. J Med Virol 92(9):1603–1608. https://doi.org/10.1002/jmv.25850
Sun J et al (2020) Forecasting the long-term trend of COVID-19 epidemic using a dynamic model. Sci Rep 10(1):1–10. https://doi.org/10.1038/s41598-020-78084-w
Prieto K (2022) Current forecast of COVID-19 in Mexico: A Bayesian and machine learning approaches. PLoS One 17(1 January):1–21. https://doi.org/10.1371/journal.pone.0259958
Shinde GR, Kalamkar AB, Mahalle PN, Dey N, Chaki J, Hassanien AE (2020) Forecasting Models for Coronavirus Disease (COVID-19): A Survey of the State-of-the-Art. SN Comput Sci 1(4):1–15. https://doi.org/10.1007/s42979-020-00209-9
Brinati D, Campagner A, Ferrari D, Locatelli M, Banfi G, Cabitza F (Aug.2020) Detection of COVID-19 Infection from Routine Blood Exams with Machine Learning: A Feasibility Study. J Med Syst 44(8):1–12. https://doi.org/10.1007/s10916-020-01597-4
Assaf D et al (2020) Utilization of machine-learning models to accurately predict the risk for critical COVID-19. Intern Emerg Med 15(8):1435–1443. https://doi.org/10.1007/s11739-020-02475-0
Magdon-Ismail M (202) Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics. pp. 1–16. https://doi.org/10.48550/arXiv.2003.07602
Ostertagová E (2012) Modelling using polynomial regression. Procedia Eng 48(December 2012):500–506. https://doi.org/10.1016/j.proeng.2012.09.545
Cui S, Wang Y, Wang D, Sai Q, Huang Z, Cheng TCE (2021) A two-layer nested heterogeneous ensemble learning predictive method for COVID-19 mortality. Appl Soft Comput 113:107946. https://doi.org/10.1016/j.asoc.2021.107946
Singh H, Bawa S (2021) Predicting COVID-19 statistics using machine learning regression model: Li-MuLi-Poly. Multimed Syst 28(1):1–8. https://doi.org/10.1007/s00530-021-00798-2
Kwekha-Rashid AS, Abduljabbar HN, Alhayani B (2021) Coronavirus disease (COVID-19) cases analysis using machine-learning applications, Appl. Nanosci., no. 0123456789. https://doi.org/10.1007/s13204-021-01868-7
Ghosal S, Sengupta S, Majumder M, Sinha B (2020) Diabetes & Metabolic Syndrome : Clinical Research & Reviews Linear Regression Analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases - March 14th. Diabetes Metab Syndr Clin Res Rev 14(4):311–315. https://doi.org/10.1016/j.dsx.2020.03.017
Yadav RS (2020) Data analysis of COVID-2019 epidemic using machine learning methods: a case study of India. Int J Inf Technol 12(4):1321–1330. https://doi.org/10.1007/s41870-020-00484-y
Muhammad LJ, Islam MM, Usman SS, Ayon SI (2020) Predictive Data Mining Models for Novel Coronavirus (COVID-19) Infected Patients’ Recovery. SN Comput Sci 1(4):1–7. https://doi.org/10.1007/s42979-020-00216-w
Peng Y, Nagata MH (2020) An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data. Chaos, Solitons Fractals 139:1–15. https://doi.org/10.1016/j.chaos.2020.110055
Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA (2021) Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Comput Sci 2(1):1–13. https://doi.org/10.1007/s42979-020-00394-7
Kumar V, Unnati S (2020) Modeling and Forecasting of COVID - 19 Growth Curve in India. Trans Indian Natl Acad Eng 5(4):697–710. https://doi.org/10.1007/s41403-020-00165-z
Anastassopoulou C, Russo L, Tsakris A, Siettos C (2020) Data-based analysis, modelling and forecasting of the COVID-19 outbreak. PLoS ONE 15(3):1–21. https://doi.org/10.1371/journal.pone.0230405
Nabi KN (2020) Forecasting COVID-19 pandemic: A data-driven analysis. Chaos, Solitons Fractals 139:1–15. https://doi.org/10.1016/j.chaos.2020.110046
Nayak J, Naik B, Dinesh P, Vakula K, Dash PB, Pelusi D (2022) Significance of deep learning for Covid-19: state-of-the-art review. Res Biomed Eng 38(1):243–266. https://doi.org/10.1007/s42600-021-00135-6
Kamalov F, Rajab K, Cherukuri AK, Elnagar A, Safaraliev M (2022) Deep learning for Covid-19 forecasting: State-of-the-art review. Neurocomputing 511:142–154. https://doi.org/10.1016/j.neucom.2022.09.005
Assefi M, Behravesh E, Liu G, Tafti AP (2017) Big data machine learning using apache spark MLlib, in Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017, 2018:3492–3498. https://doi.org/10.1109/BigData.2017.8258338
“Kaggle: Your Machine Learning and Data Science Community.” https://www.kaggle.com/. Accessed 23 March 2022
Funding
No Funding has been received for this research work.
Author information
Authors and Affiliations
Contributions
Piyush Sewal: Conceptualization, Methodology, Writing- Original draft preparation, Visualization, Investigation, Software, Validation.
Hari Singh: Supervision, Reviewing and Editing.
Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: Table 3 in the original publication of this article contains incorrect values for the parameters "Days," "Confirmed," and "Cured" in the "Independent Parameters" section.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sewal, P., Singh, H. Analyzing distributed Spark MLlib regression algorithms for accuracy, execution efficiency and scalability using best subset selection approach. Multimed Tools Appl 83, 44047–44066 (2024). https://doi.org/10.1007/s11042-023-17330-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17330-5