Abstract
As with our changing lifestyles, certain biological dimensions of human lives are changing, making people more vulnerable towards stroke problem. Stroke is a medical condition in which parts of the brain do not get blood supply and a person attains stroke condition which can be fatal at times. As these stroke cases are increasing at an alarming rate, there is a need to analyze about factors affecting the growth rate of these cases. There is a need to design an approach to predict whether a person will be affected by stroke or not. This paper analyse different machine learning algorithms for better prediction of stroke problem. The algorithms used for analysis include Naive Bayes, Logistic Regression, Decision Tree, Random Forest and Gradient Boosting. We use dataset, which consists of 11 features such as age, gender, BMI (body mass index), etc. The analysis of these features is done using univariate and multivariate plots to observe the correlation between these different features. The analysis also shows how some features such as age, gender, smoking status are important factors and some feature like residence are of less importance. The proposed work is implemented using Apache Spark, which is a distributed general-purpose cluster-computing framework. The Receiver Operating Curve (ROC) of each algorithm is compared and it shows that the Gradient Boosting algorithm gives the best results with the ROC area score of 0.90. After fine-tuning, certain parameters in Gradient Boosting algorithm like optimization of the learning rate, depth of the tree, the number of trees and minimum sample split. The obtained ROC area score is 0.94. Other performance parameters such as Accuracy, Precision, Recall and F1 score values before fine-tuning are 0.867, 0.8673, 0.866 and 0.8659 respectively and after fine-tuning the values are 0.9449, 0.9453, 0.9449 and 0.9448 respectively.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bates, D.W., Saria, S., Ohno-Machado, L., Shah, A., Escobar, G.: Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Aff. 33(7), 1123–1131 (2014)
Borthakur, D.: The Hadoop distributed file system: architecture and design. Hadoop Proj. Website 11(2007), 21 (2007)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Chen, M., Hao, Y., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5, 8869–8879 (2017)
Donaldson, M.S., Corrigan, J.M., Kohn, L.T., et al.: To Err is Human: Building a Safer Health System, vol. 6. National Academies Press, Washington, D.C. (2000)
Hafermehl, K.T.: High spatial resolution diffusion-weighted imaging (DWI) of ischemic stroke and transient ischemic attack (TIA) (2016)
Haihong, E., Zhou, K., Song, M.: Spark-based machine learning pipeline construction method. In: 2019 International Conference on Machine Learning and Data Engineering (iCMLDE), pp. 1–6. IEEE (2019)
Kansadub, T., Thammaboosadee, S., Kiattisin, S., Jalayondeja, C.: Stroke risk prediction model based on demographic data. In: 2015 8th Biomedical Engineering International Conference (BMEiCON), pp. 1–3. IEEE (2015)
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analysis. O’Reilly Media, Inc., Sebastopol (2015)
Roger, V.L., et al.: Heart disease and stroke statistics—2012 update: a report from the American heart association. Circulation 125(1), e2 (2012). Writing Group Members
Nwosu, C.S., Dev, S., Bhardwaj, P., Veeravalli, B., John, D.: Predicting stroke from electronic health records. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5704–5707. IEEE (2019)
World Health Organization, et al.: Global status report on noncommunicable diseases 2014. No. WHO/NMH/NVI/15.1. World Health Organization (2014)
Shanthi, D., Sahoo, G., Saravanan, N.: Designing an artificial neural network model for the prediction of thrombo-embolic stroke. Int. J. Biometric Bioinform. (IJBB) 3(1), 10–18 (2009)
Singh, M.S., Choudhary, P., Thongam, K.: A comparative analysis for various stroke prediction techniques. In: Nain, N., Vipparthi, S.K., Raman, B. (eds.) CVIP 2019. CCIS, vol. 1148, pp. 98–106. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-4018-9_9
Apache Spark: Apache spark: lightning-fast cluster computing, pp. 2168–7161 (2016). http://spark.apache.org
Subha, P.P., Geethakumari, S.M.P., Athira, M., Nujum, Z.T.: Pattern and risk factors of stroke in the young among stroke patients admitted in medical college hospital, Thiruvananthapuram. Ann. Indian Acad. Neurol. 18(1), 20 (2015)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc., Sebastopol (2012)
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as Part of the 9th \(\{\)USENIX\(\}\) Symposium on Networked Systems Design and Implementation (\(\{\)NSDI\(\}\) 2012), pp. 15–28 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Rajora, M., Rathod, M., Naik, N.S. (2021). Stroke Prediction Using Machine Learning in a Distributed Environment. In: Goswami, D., Hoang, T.A. (eds) Distributed Computing and Internet Technology. ICDCIT 2021. Lecture Notes in Computer Science(), vol 12582. Springer, Cham. https://doi.org/10.1007/978-3-030-65621-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-65621-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65620-1
Online ISBN: 978-3-030-65621-8
eBook Packages: Computer ScienceComputer Science (R0)