Abstract
This paper presents a predictive analysis of data on heart disease patients to determine the possible risk factors associated with their heart disease status. Two independent (but similar) published heart disease datasets, the Cleveland data (used to build classification models) and the Statlog data (used for results’ validation), were considered for analysis. A detailed exploratory analysis using the Chi-square test of independence was performed on the Cleveland data after which ten standard classification models were trained for class prediction. The classification models were built by partitioning the Cleveland data randomly into 208 (70%) training samples and 89 (30%) test samples over 200 replications. Preliminary results showed that some of the bio-clinical categorical variables are strongly associated with the heart disease conditions of the patients (p < 0.001). The classification results from the test samples indicated that the support vector machine yielded the best predictive performances with 85% accuracy, 82% sensitivity, 88% specificity, 87% precision, 91% area under the ROC curve, and 38% log loss value. These results were validated on the Statlog data in tenfold cross-validation which were all consistent with those obtained from the Cleveland dataset.
Similar content being viewed by others
References
Libby P, Bonow RO, Mann DL, Tomaselli GF, Bhatt D, Solomon SD, Braunwald E (2021) Braunwald’s heart disease—E-book: a textbook of cardiovascular medicine. https://bit.ly/braunwald-heart-disease. Accessed 6 Nov 2022
Gandhi M, Singh SN (2015) Predictions in heart disease using techniques of data mining. In: 2015 International conference on futuristic trends on computational analysis and knowledge management (ABLAZE), pp 520–525
Hannah R, Max R (2018) Causes of death. Our World in Data. Retrieved from: https://ourworldindata.org/causes-of-death. Accessed 23 Feb 2022
Murphy SL, Xu J, Kochanek KD, Arias E, Tejada-Vera B (2021) Deaths: Final Data for 2018. National vital statistics reports: from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System, 69(13), 1–83
Fida B, Nazir M, Naveed N, Akram S (2011) Heart disease classification ensemble optimization using genetic algorithm. In: 2011 IEEE 14th international multitopic conference. IEEE, pp 19–24
Anderson RN, Smith BL (2005) Deaths: leading causes for 2002. National vital statistics reports: from the Centers for Disease Control and Prevention, National Center for Health Statistics, National Vital Statistics System, 53(17), 1–89
Nahar J, Imam T, Tickle KS, Chen Y-PP (2013) Computational intelligence for heart disease diagnosis: a medical knowledge driven approach. Expert Syst Appl 40:96–104. https://doi.org/10.1016/j.eswa.2012.07.032
Dalen JE, Alpert JS, Goldberg RJ, Weinstein RS (2014) The epidemic of the 20th century: coronary heart disease. Am J Med 127:807–812. https://doi.org/10.1016/j.amjmed.2014.04.015
Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554. https://doi.org/10.1109/ACCESS.2019.2923707
Dulhare U (2018) Prediction system for heart disease using Naive Bayes and particle swarm optimization. Biomed Res. https://doi.org/10.4066/biomedicalresearch.29-18-620
Esfahani HA, Ghazanfari M (2017) Cardiovascular disease detection using a new ensemble classifier. In: 2017 IEEE 4th international conference on knowledge-based engineering and innovation (KBEI), pp 1011–1014
Patel SB, Yadav PK, Shukla DP (2013) Predict the diagnosis of heart disease patients using classification mining techniques. IOSR J Agric Vet Sci (IOSR-JAVS) 4:61–64
Yahya WB, Rosenberg R, Ulm K (2014) Microarray-based classification of histopathologic responses of locally advanced rectal carcinomas to neoadjuvant radiochemotherapy treatment. Turk Klinikleri J Biostat 6:53–61
Pouriyeh S, Vahid S, Sannino G, et al (2017) A comprehensive investigation and comparison of machine learning techniques in the domain of heart disease. In: 2017 IEEE symposium on computers and communications (ISCC). IEEE, pp 204–207
Latha CBC, Jeeva SC (2019) Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Inform Med Unlock 16:100203. https://doi.org/10.1016/j.imu.2019.100203
Ogundepo EA, Fokoué E (2019) An empirical demonstration of the no free lunch theorem. Math Appl 8:173–188. https://doi.org/10.13164/ma.2019.11
Janosi A, Steinbrunn W, Pfisterer M, Detrano R (1988) Heart disease data set. The UCI KDD Archive. https://archive.ics.uci.edu/ml/datasets/Heart+Disease. Accessed 02 Jan 2021
Dua D, Graff C (2017) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml
Song Y-Y, Ying LU (2015) Decision tree methods: applications for classification and prediction. Shanghai Arch Psychiatry 27:130. https://doi.org/10.11919/j.issn.1002-0829.215044
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp 785–794
Strobl C, Zeileis A (2009) Party on!—a new, conditional variable importance measure for random forests available in party. https://www.r-project.org/conferences/useR-2009/slides/Strobl+Zeileis.pdf. Accessed 02 Jan 2021
Hapfelmeier A, Babatunde W, Yahya RR, Ulm K (2012) 26 Predictive modeling of gene expression data. Handb Stat Clin Oncol 4:71. https://doi.org/10.1201/b11800-31
Breiman L (2001) Random forests. Mach Learn 45:5–32
Zou J, Han Y, So S-S (2008) Overview of artificial neural networks. Artif Neural Netw 2015:14–22
Yahya WB, Oladiipo MO, Jolayemi ET (2012) A fast algorithm to construct neural networks classification models with high-dimensional genomic data. Ann Comput Sci Ser 10:39–58
Yahya WB, Ulm K, Ludwig F, Hapflemeir A (2011) K-SS: a sequential feature selection and prediction method in microarray study. Int J Artif Intell 6:19–47
Kouiroukidis N, Evangelidis G (2011) The effects of dimensionality curse in high dimensional knn search. In: 2011 15th panhellenic conference on informatics. IEEE, pp 41–45
McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley, New York
Brownlee J (2016) Master Machine Learning Algorithms: Discover How They Work and Implement Them From Scratch. https://machinelearningmastery.com/master-machine-learning-algorithms
Buja A, Stuetzle W, Shen Y (2005) Loss functions for binary class probability estimation and classification: structure and applications. Working draft, November 3
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Tharwat A (2021) Classification assessment methods. Appl Comput Inform 17(1):168–192. https://doi.org/10.1016/j.aci.2018.08.003
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ogundepo, E.A., Yahya, W.B. Performance analysis of supervised classification models on heart disease prediction. Innovations Syst Softw Eng 19, 129–144 (2023). https://doi.org/10.1007/s11334-022-00524-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11334-022-00524-9