Machine learning-based risk prediction model for cardiovascular disease using a hybrid dataset

doi:10.1016/j.datak.2022.102042

Data & Knowledge Engineering

Volume 140, July 2022, 102042

https://doi.org/10.1016/j.datak.2022.102042 Get rights and content

Abstract

CVD (cardiovascular disease) is one of the most common causes of death in the world today. CVD prediction allows health professionals to make an informed decision about their patients’ health. Data mining is the process of transforming large amounts of medical data in its raw form into actionable insights that can be used to make intelligent forecasts and decisions. Machine learning (ML) based prediction models provide a better solution to help patients’ health diagnoses in the health care industry. The objective of this research is to create a hybrid dataset to aid in the development of a best CVD risk prediction model. The Hungarian, the Switzerland, the Cleveland, and the Long Beach datasets are the most commonly used datasets in heart disease (HD) prediction. These datasets have a maximum of 303 instances with missing values in their features, and the presence of missing values reduces the accuracy of the prediction model. So, in this article, we created the ”Sathvi” dataset by combining these datasets, and it has 531 instances with 12 attributes with no missing data. The Pearson’s correlation method was used to eliminate redundant features during the feature selection process. The Naive Bayes (NB), XGBoost, k-nearest neighbour (k-NN), multilayer perceptron (MLP), support vector machine (SVM), and CatBoost ML classifiers have been applied for prediction. The CatBoost ML classifier was validated with 10-fold cross validation, and the best accuracy ranged from 88.67% to 98.11%, with a mean of 94.34%.

Introduction

Heart disease (HD) is a prevalent disease that afflicts many people in their middle or old age, and it frequently results in fatal complications. According to 2008 health survey, stroke accounted for about one in 18 deaths in the United States (US). In US, 6,55,000 people per year are died by HD. CVDs affect the cardiovascular system. Approximately one in every eighteen Americans died as a direct result of a stroke in 2008, according to government statistics. [1]. To manage CVD, lifestyle changes are necessary, or the healthcare provider may prescribe medications. The earlier CVD is detected, the easier it is to treat. The common symptoms of CVD are chest pain, an irregular heartbeat, nausea, etc. The most frequently identified possible CVD cause remained BMI. Having high cholesterol and high blood pressure were the second and third most common risk factors for CVD. According to the 2011 survey, men were 1.64 times more likely than women to have CVD [2]. Faced with a global viral pandemic like COVID-19 [3], We must emphasize international objectives to reduce the early mortality led by CVD, which limits healthy and sustainable development in all countries around the world. There is an abundance of research data and hospital patient records available. There are many open resources available to access healthcare information, and research can be conducted to determine how various information and communication technologies can be utilized to predict/ diagnose HD before it turns fatal. ML-based techniques are becoming more common in business and society, and they are now being employed to healthcare [4]. ML is a scientific discipline that studies how machines acquire knowledge from data and develop themself. It is primarily based on statistics and probability [5]. However, when it comes to decision making process, it outperforms standard statistical methodologies. The information gathered from a dataset and fed into the algorithm is referred to as features. The quality of the features offered to the algorithm determines the model’s prediction accuracy.

The job of the ML developer is to identify the subset of attributes that will best fit the objective, thereby boosting the model’s accuracy. There are three basic steps to take in developing the ML prediction model, namely training, testing, and validation [6]. Training is essential because the prediction or classification model’s accuracy is dependent on the training data. The algorithm’s performance will be evaluated using the test dataset. The k-fold validation is required to determine the stability of the model [7]. The primary aim of this research is to build the best early-stage CVD prediction model based on the most optimal attributes. Among the sub-goals are a review of existing approaches for detecting CVD; creating a hybrid dataset with no missing values; determining the best features using the Pearson’s r coefficient of correlation feature selection technique; building various prediction models on a “Sathvi” dataset using different ML algorithms; and evaluating the performance of the best ML algorithm using k-fold cross validation.

Section snippets

Related work

An algorithm’s ability to learn from its own data and experience is known as ML. It is regarded as a component of artificial intelligence. It has a wide range of applications in the fields of electrical [8], health care [9], agriculture [10], meteorology [11], and so on. The HD risk prediction model was developed by Shah et al. [12] using 14 essential attributes. They used NB, decision trees, k-NN, and random forests for data mining classification. They discovered that the k-NN classifier has

Materials and methods

The following stages are involved in the development of a CVD risk prediction model. It begins with the creation of the “Sathvi” dataset, followed by pre-processing the data, feature selection, application of ML classification algorithms, identification of the best ML algorithm, and k-fold cross validation of the selected model.

Proposed machine learning classifiers

The NB, XGBoost, k-NN, SVM, MLP, and CatBoost ML classifiers have been applied for prediction. It is described in this section.

Training and test dataset

The modelling step infers a representative model from the data. Training datasets are collections of data used to construct models, and they contain known features as well as target. Validation of the created model will also require comparison to another well-known dataset referred to as the test dataset or validation dataset. To facilitate this process, it is feasible to partition the entire known dataset into a training and a test set [32]. The “Sathvi” dataset has an 80:20 split between

Conclusion

In this research, the “Sathvi” dataset has been created using the existing four CVD datasets with 531 instances. It does not have any missing values. The “hybrid” and “Sathvi” datasets are available as supplementary files for public use. The risk prediction model was developed with six ML classifiers and identified that the CatBoost ML classifier performs better with a mean accuracy of 94.34% by performing 10-fold cross validation. The risk prediction model was developed with 10 attributes.

CRediT authorship contribution statement

Karthick Kanagarathinam: Conceptualization, Data curation, Writing – original draft, Investigation, Methodology. Durairaj Sankaran: Supervision, Validation, Writing – review & editing. R. Manikandan: Formal analysis, Software, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Dr. K. Karthick is working as an Associate Professor in Department of Electrical and Electronics Engineering, GMR Institute of Technology, Rajam, India. He received his B.E. degree in Electrical and Electronics Engineering from Periyar University, Salem, India and a M.E. degree in Power Electronics and Drives from Anna University, Chennai, India. He completed his Doctorate in Electrical Engineering from Anna University, Chennai. He has more than 16 years of experience in teaching. He is the

References (32)

DavenportThomas et al.
The potential for artificial intelligence in healthcare
Future Healthc. J.
(2019)
MaFuzhe et al.
Detection and diagnosis of chronic kidney disease using deep learning-based heterogeneous modified artificial neural network
Future Gener. Comput. Syst.
(2020)
GómezDiego et al.
Machine learning approach to predict leaf colour change in fagus sylvatica L. (Spain)
Agricult. Forest Meteorol.
(2021)
MienyeIbomoiye Domor et al.
An improved ensemble learning approach for the prediction of heart disease risk
Inform. Med. Unlocked
(2020)
RahhalM.M.A. et al.
Deep learning approach for active classification of electrocardiogram signals
Inform. Sci.
(2016)
PiresI.M. et al.
Machine learning for the evaluation of the presence of heart disease
Procedia Comput. Sci.
(2020)
HuangGuomin et al.
Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions
J. Hydrol.
(2019)
Roger. et al.
Heart disease and stroke statistics–2012 update: a report from the American Heart Association
Circulation
(2012)
TranDMT. et al.
Risk factors associated with cardiovascular disease among adult Nevadans
PLoS One
(2021)
KanagarathinamKarthick et al.
Analysis of ‘earlyR’ epidemic model and time series model for prediction of COVID-19 registered cases
Mater. Today: Proc.
(2020)

MakridakisS. et al.

Statistical and machine learning forecasting methods: Concerns and ways forward

PLoS One

(2018)

VabalasA. et al.

Machine learning algorithm validation with a limited sample size

PLoS One

(2019)

PalK. et al.

Data classification with k-fold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques

SekarK. et al.

Power quality disturbance detection using machine learning algorithm

SharmaA. et al.

Machine learning applications for precision agriculture: A comprehensive review

IEEE Access

(2021)

ShahD. et al.

Heart disease prediction using machine learning techniques

SN Comput. Sci.

(2020)

Cited by (17)

MDensNet201-IDRSRNet: Efficient cardiovascular disease prediction system using hybrid deep learning
2024, Biomedical Signal Processing and Control
Cardiovascular diseases (CVDs) are common diseases that impact the heart or vascular system. Since early discovery significantly improves survival chances, precise prediction techniques are essential. There are new paths for more accurate CVD prediction due to emerging technologies like machine learning (ML). Heart disease may now be identified in its early stages using several machine learning algorithms, which can aid in future treatments. However, none of the existing algorithms achieve high accuracy and frequently fail because of bias and over-fitting. To improve the prediction accuracy of cardiovascular disease, a new innovative approach is proposed in this research by utilizing deep learning techniques to identify significant features. For efficient CVD prediction, we propose a hybrid deep-learning intelligent system. Tests and assessments have been conducted using the five benchmark datasets for cardiac disease from the UCI repository. Three data processing techniques are first utilized in the pre-processing stage to improve the dataset's quality by preventing undesired distortions: outlier removal, replacing missing values, and resolving data imbalance problems. Next, deep learning-based Modified DenseNet201 (MDenseNet201) extracts the disease-related features. Relief and Least Absolute Shrinkage and Selection Operator (LASSO) approaches are used to select the appropriate features. Finally, a deep learning-based improved deep residual shrinkage network (IDRSNet) is employed to predict cardiovascular disease. The accuracy of the proposed model on the University of California Irvine (UCI) machine learning repository dataset is 99.12%. Based on experimental results, the proposed hybrid deep learning system produced more excellent accuracy for CVD prediction than existing approaches. The combined intelligent system (MDensNet201-IDRSNet), which generates the best practical solution out of all input prediction models considering performance criteria, makes it possible for physicians and radiologists to diagnose cardiac patients more accurately.
A hybrid deep neural net learning model for predicting Coronary Heart Disease using Randomized Search Cross-Validation Optimization
2023, Decision Analytics Journal
Coronary Heart Disease (CHD) is a life-threatening public health problem. Many chronic CHDs and health risks can be avoided, reversed, and reduced with proper risk assessment. Medical professionals find it challenging to anticipate heart attacks and heart failures since it is a complex process requiring knowledge, experience, and medical resource facilitation. Although healthcare is generally information-savvy, not all available data are analysed to find hidden patterns and make informed and timely decisions since heart disease prediction relies heavily on clinical data processing. This study proposes a hybrid deep neural net learning model for predicting CHD using the BRFSS-2015 Dataset. The best features subset is chosen based on the co-relation score and dataset classes are balanced using the cluster-abundant data class approach. Bi-direction Long Short-Term Memory (BiLSTM) and Gated Recurrent Unit (GRU) hyper-parameter tuning is accomplished using Randomized Search Cross-Validation Optimization (RSCV). In comparison to GRU, LSTM, and BiLSTM-GRU, this suggested model obtains a classification accuracy of 98.28% which outperforms existing models.
Potential directions on coronary artery disease prediction using machine learning algorithms: A survey
2024, IAES International Journal of Artificial Intelligence
Probability rough set and portfolio optimization integrated three-way predication decisions approach to stock price
2023, Applied Intelligence
A Dense Network Approach with Gaussian Optimizer for Cardiovascular Disease Prediction
2023, New Generation Computing
Enhancing Sustainable Urban Energy Management through Short-Term Wind Power Forecasting Using LSTM Neural Network
2023, Sustainability (Switzerland)

View all citing articles on Scopus

Dr. S. Durairaj is working as an Assistant Professor in the Department of Mechatronics Engineering at K S Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India. He completed his Doctorate in May 2017 at Anna University, Chennai, India. He completed his M.E. degree in Power Electronics and Drives in 2009. He has more than 12 years of experience in teaching. His research interest includes Green Energy, Power Electronics and drives, machine learning, etc.

Dr. R. Manikandan received his B.E degree in Electronics and Instrumentation Engineering from Annamalai University, Chidambaram in 2002. He obtained his M.E degree in Applied Electronics from Anna University, Chennai in 2008 and his Ph.D. degree in Image/Video Processing from the Department of Advanced Sports Training and Technology at Tamil Nadu Physical Education and Sports University, Chennai in 2014. His main research interests include automation, computer vision and image/video processing. He is now a Professor in the Department of Electronics and Communication Engineering at Panimalar Engineering College, Chennai.

View full text

Machine learning-based risk prediction model for cardiovascular disease using a hybrid dataset

Abstract

Introduction

Section snippets

Related work

Materials and methods

Proposed machine learning classifiers

Training and test dataset

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Future Healthc. J.

Future Gener. Comput. Syst.

Agricult. Forest Meteorol.

Inform. Med. Unlocked

Inform. Sci.

Procedia Comput. Sci.

J. Hydrol.

Heart disease and stroke statistics–2012 update: a report from the American Heart Association

Circulation

Risk factors associated with cardiovascular disease among adult Nevadans

PLoS One

Analysis of ‘earlyR’ epidemic model and time series model for prediction of COVID-19 registered cases

Mater. Today: Proc.

Statistical and machine learning forecasting methods: Concerns and ways forward

PLoS One

Machine learning algorithm validation with a limited sample size

PLoS One

Data classification with k-fold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques

Power quality disturbance detection using machine learning algorithm

Machine learning applications for precision agriculture: A comprehensive review

IEEE Access

Heart disease prediction using machine learning techniques

SN Comput. Sci.