Abstract
Noncommunicable diseases are among the most significant health threats in our society, being cardiovascular diseases (CVD) the most prevalent. Because of the severity and prevalence of these illnesses, early detection and prevention are critical for reducing the worldwide health and economic burden. Though machine learning (ML) methods usually outperform conventional approaches in many domains, class imbalance can hinder the learning process. Oversampling techniques on the minority classes can help to overcome this issue. In particular, in this paper we apply oversampling methods to categorical data, aiming to improve the identification of risk factors associated with CVD. To conduct this study, questionnaire data (categorical) obtained by the Norwegian Centre for E-health Research associated with healthy and CVD patients are considered. The goal of this work is two-fold. Firstly, evaluating the influence of combining oversampling techniques and linear/nonlinear supervised ML methods in binary tasks. Secondly, identifying the most relevant features for predicting healthy and CVD cases. Experimental results show that oversampling and FS techniques help to improve CVD prediction. Specifically, the use of Generative Adversarial Networks and linear models usually achieve the best performance (area under the curve of 67%), outperforming other oversampling techniques. Synthetic data generation has proved to be beneficial for both identifying risk factors and creating models with reasonable generalization capability in the CVD prediction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, A., et al.: Generative adversarial network: an overview of theory and applications. Int. J. Inf. Manag. Data Insights 1(1), 100004 (2021)
Budreviciute, A., et al.: Management and prevention strategies for non-communicable diseases (ncds) and their risk factors. Front. Public Health 8, 788 (2020)
Bush, K., et al.: The audit alcohol consumption questions (audit-c): an effective brief screening test for problem drinking. Arch. Internal Med. 158(16), 1789–1795 (1998)
Cai, J., et al.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018)
Carvalho, D.V., et al.: Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 832 (2019)
Cerda, P., et al.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8), 1477–1494 (2018)
Chawla, N.V., et al.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Choi, E., et al.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305. PMLR (2017)
Chushig-Muzo, D., et al.: Interpreting clinical latent representations using autoencoders and probabilistic models. Artif. Intell. Med. 122, 102211 (2021)
Cleland, C., et al.: Validity of the international physical activity questionnaire (ipaq) for assessing moderate-to-vigorous physical activity and sedentary behaviour of older adults in the united kingdom. BMC Med. Res. Methodol. 18(1), 1–12 (2018)
Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Mach. Learn. 10(1), 57–78 (1993)
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Maga. 35(1), 53–65 (2018)
Dahlöf, B.: Cardiovascular disease risk factors: epidemiology and risk assessment. Am. J. Cardiol. 105(1), 3A-9A (2010)
Davagdorj, K., et al.: Explainable artificial intelligence based framework for non-communicable diseases prediction. IEEE Access 9, 123672–123688 (2021)
Díez, J.M.B., et al.: Cardiovascular disease epidemiology and risk factors in primary care. Revista Española de Cardiología (English Edition) 58(4), 367–373 (2005)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)
Fernández, A., et al.: Smote for learning from imbalanced data: progress and challenges. Mark. 15-year Anni. 61, 863–905 (2018)
Forouzanfar, M.H., et al.: Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: a systematic analysis for the global burden of disease study 2015. The Lancet 388(10053), 1659–1724 (2016)
Gram, I.T., et al.: A smartphone-based information communication technology solution for primary modifiable risk factors for noncommunicable diseases: Pilot and feasibility study in norway. JMIR Format. Res. 6(2), e33636 (2022)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Japkowicz, N., et al.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets, vol. 68, pp. 10–15. AAAI Press Menlo Park, CA (2000)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kunanbayev, K., et al.: Complex encoding. In: International Joint Conference on Neural Networks, pp. 1–6. IEEE (2021)
Lavanya, D., Rani, K.U.: Performance evaluation of decision tree classifiers on medical datasets. Int. J. Comput. Appl. 26(4), 1–4 (2011)
Maas, A.H., Appelman, Y.E.: Gender differences in coronary heart disease. Netherlands Heart J. 18(12), 598–603 (2010)
Marchese Robinson, R.L., et al.: Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets. J. Chem. Inf. Model. 57(8), 1773–1792 (2017)
Martínez-Agüero, S., et al.: Interpretable clinical time-series modeling with intelligent feature selection for early prediction of antimicrobial multidrug resistance. Future Gener. Comput. Syst. 133, 68–83 (2022)
Meng, C., et al.: Interpretability and fairness evaluation of deep learning models on mimic-iv dataset. Sci. Rep. 12(1), 1–28 (2022)
Meyer, D., Wien, F.T.: Support vector machines. The Interface to libsvm in Package e1071 28 (2015)
Mohd Noor, N.A., et al.: Consumer attitudes toward dietary supplements consumption. Int. J. Pharm. Healthcare Mark. 8(1), 6–26 (2014)
Mora-Jiménez, I., et al.: Artificial intelligence to get insights of multi-drug resistance risk factors during the first 48 hours from icu admission. Antibiotics 10(3), 239 (2021)
Naim, F.A., Hannan, U.H., Humayun Kabir, M.: Effective rate of minority class over-sampling for maximizing the imbalanced dataset model performance. In: Gupta, D., Polkowski, Z., Khanna, A., Bhattacharyya, S., Castillo, O. (eds.) Proceedings of Data Analytics and Management. LNDECT, vol. 91, pp. 9–20. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-6285-0_2
Nelsen, R.B.: An Introduction to Copulas. Springer, Heidelberg (2007). https://doi.org/10.1007/0-387-28678-0
Ngiam, K.Y., Khor, W.: Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 20(5), e262–e273 (2019)
Organization, W.H., et al.: Noncommunicable diseases country profiles 2018 (2018)
Organization, W.H., et al.: Noncommunicable diseases: progress monitor 2020 (2020)
Psaltopoulou, T., Hatzis, G., et al.: Socioeconomic status and risk factors for cardiovascular disease: impact of dietary mediators. Hellenic J. Cardiol. 58(1), 32–42 (2017)
Pu, Y., et al.: Variational autoencoder for deep learning of images, labels and captions. Adv. Neural Inf. Process. Syst. 29(1), 295–308 (2019)
Ranstam, J., Cook, J.: Lasso regression. J. Brit. Surg. 105(10), 1348–1348 (2018)
Refaeilzadeh, P., et al.: Cross-validation. Encycl. Database Syst. 5, 532–538 (2009)
Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991)
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, Heidelberg (2008). https://doi.org/10.1007/978-0-387-77242-4
Taylor, H.A., Jr., et al.: Relationships of bmi to cardiovascular risk factors differ by ethnicity. Obesity 18(8), 1638–1645 (2010)
Van Rijsbergen, C.J.: The Geometry of Information Retrieval. Cambridge University Press, Cambridge (2004)
Wagner, K.H., Brath, H.: A global view on the development of non communicable diseases. Prev. Med. 54, S38–S41 (2012)
Xu, L., et al.: Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 32 (2019)
Xu, W., Tan, Y.: Semisupervised text classification by variational autoencoder. IEEE Trans. Neural Netw. Learn. Syst. 31(1), 295–308 (2019)
Yusuf, H.R., et al.: Impact of multiple risk factor profiles on determining cardiovascular disease risk. Prev. Med. 27(1), 1–9 (1998)
Acknowledgements
This work has been partly supported by European Commission through the H2020-EU.3.1.4.2., European Project WARIFA (Watching the risk factors: Artificial intelligence and the prevention of chronic conditions) under Grant Agreement 101017385; and by the Spanish Government by the Spanish Grants BigTheory (PID2019-106623RB-C41), and AAVis-BMR PID2019-107768RA-I00); Project Ref. 2020-661, financed by Rey Juan Carlos University and Community of Madrid; and by the Research Council of Norway (HELSE-EU-project 269882).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
García-Vicente, C. et al. (2022). Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases. In: Rezig, E.K., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2022 2022. Lecture Notes in Computer Science, vol 13814. Springer, Cham. https://doi.org/10.1007/978-3-031-23905-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-23905-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23904-5
Online ISBN: 978-3-031-23905-2
eBook Packages: Computer ScienceComputer Science (R0)