Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases

García-Vicente, Clara; Chushig-Muzo, David; Mora-Jiménez, Inmaculada; Fabelo, Himar; Gram, Inger Torhild; Løchen, Maja-Lisa; Granja, Conceição; Soguero-Ruiz, Cristina

doi:10.1007/978-3-031-23905-2_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13814))

Included in the following conference series:

VLDB Workshop on Data Management and Analytics for Medicine and Healthcare
VLDB Workshop on Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy and Security Assurances

312 Accesses
4 Citations

Abstract

Noncommunicable diseases are among the most significant health threats in our society, being cardiovascular diseases (CVD) the most prevalent. Because of the severity and prevalence of these illnesses, early detection and prevention are critical for reducing the worldwide health and economic burden. Though machine learning (ML) methods usually outperform conventional approaches in many domains, class imbalance can hinder the learning process. Oversampling techniques on the minority classes can help to overcome this issue. In particular, in this paper we apply oversampling methods to categorical data, aiming to improve the identification of risk factors associated with CVD. To conduct this study, questionnaire data (categorical) obtained by the Norwegian Centre for E-health Research associated with healthy and CVD patients are considered. The goal of this work is two-fold. Firstly, evaluating the influence of combining oversampling techniques and linear/nonlinear supervised ML methods in binary tasks. Secondly, identifying the most relevant features for predicting healthy and CVD cases. Experimental results show that oversampling and FS techniques help to improve CVD prediction. Specifically, the use of Generative Adversarial Networks and linear models usually achieve the best performance (area under the curve of 67%), outperforming other oversampling techniques. Synthetic data generation has proved to be beneficial for both identifying risk factors and creating models with reasonable generalization capability in the CVD prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, A., et al.: Generative adversarial network: an overview of theory and applications. Int. J. Inf. Manag. Data Insights 1(1), 100004 (2021)
Google Scholar
Budreviciute, A., et al.: Management and prevention strategies for non-communicable diseases (ncds) and their risk factors. Front. Public Health 8, 788 (2020)
Article Google Scholar
Bush, K., et al.: The audit alcohol consumption questions (audit-c): an effective brief screening test for problem drinking. Arch. Internal Med. 158(16), 1789–1795 (1998)
Article Google Scholar
Cai, J., et al.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018)
Article Google Scholar
Carvalho, D.V., et al.: Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 832 (2019)
Article Google Scholar
Cerda, P., et al.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8), 1477–1494 (2018)
Article Google Scholar
Chawla, N.V., et al.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article MATH Google Scholar
Choi, E., et al.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305. PMLR (2017)
Google Scholar
Chushig-Muzo, D., et al.: Interpreting clinical latent representations using autoencoders and probabilistic models. Artif. Intell. Med. 122, 102211 (2021)
Article Google Scholar
Cleland, C., et al.: Validity of the international physical activity questionnaire (ipaq) for assessing moderate-to-vigorous physical activity and sedentary behaviour of older adults in the united kingdom. BMC Med. Res. Methodol. 18(1), 1–12 (2018)
Article Google Scholar
Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Mach. Learn. 10(1), 57–78 (1993)
Article Google Scholar
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Maga. 35(1), 53–65 (2018)
Article Google Scholar
Dahlöf, B.: Cardiovascular disease risk factors: epidemiology and risk assessment. Am. J. Cardiol. 105(1), 3A-9A (2010)
Article Google Scholar
Davagdorj, K., et al.: Explainable artificial intelligence based framework for non-communicable diseases prediction. IEEE Access 9, 123672–123688 (2021)
Article Google Scholar
Díez, J.M.B., et al.: Cardiovascular disease epidemiology and risk factors in primary care. Revista Española de Cardiología (English Edition) 58(4), 367–373 (2005)
Article Google Scholar
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)
Book MATH Google Scholar
Fernández, A., et al.: Smote for learning from imbalanced data: progress and challenges. Mark. 15-year Anni. 61, 863–905 (2018)
Google Scholar
Forouzanfar, M.H., et al.: Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: a systematic analysis for the global burden of disease study 2015. The Lancet 388(10053), 1659–1724 (2016)
Article Google Scholar
Gram, I.T., et al.: A smartphone-based information communication technology solution for primary modifiable risk factors for noncommunicable diseases: Pilot and feasibility study in norway. JMIR Format. Res. 6(2), e33636 (2022)
Article Google Scholar
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Japkowicz, N., et al.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets, vol. 68, pp. 10–15. AAAI Press Menlo Park, CA (2000)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kunanbayev, K., et al.: Complex encoding. In: International Joint Conference on Neural Networks, pp. 1–6. IEEE (2021)
Google Scholar
Lavanya, D., Rani, K.U.: Performance evaluation of decision tree classifiers on medical datasets. Int. J. Comput. Appl. 26(4), 1–4 (2011)
Google Scholar
Maas, A.H., Appelman, Y.E.: Gender differences in coronary heart disease. Netherlands Heart J. 18(12), 598–603 (2010)
Article Google Scholar
Marchese Robinson, R.L., et al.: Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets. J. Chem. Inf. Model. 57(8), 1773–1792 (2017)
Article Google Scholar
Martínez-Agüero, S., et al.: Interpretable clinical time-series modeling with intelligent feature selection for early prediction of antimicrobial multidrug resistance. Future Gener. Comput. Syst. 133, 68–83 (2022)
Article Google Scholar
Meng, C., et al.: Interpretability and fairness evaluation of deep learning models on mimic-iv dataset. Sci. Rep. 12(1), 1–28 (2022)
Article Google Scholar
Meyer, D., Wien, F.T.: Support vector machines. The Interface to libsvm in Package e1071 28 (2015)
Google Scholar
Mohd Noor, N.A., et al.: Consumer attitudes toward dietary supplements consumption. Int. J. Pharm. Healthcare Mark. 8(1), 6–26 (2014)
Article Google Scholar
Mora-Jiménez, I., et al.: Artificial intelligence to get insights of multi-drug resistance risk factors during the first 48 hours from icu admission. Antibiotics 10(3), 239 (2021)
Article Google Scholar
Naim, F.A., Hannan, U.H., Humayun Kabir, M.: Effective rate of minority class over-sampling for maximizing the imbalanced dataset model performance. In: Gupta, D., Polkowski, Z., Khanna, A., Bhattacharyya, S., Castillo, O. (eds.) Proceedings of Data Analytics and Management. LNDECT, vol. 91, pp. 9–20. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-6285-0_2
Chapter Google Scholar
Nelsen, R.B.: An Introduction to Copulas. Springer, Heidelberg (2007). https://doi.org/10.1007/0-387-28678-0
Book MATH Google Scholar
Ngiam, K.Y., Khor, W.: Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 20(5), e262–e273 (2019)
Article Google Scholar
Organization, W.H., et al.: Noncommunicable diseases country profiles 2018 (2018)
Google Scholar
Organization, W.H., et al.: Noncommunicable diseases: progress monitor 2020 (2020)
Google Scholar
Psaltopoulou, T., Hatzis, G., et al.: Socioeconomic status and risk factors for cardiovascular disease: impact of dietary mediators. Hellenic J. Cardiol. 58(1), 32–42 (2017)
Article Google Scholar
Pu, Y., et al.: Variational autoencoder for deep learning of images, labels and captions. Adv. Neural Inf. Process. Syst. 29(1), 295–308 (2019)
Google Scholar
Ranstam, J., Cook, J.: Lasso regression. J. Brit. Surg. 105(10), 1348–1348 (2018)
Article Google Scholar
Refaeilzadeh, P., et al.: Cross-validation. Encycl. Database Syst. 5, 532–538 (2009)
Article Google Scholar
Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991)
Article Google Scholar
Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)
Article Google Scholar
Steinwart, I., Christmann, A.: Support Vector Machines. Springer, Heidelberg (2008). https://doi.org/10.1007/978-0-387-77242-4
Book MATH Google Scholar
Taylor, H.A., Jr., et al.: Relationships of bmi to cardiovascular risk factors differ by ethnicity. Obesity 18(8), 1638–1645 (2010)
Article Google Scholar
Van Rijsbergen, C.J.: The Geometry of Information Retrieval. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Wagner, K.H., Brath, H.: A global view on the development of non communicable diseases. Prev. Med. 54, S38–S41 (2012)
Article Google Scholar
Xu, L., et al.: Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Xu, W., Tan, Y.: Semisupervised text classification by variational autoencoder. IEEE Trans. Neural Netw. Learn. Syst. 31(1), 295–308 (2019)
Article Google Scholar
Yusuf, H.R., et al.: Impact of multiple risk factor profiles on determining cardiovascular disease risk. Prev. Med. 27(1), 1–9 (1998)
Article Google Scholar

Download references

Acknowledgements

This work has been partly supported by European Commission through the H2020-EU.3.1.4.2., European Project WARIFA (Watching the risk factors: Artificial intelligence and the prevention of chronic conditions) under Grant Agreement 101017385; and by the Spanish Government by the Spanish Grants BigTheory (PID2019-106623RB-C41), and AAVis-BMR PID2019-107768RA-I00); Project Ref. 2020-661, financed by Rey Juan Carlos University and Community of Madrid; and by the Research Council of Norway (HELSE-EU-project 269882).

Author information

Authors and Affiliations

Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, 28943, Spain
Clara García-Vicente, David Chushig-Muzo, Inmaculada Mora-Jiménez & Cristina Soguero-Ruiz
Research Institute for Applied Microelectronics, University of Las Palmas de Gran Canaria, Las Palmas de Gran Canaria, Spain
Himar Fabelo
Fundación Canaria Instituto de Investigación Sanitaria de Canarias (FIISC), Las Palmas de Gran Canaria, Spain
Himar Fabelo
Norwegian Centre for E-health Research, University Hospital of North Norway, Tromsø, 9019, Norway
Inger Torhild Gram & Conceição Granja
Faculty of Health Sciences, Department of Community Medicine, UiT The Arctic University of Norway, Tromsø, 9019, Norway
Inger Torhild Gram & Maja-Lisa Løchen
Faculty of Nursing and Health Sciences, Nord University, Bodø, Norway
Conceição Granja

Authors

Clara García-Vicente
View author publications
You can also search for this author in PubMed Google Scholar
David Chushig-Muzo
View author publications
You can also search for this author in PubMed Google Scholar
Inmaculada Mora-Jiménez
View author publications
You can also search for this author in PubMed Google Scholar
Himar Fabelo
View author publications
You can also search for this author in PubMed Google Scholar
Inger Torhild Gram
View author publications
You can also search for this author in PubMed Google Scholar
Maja-Lisa Løchen
View author publications
You can also search for this author in PubMed Google Scholar
Conceição Granja
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Soguero-Ruiz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cristina Soguero-Ruiz .

Editor information

Editors and Affiliations

Massachusetts Institute of Technology, Cambridge, MA, USA
El Kindi Rezig
Massachusetts Institute of Technology, Lexington, KY, USA
Vijay Gadepally
Intel Corporation, Portland, OR, USA
Timothy Mattson
Massachusetts Institute of Technology, Cambridge, MA, USA
Michael Stonebraker
Massachusetts Institute of Technology, Cambridge, MA, USA
Tim Kraska
Georgia State University, Atlanta, GA, USA
Jun Kong
University of Washington, Seattle, WA, USA
Gang Luo
Shandong University, Qingdao, China
Dejun Teng
Stony Brook University, Stony Brook, NY, USA
Fusheng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

García-Vicente, C. et al. (2022). Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases. In: Rezig, E.K., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2022 2022. Lecture Notes in Computer Science, vol 13814. Springer, Cham. https://doi.org/10.1007/978-3-031-23905-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-23905-2_6
Published: 21 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23904-5
Online ISBN: 978-3-031-23905-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases