Skip to main content

Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases

  • Conference paper
  • First Online:
Heterogeneous Data Management, Polystores, and Analytics for Healthcare (DMAH 2022, Poly 2022)

Abstract

Noncommunicable diseases are among the most significant health threats in our society, being cardiovascular diseases (CVD) the most prevalent. Because of the severity and prevalence of these illnesses, early detection and prevention are critical for reducing the worldwide health and economic burden. Though machine learning (ML) methods usually outperform conventional approaches in many domains, class imbalance can hinder the learning process. Oversampling techniques on the minority classes can help to overcome this issue. In particular, in this paper we apply oversampling methods to categorical data, aiming to improve the identification of risk factors associated with CVD. To conduct this study, questionnaire data (categorical) obtained by the Norwegian Centre for E-health Research associated with healthy and CVD patients are considered. The goal of this work is two-fold. Firstly, evaluating the influence of combining oversampling techniques and linear/nonlinear supervised ML methods in binary tasks. Secondly, identifying the most relevant features for predicting healthy and CVD cases. Experimental results show that oversampling and FS techniques help to improve CVD prediction. Specifically, the use of Generative Adversarial Networks and linear models usually achieve the best performance (area under the curve of 67%), outperforming other oversampling techniques. Synthetic data generation has proved to be beneficial for both identifying risk factors and creating models with reasonable generalization capability in the CVD prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, A., et al.: Generative adversarial network: an overview of theory and applications. Int. J. Inf. Manag. Data Insights 1(1), 100004 (2021)

    Google Scholar 

  2. Budreviciute, A., et al.: Management and prevention strategies for non-communicable diseases (ncds) and their risk factors. Front. Public Health 8, 788 (2020)

    Article  Google Scholar 

  3. Bush, K., et al.: The audit alcohol consumption questions (audit-c): an effective brief screening test for problem drinking. Arch. Internal Med. 158(16), 1789–1795 (1998)

    Article  Google Scholar 

  4. Cai, J., et al.: Feature selection in machine learning: a new perspective. Neurocomputing 300, 70–79 (2018)

    Article  Google Scholar 

  5. Carvalho, D.V., et al.: Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 832 (2019)

    Article  Google Scholar 

  6. Cerda, P., et al.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8), 1477–1494 (2018)

    Article  Google Scholar 

  7. Chawla, N.V., et al.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  MATH  Google Scholar 

  8. Choi, E., et al.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305. PMLR (2017)

    Google Scholar 

  9. Chushig-Muzo, D., et al.: Interpreting clinical latent representations using autoencoders and probabilistic models. Artif. Intell. Med. 122, 102211 (2021)

    Article  Google Scholar 

  10. Cleland, C., et al.: Validity of the international physical activity questionnaire (ipaq) for assessing moderate-to-vigorous physical activity and sedentary behaviour of older adults in the united kingdom. BMC Med. Res. Methodol. 18(1), 1–12 (2018)

    Article  Google Scholar 

  11. Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Mach. Learn. 10(1), 57–78 (1993)

    Article  Google Scholar 

  12. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Maga. 35(1), 53–65 (2018)

    Article  Google Scholar 

  13. Dahlöf, B.: Cardiovascular disease risk factors: epidemiology and risk assessment. Am. J. Cardiol. 105(1), 3A-9A (2010)

    Article  Google Scholar 

  14. Davagdorj, K., et al.: Explainable artificial intelligence based framework for non-communicable diseases prediction. IEEE Access 9, 123672–123688 (2021)

    Article  Google Scholar 

  15. Díez, J.M.B., et al.: Cardiovascular disease epidemiology and risk factors in primary care. Revista Española de Cardiología (English Edition) 58(4), 367–373 (2005)

    Article  Google Scholar 

  16. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)

    Book  MATH  Google Scholar 

  17. Fernández, A., et al.: Smote for learning from imbalanced data: progress and challenges. Mark. 15-year Anni. 61, 863–905 (2018)

    Google Scholar 

  18. Forouzanfar, M.H., et al.: Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: a systematic analysis for the global burden of disease study 2015. The Lancet 388(10053), 1659–1724 (2016)

    Article  Google Scholar 

  19. Gram, I.T., et al.: A smartphone-based information communication technology solution for primary modifiable risk factors for noncommunicable diseases: Pilot and feasibility study in norway. JMIR Format. Res. 6(2), e33636 (2022)

    Article  Google Scholar 

  20. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  21. Japkowicz, N., et al.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets, vol. 68, pp. 10–15. AAAI Press Menlo Park, CA (2000)

    Google Scholar 

  22. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  23. Kunanbayev, K., et al.: Complex encoding. In: International Joint Conference on Neural Networks, pp. 1–6. IEEE (2021)

    Google Scholar 

  24. Lavanya, D., Rani, K.U.: Performance evaluation of decision tree classifiers on medical datasets. Int. J. Comput. Appl. 26(4), 1–4 (2011)

    Google Scholar 

  25. Maas, A.H., Appelman, Y.E.: Gender differences in coronary heart disease. Netherlands Heart J. 18(12), 598–603 (2010)

    Article  Google Scholar 

  26. Marchese Robinson, R.L., et al.: Comparison of the predictive performance and interpretability of random forest and linear models on benchmark data sets. J. Chem. Inf. Model. 57(8), 1773–1792 (2017)

    Article  Google Scholar 

  27. Martínez-Agüero, S., et al.: Interpretable clinical time-series modeling with intelligent feature selection for early prediction of antimicrobial multidrug resistance. Future Gener. Comput. Syst. 133, 68–83 (2022)

    Article  Google Scholar 

  28. Meng, C., et al.: Interpretability and fairness evaluation of deep learning models on mimic-iv dataset. Sci. Rep. 12(1), 1–28 (2022)

    Article  Google Scholar 

  29. Meyer, D., Wien, F.T.: Support vector machines. The Interface to libsvm in Package e1071 28 (2015)

    Google Scholar 

  30. Mohd Noor, N.A., et al.: Consumer attitudes toward dietary supplements consumption. Int. J. Pharm. Healthcare Mark. 8(1), 6–26 (2014)

    Article  Google Scholar 

  31. Mora-Jiménez, I., et al.: Artificial intelligence to get insights of multi-drug resistance risk factors during the first 48 hours from icu admission. Antibiotics 10(3), 239 (2021)

    Article  Google Scholar 

  32. Naim, F.A., Hannan, U.H., Humayun Kabir, M.: Effective rate of minority class over-sampling for maximizing the imbalanced dataset model performance. In: Gupta, D., Polkowski, Z., Khanna, A., Bhattacharyya, S., Castillo, O. (eds.) Proceedings of Data Analytics and Management. LNDECT, vol. 91, pp. 9–20. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-6285-0_2

    Chapter  Google Scholar 

  33. Nelsen, R.B.: An Introduction to Copulas. Springer, Heidelberg (2007). https://doi.org/10.1007/0-387-28678-0

    Book  MATH  Google Scholar 

  34. Ngiam, K.Y., Khor, W.: Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 20(5), e262–e273 (2019)

    Article  Google Scholar 

  35. Organization, W.H., et al.: Noncommunicable diseases country profiles 2018 (2018)

    Google Scholar 

  36. Organization, W.H., et al.: Noncommunicable diseases: progress monitor 2020 (2020)

    Google Scholar 

  37. Psaltopoulou, T., Hatzis, G., et al.: Socioeconomic status and risk factors for cardiovascular disease: impact of dietary mediators. Hellenic J. Cardiol. 58(1), 32–42 (2017)

    Article  Google Scholar 

  38. Pu, Y., et al.: Variational autoencoder for deep learning of images, labels and captions. Adv. Neural Inf. Process. Syst. 29(1), 295–308 (2019)

    Google Scholar 

  39. Ranstam, J., Cook, J.: Lasso regression. J. Brit. Surg. 105(10), 1348–1348 (2018)

    Article  Google Scholar 

  40. Refaeilzadeh, P., et al.: Cross-validation. Encycl. Database Syst. 5, 532–538 (2009)

    Article  Google Scholar 

  41. Safavian, S.R., Landgrebe, D.: A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991)

    Article  Google Scholar 

  42. Stanfill, C., Waltz, D.: Toward memory-based reasoning. Commun. ACM 29(12), 1213–1228 (1986)

    Article  Google Scholar 

  43. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, Heidelberg (2008). https://doi.org/10.1007/978-0-387-77242-4

    Book  MATH  Google Scholar 

  44. Taylor, H.A., Jr., et al.: Relationships of bmi to cardiovascular risk factors differ by ethnicity. Obesity 18(8), 1638–1645 (2010)

    Article  Google Scholar 

  45. Van Rijsbergen, C.J.: The Geometry of Information Retrieval. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  46. Wagner, K.H., Brath, H.: A global view on the development of non communicable diseases. Prev. Med. 54, S38–S41 (2012)

    Article  Google Scholar 

  47. Xu, L., et al.: Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  48. Xu, W., Tan, Y.: Semisupervised text classification by variational autoencoder. IEEE Trans. Neural Netw. Learn. Syst. 31(1), 295–308 (2019)

    Article  Google Scholar 

  49. Yusuf, H.R., et al.: Impact of multiple risk factor profiles on determining cardiovascular disease risk. Prev. Med. 27(1), 1–9 (1998)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partly supported by European Commission through the H2020-EU.3.1.4.2., European Project WARIFA (Watching the risk factors: Artificial intelligence and the prevention of chronic conditions) under Grant Agreement 101017385; and by the Spanish Government by the Spanish Grants BigTheory (PID2019-106623RB-C41), and AAVis-BMR PID2019-107768RA-I00); Project Ref. 2020-661, financed by Rey Juan Carlos University and Community of Madrid; and by the Research Council of Norway (HELSE-EU-project 269882).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristina Soguero-Ruiz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

García-Vicente, C. et al. (2022). Clinical Synthetic Data Generation to Predict and Identify Risk Factors for Cardiovascular Diseases. In: Rezig, E.K., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2022 2022. Lecture Notes in Computer Science, vol 13814. Springer, Cham. https://doi.org/10.1007/978-3-031-23905-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23905-2_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23904-5

  • Online ISBN: 978-3-031-23905-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics