SMOOTH-GAN: Towards Sharp and Smooth Synthetic EHR Data Generation

Rashidian, Sina; Wang, Fusheng; Moffitt, Richard; Garcia, Victor; Dutt, Anurag; Chang, Wei; Pandya, Vishwam; Hajagos, Janos; Saltz, Mary; Saltz, Joel

doi:10.1007/978-3-030-59137-3_4

Sina Rashidian¹⁰,
Fusheng Wang^10,11,
Richard Moffitt¹¹,
Victor Garcia¹¹,
Anurag Dutt¹⁰,
Wei Chang¹⁰,
Vishwam Pandya¹⁰,
Janos Hajagos¹¹,
Mary Saltz¹¹ &
…
Joel Saltz¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12299))

Included in the following conference series:

International Conference on Artificial Intelligence in Medicine

2551 Accesses
18 Citations

Abstract

Generative adversarial networks (GANs) have been highly successful for generating realistic synthetic data. In healthcare, synthetic data generation can be helpful for producing annotated data and improving data-driven research without worries on data privacy. However, electronic health records (EHRs) are noisy, incomplete and complex, and existing work on EHR data is mainly devoted to generating discrete elements such as diagnosis codes and medications or frequent laboratory values. In this work, we propose SMOOTH-GAN, a novel approach for generating reliable EHR data such as laboratory values and medications given diagnosis codes. SMOOTH-GAN takes advantage of a conditional GAN architecture with WGAN-GP loss, and is able to learn transitions between disease stages with high flexibility over data customization. Our experiments demonstrate the model’s effectiveness in terms of both statistical similarity and accuracy on machine learning based prediction. To further demonstrate the usage of our model, we apply counterfactual reasoning and generate data with occurrence of multiple diseases, which can provide unique datasets for artificial intelligence driven healthcare research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: International Conference on Machine Learning. pp. 214–223 (2017)
Google Scholar
Ashfaq, A., Sant’Anna, A., Lingman, M., Nowaczyk, S.: Readmission prediction using deep learning on electronic health records. Journal of biomedical informatics 97, 103256 (2019)
Google Scholar
Avati, A., Jung, K., Harman, S., Downing, L., Ng, A., Shah, N.H.: Improving palliative care with deep learning. BMC medical informatics and decision making 18(4), 122 (2018)
Article Google Scholar
Bounliphone, W., Belilovsky, E., Blaschko, M.B., Antonoglou, I., Gretton, A.: A test of relative similarity for model selection in generative models. arXiv preprint arXiv:1511.04581 (2015)
Che, Z., Cheng, Y., Zhai, S., Sun, Z., Liu, Y.: Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In: 2017 IEEE International Conference on Data Mining (ICDM). pp. 787–792. IEEE (2017)
Google Scholar
Choi, E., Bahadori, M.T., Schuetz, A., Stewart, W.F., Sun, J.: Doctor AI: predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference, pp. 301–318 (2016)
Google Scholar
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305 (2017)
Google Scholar
Esteban, C., Hyland, S.L., Rätsch, G.: Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633 (2017)
Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
Google Scholar
Gretton, A., Borgwardt, K., Rasch, M., Schölkopf, B., Smola, A.J.: A kernel method for the two-sample-problem. In: Advances in neural information processing systems, pp. 513–520 (2007)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5767–5777. Curran Associates, Inc. (2017), http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf
Liu, S., Kailkhura, B., Loveland, D., Han, Y.: Generative counterfactual introspection for explainable deep learning. arXiv preprint arXiv:1907.03077 (2019)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier gans. In: Proceedings of the 34th International Conference on Machine Learning. 70, pp. 2642–2651. JMLR. org (2017)
Google Scholar
Perarnau, G., Van De Weijer, J., Raducanu, B., Álvarez, J.M.: Invertible conditional gans for image editing. arXiv preprint arXiv:1611.06355 (2016)
Rashidian, S., et al.: Deep learning on electronic health records to improve disease coding accuracy. In: AMIA Summits on Translational Science Proceedings. vol. 2019, p. 620 (2019)
Google Scholar
Steiner, C.A., Barrett, M.L., Weiss, A.J., Andrews, R.M.: Trends and projections in hospital stays for adults with multiple chronic conditions, 2003–2014: Statistical brief# 183. Healthcare Cost and Utilization Project (HCUP) Statistical Briefs. Rockville: Agency for Health Care Policy and Research (US) (2006)
Google Scholar
Xu, L., Skoularidou, M., Cuesta-Infante, A., Veeramachaneni, K.: Modeling tabular data using conditional gan. In: Advances in Neural Information Processing Systems, pp. 7333–7343 (2019)
Google Scholar

Download references

Acknowledgments

Authors wish to thank Aryan Arbabi for his constructive comments.

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, NY, 11794, USA
Sina Rashidian, Fusheng Wang, Anurag Dutt, Wei Chang & Vishwam Pandya
Stony Brook Medicine, Stony Brook, NY, 11794, USA
Fusheng Wang, Richard Moffitt, Victor Garcia, Janos Hajagos, Mary Saltz & Joel Saltz

Authors

Sina Rashidian
View author publications
You can also search for this author in PubMed Google Scholar
Fusheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Richard Moffitt
View author publications
You can also search for this author in PubMed Google Scholar
Victor Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Anurag Dutt
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chang
View author publications
You can also search for this author in PubMed Google Scholar
Vishwam Pandya
View author publications
You can also search for this author in PubMed Google Scholar
Janos Hajagos
View author publications
You can also search for this author in PubMed Google Scholar
Mary Saltz
View author publications
You can also search for this author in PubMed Google Scholar
Joel Saltz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sina Rashidian .

Editor information

Editors and Affiliations

School of Nursing, University of Minnesota, Minneapolis, MN, USA
Martin Michalowski
Ben-Gurion University of the Negev, Tonawanda, NY, USA
Robert Moskovitch

A Appendix

1.1 A.1 Binary Data Distribution

As GANs were known to struggle with generating binary values, we added Fig. 4 to illustrate dimension-wise probability for medications comparing real versus synthetic data.

1.2 A.2 Is Training Data Memorized by the GAN?

For ensuring privacy and discovering whether the GAN is generating new cases or memorizing the training set, we followed the footsteps of [8] by measuring maximum mean discrepancy (MMD) and applying the three-sample test [4, 11]. MMD can answer whether two sets of samples were generated from the same distribution. If the synthetic data is memorized then MMD(synthetic, training) would be significantly lower than MMD(synthetic, test). For this reason, we state the null hypothesis as GAN has not memorized the training set, and consequently MMD(synthetic, test) \(\le \) MMD(synthetic, training). We sampled from these three datasets 35 times and calculated MMDs and p-values for the hypothesis. The mean p-value with its standard deviation is \(0.26\pm 0.15\) which means we cannot reject the null hypothesis and we can establish that GAN did not memorize from the training set.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rashidian, S. et al. (2020). SMOOTH-GAN: Towards Sharp and Smooth Synthetic EHR Data Generation. In: Michalowski, M., Moskovitch, R. (eds) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science(), vol 12299. Springer, Cham. https://doi.org/10.1007/978-3-030-59137-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-59137-3_4
Published: 26 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59136-6
Online ISBN: 978-3-030-59137-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SMOOTH-GAN: Towards Sharp and Smooth Synthetic EHR Data Generation

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Binary Data Distribution

1.2 A.2 Is Training Data Memorized by the GAN?

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation