Generating Longitudinal Synthetic EHR Data with Recurrent Autoencoders and Generative Adversarial Networks

Sun, Siao; Wang, Fusheng; Rashidian, Sina; Kurc, Tahsin; Abell-Hart, Kayley; Hajagos, Janos; Zhu, Wei; Saltz, Mary; Saltz, Joel

doi:10.1007/978-3-030-93663-1_12

Siao Sun¹⁷,
Fusheng Wang^18,19,
Sina Rashidian¹⁸,
Tahsin Kurc¹⁹,
Kayley Abell-Hart¹⁹,
Janos Hajagos¹⁹,
Wei Zhu¹⁷,
Mary Saltz¹⁹ &
…
Joel Saltz¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12921))

Included in the following conference series:

VLDB Workshop on Data Management and Analytics for Medicine and Healthcare
VLDB Workshop on Polystore Systems for Heterogeneous Data in Multiple Databases with Privacy and Security Assurances

691 Accesses
4 Citations

Abstract

Synthetic electronic health records (EHR) can facilitate effective use of clinical data in software development, medical education, and medical research without the concerns of data privacy. We propose a novel Generative Adversarial Network (GAN) approach, called Longitudinal GAN (LongGAN), that can generate synthetic longitudinal EHR data. LongGAN employs a recurrent autoencoder and the Wasserstein GAN Gradient Penalty (WGAN-GP) architecture with conditional inputs. We evaluate LongGAN with the task of generating training data for machine/deep learning methods. Our experiments show that predictive models trained with synthetic data from LongGAN achieve comparable performance to those trained with real data. Moreover, these models have up to 0.27 higher AUROC and up to 0.21 higher AUPRC values than models trained with synthetic data from RCGAN and TimeGAN, the two most relevant methods for longitudinal data generation. We also demonstrate that LongGAN is able to preserve patient privacy in a given attribute disclosure attack setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Rothstein, M.A.: Is deidentification sufficient to protect health privacy in research? Am J Bioeth. 10(9), 3–11 (2010)
Article Google Scholar
Foraker, R.E., Yu, S.C., Gupta, A., Michelson, A.P., Pineda Soto, J.A., Colvin, R., et al.: Spot the difference: Comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open. 3(4), 557–566 (2020)
Article Google Scholar
Benaim, A.R., et al.: Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med. Inform. 8(2), e16492 (2020)
Article Google Scholar
Guo, A., Foraker, R.E., MacGregor, R.M., Masood, F.M., Cupps, B.P., Pasque, M.K.: The use of synthetic electronic health record data and deep learning to improve timing of high-risk heart failure surgical intervention by predicting proximity to catastrophic decompensation. Front. Digit. Health 44 (2020)
Google Scholar
Che, Z., Cheng, Y., Zhai, S., Sun, Z., Liu, Y.: Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 787–92 (2017)
Google Scholar
Walonoski, J.A., Kramer, M., Nichols, J., Quina, A., Moesel, C., Hall, D., et al.: Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inf. Assoc. JAMIA. 25, 230–238 (2018)
Article Google Scholar
Dube, K., Gallagher, T.: Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. In: Gibbons, J., MacCaull, W. (eds.) FHIES 2013. LNCS, vol. 8315, pp. 69–86. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-53956-5_6
Chapter Google Scholar
Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., Sales, A.P.: Generation and evaluation of synthetic patient data. BMC Med. Res. Method. 20(1), 1–40 (2020)
Google Scholar
McLachlan, S., Dube, K., Gallagher, T., Simmonds, J.A., Fenton, N.: Realistic Synthetic Data Generation: The ATEN Framework. In: Cliquet Jr., A., et al. (eds.) BIOSTEC 2018. CCIS, vol. 1024, pp. 497–523. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29196-9_25
Chapter Google Scholar
Pollack, A.H., Simon, T.D., Snyder, J., Pratt, W.: Creating synthetic patient data to support the design and evaluation of novel health information technology. J. Biomed. Inf. 95, 103201 (2019)
Google Scholar
Walonoski, J., et al.: Synthe‚ novel coronavirus (covid-19) model and synthetic data set. Intell. Based Med. 1, 100007 (2020)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Dong X, et al.: Identifying risk of opioid use disorder for patients taking opioid medications with deep learning. arXiv preprint arXiv:201004589 (2020)
Dong, X., et al.: Predicting opioid overdose risk of patients with opioid prescriptions using electronic health records based on temporal deep learning. J. Biomed. Inf. 116, 103725 (2021)
Google Scholar
Rashidian, S., et al.: Detecting miscoded diabetes diagnosis codes in electronic health records for quality improvement: temporal deep learning approach. JMIR Med. Inform. 8(12), e22649 (2020)
Article Google Scholar
Tao, M., Tang, H., Wu, S., Sebe, N., Wu, F., Jing, X.: Df-gan: deep fusion generative adversarial networks for text-to-image synthesis. ArXiv. abs/2008.05865 (2020)
Google Scholar
Clark, A., Donahue, J., Simonyan, K.: Adversarial video generation on complex datasets. arXiv: Computer Vision and Pattern Recognition (2019)
Google Scholar
Engel, J., Agrawal, K.K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A.: Gansynth: adversarial neural audio synthesis. ArXiv; abs/1902.08710 (2019)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014).
Google Scholar
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W., Sun, J.: Generating multi-label discrete electronic health records using generative adversarial networks. ArXiv; abs/1703.06490 (2017)
Google Scholar
Rashidian, S., et al.: SMOOTH-GAN: towards sharp and smooth synthetic ehr data generation. In: Michalowski, M., Moskovitch, R. (eds.) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science, vol. 12299. Springer, Cham (2020)
Google Scholar
Esteban, C., Hyland, S.L., Rätsch, G.: Real-valued (medical) time series generation with recurrent conditional GANs. ArXiv. abs/1706.02633 (2017)
Google Scholar
Yoon, J., Jarrett, D., Schaar, M.V.D.: Time-series generative adversarial networks. In: NeurIPS (2019)
Google Scholar
Lee, D., Yu, H., Jiang, X., Rogith, D., Gudala, M., Tejani, M., et al.: Generating sequential electronic health records using dual adversarial autoencoder. J. Am. Med. Inform. Assoc. 27(9), 1411–1419 (2020)
Article Google Scholar
Jordon, J., Yoon, J., Schaar, M.V.D.: Pate-gan: generating synthetic data with differential privacy guarantees. In: ICLR (2019)
Google Scholar
Baowaly, M.K., Lin, C., Liu, C.-L., Chen, K.-T.: Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inform. Assoc. 26(228), 41 (2019)
Google Scholar
Yoon, J., Drumright, L.N., Van Der Schaar, M.: Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE J. Biomed. Health Informatics. 24(8), 2378–2388
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.: Improved training of wasserstein gans. In: NIPS (2017)
Google Scholar
Nguyen, H.D., Tran, K.P., Thomassey, S., Hamad, M.: Forecasting and anomaly detection approaches using LSTM and LSTM autoencoder techniques with the applications in supply chain management. Int. J. Inf. Manag. 57, 102282 (2021)
Article Google Scholar
Chawla, A., Lee, B., Jacob, P., Fallon, S.: Bidirectional LSTM autoencoder for sequence based anomaly detection in cyber security. Int. J. Simulation: Syst., Sci. Technol. (2019)
Google Scholar
Wong, T., Luo, Z.: Recurrent auto-encoder model for multidimensional time series representation (2018)
Google Scholar
Mirza, M, Osindero, S.: Conditional generative adversarial nets. ArXiv. abs/1411.1784 (2014)
Google Scholar
Al-Shawwa, B., Glynn, E., Hoffman, M.A., Ehsan, Z., Ingram, D.G.: Outpatient health care utilization for sleep disorders in the cerner health facts database. J. Clin. Sleep Med. 17(2), 203–209 (2021)
Article Google Scholar
Petrick, J.L., Nguyen, T., Cook, M.B.: Temporal trends of esophageal disorders by age in the cerner health facts database. Ann. Epidemiol. 26(2), 151–4.e4 (2016)
Article Google Scholar
DeShazo, J.P., Hoffman, M.: A comparison of a multistate inpatient ehr database to the hcup nationwide inpatient sample. BMC Health Services Res. 15(1), 1–8 (2015)
Google Scholar
Hripcsak, G., Ryan, P.B., Duke, J.D., Shah, N.H., Park, R.W., Huser, V., et al.: Characterizing treatment pathways at scale using the ohdsi network. Proc Natl Acad Sci U S A. 113(27), 7329–7336 (2016)
Article Google Scholar
Shukla, S.N., Marlin, B.M.: Interpolation-prediction networks for irregularly sampled time series. ArXiv ;abs/1909.07782 (2019)
Google Scholar
Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? In: NeurIPS (2019)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.: Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI (2016)
Google Scholar
Oliphant, T.E.: Guide to NumPy (2015)
Google Scholar
McKinney, W.: Data structures for statistical computing in python (2010)
Google Scholar
Virtanen, P., et al.: Scipy 1.0: fundamental algorithms for scientific computing in python. Nat. Method. 17(3), 261–272 (2020).
Google Scholar
Matwin, S., Nin, J., Sehatkar, M., Szapiro, T.: A review of attribute disclosure control. In: Navarro-Arribas G., Torra V. (eds.) Advanced Research in Data Privacy. Studies in Computational Intelligence, vol. 567. Springer, Cham (2015)
Google Scholar
Surendra, H., MohanH, S.: A review of synthetic data generation methods for privacy preserving data publishing. Int. J. Sci. Technol. Res. 6, 95–101 (2017)
Google Scholar
Hittmeir, M., Mayer, R., Ekelhart, A.: A baseline for attribute disclosure risk in synthetic data. In: Proceedings of the Tenth ACM Conference on Data and Application Security and Privacy (2020)
Google Scholar
Stadler, T., Oprisanu, B., Troncoso, C.: Synthetic data - a privacy mirage. ArXiv. abs/2011.07018 (2020)
Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML (2013)
Google Scholar
García-Laencina, P.J., Sancho-Gómez, J., Figueiras-Vidal, A.R.: Pattern classification with missing data: A review. Neural Comput. Appl. 19, 263–282 (2009)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al.: Attention is all you need. ArXiv. abs/1706.03762 (2017)
Google Scholar
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al.: Roberta: a robustly optimized bert pretraining approach. ArXiv. abs/1907.11692 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Choi, K., Hawthorne, C., Simon, I., Dinculescu, M., Engel, J.: Encoding musical style with transformer autoencoders. In: ICML (2020)
Google Scholar
Fang, L., Zeng, T., Liu, C.C., Bo, L., Dong, W., Chen, C.: Transformer-based conditional variational autoencoder for controllable story generation. ArXiv abs/2101.00828 (2021)
Google Scholar
Toreini, E., et al.: Technologies for trustworthy machine learning: A survey in a socio-technical context. ArXiv. abs/2007.08911 (2020)
Google Scholar
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found Trends Theor. Comput. Sci. 9, 211–407 (2014)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
Siao Sun & Wei Zhu
Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
Fusheng Wang & Sina Rashidian
Department of Biomedical Informatics, Renaissance School of Medicine at Stony Brook University, Stony Brook, NY, USA
Fusheng Wang, Tahsin Kurc, Kayley Abell-Hart, Janos Hajagos, Mary Saltz & Joel Saltz

Authors

Siao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Fusheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sina Rashidian
View author publications
You can also search for this author in PubMed Google Scholar
Tahsin Kurc
View author publications
You can also search for this author in PubMed Google Scholar
Kayley Abell-Hart
View author publications
You can also search for this author in PubMed Google Scholar
Janos Hajagos
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Mary Saltz
View author publications
You can also search for this author in PubMed Google Scholar
Joel Saltz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siao Sun .

Editor information

Editors and Affiliations

Massachusetts Institute of Technology, Cambridge, MA, USA
El Kindi Rezig
Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, MA, USA
Vijay Gadepally
Intel Corporation, Portland, ME, USA
Timothy Mattson
Massachusetts Institute of Technology, Cambridge, MA, USA
Michael Stonebraker
Massachusetts Institute of Technology, Cambridge, MA, USA
Tim Kraska
Stonybrook University, Lake Grove, NY, USA
Fusheng Wang
University of Utah, Salt Lake City, UT, USA
Gang Luo
Georgia State University, Atlanta, GA, USA
Jun Kong
Lucerne Unviersity of Applied Sciences, Rotkreuz, Zug, Switzerland
Alevtina Dubovitskaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, S. et al. (2021). Generating Longitudinal Synthetic EHR Data with Recurrent Autoencoders and Generative Adversarial Networks. In: Rezig, E.K., et al. Heterogeneous Data Management, Polystores, and Analytics for Healthcare. DMAH Poly 2021 2021. Lecture Notes in Computer Science(), vol 12921. Springer, Cham. https://doi.org/10.1007/978-3-030-93663-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-93663-1_12
Published: 01 January 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93662-4
Online ISBN: 978-3-030-93663-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics