skip to main content
research-article

Differentially Private Medical Texts Generation Using Generative Neural Networks

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than \(80\%\) accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.

REFERENCES

  1. [1] Mehta Neil and Devarakonda Murthy V.. 2018. Machine learning, natural language programming, and electronic health records: The next step in the artificial intelligence journey?Journal of Allergy and Clinical Immunology 141, 6 (2018), 20192021.Google ScholarGoogle Scholar
  2. [2] Wu Stephen, Roberts Kirk, Datta Surabhi, Du Jingcheng, Ji Zongcheng, Si Yuqi, Soni Sarvesh, et al. 2020. Deep learning in clinical natural language processing: A methodical review. Journal of the American Medical Informatics Association 27, 3 (2020), 457470.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Wang Yanshan, Wang Liwei, Rastegar-Mojarad Majid, Moon Sungrim, Shen Feichen, Afzal Naveed, Liu Sijia, et al. 2018. Journal of Biomedical Informatics 77 (2018), 3449.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Miller Robert H. and Sim Ida. 2004. Physicians’ use of electronic medical records: Barriers and solutions. Health Affairs 23, 2 (2004), 116126.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Nisker Josh. 2006. Pipeda: A constitutional analysis. Canadian Bar Review 85 (2006), 317.Google ScholarGoogle Scholar
  6. [6] George J. Annas. 2003. HIPAA regulations-a new era of medical-record privacy?New England Journal of Medicine 348, 15 (2003), 14861490.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Act Accountability. 1996. Health Insurance Portability and Accountability Act of 1996. Public Law 104 (1996), 191.Google ScholarGoogle Scholar
  8. [8] Aziz Md. Momin Al, Sadat Md. Nazmus, Alhadidi Dima, Wang Shuang, Jiang Xiaoqian, Brown Cheryl L., and Mohammed Noman. 2017. Privacy-preserving techniques of genomic data—A survey. Briefings in Bioinformatics 20, 3 (2017), 887–895.Google ScholarGoogle Scholar
  9. [9] Dernoncourt Franck, Lee Ji Young, Uzuner Ozlem, and Szolovits Peter. 2016. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24, 3 (12 2016), 596606. DOI: DOI: http://dx.doi.org/10.1093/jamia/ocw156Google ScholarGoogle Scholar
  10. [10] Ahmed Tanbir, Aziz Md. Momin Al, and Mohammed Noman. 2020. De-identification of electronic health record using neural network. Scientific Reports 10, 1 (2020), 111.Google ScholarGoogle Scholar
  11. [11] M. Douglass, G. D. Clifford, A. Reisner, G. B. Moody, and R. G. Mark. 2004. Computer-assisted de-identification of free text in the MIMIC II database. In Proceedings of Computers in Cardiology (CinC’04).Google ScholarGoogle Scholar
  12. [12] Douglass M. M., Cliffford G. D., Reisner Andrew, Long W. J., Moody G. B., and Mark R. G.. 2005. De-identification algorithm for free-text nursing notes. In Proceedings of Computers in Cardiology (CinC’05).Google ScholarGoogle Scholar
  13. [13] Neamatullah Ishna, Douglass Margaret M., Lehman Li-Wei H., et al. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (2008), Article 32.Google ScholarGoogle Scholar
  14. [14] Khin Kaung, Burckhardt Philipp, and Padman Rema. 2018. A deep learning architecture for de-identification of patient notes: Implementation and evaluation. arXiv:1810.01570.Google ScholarGoogle Scholar
  15. [15] Yogarajan Vithya, Pfahringer Bernhard, and Mayo Michael. 2020. A review of automatic end-to-end de-identification: Is high accuracy the only metric?Applied Artificial Intelligence 34, 3 (2020), 251269.Google ScholarGoogle Scholar
  16. [16] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  17. [17] Yu Lantao, Zhang Weinan, Wang Jun, and Yu Yong. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Garfinkel Simson L.. 2015. De-Identification of Personal Information. National Institute of Standards and Technology.Google ScholarGoogle Scholar
  19. [19] O’Malley Kimberly J., Cook Karon F., Price Matt D., Wildes Kimberly Raiford, Hurdle John F., and Ashton Carol M.. 2005. Measuring diagnoses: ICD code accuracy. Health Services Research 40, 5p2 (2005), 16201639.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Guo Jiaxian, Lu Sidi, Cai Han, Zhang Weinan, Yu Yong, and Wang Jun. 2018. Long text generation via adversarial training with leaked information. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Papernot Nicolas, Abadi Martín, Erlingsson Ulfar, Goodfellow Ian, and Talwar Kunal. 2016. Semi-supervised knowledge transfer for deep learning from private training data. arXiv:1610.05755.Google ScholarGoogle Scholar
  22. [22] Stubbs Amber, Kotfila Christopher, and Uzuner Özlem. 2015. Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of Biomedical Information 58 (2015), S11–S19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Johnson Alistair E. W., Pollard Tom J., Shen Lu, Lehman Li-Wei H., Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, and Mark Roger G.. 2016. MIMIC-III, a freely accessible critical care database. Nature Scientific Data 3 (2016), Article 160035.Google ScholarGoogle Scholar
  24. [24] Song Fei and Croft W. Bruce. 1999. A general language model for information retrieval. In Proceedings of the 8th International Conference on Information and Knowledge Management. 316321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.Google ScholarGoogle Scholar
  26. [26] Bengio Yoshua, Ducharme Réjean, Vincent Pascal, and Jauvin Christian. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3 (2003), 11371155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Dwork Cynthia. 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages, and Programming—Volume Part II (ICALP’06). 112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Kerrigan Gavin, Slack Dylan, and Tuyls Jens. 2020. Differentially private language models benefit from public pre-training. arXiv:2009.05886.Google ScholarGoogle Scholar
  31. [31] Melamud Oren and Shivade Chaitanya. 2019. Towards automatic generation of shareable synthetic clinical notes using neural language models. In Proceedings of the 2nd Clinical Natural Language Processing Workshop. 3545.Google ScholarGoogle Scholar
  32. [32] Mikolov Tomas, Chen Kai, Corrado Greg, and Dean Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781.Google ScholarGoogle Scholar
  33. [33] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Plattel C.. 2014. Distributed and Incremental Clustering Using Shared Nearest Neighbours. Ph.D. Dissertation. Utrecht University.Google ScholarGoogle Scholar
  36. [36] Dunning Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 1 (1993), 6174. https://www.aclweb.org/anthology/J93-1003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Rayson Paul, Berridge Damon, and Francis Brian. 2004. Extending the Cochran rule for the comparison of word frequencies between corpora. In Proceedings of the 7th International Conference on Statistical Analysis of Textual Data (JADT’04). 926936.Google ScholarGoogle Scholar
  38. [38] Rayson Paul and Garside Roger. 2000. Comparing corpora using frequency profiling. In Proceedings of the Workshop on Comparing Corpora—Volume 9. 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Lijffijt Jefrey, Nevalainen Terttu, Säily Tanja, Papapetrou Panagiotis, Puolamäki Kai, and Mannila Heikki. 2016. Significance testing of word frequencies in corpora. Literary and Linguistic Computing 31, 2 (2016), 374397.Google ScholarGoogle Scholar
  40. [40] Johnston Janis E., Berry Kenneth J., and Jr Paul W. Mielke. 2006. Measures of effect size for chi-squared and likelihood-ratio goodness-of-fit tests. Perceptual and Motor Skills 103, 2 (2006), 412414.Google ScholarGoogle Scholar
  41. [41] Alemzadeh Homa and Devarakonda Murthy. 2017. An NLP-based cognitive system for disease status identification in electronic health records. In Proceedings of the 2017 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI’17). IEEE, Los Alamitos, CA, 8992.Google ScholarGoogle Scholar
  42. [42] Dalvi Nilesh, Domingos Pedro, Sanghai Sumit, and Verma Deepak. 2004. Adversarial classification. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 99108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Li Jiwei, Monroe Will, Shi Tianlin, Jean Sébastien, Ritter Alan, and Jurafsky Dan. 2017. Adversarial learning for neural dialogue generation. arXiv:1701.06547.Google ScholarGoogle Scholar
  44. [44] Neamatullah Ishna, Douglass Margaret M., Lehman Li-Wei H., Reisner Andrew, Villarroel Mauricio, Long William J., Szolovits Peter, Moody George B., Mark Roger G., and Clifford Gari D.. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8, 1 (2008), 32.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Jordon James, Yoon Jinsung, and Schaar Mihaela van der. 2019. PATE-GAN: Generating synthetic data with differential privacy guarantees. In Proceedings of the 2019 International Conference on Learning Representations (ICLR’19).Google ScholarGoogle Scholar
  46. [46] Brown Tom B., Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared, Dhariwal Prafulla, Neelakantan Arvind, et al. 2020. Language models are few-shot learners. arXiv:2005.14165.Google ScholarGoogle Scholar
  47. [47] Lee Scott H.. 2018. Natural language generation for electronic health records. NPJ Digital Medicine 1, 1 (2018), 17.Google ScholarGoogle Scholar
  48. [48] Carlini Nicholas, Liu Chang, Erlingsson Úlfar, Kos Jernej, and Song Dawn. 2018. The secret sharer: Evaluating and testing unintended memorization in neural networks. arXiv:1802.08232.Google ScholarGoogle Scholar
  49. [49] Dreisbach Caitlin, Koleck Theresa A., Bourne Philip E., and Bakken Suzanne. 2019. A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data. International Journal of Medical Informatics 125 (2019), 3746.Google ScholarGoogle Scholar
  50. [50] Xiao Cao, Choi Edward, and Sun Jimeng. 2018. Opportunities and challenges in developing deep learning models using electronic health records data: A systematic review. Journal of the American Medical Informatics Association 25, 10 (2018), 14191428.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Shickel Benjamin, Tighe Patrick James, Bihorac Azra, and Rashidi Parisa. 2017. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics 22, 5 (2017), 15891604.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Kingma Diederik P. and Welling Max. 2013. Auto-encoding variational Bayes. arXiv:1312.6114.Google ScholarGoogle Scholar
  53. [53] Rezende Danilo Jimenez and Mohamed Shakir. 2015. Variational inference with normalizing flows. arXiv:1505.05770.Google ScholarGoogle Scholar
  54. [54] Doersch Carl. 2016. Tutorial on variational autoencoders. arXiv:1606.05908.Google ScholarGoogle Scholar
  55. [55] Bowman Samuel R., Vilnis Luke, Vinyals Oriol, Dai Andrew M., Jozefowicz Rafal, and Bengio Samy. 2015. Generating sentences from a continuous space. arXiv:1511.06349.Google ScholarGoogle Scholar
  56. [56] Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, and Bengio Yoshua. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 26722680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Yahi Alexandre, Vanguri Rami, Elhadad Noémie, and Tatonetti Nicholas P.. 2017. Generative adversarial networks for electronic health records: a framework for exploring and evaluating methods for predicting drug-induced laboratory test trajectories. arXiv:1712.00164.Google ScholarGoogle Scholar
  58. [58] Esteban Cristóbal, Hyland Stephanie L., and Rätsch Gunnar. 2017. Real-valued (medical) time series generation with recurrent conditional GANs. arXiv:1706.02633.Google ScholarGoogle Scholar
  59. [59] Guan Jiaqi, Li Runzhe, Yu Sheng, and Zhang Xuegong. 2018. Generation of synthetic electronic medical record text. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM’18). IEEE, Los Alamitos, CA, 374380.Google ScholarGoogle Scholar
  60. [60] Fedus William, Goodfellow Ian, and Dai Andrew M.. 2018. MaskGAN: Better text generation via filling in the. arXiv:1801.07736.Google ScholarGoogle Scholar
  61. [61] Papernot Nicolas, Song Shuang, Mironov Ilya, Raghunathan Ananth, Talwar Kunal, and Erlingsson Úlfar. 2018. Scalable private learning with PATE. arXiv:1802.08908.Google ScholarGoogle Scholar
  62. [62] Ive Julia, Viani Natalia, Kam Joyce, Yin Lucia, Verma Somain, Puntis Stephen, Cardinal Rudolf N., Roberts Angus, Stewart Robert, and Velupillai Sumithra. 2020. Generation and evaluation of artificial mental health records for natural language processing. NPJ Digital Medicine 3, 1 (2020), 19.Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Douglass M. M., Cliffford G. D., Reisner Andrew, Long W. J., Moody G. B., and Mark R. G.. 2005. De-identification algorithm for free-text nursing notes. In Proceedings of Computers in Cardiology. IEEE, Los Alamitos, CA, 331334.Google ScholarGoogle Scholar
  64. [64] Uzuner Özlem and Stubbs Amber. 2015. Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks. Journal of Biomedical Informatics 58, Suppl. (2015), S1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Yang Hui and Garibaldi Jonathan M.. 2015. Automatic detection of protected health information from clinic narratives. Journal of Biomedical Informatics 58 (2015), S30–S38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Aberdeen John, Bayer Samuel, Yeniterzi Reyyan, Wellner Ben, Clark Cheryl, Hanauer David, Malin Bradley, and Hirschman Lynette. 2010. The MITRE Identification Scrubber Toolkit: Design, training, and assessment. International Journal of Medical Informatics 79, 12 (2010), 849859.Google ScholarGoogle Scholar
  67. [67] Chen Tao, Cullen Richard M., and Godwin Marshall. 2015. Hidden Markov model using Dirichlet process for de-identification. Journal of Biomedical Informatics 58 (2015), S60–S66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Naseri Mohammad, Hayes Jamie, and Cristofaro Emiliano De. 2020. Toward robustness and privacy in federated learning: Experimenting with local and central differential privacy. arXiv:2009.03561.Google ScholarGoogle Scholar
  69. [69] Papernot Nicolas and Goodfellow Ian. 2018. Privacy and machine learning: Two unexpected allies?Cleverhans Blog. Retrieved August 9, 2021 from http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html.Google ScholarGoogle Scholar
  70. [70] Song Shuang, Chaudhuri Kamalika, and Sarwate Anand D.. 2013. Stochastic gradient descent with differentially private updates. In Proceedings of the 2013 IEEE Global Conference on Signal and Information Processing. IEEE, Los Alamitos, CA, 245248.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Shokri Reza and Shmatikov Vitaly. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 13101321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Ruder Sebastian. 2016. An overview of gradient descent optimization algorithms. arXiv:1609.04747.Google ScholarGoogle Scholar
  73. [73] Abadi Martin, Chu Andy, Goodfellow Ian, McMahan H. Brendan, Mironov Ilya, Talwar Kunal, and Zhang Li. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 308318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] McMahan H. Brendan, Ramage Daniel, Talwar Kunal, and Zhang Li. 2017. Learning differentially private recurrent language models. arXiv:1710.06963.Google ScholarGoogle Scholar
  75. [75] Papernot N., McDaniel P., Sinha A., and Wellman M. P.. 2018. SoK: Security and privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P’18). 399414. DOI: DOI: http://dx.doi.org/10.1109/EuroSP.2018.00035Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Shokri Reza, Stronati Marco, Song Congzheng, and Shmatikov Vitaly. 2017. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP’17). IEEE, Los Alamitos, CA, 318.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Li Linyang, Ma Ruotian, Guo Qipeng, Xue Xiangyang, and Qiu Xipeng. 2020. BERT-attack: Adversarial attack against BERT using BERT. arXiv:2004.09984.Google ScholarGoogle Scholar
  78. [78] Goller C. and Kuchler A.. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of the International Conference on Neural Networks (ICNN’96).Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Differentially Private Medical Texts Generation Using Generative Neural Networks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Computing for Healthcare
            ACM Transactions on Computing for Healthcare  Volume 3, Issue 1
            January 2022
            255 pages
            EISSN:2637-8051
            DOI:10.1145/3485154
            Issue’s Table of Contents

            Copyright © 2021 Association for Computing Machinery.

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 October 2021
            • Revised: 1 May 2021
            • Accepted: 1 May 2021
            • Received: 1 July 2020
            Published in health Volume 3, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format