SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes

Begoli, Edmon; Brown, Kris; Srinivasan, Sudarshan; Tamang, Suzanne

doi:10.1109/BigData.2018.8621981

Title: SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes

Conference · Sat Dec 01 00:00:00 EST 2018

DOI:https://doi.org/10.1109/BigData.2018.8621981· OSTI ID:1507868

^[1]; Brown, Kris ^[1]; Srinivasan, Sudarshan ^[1]; Tamang, Suzanne ^[2]

ORNL
Stanford University

One of the key, emerging challenges that connects the "Big Data" and the AI domain is the availability of sufficient volumes of training data for AI/Machine Learning tasks. SynthNotes is a framework for generating standards-compliant, realistic mental health progress report notes at the very large, population-level scale, and in a strict privacy-preserving manner. Our framework, inspired by the needs to explore, evaluate, and train computational methods for the emerging mental health crisis in the US, is useful for benchmarking, optimization, and training of biomedical natural language processing, information extraction, and machine learning systems intended to operate at "Big Data" scale (billions of notes). The free text notes generated by SynthNotes are based on the literature and public statistical models allowing for realistic, natural language representation of a patient, and his or her mental health characteristics. Additionally, SynthNotes can partially simulate stylistic, grammatical, and expressive characteristics of a licensed mental health professional. SynthNotes is modular and flexible, allowing for representation of variety of conditions, incorporation of alternative foundational models, and parametrization of the variability of the structure, content, and size of the synthetically generated corpus. In this paper, we report on the initial use and performance characteristics of our SynthNotes framework and on the ongoing work for inclusion of content planning and deep learning-based generative methods trained on real data.

View Conference

Cite

Export

Save

Research Organization:: Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization:: USDOE

DOE Contract Number:: AC05-00OR22725

OSTI ID:: 1507868

Resource Relation:: Conference: 2018 IEEE International Conference on Big Data - Seattle, Washington, United States of America - 8/10/2018 8:00:00 AM-8/13/2018 8:00:00 AM

Country of Publication:: United States

Language:: English

References (25)

New Data on Suicide Risk Among Military Veterans Lyon, Jeff JAMA, Vol. 318, Issue 16 https://doi.org/10.1001/jama.2017.15982	journal	October 2017
Development and applications of the Veterans Health Administration’s Stratification Tool for Opioid Risk Mitigation (STORM) to improve opioid safety and prevent overdose and suicide. Oliva, Elizabeth M.; Bowe, Thomas; Tavakoli, Sara Psychological Services, Vol. 14, Issue 1 https://doi.org/10.1037/ser0000099	journal	February 2017
Using a composite index of socioeconomic status to investigate health disparities while protecting the confidentiality of cancer registry data Yu, Mandi; Tatalovich, Zaria; Gibson, James T. Cancer Causes & Control, Vol. 25, Issue 1 https://doi.org/10.1007/s10552-013-0310-1	journal	November 2013
VistA—U.S. Department of Veterans Affairs national-scale HIS Brown, S. International Journal of Medical Informatics, Vol. 69, Issue 2-3 https://doi.org/10.1016/S1386-5056(02)00131-4	journal	March 2003
A synthetic Longitudinal Study dataset for England and Wales Dennett, Adam; Norman, Paul; Shelton, Nicola Data in Brief, Vol. 9 https://doi.org/10.1016/j.dib.2016.08.036	journal	December 2016
Synthetic Text Generation for Sentiment Analysis Maqsud, Umar Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis https://doi.org/10.18653/v1/W15-2922	conference	January 2015
Learning to Write Case Notes Using the SOAP Format Cameron, Susan; Turtle-Song, Imani Journal of Counseling & Development, Vol. 80, Issue 3 https://doi.org/10.1002/j.1556-6678.2002.tb00193.x	journal	July 2002
A Hybrid Convolutional Variational Autoencoder for Text Generation Semeniuta, Stanislau; Severyn, Aliaksei; Barth, Erhardt Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing https://doi.org/10.18653/v1/D17-1066	conference	January 2017
Texygen Zhu, Yaoming; Lu, Sidi; Zheng, Lei The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval https://doi.org/10.1145/3209978.3210080	conference	June 2018
The DSM-5: Classification and criteria changes Regier, Darrel A.; Kuhl, Emily A.; Kupfer, David J. World Psychiatry, Vol. 12, Issue 2 https://doi.org/10.1002/wps.20050	journal	June 2013
MIMIC-III, a freely accessible critical care database Johnson, Alistair E. W.; Pollard, Tom J.; Shen, Lu Scientific Data, Vol. 3, Issue 1 https://doi.org/10.1038/sdata.2016.35	journal	May 2016
The Synthetic Data Vault Patki, Neha; Wedge, Roy; Veeramachaneni, Kalyan 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) https://doi.org/10.1109/DSAA.2016.49	conference	October 2016
TextGen: a realistic text data content generation method for modern storage system benchmarks Wang, Long-xiang; Dong, Xiao-she; Zhang, Xing-jun Frontiers of Information Technology & Electronic Engineering, Vol. 17, Issue 10 https://doi.org/10.1631/FITEE.1500332	journal	October 2016
synthpop: Bespoke Creation of Synthetic Data in R Nowok, Beata; Raab, Gillian M.; Dibben, Chris Journal of Statistical Software, Vol. 74, Issue 11 https://doi.org/10.18637/jss.v074.i11	journal	January 2016
Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record Walonoski, Jason; Kramer, Mark; Nichols, Joseph Journal of the American Medical Informatics Association, Vol. 25, Issue 3 https://doi.org/10.1093/jamia/ocx079	journal	August 2017
Community-Wide Health Risk Assessment Using Geographically Resolved Demographic Data: A Synthetic Population Approach Levy, Jonathan I.; Fabian, Maria Patricia; Peters, Junenette L. PLoS ONE, Vol. 9, Issue 1 https://doi.org/10.1371/journal.pone.0087144	journal	January 2014
Data-driven approach for creating synthetic electronic medical records Buczak, Anna L.; Babin, Steven; Moniz, Linda BMC Medical Informatics and Decision Making, Vol. 10, Issue 1 https://doi.org/10.1186/1472-6947-10-59	journal	October 2010
The Unified Medical Language System (UMLS): integrating biomedical terminology Bodenreider, O. Nucleic Acids Research, Vol. 32, Issue 90001 https://doi.org/10.1093/nar/gkh061	journal	January 2004
Addressing the Opioid Epidemic in the United States Gellad, Walid F.; Good, Chester B.; Shulkin, David J. JAMA Internal Medicine, Vol. 177, Issue 5 https://doi.org/10.1001/jamainternmed.2017.0147	journal	May 2017
Predictive Modeling and Concentration of the Risk of Suicide: Implications for Preventive Interventions in the US Department of Veterans Affairs McCarthy, John F.; Bossarte, Robert M.; Katz, Ira R. American Journal of Public Health, Vol. 105, Issue 9 https://doi.org/10.2105/AJPH.2015.302737	journal	September 2015
Protecting Confidentiality in Cancer Registry Data With Geographic Identifiers Yu, Mandi; Reiter, Jerome Phillip; Zhu, Li American Journal of Epidemiology, Vol. 186, Issue 1 https://doi.org/10.1093/aje/kwx050	journal	June 2017
Physicians' Characteristics Associated with Exploring Suicide Risk among Patients with Depression: A French Panel Survey of General Practitioners Bocquier, Aurélie; Pambrun, Elodie; Dumesnil, Hélène PLoS ONE, Vol. 8, Issue 12 https://doi.org/10.1371/journal.pone.0080797	journal	December 2013
Automatically generating Wikipedia articles Sauper, Christina; Barzilay, Regina Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - ACL-IJCNLP '09 https://doi.org/10.3115/1687878.1687909	conference	January 2009
Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications Savova, Guergana K.; Masanz, James J.; Ogren, Philip V. Journal of the American Medical Informatics Association, Vol. 17, Issue 5 https://doi.org/10.1136/jamia.2009.001560	journal	September 2010
Automatic generation of textual summaries from neonatal intensive care data Portet, François; Reiter, Ehud; Gatt, Albert Artificial Intelligence, Vol. 173, Issue 7-8 https://doi.org/10.1016/j.artint.2008.12.002	journal	May 2009

Similar Records

PRIMED for the Future: Purposing Raw Intake for Machine Learning-Enabled Detection (Final Report)

Technical Report · Mon Nov 06 00:00:00 EST 2023 · OSTI ID:1507868

Sandholtz, S. H.; Valdes, C.; Mulakken, N.; +7 more

Explainable Artificial Intelligence Recommendation System by Leveraging the Semantics of Adverse Childhood Experiences: Proof-of-Concept Prototype Development

Journal Article · Sat Apr 11 00:00:00 EDT 2020 · JMIR medical informatics · OSTI ID:1507868

Ammar, Nariman; Shaban-Nejad, Arash

Optimal vocabulary selection approaches for privacy-preserving deep NLP model training for information extraction and cancer epidemiology

Journal Article · Mon Feb 14 00:00:00 EST 2022 · Cancer Biomarkers · OSTI ID:1507868

Yoon, Hong-Jun; Stanley, Christopher B.; Christian, J. Blair; +11 more

Title: SynthNotes: A Generator Framework for High-volume, High-fidelity Synthetic Mental Health Notes

Citation Formats

References (25)

Similar Records

Related Subjects