Challenges in Synthesizing Surrogate PHI in Narrative EMRs

Stubbs, Amber; Uzuner, Özlem; Kotfila, Christopher; Goldstein, Ira; Szolovits, Peter

doi:10.1007/978-3-319-23633-9_27

Amber Stubbs³,
Özlem Uzuner⁴,
Christopher Kotfila⁴,
Ira Goldstein⁵ &
…
Peter Szolovits⁶

2646 Accesses
5 Citations
10 Altmetric

Abstract

Preparing narrative medical records for use outside of their originating institutions requires that protected health information (PHI) be removed from the records. If researchers intend to use these records for natural language processing, then preparing the medical documents requires two steps: (1) identifying the PHI and (2) replacing the PHI with realistic surrogates. In this chapter we discuss the challenges associated with generating these realistic surrogates and describe the algorithms we used to prepare the 2014 i2b2/UTHealth shared task corpus for distribution and use in a natural language processing task focused on de-identification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Somewhat confusingly, “re-identification” is also sometimes used to refer to determining a person’s true identity from de-identified data [7], so we avoid that term for the remainder of this chapter.
2.
Informatics for Integrating Biology and the Bedside.
3.
University of Texas Health Science Center at Houston.
4.
45 CFR 164.514.
5.
http://www.asbestos.com/occupations/.

References

Berman, J.J.: Concept-match medical data scrubbing. How pathology text can be used in research. Arch. Pathol. Lab. Med. 127(6), 680–6 (2003)
Google Scholar
Chakaravarthy, V.T., Gupta, H., Roy, P., Mohania, M.K.: Efficient techniques for document sanitization. In: Proceedings of the 17th ACM conference on Information and knowledge management, pp. 843–852 (2008)
Google Scholar
Clifford, G.D., Scott, D.J., Villarroel, M.: User Guide and Documentation for the MIMIC II Database, database version 2.6. Available online: https://mimic.physionet.org/UserGuide/UserGuide.html (2012)
Deleger, L., Lingren, T., Ni, Y., Kaiser, M., Stoutenborough, L., Marsolo, K., Kouril, M., Molnar, K., Solti, I.: Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research. J. Biomed. Inform. Aug;50:173–83 (2014). doi: 10.1016/j.jbi.2014.01.014
Google Scholar
Douglass M.M.: Computer-assisted de-identification of free-text nursing notes. MEng thesis, Massachusetts Institute of Technology (2005)
Google Scholar
Douglass M.M, Clifford, G.D., Reisner, A., Moody, G.B., Mark, R.G.: Computer-assisted deidentification of free text in the MIMIC II database. Comput. Cardiol. 31, 341–344 (2004)
Google Scholar
El Emam, K., Buckeridge, D., Tamblyn, R., Neisa, A., Jonker, E., Verma, A.: The re-identification risk of Canadians from longitudinal demographics. BMC Med. Inform. Decis. Mak. 11, 46 (2011)
Article Google Scholar
Gardner, J., Xiong, L.: An integrated framework for de-identifying unstructured medical data. Data Knowl. Eng. 68(12), 1441–1451 (2009)
Article Google Scholar
Goldberger, A.L., Amaral, L.A.N., Glass, L., Hausdorff, J.M., Ivanov, P., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.-K., Stanley, H.E.: PhysioBank, PhysioToolkit, and Physionet: components of a new research resource for complex physiologic signals. Circulation 101(23), e215-e220 (June 13, 2000). http://circ.ahajournals.org/cgi/content/full/101/23/e215
Golle, P.: Revisiting the uniqueness of simple demographics in the US population. In: Workshop on Privacy in the Electronic Society (2006)
Book Google Scholar
Gupta, D., Saul, M., Gilbertson, J.: Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. Am. J. Clin. Pathol. 121(2), 176–186 (2004)
Article Google Scholar
HHS (Department of Health and Human Services). Standards for Privacy of Individually Identifiable Health Information, 45 CFR Parts 160 and 164. December 3, 2002 Revised April 3, 2003. Available from: http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/introdution.html
Jiang, W., Murugesan, M., Clifton, C., Si, L.: t-Plausibility: semantic preserving text sanitization. In: 2009 International Conference on Computational Science and Engineering (CSE), pp. 68–75 (2009). doi:10.1109/CSE.2009.353
Google Scholar
Kumar, V., Stubbs, A., Shaw, S., Uzuner, O.: Creation of a new longitudinal corpus of clinical narratives. J. Biomed. Inform. 2015.
Google Scholar
Kushida, C.A., Nichols, D.A., Jadrnicek, R., Miller, R., Walsh, J.K., Griffin, K.: Strategies for de-identification and anonymization of electronic health record data for use in multicenter research studies. Med. Care 50, S82–S101 (2012)
Article Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Lafky, D.: The Safe Harbor method of de-identification: an empirical test. Fourth National HIPAA Summit West. http://www.ehcca.com/presentations/HIPAAWest4/lafky_2.pdf (2010)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR. 163(4), 845–848 (1965) [Russian]. English translation in Sov. Phys. Dokl. 10(8), 707–710 (1966)
Google Scholar
Li, M., Carrell, D., Aberdeen, J., Hirschman, L., Malin, B.: De-identification of clinical narratives through writing complexity measures. Int. J. Med. Inform. 83(10), 750–767 (2014)
Article Google Scholar
McMurry, A.J., Fitch, B., Savova, G., Kohane, I.S., Reis, B.Y.: Improved de-identification of physician notes through integrative modeling of both public and private medical text. BMC Med. Inform. Decis. Mak. 13, 112 (2013). doi:10.1186/1472-6947-13-112
Article Google Scholar
Meystre, S., Friedlin, F., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)
Article Google Scholar
Meystre, S., Shen, S., Hofmann, D., Gundlapalli, A.: Can physicians recognize their own patients in de-identified notes? Stud. Health Technol. Inform. Stud Health Technol Inform. 2014;205:778–82
Google Scholar
Neamatullah, I., Douglass, M., Lehman, L.-W., Reisner, A., Villarroel, M., Long, W., Szolovits, P., Moody, G., Mark, R., Clifford, G.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8, 32 (2008)
Article Google Scholar
Stubbs, A., Kotfila, C., Uzuner, Ö.: Automated systems for the de-identification of longitudinal clinical narratives. J Biomed Inform. 2015 Jul 28. pii: S1532-0464(15)00117-3. doi: 10.1016/j.jbi.2015.06.007
Google Scholar
Stubbs, A., Uzuner, Ö.: Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus J Biomed Inform. 2015 Aug 28. pii: S1532-0464(15)00182-3. doi: 10.1016/j.jbi.2015.07.020
Google Scholar
Sun, W., Rumshishky, A., Uzuner, Ö.: Evaluating temporal relations in clinical text: 2012 i2b2 Challenge. J. Am. Med. Inform. Assoc. Published Online First 5 April 2013
Google Scholar
Sweeney, L.: Replacing personally-identifying information in medical records, the scrub system. In: Cimino, J.J. (ed.) Proceedings, Journal of the American Medical Informatics Association, pp. 333–337. Hanley and Belfus, Washington (1996)
Google Scholar
Sweeney, L.: Uniqueness of Simple Demographics in the U.S. Population. Carnegie Mellon University, School of Computer Science, Data Privacy Laboratory, Technical Report LIDAP-WP4. Pittsburgh (2000)
Google Scholar
Uzuner, Ö., Luo, Y., Szolovits, P.: Evaluating the state-of-the-art in automatic de-identification. J. Am. Med. Inform. Assoc. 14(5), 550–563 (2007)
Article Google Scholar
Uzuner, Ö., Stubbs, A., Xu, H., co-chairs.: “Data Release and Call for Participation: 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data”. https://www.i2b2.org/NLP/HeartDisease/
Yeniterzi, R., Aberdeen, J., Bayer, S., Wellner, B., Hirschman, L., Malin, B.: Effects of personal identifier resynthesis on clinical text de-identification. J. Am. Med. Inform. Assoc. 17, 159–168 (2010)
Article Google Scholar

Download references

Acknowledgements

This project was funded by NIH NLM 2U54LM008748 PI: Isaac Kohane, and by NIH NLM 5R13LM011411 PI: Ozlem Uzuner.

Author information

Authors and Affiliations

School of Library and Information Science, Simmons College, Boston, MA, USA
Amber Stubbs
State University of New York, Albany, NY, USA
Özlem Uzuner & Christopher Kotfila
Department of Computer Science, Siena College, Loudonville, NY, USA
Ira Goldstein
Department of Computer Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Peter Szolovits

Authors

Amber Stubbs
View author publications
You can also search for this author in PubMed Google Scholar
Özlem Uzuner
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Kotfila
View author publications
You can also search for this author in PubMed Google Scholar
Ira Goldstein
View author publications
You can also search for this author in PubMed Google Scholar
Peter Szolovits
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amber Stubbs .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Dublin, Ireland
Aris Gkoulalas-Divanis
Cardiff University, Cardiff, United Kingdom
Grigorios Loukides

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stubbs, A., Uzuner, Ö., Kotfila, C., Goldstein, I., Szolovits, P. (2015). Challenges in Synthesizing Surrogate PHI in Narrative EMRs. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-23633-9_27
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23632-2
Online ISBN: 978-3-319-23633-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics