skip to main content
10.1145/3570991.3571011acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
short-paper

A robust section identification method for scanned electronic health records

Published:04 January 2023Publication History

ABSTRACT

An Electronic Health Record (EHR) is a digital document containing critical information concerning a patient’s visit to a hospital. However, since they are often archived as scanned images, Optical Character Recognition (OCR) is used to extract the clinical text for analytics. The accuracy of OCR is compromised when the scanned EHRs contain noise artifacts or when the scans are of poor quality. Clinical text sections in the EHR help precisely locate information pertinent to a specific aspect of a patient’s visit, which is vital for any downstream clinical analytics activities such as medical coding, medical necessity assessment, and diagnosis identification. Section Identification is the task of identifying the different sections present in an EHR with the help of their headers. Traditionally, rule-based keyword matching and statistical approaches are employed to solve this problem. However, these approaches rely on external lookups and knowledge bases and are therefore susceptible to the errors introduced by OCR processes. We propose a character-based word sequence modeling approach for Clinical Section Identification from scanned EHRs that is robust against OCR-induced errors. We also utilize character augmentation techniques from existing literature to improve their robustness to OCR errors. We empirically demonstrate that our models trained with and without character augmentation significantly outperform existing approaches on a medical dataset with OCR errors.

References

  1. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google ScholarGoogle Scholar
  2. Emilia Apostolova, David S. Channin, Dina Demner-Fushman, Jacob D. Furst, Steven L. Lytinen, and Daniela Stan Raicu. 2009. Automatic segmentation of clinical texts. 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2009), 5905–5908.Google ScholarGoogle ScholarCross RefCross Ref
  3. Markus Bayer, Marc-André Kaufhold, and Christian Reuter. 2022. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. (jun 2022). https://doi.org/10.1145/3544558 Just Accepted.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173(2017).Google ScholarGoogle Scholar
  5. Hong Jie Dai, Shabbir Syed-Abdul, Chih Wei Chen, and Chieh Chen Wu. 2015. Recognition and Evaluation of Clinical Section Headings in Clinical Documents Using Token-Based Formulation with Conditional Random Fields. BioMed Research International 2015 (2015). https://doi.org/10.1155/2015/873012Google ScholarGoogle Scholar
  6. Joshua C. Denny, III Spickard, Anderson, Kevin B. Johnson, Neeraja B. Peterson, Josh F. Peterson, and Randolph A. Miller. 2009. Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents. Journal of the American Medical Informatics Association 16, 6 (11 2009), 806–815. https://doi.org/10.1197/jamia.M3037 arXiv:https://academic.oup.com/jamia/article-pdf/16/6/806/2193735/16-6-806.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  7. Son Doan, Lisa Bastarache, Sergio Klimkowski, Joshua C Denny, and Hua Xu. 2010. Integrating existing natural language processing tools for medication extraction from discharge summaries. Journal of the American Medical Informatics Association 17, 5(2010), 528–531.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kavita Ganesan and Michael Subotin. 2014. A general supervised approach to segmentation of clinical texts. In 2014 IEEE International Conference on Big Data (Big Data). 33–40. https://doi.org/10.1109/BigData.2014.7004390Google ScholarGoogle ScholarCross RefCross Ref
  9. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 2741–2749.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016. CharNER: Character-Level Named Entity Recognition. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 911–921. https://aclanthology.org/C16-1087Google ScholarGoogle Scholar
  12. Ying Li, Sharon Lipsky Gorman, and Noémie Elhadad. 2010. Section Classification in Clinical Notes Using Supervised Hidden Markov Model. In Proceedings of the 1st ACM International Health Informatics Symposium (Arlington, Virginia, USA) (IHI ’10). Association for Computing Machinery, New York, NY, USA, 744–750. https://doi.org/10.1145/1882992.1883105Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.Google ScholarGoogle Scholar
  14. Jian Ni, Brian Delaney, and Radu Florian. 2015. Fast Model Adaptation for Automated Section Classification in Electronic Medical Records. In MEDINFO 2015: eHealth-enabled Health - Proceedings of the 15th World Congress on Health and Biomedical Informatics, São Paulo, Brazil, 19-23 August 2015(Studies in Health Technology and Informatics, Vol. 216), Indra Neil Sarkar, Andrew Georgiou, and Paulo Mazzoncini de Azevedo Marques (Eds.). IOS Press, 35–39. https://doi.org/10.3233/978-1-61499-564-7-35Google ScholarGoogle Scholar
  15. Alexandra Pomares-Quimbaya, Markus Kreuzthaler, and Stefan Schulz. 2019. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Medical Research Methodology 19, 1 (2019), 155–155.Google ScholarGoogle ScholarCross RefCross Ref
  16. Sara Rosenthal, Ken Barker, and Zhicheng Liang. 2019. Leveraging Medical Literature for Section Prediction in Electronic Health Records. In EMNLP.Google ScholarGoogle Scholar
  17. Najmeh Sadoughi, Greg P. Finley, Erik Edwards, Amanda Robinson, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. 2018. Detecting Section Boundaries in Medical Dictations: Toward Real-Time Conversion of Medical Dictations to Clinical Reports. In Speech and Computer, Alexey Karpov, Oliver Jokisch, and Rodmonga Potapova (Eds.). Springer International Publishing, Cham, 563–573.Google ScholarGoogle Scholar
  18. M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ricky K Taira, Stephen G Soderland, and Rex M Jakobovits. 2001. Automatic structuring of radiology free-text reports. Radiographics 21, 1 (2001), 237–245.Google ScholarGoogle ScholarCross RefCross Ref
  20. Tijmen Tieleman, Geoffrey Hinton, 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31.Google ScholarGoogle Scholar
  21. Zenan Zhai, Dat Quoc Nguyen, and Karin Verspoor. 2018. Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, Brussels, Belgium, 38–43. https://doi.org/10.18653/v1/W18-5605Google ScholarGoogle ScholarCross RefCross Ref
  22. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA, USA, 649–657.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A robust section identification method for scanned electronic health records

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
      January 2023
      357 pages
      ISBN:9781450397971
      DOI:10.1145/3570991

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate197of680submissions,29%
    • Article Metrics

      • Downloads (Last 12 months)33
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format