short-paper

A robust section identification method for scanned electronic health records

Authors:
Anand Subramanian

BUDDI AI, India

BUDDI AI, India

0000-0003-4711-5457
View Profile

,
Praveen Kumar Suresh

BUDDI AI, India

BUDDI AI, India

0000-0001-7513-8408
View Profile

,
Sudarsun Santhiappan

BUDDI AI, India

BUDDI AI, India

0000-0001-5769-2405
View Profile

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)January 2023Pages 213–217https://doi.org/10.1145/3570991.3571011

Published:04 January 2023Publication History

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 213–217

ABSTRACT

An Electronic Health Record (EHR) is a digital document containing critical information concerning a patient’s visit to a hospital. However, since they are often archived as scanned images, Optical Character Recognition (OCR) is used to extract the clinical text for analytics. The accuracy of OCR is compromised when the scanned EHRs contain noise artifacts or when the scans are of poor quality. Clinical text sections in the EHR help precisely locate information pertinent to a specific aspect of a patient’s visit, which is vital for any downstream clinical analytics activities such as medical coding, medical necessity assessment, and diagnosis identification. Section Identification is the task of identifying the different sections present in an EHR with the help of their headers. Traditionally, rule-based keyword matching and statistical approaches are employed to solve this problem. However, these approaches rely on external lookups and knowledge bases and are therefore susceptible to the errors introduced by OCR processes. We propose a character-based word sequence modeling approach for Clinical Section Identification from scanned EHRs that is robust against OCR-induced errors. We also utilize character augmentation techniques from existing literature to improve their robustness to OCR errors. We empirically demonstrate that our models trained with and without character augmentation significantly outperform existing approaches on a medical dataset with OCR errors.

References

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
Emilia Apostolova, David S. Channin, Dina Demner-Fushman, Jacob D. Furst, Steven L. Lytinen, and Daniela Stan Raicu. 2009. Automatic segmentation of clinical texts. 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2009), 5905–5908.Google ScholarCross Ref
Markus Bayer, Marc-André Kaufhold, and Christian Reuter. 2022. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. (jun 2022). https://doi.org/10.1145/3544558 Just Accepted.Google ScholarDigital Library
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173(2017).Google Scholar
Hong Jie Dai, Shabbir Syed-Abdul, Chih Wei Chen, and Chieh Chen Wu. 2015. Recognition and Evaluation of Clinical Section Headings in Clinical Documents Using Token-Based Formulation with Conditional Random Fields. BioMed Research International 2015 (2015). https://doi.org/10.1155/2015/873012Google Scholar
Joshua C. Denny, III Spickard, Anderson, Kevin B. Johnson, Neeraja B. Peterson, Josh F. Peterson, and Randolph A. Miller. 2009. Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents. Journal of the American Medical Informatics Association 16, 6 (11 2009), 806–815. https://doi.org/10.1197/jamia.M3037 arXiv:https://academic.oup.com/jamia/article-pdf/16/6/806/2193735/16-6-806.pdfGoogle ScholarCross Ref
Son Doan, Lisa Bastarache, Sergio Klimkowski, Joshua C Denny, and Hua Xu. 2010. Integrating existing natural language processing tools for medication extraction from discharge summaries. Journal of the American Medical Informatics Association 17, 5(2010), 528–531.Google ScholarCross Ref
Kavita Ganesan and Michael Subotin. 2014. A general supervised approach to segmentation of clinical texts. In 2014 IEEE International Conference on Big Data (Big Data). 33–40. https://doi.org/10.1109/BigData.2014.7004390Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 2741–2749.Google ScholarDigital Library
Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016. CharNER: Character-Level Named Entity Recognition. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 911–921. https://aclanthology.org/C16-1087Google Scholar
Ying Li, Sharon Lipsky Gorman, and Noémie Elhadad. 2010. Section Classification in Clinical Notes Using Supervised Hidden Markov Model. In Proceedings of the 1st ACM International Health Informatics Symposium (Arlington, Virginia, USA) (IHI ’10). Association for Computing Machinery, New York, NY, USA, 744–750. https://doi.org/10.1145/1882992.1883105Google ScholarDigital Library
Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.Google Scholar
Jian Ni, Brian Delaney, and Radu Florian. 2015. Fast Model Adaptation for Automated Section Classification in Electronic Medical Records. In MEDINFO 2015: eHealth-enabled Health - Proceedings of the 15th World Congress on Health and Biomedical Informatics, São Paulo, Brazil, 19-23 August 2015(Studies in Health Technology and Informatics, Vol. 216), Indra Neil Sarkar, Andrew Georgiou, and Paulo Mazzoncini de Azevedo Marques (Eds.). IOS Press, 35–39. https://doi.org/10.3233/978-1-61499-564-7-35Google Scholar
Alexandra Pomares-Quimbaya, Markus Kreuzthaler, and Stefan Schulz. 2019. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Medical Research Methodology 19, 1 (2019), 155–155.Google ScholarCross Ref
Sara Rosenthal, Ken Barker, and Zhicheng Liang. 2019. Leveraging Medical Literature for Section Prediction in Electronic Health Records. In EMNLP.Google Scholar
Najmeh Sadoughi, Greg P. Finley, Erik Edwards, Amanda Robinson, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. 2018. Detecting Section Boundaries in Medical Dictations: Toward Real-Time Conversion of Medical Dictations to Clinical Reports. In Speech and Computer, Alexey Karpov, Oliver Jokisch, and Rodmonga Potapova (Eds.). Springer International Publishing, Cham, 563–573.Google Scholar
M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093Google ScholarDigital Library
Ricky K Taira, Stephen G Soderland, and Rex M Jakobovits. 2001. Automatic structuring of radiology free-text reports. Radiographics 21, 1 (2001), 237–245.Google ScholarCross Ref
Tijmen Tieleman, Geoffrey Hinton, 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31.Google Scholar
Zenan Zhai, Dat Quoc Nguyen, and Karin Verspoor. 2018. Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, Brussels, Belgium, 38–43. https://doi.org/10.18653/v1/W18-5605Google ScholarCross Ref
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA, USA, 649–657.Google ScholarDigital Library

Index Terms

A robust section identification method for scanned electronic health records
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Electronic health records (EHR) facilitate integration of patient health history for planning safe and proper treatment. Combined with data analytics, aggregate-level EHR enable examination and development of effective medicines and therapies for ...
Read More
Gujarati Script Recognition
Abstract
Character recognition is the extraction of printed or handwritten text from images into machine-readable format. The extracted text can be easily edited, modified and efficiently stored. While there are several Optical Character Recognition (OCR) ...
Read More
An Electronic Medical Records’ Keywords Detection Using Supervised Learning for Supporting Medical Coding
IC3INA '21: Proceedings of the 2021 International Conference on Computer, Control, Informatics and Its Applications

Electronic medical records (EMRs) are an important record, which is recorded to describe a medical status of patient in the digital or computerized form. Most of the EMRs content contain a textual data or medical notes. One of the important usages of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
January 2023
357 pages
ISBN:9781450397971
DOI:10.1145/3570991

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
CNN
Clinical Section Identification
Deep Learning
LSTM
Natural Language Processing
OCR
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate197of680submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 55
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A robust section identification method for scanned electronic health records

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Gujarati Script Recognition

An Electronic Medical Records’ Keywords Detection Using Supervised Learning for Supporting Medical Coding

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A robust section identification method for scanned electronic health records

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

ABSTRACT

References

Cited By

Index Terms

Recommendations

Electronic health records: how can IS researchers contribute to transforming healthcare?

Gujarati Script Recognition

An Electronic Medical Records’ Keywords Detection Using Supervised Learning for Supporting Medical Coding

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media