ABSTRACT
An Electronic Health Record (EHR) is a digital document containing critical information concerning a patient’s visit to a hospital. However, since they are often archived as scanned images, Optical Character Recognition (OCR) is used to extract the clinical text for analytics. The accuracy of OCR is compromised when the scanned EHRs contain noise artifacts or when the scans are of poor quality. Clinical text sections in the EHR help precisely locate information pertinent to a specific aspect of a patient’s visit, which is vital for any downstream clinical analytics activities such as medical coding, medical necessity assessment, and diagnosis identification. Section Identification is the task of identifying the different sections present in an EHR with the help of their headers. Traditionally, rule-based keyword matching and statistical approaches are employed to solve this problem. However, these approaches rely on external lookups and knowledge bases and are therefore susceptible to the errors introduced by OCR processes. We propose a character-based word sequence modeling approach for Clinical Section Identification from scanned EHRs that is robust against OCR-induced errors. We also utilize character augmentation techniques from existing literature to improve their robustness to OCR errors. We empirically demonstrate that our models trained with and without character augmentation significantly outperform existing approaches on a medical dataset with OCR errors.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.Google Scholar
- Emilia Apostolova, David S. Channin, Dina Demner-Fushman, Jacob D. Furst, Steven L. Lytinen, and Daniela Stan Raicu. 2009. Automatic segmentation of clinical texts. 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (2009), 5905–5908.Google ScholarCross Ref
- Markus Bayer, Marc-André Kaufhold, and Christian Reuter. 2022. A Survey on Data Augmentation for Text Classification. ACM Comput. Surv. (jun 2022). https://doi.org/10.1145/3544558 Just Accepted.Google ScholarDigital Library
- Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173(2017).Google Scholar
- Hong Jie Dai, Shabbir Syed-Abdul, Chih Wei Chen, and Chieh Chen Wu. 2015. Recognition and Evaluation of Clinical Section Headings in Clinical Documents Using Token-Based Formulation with Conditional Random Fields. BioMed Research International 2015 (2015). https://doi.org/10.1155/2015/873012Google Scholar
- Joshua C. Denny, III Spickard, Anderson, Kevin B. Johnson, Neeraja B. Peterson, Josh F. Peterson, and Randolph A. Miller. 2009. Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents. Journal of the American Medical Informatics Association 16, 6 (11 2009), 806–815. https://doi.org/10.1197/jamia.M3037 arXiv:https://academic.oup.com/jamia/article-pdf/16/6/806/2193735/16-6-806.pdfGoogle ScholarCross Ref
- Son Doan, Lisa Bastarache, Sergio Klimkowski, Joshua C Denny, and Hua Xu. 2010. Integrating existing natural language processing tools for medication extraction from discharge summaries. Journal of the American Medical Informatics Association 17, 5(2010), 528–531.Google ScholarCross Ref
- Kavita Ganesan and Michael Subotin. 2014. A general supervised approach to segmentation of clinical texts. In 2014 IEEE International Conference on Big Data (Big Data). 33–40. https://doi.org/10.1109/BigData.2014.7004390Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (Phoenix, Arizona) (AAAI’16). AAAI Press, 2741–2749.Google ScholarDigital Library
- Onur Kuru, Ozan Arkan Can, and Deniz Yuret. 2016. CharNER: Character-Level Named Entity Recognition. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Committee, Osaka, Japan, 911–921. https://aclanthology.org/C16-1087Google Scholar
- Ying Li, Sharon Lipsky Gorman, and Noémie Elhadad. 2010. Section Classification in Clinical Notes Using Supervised Hidden Markov Model. In Proceedings of the 1st ACM International Health Informatics Symposium (Arlington, Virginia, USA) (IHI ’10). Association for Computing Machinery, New York, NY, USA, 744–750. https://doi.org/10.1145/1882992.1883105Google ScholarDigital Library
- Edward Ma. 2019. NLP Augmentation. https://github.com/makcedward/nlpaug.Google Scholar
- Jian Ni, Brian Delaney, and Radu Florian. 2015. Fast Model Adaptation for Automated Section Classification in Electronic Medical Records. In MEDINFO 2015: eHealth-enabled Health - Proceedings of the 15th World Congress on Health and Biomedical Informatics, São Paulo, Brazil, 19-23 August 2015(Studies in Health Technology and Informatics, Vol. 216), Indra Neil Sarkar, Andrew Georgiou, and Paulo Mazzoncini de Azevedo Marques (Eds.). IOS Press, 35–39. https://doi.org/10.3233/978-1-61499-564-7-35Google Scholar
- Alexandra Pomares-Quimbaya, Markus Kreuzthaler, and Stefan Schulz. 2019. Current approaches to identify sections within clinical narratives from electronic health records: a systematic review. BMC Medical Research Methodology 19, 1 (2019), 155–155.Google ScholarCross Ref
- Sara Rosenthal, Ken Barker, and Zhicheng Liang. 2019. Leveraging Medical Literature for Section Prediction in Electronic Health Records. In EMNLP.Google Scholar
- Najmeh Sadoughi, Greg P. Finley, Erik Edwards, Amanda Robinson, Maxim Korenevsky, Michael Brenndoerfer, Nico Axtmann, Mark Miller, and David Suendermann-Oeft. 2018. Detecting Section Boundaries in Medical Dictations: Toward Real-Time Conversion of Medical Dictations to Clinical Reports. In Speech and Computer, Alexey Karpov, Oliver Jokisch, and Rodmonga Potapova (Eds.). Springer International Publishing, Cham, 563–573.Google Scholar
- M. Schuster and K.K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093Google ScholarDigital Library
- Ricky K Taira, Stephen G Soderland, and Rex M Jakobovits. 2001. Automatic structuring of radiology free-text reports. Radiographics 21, 1 (2001), 237–245.Google ScholarCross Ref
- Tijmen Tieleman, Geoffrey Hinton, 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4, 2 (2012), 26–31.Google Scholar
- Zenan Zhai, Dat Quoc Nguyen, and Karin Verspoor. 2018. Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis. Association for Computational Linguistics, Brussels, Belgium, 38–43. https://doi.org/10.18653/v1/W18-5605Google ScholarCross Ref
- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-Level Convolutional Networks for Text Classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (Montreal, Canada) (NIPS’15). MIT Press, Cambridge, MA, USA, 649–657.Google ScholarDigital Library
Index Terms
- A robust section identification method for scanned electronic health records
Recommendations
Electronic health records: how can IS researchers contribute to transforming healthcare?
Electronic health records (EHR) facilitate integration of patient health history for planning safe and proper treatment. Combined with data analytics, aggregate-level EHR enable examination and development of effective medicines and therapies for ...
Gujarati Script Recognition
AbstractCharacter recognition is the extraction of printed or handwritten text from images into machine-readable format. The extracted text can be easily edited, modified and efficiently stored. While there are several Optical Character Recognition (OCR) ...
An Electronic Medical Records’ Keywords Detection Using Supervised Learning for Supporting Medical Coding
IC3INA '21: Proceedings of the 2021 International Conference on Computer, Control, Informatics and Its ApplicationsElectronic medical records (EMRs) are an important record, which is recorded to describe a medical status of patient in the digital or computerized form. Most of the EMRs content contain a textual data or medical notes. One of the important usages of ...
Comments