Abstract
The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Alex, B., Burns, J.: Estimating and rating the quality of optically character recognised text. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 97–102 (2014)
Alex, B., Grover, C., Klein, E., Tobin, R.: Digitised historical text: does it have to be mediocre? In: KONVENS, pp. 401–409 (2012)
Amjad, M., et al.: Urduthreat@ fire2021: shared track on abusive threat identification in Urdu. In: Forum for Information Retrieval Evaluation, pp. 9–11. FIRE 2021, Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3503162.3505241
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Briskilal, J., Subalalitha, C.: An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Inf. Process. Manage. 59(1), 102756 (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-1423
Guo, Y., Dong, X., Al-Garadi, M.A., Sarker, A., Paris, C., Aliod, D.M.: Benchmarking of transformer-based pre-trained models on social media text classification datasets. In: Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association, pp. 86–91. Australasian Language Technology Association, Virtual Workshop, December 2020. https://aclanthology.org/2020.alta-1.10
Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Zhao, T.: SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2177–2190. Association for Computational Linguistics, July 2020. https://doi.org/10.18653/v1/2020.acl-main.197, https://aclanthology.org/2020.acl-main.197
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751 (2014). https://aclweb.org/anthology/D/D14/D14-1181.pdf
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). arxiv.org/abs/1907.11692
Mieskes, M., Schmunk, S.: OCR quality and NLP preprocessing. In: WNLP@ ACL, pp. 102–105 (2019)
Murarka, A., Radhakrishnan, B., Ravichandran, S.: Classification of mental illnesses on social media using RoBERTa. In: Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, pp. 59–68. Association for Computational Linguistics, April 2021. https://aclanthology.org/2021.louhi-1.7
Murata, M., Busagala, L.S.P., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy and feature transformation on automatic text classification. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 506–517. Springer, Heidelberg (2006). https://doi.org/10.1007/11669487_45
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Smith, D.A., Cordell, R.: A research agenda for historical and multilingual optical character recognition, p. 36. NULab, Northeastern University (2018). https://ocr.northeastern.edu/report
Stein, S.S., Argamon, S., Frieder, O.: The effect of OCR errors on stylistic text classification. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 701–702 (2006)
Van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6, https://aclanthology.org/2020.emnlp-demos.6
Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)
Zhao, Z., Zhang, Z., Hopfgartner, F.: SS-BERT: mitigating identity terms bias in toxic comment classification by utilising the notion of “subjectivity” and “identity terms”. CoRR abs/2109.02691 (2021). arxiv.org/abs:2109.02691
Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015). https://doi.org/10.1109/ICCV.2015.11
Acknowledgments
The work was done with partial support from the Mexican Government through the grant A1-S-47854 of CONACYT, Mexico, grants 20220852 and 20220859 of the Secretaría de Investigación y Posgrado of the Instituto Politécnico Nacional, Mexico. The authors thank the CONACYT for the computing resources brought to them through the Plataforma de Aprendizaje Profundo para Tecnologías del Lenguaje of the Laboratorio de Supercómputo of the INAOE, Mexico and acknowledge the support of Microsoft through the Microsoft Latin America PhD Award.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vitman, O., Kostiuk, Y., Plachinda, P., Zhila, A., Sidorov, G., Gelbukh, A. (2022). Evaluating the Impact of OCR Quality on Short Texts Classification Task. In: Pichardo Lagunas, O., Martínez-Miranda, J., Martínez Seis, B. (eds) Advances in Computational Intelligence. MICAI 2022. Lecture Notes in Computer Science(), vol 13613. Springer, Cham. https://doi.org/10.1007/978-3-031-19496-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-19496-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19495-5
Online ISBN: 978-3-031-19496-2
eBook Packages: Computer ScienceComputer Science (R0)