Abstract
Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
https://www.digitisation.eu/.
- 2.
en_core_web_sm from SpaCy v3.4.4 from https://spacy.io/.
References
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2017 competition on post-OCR text correction. In: 14th IAPR ICDAR, vol. 1, pp. 1423–1428. IEEE (2017)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)
Dahlmeier, D., Ng, H.T.: A beam-search decoder for grammatical error correction. In: Proceedings of the 2012 EMNLP, pp. 568–578 (2012)
Damerau, F.J., Mays, E.: An examination of undetected typing errors. Inf. Process. Manage. 25(6), 659–664 (1989)
Dannélls, D., Persson, S.: Supervised OCR post-correction of historical Swedish texts: what role does the OCR system play? In: DHN, pp. 24–37 (2020)
Dutta, H., Gupta, A.: PNRank: unsupervised ranking of person name entities from noisy OCR text. Decis. Support Syst. 152, 113662 (2022)
Graliński, F., et al.: Kleister: a novel task for information extraction involving long documents with complex layout. arXiv:2003.02356 (2020)
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 ICDAR, pp. 1516–1520. IEEE (2019)
Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38. IEEE (2019)
Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Post-OCR error detection by generating plausible candidates. In: 2019 ICDAR, pp. 876–881. IEEE (2019)
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUSND: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)
Kemighan, M.D., Church, K., Gale, W.A.: A spelling correction program based on a noisy channel model. In: COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics (1990)
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp. 67–72. ACL, Vancouver, Canada (2017)
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)
Mitchell, J., Lapata, M.: Language models based on semantic composition. In: Proceedings of the 2009 Conference on EMNLP, pp. 430–439 (2009)
Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE JCDL in 2020, pp. 333–336 (2020)
Peterson, J.L.: A note on undetected typing errors. Commun. ACM 29(7), 633–637 (1986)
Pham, D., Nguyen, D., Le, A., Phan, M., Kromer, P.: Candidate word generation for OCR errors using optimization algorithm. In: AIP Conference Proceedings, vol. 2406, p. 020028. AIP Publishing LLC (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 ICDAR, pp. 1588–1593. IEEE (2019)
Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Soper, E., Fujimoto, S., Yu, Y.Y.: Bart for post-correction of OCR newspaper text. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 284–290 (2021)
Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR 3(3), 125–137 (2001)
Thawani, A., Pujara, J., Szekely, P.A., Ilievski, F.: Representing numbers in NLP: a survey and a vision. arXiv preprint arXiv:2103.13136 (2021)
Vajjala, S., Lučić, I.: Onestopenglish corpus: A new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)
Xue, L., et al.: ByT5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hemmer, A., Brachat, J., Coustaty, M., Ogier, JM. (2023). Estimating Post-OCR Denoising Complexity on Numerical Texts. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-42430-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42429-8
Online ISBN: 978-3-031-42430-4
eBook Packages: Computer ScienceComputer Science (R0)