Estimating Post-OCR Denoising Complexity on Numerical Texts

Hemmer, Arthur; Brachat, Jérôme; Coustaty, Mickaël; Ogier, Jean-Marc

doi:10.1007/978-3-031-42430-4_6

Arthur Hemmer^12,13,
Jérôme Brachat¹²,
Mickaël Coustaty¹³ &
…
Jean-Marc Ogier¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1863))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

458 Accesses

Abstract

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measuring Content Complexity of Technical Texts: Machine Learning Experiments

Textual Complexity as an Indicator of Document Relevance

Multi-class Text Complexity Evaluation via Deep Neural Networks

Notes

1.
https://www.digitisation.eu/.
2.
en_core_web_sm from SpaCy v3.4.4 from https://spacy.io/.

References

Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2017 competition on post-OCR text correction. In: 14th IAPR ICDAR, vol. 1, pp. 1423–1428. IEEE (2017)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)
Article Google Scholar
Dahlmeier, D., Ng, H.T.: A beam-search decoder for grammatical error correction. In: Proceedings of the 2012 EMNLP, pp. 568–578 (2012)
Google Scholar
Damerau, F.J., Mays, E.: An examination of undetected typing errors. Inf. Process. Manage. 25(6), 659–664 (1989)
Article Google Scholar
Dannélls, D., Persson, S.: Supervised OCR post-correction of historical Swedish texts: what role does the OCR system play? In: DHN, pp. 24–37 (2020)
Google Scholar
Dutta, H., Gupta, A.: PNRank: unsupervised ranking of person name entities from noisy OCR text. Decis. Support Syst. 152, 113662 (2022)
Article Google Scholar
Graliński, F., et al.: Kleister: a novel task for information extraction involving long documents with complex layout. arXiv:2003.02356 (2020)
Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 ICDAR, pp. 1516–1520. IEEE (2019)
Google Scholar
Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38. IEEE (2019)
Google Scholar
Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Post-OCR error detection by generating plausible candidates. In: 2019 ICDAR, pp. 876–881. IEEE (2019)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.P.: FUSND: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)
Google Scholar
Kemighan, M.D., Church, K., Gale, W.A.: A spelling correction program based on a noisy channel model. In: COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics (1990)
Google Scholar
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp. 67–72. ACL, Vancouver, Canada (2017)
Google Scholar
Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)
Article MATH Google Scholar
Mitchell, J., Lapata, M.: Language models based on semantic composition. In: Proceedings of the 2009 Conference on EMNLP, pp. 430–439 (2009)
Google Scholar
Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE JCDL in 2020, pp. 333–336 (2020)
Google Scholar
Peterson, J.L.: A note on undetected typing errors. Commun. ACM 29(7), 633–637 (1986)
Article Google Scholar
Pham, D., Nguyen, D., Le, A., Phan, M., Kromer, P.: Candidate word generation for OCR errors using optimization algorithm. In: AIP Conference Proceedings, vol. 2406, p. 020028. AIP Publishing LLC (2021)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)
Google Scholar
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 ICDAR, pp. 1588–1593. IEEE (2019)
Google Scholar
Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article MathSciNet MATH Google Scholar
Soper, E., Fujimoto, S., Yu, Y.Y.: Bart for post-correction of OCR newspaper text. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 284–290 (2021)
Google Scholar
Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR 3(3), 125–137 (2001)
Article Google Scholar
Thawani, A., Pujara, J., Szekely, P.A., Ilievski, F.: Representing numbers in NLP: a survey and a vision. arXiv preprint arXiv:2103.13136 (2021)
Vajjala, S., Lučić, I.: Onestopenglish corpus: A new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)
Google Scholar
Xue, L., et al.: ByT5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Shift Technology, Paris, France
Arthur Hemmer & Jérôme Brachat
La Rochelle Université, L3i - La Rochelle Université, Avenue Michel Crépeau, 17042, La Rochelle, France
Arthur Hemmer, Mickaël Coustaty & Jean-Marc Ogier

Authors

Arthur Hemmer
View author publications
You can also search for this author in PubMed Google Scholar
Jérôme Brachat
View author publications
You can also search for this author in PubMed Google Scholar
Mickaël Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Ogier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur Hemmer .

Editor information

Editors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Siridech Boonsang
Iwate Prefectural University, Iwate, Japan
Hamido Fujita
Wrocław University of Science and Technology, Wrocław, Poland
Bogumiła Hnatkowska
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
King Mongkut's Institute of Technology, Ladkrabang, Thailand
Kitsuchart Pasupa
Malaysia Japan International Institute of Technology, Kuala Lumpur, Malaysia
Ali Selamat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hemmer, A., Brachat, J., Coustaty, M., Ogier, JM. (2023). Estimating Post-OCR Denoising Complexity on Numerical Texts. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-42430-4_6
Published: 29 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42429-8
Online ISBN: 978-3-031-42430-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Estimating Post-OCR Denoising Complexity on Numerical Texts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Measuring Content Complexity of Technical Texts: Machine Learning Experiments

Textual Complexity as an Indicator of Document Relevance

Multi-class Text Complexity Evaluation via Deep Neural Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Estimating Post-OCR Denoising Complexity on Numerical Texts

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Measuring Content Complexity of Technical Texts: Machine Learning Experiments

Textual Complexity as an Indicator of Document Relevance

Multi-class Text Complexity Evaluation via Deep Neural Networks

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation