Skip to main content

Estimating Post-OCR Denoising Complexity on Numerical Texts

  • Conference paper
  • First Online:
Recent Challenges in Intelligent Information and Database Systems (ACIIDS 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1863))

Included in the following conference series:

  • 458 Accesses

Abstract

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.digitisation.eu/.

  2. 2.

    en_core_web_sm from SpaCy v3.4.4 from https://spacy.io/.

References

  1. Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2017 competition on post-OCR text correction. In: 14th IAPR ICDAR, vol. 1, pp. 1423–1428. IEEE (2017)

    Google Scholar 

  2. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  3. Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)

    Article  Google Scholar 

  4. Dahlmeier, D., Ng, H.T.: A beam-search decoder for grammatical error correction. In: Proceedings of the 2012 EMNLP, pp. 568–578 (2012)

    Google Scholar 

  5. Damerau, F.J., Mays, E.: An examination of undetected typing errors. Inf. Process. Manage. 25(6), 659–664 (1989)

    Article  Google Scholar 

  6. Dannélls, D., Persson, S.: Supervised OCR post-correction of historical Swedish texts: what role does the OCR system play? In: DHN, pp. 24–37 (2020)

    Google Scholar 

  7. Dutta, H., Gupta, A.: PNRank: unsupervised ranking of person name entities from noisy OCR text. Decis. Support Syst. 152, 113662 (2022)

    Article  Google Scholar 

  8. Graliński, F., et al.: Kleister: a novel task for information extraction involving long documents with complex layout. arXiv:2003.02356 (2020)

  9. Huang, Z., et al.: ICDAR 2019 competition on scanned receipt OCR and information extraction. In: 2019 ICDAR, pp. 1516–1520. IEEE (2019)

    Google Scholar 

  10. Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38. IEEE (2019)

    Google Scholar 

  11. Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A., et al.: Post-OCR error detection by generating plausible candidates. In: 2019 ICDAR, pp. 876–881. IEEE (2019)

    Google Scholar 

  12. Jaume, G., Ekenel, H.K., Thiran, J.P.: FUSND: a dataset for form understanding in noisy scanned documents. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol. 2, pp. 1–6. IEEE (2019)

    Google Scholar 

  13. Kemighan, M.D., Church, K., Gale, W.A.: A spelling correction program based on a noisy channel model. In: COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics (1990)

    Google Scholar 

  14. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp. 67–72. ACL, Vancouver, Canada (2017)

    Google Scholar 

  15. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR 5(1), 39–46 (2002)

    Article  MATH  Google Scholar 

  16. Mitchell, J., Lapata, M.: Language models based on semantic composition. In: Proceedings of the 2009 Conference on EMNLP, pp. 430–439 (2009)

    Google Scholar 

  17. Nguyen, T.T.H., Jatowt, A., Nguyen, N.V., Coustaty, M., Doucet, A.: Neural machine translation with BERT for post-OCR error detection and correction. In: Proceedings of the ACM/IEEE JCDL in 2020, pp. 333–336 (2020)

    Google Scholar 

  18. Peterson, J.L.: A note on undetected typing errors. Commun. ACM 29(7), 633–637 (1986)

    Article  Google Scholar 

  19. Pham, D., Nguyen, D., Le, A., Phan, M., Kromer, P.: Candidate word generation for OCR errors using optimization algorithm. In: AIP Conference Proceedings, vol. 2406, p. 020028. AIP Publishing LLC (2021)

    Google Scholar 

  20. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  21. Ramirez-Orta, J.A., Xamena, E., Maguitman, A., Milios, E., Soto, A.J.: Post-OCR document correction with large ensembles of character sequence-to-sequence models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 11192–11199 (2022)

    Google Scholar 

  22. Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 ICDAR, pp. 1588–1593. IEEE (2019)

    Google Scholar 

  23. Shannon, C.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  24. Soper, E., Fujimoto, S., Yu, Y.Y.: Bart for post-correction of OCR newspaper text. In: Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 284–290 (2021)

    Google Scholar 

  25. Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for OCR errors in text. IJDAR 3(3), 125–137 (2001)

    Article  Google Scholar 

  26. Thawani, A., Pujara, J., Szekely, P.A., Ilievski, F.: Representing numbers in NLP: a survey and a vision. arXiv preprint arXiv:2103.13136 (2021)

  27. Vajjala, S., Lučić, I.: Onestopenglish corpus: A new corpus for automatic readability assessment and text simplification. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 297–304 (2018)

    Google Scholar 

  28. Xue, L., et al.: ByT5: Towards a token-free future with pre-trained byte-to-byte models. Trans. Assoc. Comput. Linguist. 10, 291–306 (2022)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arthur Hemmer .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hemmer, A., Brachat, J., Coustaty, M., Ogier, JM. (2023). Estimating Post-OCR Denoising Complexity on Numerical Texts. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42430-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42429-8

  • Online ISBN: 978-3-031-42430-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics