Skip to main content

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Abstract

Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. In this work, we propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded. Additionally, we provide results of OCR engines and post-OCR correction tools on ESTER-Pt, which can serve as a baseline for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 159.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://zenodo.org/record/7872951#.ZEue0XbMJhE.

  2. 2.

    https://en.unesco.org/sites/default/files/accord_unesco_langue_portuguaise_conference_generale_eng.pdf.

  3. 3.

    https://dumps.wikimedia.org/ptwiki/.

  4. 4.

    http://www.dominiopublico.gov.br/.

  5. 5.

    http://bndigital.bn.gov.br/acervodigital/.

  6. 6.

    http://bndigital.bn.gov.br/acervodigital/.

  7. 7.

    https://www.gutenberg.org/.

  8. 8.

    https://github.com/tesseract-ocr/tesseract.

  9. 9.

    https://cloud.google.com/document-ai.

  10. 10.

    https://github.com/wolfgarbe/SymSpell.

  11. 11.

    https://github.com/impactcentre/ocrevalUAtion.

References

  1. Almeida, H.d.: Augusto dos Anjos - Um Tema para Debates. Apex (1970)

    Google Scholar 

  2. Arrigo, M., Strassel, S., King, N., Tran, T., Mason, L.: CAMIO: A corpus for OCR in multiple languages. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. pp. 1209–1216 (2022)

    Google Scholar 

  3. Bazzo, G.T., Lorentz, G.A., Vargas, D.S., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: European Conference on Information Retrieval. pp. 102–109 (2020)

    Google Scholar 

  4. Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: OCR-IDL: OCR annotations for industry document library dataset. arXiv preprint arXiv:2202.12985 (2022)

  5. Carrasco, R.C.: An open-source OCR evaluation tool. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. pp. 179–184 (2014)

    Google Scholar 

  6. de Carvalho, G.V.: Biografia da Biblioteca Nacional, 1807–1990. Editora Irradiação Cultural (1994)

    Google Scholar 

  7. Chen, J., Yu, H., Ma, J., Guan, M., Xu, X., Wang, X., Qu, S., Li, B., Xue, X.: Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093 (2021)

  8. Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1423–1428 (2017)

    Google Scholar 

  9. Clausner, C., Papadopoulos, C., Pletschacher, S., Antonacopoulos, A.: The ENP image and ground truth dataset of historical newspapers. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 931–935. IEEE (2015)

    Google Scholar 

  10. DBNL: DBNL OCR data set (Jun 2019). https://doi.org/10.5281/zenodo.3239290, https://doi.org/10.5281/zenodo.3239290

  11. Dong, R., Smith, D.A.: Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2363–2372 (2018)

    Google Scholar 

  12. Doush, I.A., AIKhateeb, F., Gharibeh, A.H.: Yarmouk arabic OCR dataset. In: 2018 8th International Conference on Computer Science and Information Technology (CSIT). pp. 150–154 (2018)

    Google Scholar 

  13. Dutta, H., Gupta, A.: PNRank: Unsupervised ranking of person name entities from noisy OCR text. Decision Support Systems 152, 113662 (2022)

    Article  Google Scholar 

  14. Eger, S., vor der Brück, T., Mehler, A.: A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. The Prague bulletin of mathematical linguistics 105(1), 77 (2016)

    Google Scholar 

  15. Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M., Doucet, A.: Named entity recognition and classification on historical documents: A survey. arXiv preprint arXiv:2109.11406 (2021)

  16. Evershed, J., Fitch, K.: Correcting noisy OCR: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. pp. 45–51 (2014)

    Google Scholar 

  17. Gabay, S., Clérice, T., Reul, C.: OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more) (May 2020), https://hal.science/hal-02577236

  18. Gatos, B., Stamatopoulos, N., Louloudis, G., Sfikas, G., Retsinas, G., Papavassiliou, V., Sunistira, F., Katsouros, V.: Grpoly-db: An old greek polytonic document image database. In: 2015 13th international conference on document analysis and recognition (ICDAR). pp. 646–650. IEEE (2015)

    Google Scholar 

  19. Gupte, A., Romanov, A., Mantravadi, S., Banda, D., Liu, J., Khan, R., Meenal, L.R., Han, B., Srinivasan, S.: Lights, camera, action! a framework to improve NLP accuracy over OCR documents (2021)

    Google Scholar 

  20. Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: International Conference on Theory and Practice of Digital Libraries. pp. 87–101. Springer (2020)

    Google Scholar 

  21. Hamdi, A., Pontes, E.L., Sidere, N., Coustaty, M., Doucet, A.: In-depth analysis of the impact of OCR errors on named entity recognition and linking. Natural Language Engineering pp. 1–24 (2022)

    Google Scholar 

  22. Hegghammer, T.: OCR with tesseract, amazon textract, and google document ai: a benchmarking experiment. Journal of Computational Social Science pp. 1–22 (2021)

    Google Scholar 

  23. Huynh, V.N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: International Conference on Asian Digital Libraries. pp. 33–42. Springer (2020)

    Google Scholar 

  24. Jean-Caurant, A., Tamani, N., Courboulay, V., Burie, J.C.: Lexicographical-based order for post-OCR correction of named entities. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1192–1197. IEEE (2017)

    Google Scholar 

  25. Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: DocCreator: A new software for creating synthetic ground-truthed document images. Journal of imaging 3(4), 62 (2017)

    Article  Google Scholar 

  26. Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval 2, 165–176 (2000)

    Article  Google Scholar 

  27. Kettunen, K., Keskustalo, H., Kumpulainen, S., Pääkkönen, T., Rautiainen, J.: OCR quality affects perceived usefulness of historical newspaper clippings-a user study. arXiv preprint arXiv:2203.03557 (2022)

  28. Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR Quality on Named Entity Linking. In: Jatowt, A., Maeda, A., Syn, S.Y. (eds.) ICADL 2019. LNCS, vol. 11853, pp. 102–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34058-2_11

    Chapter  Google Scholar 

  29. Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A benchmark and dataset for Post-OCR text correction in sanskrit. arXiv preprint arXiv:2211.07980 (2022)

  30. Martínek, J., Lenc, L., Král, P.: Training Strategies for OCR Systems for Historical Documents. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2019. IAICT, vol. 559, pp. 362–373. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19823-7_30

    Chapter  Google Scholar 

  31. Mei, J., Islam, A., Moh’d, A., Wu, Y., Milios, E.: Post-processing OCR text using web-scale corpora. In: Proceedings of the 2017 ACM Symposium on Document Engineering. pp. 117–120 (2017)

    Google Scholar 

  32. Molla, D., Cassidy, S.: Overview of the 2017 ALTa shared task: Correcting OCR errors. In: Proceedings of the Australasian Language Technology Association Workshop 2017. pp. 115–118 (2017)

    Google Scholar 

  33. Nabuco, J.: Um estadista do Império: Nabuco de Araujo: sua vida, suas opiniões, sua época, por seu filho Joaquim Nabuco (Tomo 3). H. Garnier, Rio de Janeiro (1897)

    Google Scholar 

  34. Nabuco, J.: Cartas aos abolicionistas ingleses. Joaquim Nabuco, Massangana (1985)

    Google Scholar 

  35. Nabuco, J.: O abolicionismo. Centro Edelstein (2011)

    Google Scholar 

  36. Nastase, V., Hitschler, J.: Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  37. Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Computing Surveys (CSUR) 54(6), 1–37 (2021)

    Article  Google Scholar 

  38. de Oliveira, L.L., Vargas, D.S., Alexandre, A.M.A., Cordeiro, F.C., Gomes, D.d.S.M., Rodrigues, M.d.C., Romeu, R.K., Moreira, V.P.: Evaluating and mitigating the impact of OCR errors on information retrieval. International Journal on Digital Libraries pp. 1–18 (2023)

    Google Scholar 

  39. Pack, C., Liu, Y., Soh, L.K., Lorang, E.: Augmentation-based pseudo-ground truth generation for deep learning in historical document segmentation for greater levels of archival description and access. Journal on Computing and Cultural Heritage (JOCCH) 15(3), 1–21 (2022)

    Article  Google Scholar 

  40. Ribeiro, N.: Albrecht Dürer: o apogeu do Renascimento alemão (1999)

    Google Scholar 

  41. Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1588–1593 (2019)

    Google Scholar 

  42. Saini, N., Pinto, P., Bheemaraj, A., Kumar, D., Daga, D., Yadav, S., Nagaraj, S.: OCR synthetic benchmark dataset for indic languages. arXiv preprint arXiv:2205.02543 (2022)

  43. Simistira, F., Ul-Hassan, A., Papavassiliou, V., Gatos, B., Katsouros, V., Liwicki, M.: Recognition of historical greek polytonic scripts using lstm networks. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 766–770. IEEE (2015)

    Google Scholar 

  44. Sodré, N.W.: Brasil: radiografia de um modelo. Vozes (1975)

    Google Scholar 

  45. Sodré, N.W.: História da imprensa no Brasil. Mauad Editora Ltda (1998)

    Google Scholar 

  46. Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground truth for training OCR engines on historical documents in german fraktur and early modern latin. Journal for Language Technology and Computational Linguistics 33(1), 97–114 (2018)

    Article  Google Scholar 

  47. van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, ICAART. pp. 484–496 (2020)

    Google Scholar 

  48. Vargas, D.S., de Oliveira, L.L., Moreira, V.P., Bazzo, G.T., Lorentz, G.A.: sOCRates-a post-OCR text correction method. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. pp. 61–72 (2021)

    Google Scholar 

  49. Volk, M.: The text+Berg corpus: an alpine french-german parallel resource (2011)

    Google Scholar 

  50. Yalniz, I.Z., Manmatha, R.: A fast alignment scheme for automatic OCR evaluation of books. In: 2011 International Conference on Document Analysis and Recognition. pp. 754–758. IEEE (2011)

    Google Scholar 

  51. Zosa, E., Mutuvi, S., Granroth-Wilding, M., Doucet, A.: Evaluating the robustness of embedding-based topic models to OCR noise. In: International Conference on Asian Digital Libraries. pp. 392–400. Springer (2021)

    Google Scholar 

Download references

Acknowledgment

This work has been financed in part by CAPES Finance Code 001 and CNPq/Brazil.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moniele Kunrath Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, M.K., Bazzo, G., de Oliveira, L.L., Moreira, V.P. (2023). ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41682-8_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41681-1

  • Online ISBN: 978-3-031-41682-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics