ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese

Santos, Moniele Kunrath; Bazzo, Guilherme; de Oliveira, Lucas Lima; Moreira, Viviane Pereira

doi:10.1007/978-3-031-41682-8_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14189))

Included in the following conference series:

International Conference on Document Analysis and Recognition

676 Accesses

Abstract

Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. In this work, we propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded. Additionally, we provide results of OCR engines and post-OCR correction tools on ESTER-Pt, which can serve as a baseline for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Almeida, H.d.: Augusto dos Anjos - Um Tema para Debates. Apex (1970)
Google Scholar
Arrigo, M., Strassel, S., King, N., Tran, T., Mason, L.: CAMIO: A corpus for OCR in multiple languages. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. pp. 1209–1216 (2022)
Google Scholar
Bazzo, G.T., Lorentz, G.A., Vargas, D.S., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: European Conference on Information Retrieval. pp. 102–109 (2020)
Google Scholar
Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: OCR-IDL: OCR annotations for industry document library dataset. arXiv preprint arXiv:2202.12985 (2022)
Carrasco, R.C.: An open-source OCR evaluation tool. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. pp. 179–184 (2014)
Google Scholar
de Carvalho, G.V.: Biografia da Biblioteca Nacional, 1807–1990. Editora Irradiação Cultural (1994)
Google Scholar
Chen, J., Yu, H., Ma, J., Guan, M., Xu, X., Wang, X., Qu, S., Li, B., Xue, X.: Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093 (2021)
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1423–1428 (2017)
Google Scholar
Clausner, C., Papadopoulos, C., Pletschacher, S., Antonacopoulos, A.: The ENP image and ground truth dataset of historical newspapers. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 931–935. IEEE (2015)
Google Scholar
DBNL: DBNL OCR data set (Jun 2019). https://doi.org/10.5281/zenodo.3239290, https://doi.org/10.5281/zenodo.3239290
Dong, R., Smith, D.A.: Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2363–2372 (2018)
Google Scholar
Doush, I.A., AIKhateeb, F., Gharibeh, A.H.: Yarmouk arabic OCR dataset. In: 2018 8th International Conference on Computer Science and Information Technology (CSIT). pp. 150–154 (2018)
Google Scholar
Dutta, H., Gupta, A.: PNRank: Unsupervised ranking of person name entities from noisy OCR text. Decision Support Systems 152, 113662 (2022)
Article Google Scholar
Eger, S., vor der Brück, T., Mehler, A.: A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. The Prague bulletin of mathematical linguistics 105(1), 77 (2016)
Google Scholar
Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M., Doucet, A.: Named entity recognition and classification on historical documents: A survey. arXiv preprint arXiv:2109.11406 (2021)
Evershed, J., Fitch, K.: Correcting noisy OCR: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. pp. 45–51 (2014)
Google Scholar
Gabay, S., Clérice, T., Reul, C.: OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more) (May 2020), https://hal.science/hal-02577236
Gatos, B., Stamatopoulos, N., Louloudis, G., Sfikas, G., Retsinas, G., Papavassiliou, V., Sunistira, F., Katsouros, V.: Grpoly-db: An old greek polytonic document image database. In: 2015 13th international conference on document analysis and recognition (ICDAR). pp. 646–650. IEEE (2015)
Google Scholar
Gupte, A., Romanov, A., Mantravadi, S., Banda, D., Liu, J., Khan, R., Meenal, L.R., Han, B., Srinivasan, S.: Lights, camera, action! a framework to improve NLP accuracy over OCR documents (2021)
Google Scholar
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: International Conference on Theory and Practice of Digital Libraries. pp. 87–101. Springer (2020)
Google Scholar
Hamdi, A., Pontes, E.L., Sidere, N., Coustaty, M., Doucet, A.: In-depth analysis of the impact of OCR errors on named entity recognition and linking. Natural Language Engineering pp. 1–24 (2022)
Google Scholar
Hegghammer, T.: OCR with tesseract, amazon textract, and google document ai: a benchmarking experiment. Journal of Computational Social Science pp. 1–22 (2021)
Google Scholar
Huynh, V.N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: International Conference on Asian Digital Libraries. pp. 33–42. Springer (2020)
Google Scholar
Jean-Caurant, A., Tamani, N., Courboulay, V., Burie, J.C.: Lexicographical-based order for post-OCR correction of named entities. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1192–1197. IEEE (2017)
Google Scholar
Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: DocCreator: A new software for creating synthetic ground-truthed document images. Journal of imaging 3(4), 62 (2017)
Article Google Scholar
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval 2, 165–176 (2000)
Article Google Scholar
Kettunen, K., Keskustalo, H., Kumpulainen, S., Pääkkönen, T., Rautiainen, J.: OCR quality affects perceived usefulness of historical newspaper clippings-a user study. arXiv preprint arXiv:2203.03557 (2022)
Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR Quality on Named Entity Linking. In: Jatowt, A., Maeda, A., Syn, S.Y. (eds.) ICADL 2019. LNCS, vol. 11853, pp. 102–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34058-2_11
Chapter Google Scholar
Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A benchmark and dataset for Post-OCR text correction in sanskrit. arXiv preprint arXiv:2211.07980 (2022)
Martínek, J., Lenc, L., Král, P.: Training Strategies for OCR Systems for Historical Documents. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2019. IAICT, vol. 559, pp. 362–373. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19823-7_30
Chapter Google Scholar
Mei, J., Islam, A., Moh’d, A., Wu, Y., Milios, E.: Post-processing OCR text using web-scale corpora. In: Proceedings of the 2017 ACM Symposium on Document Engineering. pp. 117–120 (2017)
Google Scholar
Molla, D., Cassidy, S.: Overview of the 2017 ALTa shared task: Correcting OCR errors. In: Proceedings of the Australasian Language Technology Association Workshop 2017. pp. 115–118 (2017)
Google Scholar
Nabuco, J.: Um estadista do Império: Nabuco de Araujo: sua vida, suas opiniões, sua época, por seu filho Joaquim Nabuco (Tomo 3). H. Garnier, Rio de Janeiro (1897)
Google Scholar
Nabuco, J.: Cartas aos abolicionistas ingleses. Joaquim Nabuco, Massangana (1985)
Google Scholar
Nabuco, J.: O abolicionismo. Centro Edelstein (2011)
Google Scholar
Nastase, V., Hitschler, J.: Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Google Scholar
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Computing Surveys (CSUR) 54(6), 1–37 (2021)
Article Google Scholar
de Oliveira, L.L., Vargas, D.S., Alexandre, A.M.A., Cordeiro, F.C., Gomes, D.d.S.M., Rodrigues, M.d.C., Romeu, R.K., Moreira, V.P.: Evaluating and mitigating the impact of OCR errors on information retrieval. International Journal on Digital Libraries pp. 1–18 (2023)
Google Scholar
Pack, C., Liu, Y., Soh, L.K., Lorang, E.: Augmentation-based pseudo-ground truth generation for deep learning in historical document segmentation for greater levels of archival description and access. Journal on Computing and Cultural Heritage (JOCCH) 15(3), 1–21 (2022)
Article Google Scholar
Ribeiro, N.: Albrecht Dürer: o apogeu do Renascimento alemão (1999)
Google Scholar
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1588–1593 (2019)
Google Scholar
Saini, N., Pinto, P., Bheemaraj, A., Kumar, D., Daga, D., Yadav, S., Nagaraj, S.: OCR synthetic benchmark dataset for indic languages. arXiv preprint arXiv:2205.02543 (2022)
Simistira, F., Ul-Hassan, A., Papavassiliou, V., Gatos, B., Katsouros, V., Liwicki, M.: Recognition of historical greek polytonic scripts using lstm networks. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 766–770. IEEE (2015)
Google Scholar
Sodré, N.W.: Brasil: radiografia de um modelo. Vozes (1975)
Google Scholar
Sodré, N.W.: História da imprensa no Brasil. Mauad Editora Ltda (1998)
Google Scholar
Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground truth for training OCR engines on historical documents in german fraktur and early modern latin. Journal for Language Technology and Computational Linguistics 33(1), 97–114 (2018)
Article Google Scholar
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, ICAART. pp. 484–496 (2020)
Google Scholar
Vargas, D.S., de Oliveira, L.L., Moreira, V.P., Bazzo, G.T., Lorentz, G.A.: sOCRates-a post-OCR text correction method. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. pp. 61–72 (2021)
Google Scholar
Volk, M.: The text+Berg corpus: an alpine french-german parallel resource (2011)
Google Scholar
Yalniz, I.Z., Manmatha, R.: A fast alignment scheme for automatic OCR evaluation of books. In: 2011 International Conference on Document Analysis and Recognition. pp. 754–758. IEEE (2011)
Google Scholar
Zosa, E., Mutuvi, S., Granroth-Wilding, M., Doucet, A.: Evaluating the robustness of embedding-based topic models to OCR noise. In: International Conference on Asian Digital Libraries. pp. 392–400. Springer (2021)
Google Scholar

Download references

Acknowledgment

This work has been financed in part by CAPES Finance Code 001 and CNPq/Brazil.

Author information

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande Do Sul, Porto Alegre, Brazil
Moniele Kunrath Santos, Guilherme Bazzo, Lucas Lima de Oliveira & Viviane Pereira Moreira

Authors

Moniele Kunrath Santos
View author publications
You can also search for this author in PubMed Google Scholar
Guilherme Bazzo
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Lima de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Viviane Pereira Moreira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moniele Kunrath Santos .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, M.K., Bazzo, G., de Oliveira, L.L., Moreira, V.P. (2023). ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-41682-8_23
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41681-1
Online ISBN: 978-3-031-41682-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese