Abstract
Optical Character Recognition (OCR) is a technology that enables machines to read and interpret printed or handwritten texts from scanned images or photographs. However, the accuracy of OCR systems can vary depending on several factors, such as the quality of the input image, the font used, and the language of the document. As a general tendency, OCR algorithms perform better in resource-rich languages as they have more annotated data to train the recognition process. In this work, we propose ESTER-Pt, an Evaluation Suite for TExt Recognition in Portuguese. Despite being one of the largest languages in terms of speakers, OCR in Portuguese remains largely unexplored. Our evaluation suite comprises four types of resources: synthetic text-based documents, synthetic image-based documents, real scanned documents, and a hybrid set with real image-based documents that were synthetically degraded. Additionally, we provide results of OCR engines and post-OCR correction tools on ESTER-Pt, which can serve as a baseline for future work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
References
Almeida, H.d.: Augusto dos Anjos - Um Tema para Debates. Apex (1970)
Arrigo, M., Strassel, S., King, N., Tran, T., Mason, L.: CAMIO: A corpus for OCR in multiple languages. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference. pp. 1209–1216 (2022)
Bazzo, G.T., Lorentz, G.A., Vargas, D.S., Moreira, V.P.: Assessing the impact of OCR errors in information retrieval. In: European Conference on Information Retrieval. pp. 102–109 (2020)
Biten, A.F., Tito, R., Gomez, L., Valveny, E., Karatzas, D.: OCR-IDL: OCR annotations for industry document library dataset. arXiv preprint arXiv:2202.12985 (2022)
Carrasco, R.C.: An open-source OCR evaluation tool. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. pp. 179–184 (2014)
de Carvalho, G.V.: Biografia da Biblioteca Nacional, 1807–1990. Editora Irradiação Cultural (1994)
Chen, J., Yu, H., Ma, J., Guan, M., Xu, X., Wang, X., Qu, S., Li, B., Xue, X.: Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093 (2021)
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1423–1428 (2017)
Clausner, C., Papadopoulos, C., Pletschacher, S., Antonacopoulos, A.: The ENP image and ground truth dataset of historical newspapers. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 931–935. IEEE (2015)
DBNL: DBNL OCR data set (Jun 2019). https://doi.org/10.5281/zenodo.3239290, https://doi.org/10.5281/zenodo.3239290
Dong, R., Smith, D.A.: Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2363–2372 (2018)
Doush, I.A., AIKhateeb, F., Gharibeh, A.H.: Yarmouk arabic OCR dataset. In: 2018 8th International Conference on Computer Science and Information Technology (CSIT). pp. 150–154 (2018)
Dutta, H., Gupta, A.: PNRank: Unsupervised ranking of person name entities from noisy OCR text. Decision Support Systems 152, 113662 (2022)
Eger, S., vor der Brück, T., Mehler, A.: A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. The Prague bulletin of mathematical linguistics 105(1), 77 (2016)
Ehrmann, M., Hamdi, A., Pontes, E.L., Romanello, M., Doucet, A.: Named entity recognition and classification on historical documents: A survey. arXiv preprint arXiv:2109.11406 (2021)
Evershed, J., Fitch, K.: Correcting noisy OCR: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. pp. 45–51 (2014)
Gabay, S., Clérice, T., Reul, C.: OCR17: Ground Truth and Models for 17th c. French Prints (and hopefully more) (May 2020), https://hal.science/hal-02577236
Gatos, B., Stamatopoulos, N., Louloudis, G., Sfikas, G., Retsinas, G., Papavassiliou, V., Sunistira, F., Katsouros, V.: Grpoly-db: An old greek polytonic document image database. In: 2015 13th international conference on document analysis and recognition (ICDAR). pp. 646–650. IEEE (2015)
Gupte, A., Romanov, A., Mantravadi, S., Banda, D., Liu, J., Khan, R., Meenal, L.R., Han, B., Srinivasan, S.: Lights, camera, action! a framework to improve NLP accuracy over OCR documents (2021)
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: International Conference on Theory and Practice of Digital Libraries. pp. 87–101. Springer (2020)
Hamdi, A., Pontes, E.L., Sidere, N., Coustaty, M., Doucet, A.: In-depth analysis of the impact of OCR errors on named entity recognition and linking. Natural Language Engineering pp. 1–24 (2022)
Hegghammer, T.: OCR with tesseract, amazon textract, and google document ai: a benchmarking experiment. Journal of Computational Social Science pp. 1–22 (2021)
Huynh, V.N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: International Conference on Asian Digital Libraries. pp. 33–42. Springer (2020)
Jean-Caurant, A., Tamani, N., Courboulay, V., Burie, J.C.: Lexicographical-based order for post-OCR correction of named entities. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). vol. 1, pp. 1192–1197. IEEE (2017)
Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: DocCreator: A new software for creating synthetic ground-truthed document images. Journal of imaging 3(4), 62 (2017)
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval 2, 165–176 (2000)
Kettunen, K., Keskustalo, H., Kumpulainen, S., Pääkkönen, T., Rautiainen, J.: OCR quality affects perceived usefulness of historical newspaper clippings-a user study. arXiv preprint arXiv:2203.03557 (2022)
Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR Quality on Named Entity Linking. In: Jatowt, A., Maeda, A., Syn, S.Y. (eds.) ICADL 2019. LNCS, vol. 11853, pp. 102–115. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34058-2_11
Maheshwari, A., Singh, N., Krishna, A., Ramakrishnan, G.: A benchmark and dataset for Post-OCR text correction in sanskrit. arXiv preprint arXiv:2211.07980 (2022)
Martínek, J., Lenc, L., Král, P.: Training Strategies for OCR Systems for Historical Documents. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2019. IAICT, vol. 559, pp. 362–373. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19823-7_30
Mei, J., Islam, A., Moh’d, A., Wu, Y., Milios, E.: Post-processing OCR text using web-scale corpora. In: Proceedings of the 2017 ACM Symposium on Document Engineering. pp. 117–120 (2017)
Molla, D., Cassidy, S.: Overview of the 2017 ALTa shared task: Correcting OCR errors. In: Proceedings of the Australasian Language Technology Association Workshop 2017. pp. 115–118 (2017)
Nabuco, J.: Um estadista do Império: Nabuco de Araujo: sua vida, suas opiniões, sua época, por seu filho Joaquim Nabuco (Tomo 3). H. Garnier, Rio de Janeiro (1897)
Nabuco, J.: Cartas aos abolicionistas ingleses. Joaquim Nabuco, Massangana (1985)
Nabuco, J.: O abolicionismo. Centro Edelstein (2011)
Nastase, V., Hitschler, J.: Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Doucet, A.: Survey of post-OCR processing approaches. ACM Computing Surveys (CSUR) 54(6), 1–37 (2021)
de Oliveira, L.L., Vargas, D.S., Alexandre, A.M.A., Cordeiro, F.C., Gomes, D.d.S.M., Rodrigues, M.d.C., Romeu, R.K., Moreira, V.P.: Evaluating and mitigating the impact of OCR errors on information retrieval. International Journal on Digital Libraries pp. 1–18 (2023)
Pack, C., Liu, Y., Soh, L.K., Lorang, E.: Augmentation-based pseudo-ground truth generation for deep learning in historical document segmentation for greater levels of archival description and access. Journal on Computing and Cultural Heritage (JOCCH) 15(3), 1–21 (2022)
Ribeiro, N.: Albrecht Dürer: o apogeu do Renascimento alemão (1999)
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). pp. 1588–1593 (2019)
Saini, N., Pinto, P., Bheemaraj, A., Kumar, D., Daga, D., Yadav, S., Nagaraj, S.: OCR synthetic benchmark dataset for indic languages. arXiv preprint arXiv:2205.02543 (2022)
Simistira, F., Ul-Hassan, A., Papavassiliou, V., Gatos, B., Katsouros, V., Liwicki, M.: Recognition of historical greek polytonic scripts using lstm networks. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). pp. 766–770. IEEE (2015)
Sodré, N.W.: Brasil: radiografia de um modelo. Vozes (1975)
Sodré, N.W.: História da imprensa no Brasil. Mauad Editora Ltda (1998)
Springmann, U., Reul, C., Dipper, S., Baiter, J.: Ground truth for training OCR engines on historical documents in german fraktur and early modern latin. Journal for Language Technology and Computational Linguistics 33(1), 97–114 (2018)
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, ICAART. pp. 484–496 (2020)
Vargas, D.S., de Oliveira, L.L., Moreira, V.P., Bazzo, G.T., Lorentz, G.A.: sOCRates-a post-OCR text correction method. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. pp. 61–72 (2021)
Volk, M.: The text+Berg corpus: an alpine french-german parallel resource (2011)
Yalniz, I.Z., Manmatha, R.: A fast alignment scheme for automatic OCR evaluation of books. In: 2011 International Conference on Document Analysis and Recognition. pp. 754–758. IEEE (2011)
Zosa, E., Mutuvi, S., Granroth-Wilding, M., Doucet, A.: Evaluating the robustness of embedding-based topic models to OCR noise. In: International Conference on Asian Digital Libraries. pp. 392–400. Springer (2021)
Acknowledgment
This work has been financed in part by CAPES Finance Code 001 and CNPq/Brazil.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Santos, M.K., Bazzo, G., de Oliveira, L.L., Moreira, V.P. (2023). ESTER-Pt: An Evaluation Suite for TExt Recognition in Portuguese. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-41682-8_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41681-1
Online ISBN: 978-3-031-41682-8
eBook Packages: Computer ScienceComputer Science (R0)