Evaluating and mitigating the impact of OCR errors on information retrieval

de Oliveira, Lucas Lima; Vargas, Danny Suarez; Alexandre, Antônio Marcelo Azevedo; Cordeiro, Fábio Corrêa; Gomes, Diogo da Silva Magalhães; Rodrigues, Max de Castro; Romeu, Regis Kruel; Moreira, Viviane Pereira

doi:10.1007/s00799-023-00345-6

Evaluating and mitigating the impact of OCR errors on information retrieval

Published: 26 January 2023

Volume 24, pages 45–62, (2023)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Lucas Lima de Oliveira¹,
Danny Suarez Vargas¹,
Antônio Marcelo Azevedo Alexandre^2,3,
Fábio Corrêa Cordeiro^2,4,
Diogo da Silva Magalhães Gomes²,
Max de Castro Rodrigues²,
Regis Kruel Romeu² &
…
Viviane Pereira Moreira ORCID: orcid.org/0000-0003-4400-054X¹

890 Accesses
8 Citations
Explore all metrics

Abstract

Optical character recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as information retrieval (IR). Post-processing OCR-ed documents is an alternative to fix digitization errors and, intuitively, improve the results of downstream tasks. This work evaluates the impact of OCR digitization and correction on IR. We compared different digitization and correction methods on real OCR-ed data from an IR test collection with 22k documents and 34 query topics on the geoscientific domain in Portuguese. Our results have shown significant differences in IR metrics for the different digitization methods (up to 5 percentage points in terms of mean average precision). Regarding the impact of error correction, our results showed that on the average for the complete set of query topics, retrieval quality metrics change very little. However, a more detailed analysis revealed it improved 19 out of 34 query topics. Our findings indicate that, contrary to previous work, long documents are impacted by OCR errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services

OCR Improvements for Images of Multi-page Historical Documents

Are Layout Analysis and OCR Still Useful for Document Information Extraction Using Foundation Models?

Code Availability

The code we implemented to run the experiments in this article is available at https://github.com/lucaslioli/solr-query-script. The datasets generated during and analyzed during the current study are available in https://github.com/Petroles/regis-collection and https://github.com/lucaslioli/regis-collection-gs.

Notes

According to GoogleTrends https://trends.google.com/trends/explore?date=all &q=%2Fm%2F0600q.
https://www.pdfa.org/wp-content/uploads/2018/06/1330_Johnson.pdf.
A digitization error happens when the OCR software fails to correctly recognize the characters in the input document. This is different from misspellings, which are human-generated.
Clastic is an adjective that describes a type of rock consisting of broken pieces of other rocks (Cambridge Dictionary).
https://github.com/Petroles/regis-collection.
https://github.com/lucaslioli/regis-collection-gs.
https://tika.apache.org/.
https://github.com/tesseract-ocr/tesseract.
https://www.abbyy.com/.
https://petroles.puc-rio.ai/index_en.html, see tab Development in progress.
https://github.com/freedesktop/poppler.
https://github.com/facebookresearch/detectron2.
https://github.com/pdfminer/pdfminer.six.
https://github.com/camelot-dev/camelot.
https://github.com/spotify/luigi.
https://github.com/wolfgarbe/SymSpell.
https://lucene.apache.org/solr/.
https://trec.nist.gov/trec_eval/.
https://github.com/impactcentre/ocrevalUAtion.
The strict and tolerant scenarios only affect the metrics that use binary relevance judgments (i.e., relevant/not relevant. MAP is one of such metrics. NDCG, on the other hand, works by definition with multiple levels of relevance.
REGIS documents have a total of 2.4 million pages. The costs mentioned by [22] range between $1.5 and 60 US dollars per 1000 pages.

References

Bazzo, G.T., Lorentz, G.A., Vargas, D.S., et al.: Assessing the impact of OCR errors in information retrieval. In: European Conference on Information Retrieval, pp. 102–109 (2020)
Bender, E.M.: On achieving and evaluating language-independence in nlp. Linguist. Issues Lang. Technol. 6 (2011)
Bia, A., Muñoz, R., Gómez, J.: DiCoMo: the digitization cost model. Int. J. Digital Lib. 11(2), 141–153 (2010)
Article Google Scholar
Boros, E., Nguyen, N.K., Lejeune, G., et al.: Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int. J. Digital Lib. pp. 1–26 (2022)
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: ACM SIGIR Forum, pp. 235–242 (2017)
Carrasco, R.C.: An open-source OCR evaluation tool. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 179–184 (2014)
Castro, J.D.B., Canchumuni, S.W.A., Villalobos, C.E.M., et al.: Improvement optical character recognition for structured documents using generative adversarial networks. In: 2021 21st International Conference on Computational Science and Its Applications (ICCSA), pp. 285–292 (2021)
Chiron, G., Doucet, A., Coustaty, M., et al: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp .1423–1428 (2017)
Consoli, B., Santos, J., Gomes, D., et al.: Embeddings for named entity recognition in geoscience portuguese literature. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4625–4630 (2020)
Croft, W.B., Harding, S., Taghva, K., et al.: An evaluation of information retrieval accuracy with simulated OCR output. In: Symposium on Document Analysis and Information Retrieval, pp. 115–126 (1994)
Drobac, S., Lindén, K.: Optical character recognition with neural networks and post-correction with finite state methods. Int. J. Document Anal. Recog. (IJDAR) 23(4), 279–295 (2020)
Article Google Scholar
Dutta, H., Gupta, A.: PNRank: Unsupervised ranking of person name entities from noisy OCR text. Decis. Support Syst. 152(113), 662 (2022)
Google Scholar
Ehrmann, M., Hamdi, A., Pontes, E.L., et al.: Named entity recognition and classification on historical documents: A survey. arXiv preprint arXiv:2109.11406 (2021)
Evershed, J., Fitch, K.: Correcting noisy OCR: Context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 45–51 (2014)
Flores, F.N., Moreira, V.P.: Assessing the impact of stemming accuracy on information retrieval-a multilingual perspective. Inf. Process. Manag. 52(5), 840–854 (2016)
Article Google Scholar
Francois, M., Eglin, V., Biou, M.: Text detection and post-OCR correction in engineering documents. In: Uchida, S., Barney, E., Eglin, V. (eds.) Document Analysis Systems, pp. 726–740. Springer International Publishing, Cham (2022)
Chapter Google Scholar
Ghosh, K., Chakraborty, A., Parui, S.K., et al.: Improving information retrieval performance on OCRed text in the absence of clean text ground truth. Inf. Process. Manag. 52(5), 873–884 (2016)
Article Google Scholar
Gomes, D., Cordeiro, F., Consoli, B., et al.: Portuguese word embeddings for the oil and gas industry: Development and evaluation. Comput. Ind. 124(103), 347 (2021)
Google Scholar
Gupte, A., Romanov, A., Mantravadi, S., et al.: Lights, camera, action! a framework to improve nlp accuracy over OCR documents (2021)
Hämäläinen, M., Hengchen, S.: From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 431–436 (2019)
Hamdi, A., Jean-Caurant, A., Sidère, N., et al.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: International Conference on Theory and Practice of Digital Libraries, Springer, pp. 87–101 (2020)
Hegghammer, T.: OCR with tesseract, amazon textract, and google document ai: a benchmarking experiment. J. Comput. Social Sci. 1–22 (2021)
Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 329–338 (1993)
Huynh, V.N., Hamdi, A., Doucet, A.: When to use OCR post-correction for named entity recognition? In: International Conference on Asian Digital Libraries, Springer, pp. 33–42 (2020)
Jiang, M., Hu, Y., Worthey, G., et al.: Impact of OCR quality on BERT embeddings in the domain classification of book excerpts. Proceedings http://ceur-ws.org ISSN 1613:0073 (2021)
Jing, H., Lopresti, D., Shih, C.: Summarization of noisy documents: A pilot study. In: Proceedings of the HLT-NAACL 03 text summarization workshop, pp. 25–32 (2003)
Johnson, S., Jourlin, P., Jones, K.S., et al.: Spoken document retrieval for TREC-7 at cambridge university. In: TREC, p. 1 (1999)
Kantor, P.B., Voorhees, E.M.: The TREC-5 confusion track: Comparing retrieval methods for scanned text. Inf. Retrieval 2(2), 165–176 (2000)
Article Google Scholar
Kettunen, K., Keskustalo, H., Kumpulainen, S., et al.: OCR quality affects perceived usefulness of historical newspaper clippings–a user study (2022). https://arxiv.org/abs/2203.03557
Lam-Adesina, A.M., Jones, G.J.: Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents. Inf. Process. Manag. 42(3), 633–649 (2006)
Article Google Scholar
Lawley, C.J., Raimondo, S., Chen, T., et al.: Geoscience language models and their intrinsic evaluation. Appl. Comput. Geosci., 100084 (2022)
Lin, X.: Impact of imperfect OCR on part-of-speech tagging. In: Seventh International Conference on Document Analysis and Recognition, Proceedings., pp. 284–288 (2003)
Linhares Pontes, E., Hamdi, A., Sidere, N., et al.: Impact of OCR quality on named entity linking. In: International Conference on Asian Digital Libraries, Springer, pp. 102–115 (2019)
Linhares Pontes, E., Cabrera-Diego, L.A., Moreno, J.G., et al.: MELHISSA: a multilingual entity linking architecture for historical press articles. Int. J. Digital Lib. 1–28 (2021)
Ma, X., Pradeep, R., Nogueira, R., et al.: Document expansion baselines and learned sparse lexical representations for ms marco v1 and v2. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3187–3197 (2022)
Martínek, J., Lenc, L., Král, P.: Building an efficient OCR system for historical documents with little training data. Neural Comput. Appl. 32(23), 17,209-17,227 (2020)
Article Google Scholar
Mei, J., Islam, A., Moh’d, A., et al.: Statistical learning for OCR error correction. Inf. Process. Manag. 54(6), 874–887 (2018)
Article Google Scholar
Miller, D., Boisen, S., Schwartz, R., et al.: Named entity extraction from noisy input: speech and OCR. In: Sixth Applied Natural Language Processing Conference, pp. 316–324 (2000)
Mittendorf, E., Schäuble, P.: Information retrieval can cope with many errors. Inf. Retrieval 3(3), 189–216 (2000)
Article MATH Google Scholar
Mutuvi, S., Doucet, A., Odeo, M., et al.: Evaluating the impact of OCR errors on topic modeling. In: International Conference on Asian Digital Libraries, pp. 3–14 (2018)
Nguyen, T., Jatowt, A., Coustaty, M., et al.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: Joint Conference on Digital Libraries (JCDL), pp. 29–38 (2019)
Nguyen, T.T.H., Jatowt, A., Coustaty, M., et al.: Survey of post-OCR processing approaches. ACM Comput. Surv. (CSUR) 54(6), 1–37 (2021)
Article Google Scholar
Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint arXiv:1901.04085 (2019)
Lima de Oliveira, L., Romeu, R.K., Moreira, V.P.: REGIS: A test collection for geoscientific documents in portuguese. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2363–2368 (2021)
Rigaud, C., Doucet, A., Coustaty, M., et al.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1588–1593 (2019)
Sakai, T.: Statistical reform in information retrieval? In: ACM SIGIR Forum, pp. 3–12 (2014)
Santos, D., Rocha, P.: The key to the first CLEF with portuguese: Topics, questions and answers in CHAVE. In: Workshop of the Cross-Language Evaluation Forum for European Languages, pp. 821–832 (2004)
Singh, S.: Optical character recognition techniques: a survey. J. Emerg. Trends Comput. Inf. Sci. 4(6), 545–550 (2013)
Google Scholar
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 623–632 (2007)
van Strien, D., Beelen, K., Ardanuy, M.C., et al.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, ICAART, pp. 484–496 (2020)
Taghva, K., Borsack, J., Condit, A., et al.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994)
Article Google Scholar
Taghva, K., Borsack, J., Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996)
Article Google Scholar
Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. (TOIS) 14(1), 64–93 (1996)
Article Google Scholar
Traub, M.C., Samar, T., Van Ossenbruggen, J., et al.: Impact of crowdsourcing OCR improvements on retrievability bias. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 29–36 (2018)
Vargas, D.S., de Oliveira, L.L., Moreira, V.P., et al.: sOCRates-a post-OCR text correction method. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados, pp. 61–72 (2021)
Wiedenhofer, L., Hein, H.G., Dengel, A.: Post-processing of OCR results for automatic indexing. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, IEEE, pp. 592–596 (1995)
Zhuang, S., Zuccon, G.: Dealing with typos for BERT-based passage retrieval and ranking. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2836–2842 (2021)
Zosa, E., Mutuvi, S., Granroth-Wilding, M., et al.: Evaluating the robustness of embedding-based topic models to ocr noise. In: International Conference on Asian Digital Libraries, Springer, pp. 392–400 (2021)
Zu, G., Murata, M., Ohyama, W., et al.: The impact of OCR accuracy on automatic text classification. In: Advanced Workshop on Content Computing, pp. 403–409 (2004)

Download references

Acknowledgements

The authors thank the anonymous reviewers whose suggestions helped us improve our manuscript. We also thank Moniele K. Santos for her help in creating the ground truth. This work was partially supported by Petrobras 2017/00752-3, CAPES Finance Code 001, and CNPq/Brazil. The authors acknowledge the National Laboratory for Scientific Computing (LNCC/MCTI, Brazil) for providing HPC resources of the SDumont supercomputer, which have contributed to the research results reported within this article (URL: http://sdumont.lncc.br).

Author information

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, RS, Brazil
Lucas Lima de Oliveira, Danny Suarez Vargas & Viviane Pereira Moreira
Petrobras Research and Development Center (CENPES), Rio de Janeiro, RJ, Brazil
Antônio Marcelo Azevedo Alexandre, Fábio Corrêa Cordeiro, Diogo da Silva Magalhães Gomes, Max de Castro Rodrigues & Regis Kruel Romeu
Systems Engineering and Computer Science Program (PESC/COPPE), Federal University of Rio de Janeiro, Rio de Janeiro, RJ, Brazil
Antônio Marcelo Azevedo Alexandre
School of Applied Mathematics, Getulio Vargas Foundation, Rio de Janeiro, RJ, Brazil
Fábio Corrêa Cordeiro

Authors

Lucas Lima de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Danny Suarez Vargas
View author publications
You can also search for this author in PubMed Google Scholar
Antônio Marcelo Azevedo Alexandre
View author publications
You can also search for this author in PubMed Google Scholar
Fábio Corrêa Cordeiro
View author publications
You can also search for this author in PubMed Google Scholar
Diogo da Silva Magalhães Gomes
View author publications
You can also search for this author in PubMed Google Scholar
Max de Castro Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Regis Kruel Romeu
View author publications
You can also search for this author in PubMed Google Scholar
Viviane Pereira Moreira
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LLdO was involved in the conceptualization, methodology, software, investigation, writing—original draft, and visualization. DSV contributed to the methodology, software, and data curation. AMAA helped in the conceptualization, software, and writing–review and editing. FCC assisted in the conceptualization, supervision, writing–review and editing, and funding acquisition. DdSMG contributed to the conceptualization and writing–review and editing. MCR assisted in the conceptualization and writing—review and editing. RKR performed the conceptualization and writing—review and editing. VPM contributed to the conceptualization, methodology, writing—original draft, writing–review and editing and project administration.

Corresponding author

Correspondence to Viviane Pereira Moreira.

Ethics declarations

Conflict of Interest

The authors have no competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

de Oliveira, L.L., Vargas, D.S., Alexandre, A.M.A. et al. Evaluating and mitigating the impact of OCR errors on information retrieval. Int J Digit Libr 24, 45–62 (2023). https://doi.org/10.1007/s00799-023-00345-6

Download citation

Received: 05 July 2022
Revised: 03 January 2023
Accepted: 07 January 2023
Published: 26 January 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s00799-023-00345-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating and mitigating the impact of OCR errors on information retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services

OCR Improvements for Images of Multi-page Historical Documents

Are Layout Analysis and OCR Still Useful for Document Information Extraction Using Foundation Models?

Code Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Evaluating and mitigating the impact of OCR errors on information retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Reconstructing Scanned Documents for Full-Text Indexing to Empower Digital Library Services

OCR Improvements for Images of Multi-page Historical Documents

Are Layout Analysis and OCR Still Useful for Document Information Extraction Using Foundation Models?

Code Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation