skip to main content
10.1145/3529372.3533298acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
short-paper

A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis: pilot investigations

Published: 20 June 2022 Publication History

Abstract

This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019[21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-aligned corpus that collected human-proofread texts from Project Gutenberg and paired OCR views from HathiTrust Digital Library, we extracted and aligned 167,079 sentences from 189 sampled books in four domains published from 1793 to 1984. To support downstream research on OCR quality, we conducted an analysis of OCR errors with a specific focus on their associations with the source text metadata. We found that sampled data in agriculture has a higher ratio of real-word errors than other domains, while sentences from social-science volumes contain more non-word errors. Besides, data sampled from early-age volumes tend to have a high ratio of non-word errors, while samples from recently-published volumes is likely to have more real-word errors. Following our findings, we suggest that scholars should consider the potential influence of source data characteristics on their findings in the study of OCR quality issues.

References

[1]
Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P Moreira. 2020. Assessing the impact of ocr errors in information retrieval. Advances in Information Retrieval 12036 (2020), 102.
[2]
Nathanael Chambers and Dan Jurafsky. 2009. Unsupervised learning of narrative schemas and their participants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 602--610.
[3]
Rui Dong and David A Smith. 2018. Multi-input attention for unsupervised OCR correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2363--2372.
[4]
John Evershed and Kent Fitch. 2014. Correcting noisy OCR: Context beats confusion. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. 45--51.
[5]
Dennis Freeborn. 1998. From Old English to Standard English: A course book in language variation across time. University of Ottawa Press.
[6]
Simon Gabay, Thibault Clérice, and Christian Reul. 2020. OCR17: Ground truth and models for 17th c. French prints (and hopefully more).(May 2020).
[7]
Lee Gillam and Khurshid Ahmad. 2005. Pattern mining across domain-specific text collections. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, 570--579.
[8]
Harsh Gupta, Luciano Del Corro, Samuel Broscheit, Johannes Hoffart, and Eliot Brenner. 2021. Unsupervised multi-view post-OCR error correction with language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8647--8652.
[9]
Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, Antoine Doucet, et al. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 29--38.
[10]
Ming Jiang, Jennifer D'Souza, Sören Auer, and J Stephen Downie. 2021. Evaluating BERT-based scientific relation classifiers for scholarly knowledge graph construction on digital library collections. International Journal on Digital Libraries (2021), 1--19.
[11]
Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C Dubnicek, Boris Capitanu, Deren Kudeki, and J Stephen Downie. 2021. The Gutenberg-HathiTrust parallel corpus: A real-world dataset for noise investigation in uncorrected OCR texts. iConference 2021 (Poster) (2021).
[12]
Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C Dubnicek, Ted Underwood, and J Stephen Downie. 2021. Impact of OCR quality on BERT embeddings in the domain classification of book excerpts. Proceedings of the Second Conference on Computational Humanities Research 1613 (2021), 266--279.
[13]
Paul B Kantor and Ellen M Voorhees. 2000. The TREC-5 confusion track: Comparing retrieval methods for scanned text. Information Retrieval 2, 2 (2000), 165--176.
[14]
Daniel Lopresti. 2009. Optical character recognition errors and their effects on natural language processing. International Journal on Document Analysis and Recognition (IJDAR) 12, 3 (2009), 141--151.
[15]
Lijun Lyu, Maria Koutraki, Martin Krickl, and Besnik Fetahu. 2021. Neural OCR post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics 9 (2021), 479--493.
[16]
Diego Molla and Steve Cassidy. 2017. Overview of the 2017 ALTA shared task: Correcting OCR errors. In Proceedings of the Australasian Language Technology Association Workshop 2017. 115--118.
[17]
Thi Tuyet Hai Nguyen, Adam Jatowt, Mickael Coustaty, and Antoine Doucet. 2021. Survey of post-ocr processing approaches. ACM Computing Surveys (CSUR) 54, 6 (2021), 1--37.
[18]
Thi Tuyet Hai Nguyen, Adam Jatowt, Nhu-Van Nguyen, Mickael Coustaty, and Antoine Doucet. 2020. Neural machine translation with bert for post-OCR error detection and correction. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. 333--336.
[19]
Christos Papadopoulos, Stefan Pletschacher, Christian Clausner, and Apostolos Antonacopoulos. 2013. The IMPACT dataset of historical document images. In Proceedings of the Second International Workshop on Historical Document Imaging and Processing. 123--130.
[20]
Charuta Pethe, Allen Kim, and Steven Skiena. 2020. Chapter captor: Text segmentation in novels. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8373--8383.
[21]
Christophe Rigaud, Antoine Doucet, Mickaël Coustaty, and Jean-Philippe Moreux. 2019. ICDAR 2019 competition on post-OCR text correction. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1588--1593.
[22]
Shruti Rijhwani, Antonios Anastasopoulos, and Graham Neubig. 2020. OCR Post-Correction for Endangered Language Texts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 5931--5942.
[23]
Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. 2018. Ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. arXiv preprint arXiv:1809.05501 (2018).
[24]
Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, Barbara McGillivray, and Giovanni Colavizza. 2020. Assessing the impact of OCR quality on downstream NLP tasks. (2020).

Cited By

View all
  • (2024)Attention on Attention as a part of decoder for the vehicle license plate recognition2024 IEEE AITU: Digital Generation10.1109/IEEECONF61558.2024.10585566(33-37)Online publication date: 3-Apr-2024

Index Terms

  1. A prototype gutenberg-hathitrust sentence-level parallel corpus for OCR error analysis: pilot investigations

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '22: Proceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries
      June 2022
      392 pages
      ISBN:9781450393454
      DOI:10.1145/3529372
      • General Chairs:
      • Akiko Aizawa,
      • Thomas Mandl,
      • Zeljko Carevic,
      • Program Chairs:
      • Annika Hinze,
      • Philipp Mayr,
      • Philipp Schaer
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      • IEEE Technical Committee on Digital Libraries (TC DL)

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. data curation
      2. digital humanities
      3. digital libraries
      4. error analysis
      5. optical character recognition
      6. sentence-level parallel corpus

      Qualifiers

      • Short-paper

      Conference

      JCDL '22
      Sponsor:

      Acceptance Rates

      JCDL '22 Paper Acceptance Rate 35 of 132 submissions, 27%;
      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 27 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Attention on Attention as a part of decoder for the vehicle license plate recognition2024 IEEE AITU: Digital Generation10.1109/IEEECONF61558.2024.10585566(33-37)Online publication date: 3-Apr-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media