skip to main content
10.1145/3322905.3322916acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Labelling OCR Ground Truth for Usage in Repositories

Published: 08 May 2019 Publication History

Abstract

The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT). To OCR historical documents with high accuracy, a wide variety and variability of GT is required to create highly specific models for specific document corpora.
In this paper we present an XML-based format to exhaustively describe the features of GT for OCR relevant to training, storage and retrieval (GT metadata, GTM), as well as the tools for creating GT. We discuss the OCRD-ZIP format for bundling digitized books, including METS, images, transcription, GT metadata and more. We'll show how these data formats are used in different repository solutions within the OCR-D framework.

References

[1]
[n. d.]. The BagIt File Packaging Format (V1.0). https://tools.ietf.org/html/draft-kunze-bagit-16. Accessed: 2019-01-13.
[2]
[n. d.]. Metadata Encoding Transmission Standard (METS). http://www.loc.gov/standards/mets/. Accessed: 2019-01-13.
[3]
[n. d.]. Metadata Object Description Schema (MODS). http://www.loc.gov/standards/mods/. Accessed: 2019-01-13.
[4]
[n. d.]. OCRD-ZIP. https://ocr-d.github.io/ocrd_zip. Accessed: 2019-01-13.
[5]
[n. d.]. The online repository: Europeana Newspapers Project Dataset (ENP),. https://www.primaresearch.org/repository/index/ENP. Accessed: 2019-01-13.
[6]
[n. d.]. Richtlinien zur Transkription für Ground Truth. https://ocr-d.github.io/gt//trans_documentation/index.html. Accessed: 2019-01-13.
[7]
Andy Boyko, J Kunze, J Littman, L Madden, and B Vargas. 2011. The bagit file packaging format (v0. 97). Washington DC (2011).
[8]
C Clausner and A Antonacopoulos. 2018. Ontology and framework for semantic labelling of document data and software methods, In 13th IAPR International Workshop on Document Analysis Systems (DAS2018). Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS2018), 73--78. https://doi.org/10.1109/DAS.2018.46
[9]
C. Clausner, S. Pletschacher, and A. Antonacopoulos. 2011. Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments. In 2011 International Conference on Document Analysis and Recognition. 48--52. https://doi.org/10.1109/ICDAR.2011.19
[10]
Polzin Christian Federbusch, Maria and Thomas Stäcker. 2014. Volltext via OCR - Möglichkeiten und Grenzen. Beiträge aus der Staatsbibliothek zu Berlin - Preußischer Kulturbesitz, Vol. 43. Staatsbibliothek zu Berlin - Preußischer Kulturbesitz. http://staatsbibliothek-berlin.de/fileadmin/user_upload/zentrale_Seiten/historische_drucke/pdf/SBB_OCR_STUDIE_WEBVERSION_Final.pdf
[11]
Thomas Jejkal, Alexander Vondrous, Andreas Kopmann, Rainer Stotzka, and Volker Hartmann. 2014. KIT Data Manager: The Repository Architecture Enabling Cross-Disciplinary Research. Karlsruhe, 9--11.
[12]
Philip Kahle, Sebastian Colutto, Günter Hackl, and Günter Mühlberger. 2017. Transkribus - A Service Platform for Transcription, Recognition and Retrieval of Historical Documents. In 1st International Workshop on Open Services and Tools for Document Analysis, 14th IAPR International Conference on Document Analysis and Recognition, OST@ICDAR 2017, Kyoto, Japan, November 9-15, 2017. 19--24. https://doi.org/10.1109/ICDAR.2017.307
[13]
Sebastian Meyer. [n. d.].
[14]
[n.d.]. 1500. Historia. Mathis Hupfuff. http://resolver.staatsbibliothek-berlin.de/SBB0000A94200000000
[15]
S. Pletschacher and A. Antonacopoulos. 2010. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In 2010 20th International Conference on Pattern Recognition. 257--260. https://doi.org/10.1109/ICPR.2010.72
[16]
Ajinkya Prabhune, Rainer Stotzka, Vaibhav Sakharkar, Jürgen W. Hesser, and Michael Gertz. 2018. MetaStore: an adaptive metadata management framework for heterogeneous metadata models. Distributed and parallel databases 36, 1 (2018), 153--194. https://doi.org/10.1007/s10619-017-7210-4
[17]
David Smith and Ryan Cordell. 2018. A Research Agenda for Historical and Multilingual Optical Character Recognition. Mathis Hupfuff. http://hdl.handle.net/2047/D20297452
[18]
Christoph Stollwerk. 2016. Machbarkeitsstudie zu Einsatzmöglichkeiten von OCR-Software im Bereich "Alter Drucke" zur Vorbereitung einer vollständigen Digitalisierung deutscher Druckerzeugnisse zwischen 1500 und 1930. DARIAH-DE working papers, Vol. 16. GOEDOC, Dokumenten- und Publikationsserver der Georg-August-Universität Göttingen. http://nbn-resolving.de/urn:nbn:de:gbv:7-dariah-2016-2-8

Cited By

View all
  • (2023)Document Layout Analysis with Deep Learning and HeuristicsProceedings of the 7th International Workshop on Historical Document Imaging and Processing10.1145/3604951.3605513(73-78)Online publication date: 25-Aug-2023
  • (2021)A survey of OCR evaluation tools and metricsProceedings of the 6th International Workshop on Historical Document Imaging and Processing10.1145/3476887.3476888(13-18)Online publication date: 5-Sep-2021
  • (2019)okralact - a multi-engine Open Source OCR training systemProceedings of the 5th International Workshop on Historical Document Imaging and Processing10.1145/3352631.3352638(25-30)Online publication date: 20-Sep-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
May 2019
163 pages
ISBN:9781450371940
DOI:10.1145/3322905
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Corpus
  2. Ground Truth
  3. Metadata
  4. Optical Character Recognition
  5. Repository

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

DATeCH2019

Acceptance Rates

Overall Acceptance Rate 60 of 86 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Document Layout Analysis with Deep Learning and HeuristicsProceedings of the 7th International Workshop on Historical Document Imaging and Processing10.1145/3604951.3605513(73-78)Online publication date: 25-Aug-2023
  • (2021)A survey of OCR evaluation tools and metricsProceedings of the 6th International Workshop on Historical Document Imaging and Processing10.1145/3476887.3476888(13-18)Online publication date: 5-Sep-2021
  • (2019)okralact - a multi-engine Open Source OCR training systemProceedings of the 5th International Workshop on Historical Document Imaging and Processing10.1145/3352631.3352638(25-30)Online publication date: 20-Sep-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media