skip to main content
10.1145/3322905.3322917acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

OCR-D: An end-to-end open source OCR framework for historical printed documents

Published: 08 May 2019 Publication History

Abstract

Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep neural networks and, despite the various project's achievements, there remains a lack of OCR software that is a) comprehensive with regard to the challenges presented by the wide variety of historical documents and b) available as ready-to-use Free Software. The OCR-D project aims to rectify that.
In this paper we introduce the background of OCR-D, the main challenges and shortcomings in the availability of open tools and resources for OCR of historical printed documents and discuss the various software modules and related components (repositories, workflows) that are being made available through OCR-D. Finally we provide an outlook to a number of remaining challenges that are not addressed by OCR-D and point out several examples for the positive community aspects arisen through the creation and sharing of open resources for historical German OCR.

References

[1]
Hildelies Balk and Aly Conteh. 2011. IMPACT: Centre of Competence in Text Digitisation. In Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11). ACM, New York, NY, USA, 155--160. https://doi.org/10.1145/2037342.2037369
[2]
Scott Bradner. 1997. Key words for use in RFCs to Indicate Requirement Levels. BCP 14. RFC Editor. http://www.rfc-editor.org/rfc/rfc2119.txt http://www.rfc-editor.org/rfc/rfc2119.txt.
[3]
Christian Clausner and Apostolos Antonacopoulos. 2018. Ontology and framework for semantic labelling of document data and software methods. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS2018). IEEE, New York, NY, USA, 73--78. https://doi.org/10.1109/DAS.2018.46
[4]
Ryan Cordell and David Smith. 2018. A Research Agenda for Historical and Multilingual Optical Character Recognition. http://hdl.handle.net/2047/D20297452. Accessed: 2019-01-18.
[5]
Maria Federbusch, Christian Polzin, and Thomas Stäcker. 2013. Volltext via OCR- Möglichkeiten und Grenzen. Beiträge aus der Staatsbibliothek zu Berlin -Preußischer Kulturbesitz 43 (2013), 1--138.
[6]
Emilio Granell, Verónica Romero, and Carlos D. Martinez-Hinarejos. 2018. Multimodality, interactivity, and crowdsourcing for document transcription. Computational Intelligence 34, 2 (2018), 398--419. https://doi.org/10.1111/coin.12169
[7]
Thomas Jejkal, Alexander Vondrous, Andreas Kopmann, Rainer Stotzka, and Volker Hartmann. 2014. KIT Data Manager: The Repository Architecture Enabling Cross-Disciplinary Research. KIT, Karlsruhe, 9--11.
[8]
John Kunze, Justin Littman, Elizabeth Madden, John Scancella, and Chris Adams. 2018. The BagIt File Packaging Format (V1.0). https://tools.ietf.org/html/draft-kunze-bagit-17. Accessed: 2019-01-18.
[9]
Laura C. Mandell, Clemens Neudecker, Apostolos Antonacopoulos, Elizabeth Grumbach, Loretta Auvil, Matthew J. Christy, Jacob A. Heil, and Todd Samuelson. 2017. Navigating the storm: IMPACT, eMOP, and agile steering standards. Digital Scholarship in the Humanities 32, 1 (2017), 189--194. https://doi.org/10.1093/llc/fqv062
[10]
Clemens Neudecker, Sven Schlarb, Zeki Mustafa Dogan, Paolo Missier, Shoaib Sufi, Alan Williams, and Katy Wolstencroft. 2011. An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis. In Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11). ACM, New York, NY, USA, 161--168. https://doi.org/10.1145/2037342.2037370
[11]
Clemens Neudecker and Asaf Tzadok. 2010. User collaboration for improving access to historical texts. Liber Quarterly 20, 1 (2010), 119--128. https://doi.org/10.18352/lq.7981
[12]
Stefan Pletschacher and Apostolos Antonacopoulos. 2010. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In 2010 20th International Conference on Pattern Recognition. IEEE, New York, NY, USA, 257--260. https://doi.org/10.1109/ICPR.2010.72
[13]
Ajinkya Prabhune, Rainer Stotzka, Vaibhav Sakharkar, Jürgen W. Hesser, and Michael Gertz. 2018. MetaStore: an adaptive metadata management framework for heterogeneous metadata models. Distributed and parallel databases 36, 1 (2018), 153--194. https://doi.org/10.1007/s10619-017-7210-4
[14]
Ulrich Reffle and Christoph Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recognition 46, 5 (2013), 1346 - 1357. https://doi.org/10.1016/j.patcog.2012.10.002
[15]
Christian Reul, Uwe Springmann, and Frank Puppe. 2017. LAREX: A Semi-automatic Open-source Tool for Layout Analysis and Region Extraction on Early Printed Books. In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (DATeCH2017). ACM, New York, NY, USA, 137--142. https://doi.org/10.1145/3078081.3078097
[16]
Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. IEEE, New York, NY, USA, 629--633. https://doi.org/10.1109/ICDAR.2007.4376991
[17]
Uwe Springmann. 2016. OCR für alte Drucke. Informatik Spektrum 39, 6 (2016), 459--462. https://doi.org/10.1007/s00287-016-1004-3
[18]
Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. 2018. Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. CoRR abs/1809.05501 (2018). arXiv:1809.05501 http://arxiv.org/abs/1809.05501
[19]
Christoph Stollwerk. 2016. Machbarkeitsstudie zu Einsatzmöglichkeiten von OCR-Software im Bereich "Alter Drucke" zur Vorbereitung einer vollständigen Digitalisierung deutscher Druckerzeugnisse zwischen 1500 und 1930. DARIAH-DE working papers 16 (2016). http://nbn-resolving.de/urn:nbn:de:gbv:7-dariah-2016-2-8
[20]
Simon Tanner. 2001. Digitization of Printed Material: The Metadata Engine Project (METAe). Library Hi Tech News 18, 4 (2001). https://doi.org/10.1108/lhtn.2001.23918daf.002
[21]
Thorsten Vobl, Annette Gotscharek, Uli Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2014. PoCoTo - an Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH '14). ACM, New York, NY, USA, 57--61. https://doi.org/10.1145/2595188.2595197

Cited By

View all
  • (2024)Boundary Gaussian Distance Loss Function for Enhancing Character Extraction from High-Resolution Scans of Ancient Metal-Type Printed BooksElectronics10.3390/electronics1310195713:10(1957)Online publication date: 16-May-2024
  • (2024)Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical DocumentsSustainability and Empowerment in the Context of Digital Libraries10.1007/978-981-96-0865-2_5(54-66)Online publication date: 6-Dec-2024
  • (2024)New Transformer Approach to the Recognition of Mediaeval Arabic Historical ManuscriptsArtificial Intelligence and Its Practical Applications in the Digital Economy10.1007/978-3-031-71429-0_20(271-283)Online publication date: 18-Dec-2024
  • Show More Cited By

Index Terms

  1. OCR-D: An end-to-end open source OCR framework for historical printed documents

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
    May 2019
    163 pages
    ISBN:9781450371940
    DOI:10.1145/3322905
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 May 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. digital libraries
    2. digitization
    3. historical prints
    4. open source
    5. optical character recognition

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    DATeCH2019

    Acceptance Rates

    Overall Acceptance Rate 60 of 86 submissions, 70%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)77
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Boundary Gaussian Distance Loss Function for Enhancing Character Extraction from High-Resolution Scans of Ancient Metal-Type Printed BooksElectronics10.3390/electronics1310195713:10(1957)Online publication date: 16-May-2024
    • (2024)Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical DocumentsSustainability and Empowerment in the Context of Digital Libraries10.1007/978-981-96-0865-2_5(54-66)Online publication date: 6-Dec-2024
    • (2024)New Transformer Approach to the Recognition of Mediaeval Arabic Historical ManuscriptsArtificial Intelligence and Its Practical Applications in the Digital Economy10.1007/978-3-031-71429-0_20(271-283)Online publication date: 18-Dec-2024
    • (2024)TrOCR Meets Language Models: An End-to-End Post-correction ApproachDocument Analysis and Recognition – ICDAR 2024 Workshops10.1007/978-3-031-70645-5_2(12-26)Online publication date: 11-Sep-2024
    • (2024)fang: Fast Annotation of Glyphs in Historical Printed DocumentsDocument Analysis Systems10.1007/978-3-031-70442-0_23(377-392)Online publication date: 30-Aug-2024
    • (2024)Automation of historical weather data rescueGeoscience Data Journal10.1002/gdj3.26112:1Online publication date: 26-Sep-2024
    • (2023)Digitale Sammlungen als offene Daten für die ForschungBibliothek Forschung und Praxis10.1515/bfp-2023-002147:2(200-212)Online publication date: 18-Jul-2023
    • (2023)Digital Curation and AIAI in Museums10.14361/9783839467107-013(149-162)Online publication date: 4-Dec-2023
    • (2023)Document Layout Analysis with Deep Learning and HeuristicsProceedings of the 7th International Workshop on Historical Document Imaging and Processing10.1145/3604951.3605513(73-78)Online publication date: 25-Aug-2023
    • (2023)Search in Archival Facsimile Documents for Digital History2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254826(1-10)Online publication date: 9-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media