skip to main content
10.1145/3352631.3352638acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

okralact - a multi-engine Open Source OCR training system

Authors Info & Claims
Published:20 September 2019Publication History

ABSTRACT

Optical character recognition (OCR) of historical documents has been significantly more difficult than OCR of modern texts largely due to idiosyncrasies and wide variability of font, layout, language, orthography of printed texts before ca. 1850. However, traditional OCR engines were optimized towards supporting the widest possible set of modern text ("OmniFont OCR") with little or no facilities for the user to adapt the engine. Since OCR technologies began embracing deep neural networks, various Free Software OCR engines are now available that can in principle be adapted to different types of documents by training specific models from ground truth (GT). What these engines offer in terms of implementation finesse, they lack in interoperability and standardization. To overcome this, we developed okralact, a set of specifications and a prototypical implementation of an engine-agnostic system for training Open Source OCR engines like Tesseract, OCRopus, kraken or Calamari. We discuss training of these engines, compare their features, describe the specifications and functionality of okralact and outline how a turn-key system for adapting Open Source OCR engines can contribute to better OCR for historical documents and to the general Open Source OCR ecosystem.

References

  1. Matthias Boenig, Konstantin Baierer, Volker Hartmann, Maria Federbusch, and Clemens Neudecker. 2019. Labelling OCR Ground Truth for Usage in Repositories. In Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM, NY, USA, in press. https://doi.org/10.1145/3322905.3322916Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer, NY, USA, 177--186.Google ScholarGoogle ScholarCross RefCross Ref
  3. Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, USA, 111--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Thomas M. Breuel. 2008. The OCRopus open source OCR system. In Document Recognition and Retrieval XV, Vol. 6815. Society of Photo-Optical Instrumentation Engineers (SPIE), WA, USA, 15. https://doi.org/10.1117/12.783598Google ScholarGoogle Scholar
  5. Thomas M. Breuel. 2017. High Performance Text Recognition Using a Hybrid Convolutional-LSTM Implementation. In 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017. IEEE Digital Library, NY, USA, 11--16. https://doi.org/10.1109/ICDAR.2017.12Google ScholarGoogle ScholarCross RefCross Ref
  6. Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, NY, USA, 683--687.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Aletheia-an advanced document layout and text ground-truthing system for production environments. In 2011 International Conference on Document Analysis and Recognition. IEEE, NY, USA, 48--52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, NY, USA, 369--376.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Marcin Heliński, MiłSosz Kmieciak, and Tomasz ParkołSa. 2012. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. http://lib.psnc.pl/publication/428Google ScholarGoogle Scholar
  10. Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.Google ScholarGoogle Scholar
  11. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015), 15. https://arxiv.org/abs/1412.6980Google ScholarGoogle Scholar
  12. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. NIPS, CA, USA, 1097--1105.Google ScholarGoogle Scholar
  13. John Kunze, Justin Littman, Elizabeth Madden, John Scancella, and Chris Adams. 2018. The BagIt File Packaging Format (V1.0). https://tools.ietf.org/html/rfc8493. Accessed: 2019-06-09.Google ScholarGoogle Scholar
  14. Laura C. Mandell, Clemens Neudecker, Apostolos Antonacopoulos, Elizabeth Grumbach, Loretta Auvil, Matthew J. Christy, Jacob A. Heil, and Todd Samuelson. 2017. Navigating the storm: IMPACT, eMOP, and agile steering standards. Digital Scholarship in the Humanities 32, 1 (2017), 189--194. https://doi.org/10.1093/llc/fqv062Google ScholarGoogle Scholar
  15. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association. ISCA, Baixas, France, 1045--1048.Google ScholarGoogle Scholar
  16. Matthew Thomas Miller, Maxim G. Romanov, and Sarah Bowen Savant. 2018. Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans. International Journal of Middle East Studies 50, 1 (2018), 103--109. https://doi.org/10.1017/S0020743817000964Google ScholarGoogle ScholarCross RefCross Ref
  17. Clemens Neudecker, Konstantin Baierer, Volker Hartmann, Maria Federbusch, Matthias Boenig, and Elisa Hermann. 2019. OCR-D: An end-to-end open source OCR framework for historical printed documents. In Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM, NY, USA, in press. https://doi.org/10.1145/3322905.3322917Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks 12, 1 (1999), 145 - 151. https://doi.org/10.1016/S0893-6080(98)00116-6Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Reul, U. Springmann, C. Wick, and F. Puppe. 2018. Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, NY, USA, 423--428. https://doi.org/10.1109/DAS.2018.30Google ScholarGoogle ScholarCross RefCross Ref
  20. Stephen V. Rice, Frank Robert Jenkins, and Thomas A. Nartker. 1995. The Fourth Annual Test of OCR Accuracy.Google ScholarGoogle Scholar
  21. Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, and Benjamin Kiessling. 2017. Important New Developments in Arabographic Optical Character Recognition (OCR). CoRR abs/1703.09550 (2017), 1--11. arXiv:1703.09550 http://arxiv.org/abs/1703.09550Google ScholarGoogle Scholar
  22. Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. IEEE, New York, NY, USA, 629--633. https://doi.org/10.1109/ICDAR.2007.4376991Google ScholarGoogle Scholar
  23. Ray Smith. 2016. Tesseract blends old and new OCR technology.Google ScholarGoogle Scholar
  24. Uwe Springmann. 2015. Ocrocis - A high accuracy OCR method to convert early printings into digital text. Tutorial. Center for Information and Language Processing (CIS).Google ScholarGoogle Scholar
  25. Adnan Ul-Hasan and Thomas M Breuel. 2013. Can we build language-independent OCR using LSTM networks?. In Proceedings of the 4th International Workshop on Multilingual OCR. ACM, NY, USA, 9.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Christian Reul und Christoph Wick und Uwe Springmann und Frank Puppe. 2017. Transfer Learning for OCRopus Model Training on Early Printed Books. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture 5, 1 (2017), 38--51. https://doi.org/10.12685/027.7-5-1-169Google ScholarGoogle Scholar
  27. Christoph Wick, Christian Reul, and Frank Puppe. 2018. Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. CoRR abs/1807.02004 (2018), 1--12. arXiv:1807.02004 http://arxiv.org/abs/1807.02004Google ScholarGoogle Scholar
  28. Christoph Wick, Christian Reul, and Frank Puppe. 2018. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. JLCL 33, 1 (2018), 79--96.Google ScholarGoogle Scholar

Index Terms

  1. okralact - a multi-engine Open Source OCR training system

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing
        September 2019
        98 pages
        ISBN:9781450376686
        DOI:10.1145/3352631

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 September 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited

        Acceptance Rates

        HIP '19 Paper Acceptance Rate15of26submissions,58%Overall Acceptance Rate52of90submissions,58%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader