research-article

okralact - a multi-engine Open Source OCR training system

Authors:
Konstantin Baierer

Staatsbibliothek zu Berlin, Preußischer Kulturbesitz

Staatsbibliothek zu Berlin, Preußischer Kulturbesitz
View Profile

,
Rui Dong

Khoury College of Computer Sciences, Northeastern University

Khoury College of Computer Sciences, Northeastern University
View Profile

,
Clemens Neudecker

Staatsbibliothek zu Berlin, Preußischer Kulturbesitz

Staatsbibliothek zu Berlin, Preußischer Kulturbesitz
View Profile

HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and ProcessingSeptember 2019Pages 25–30https://doi.org/10.1145/3352631.3352638

Published:20 September 2019Publication History

HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing

Pages 25–30

ABSTRACT

Optical character recognition (OCR) of historical documents has been significantly more difficult than OCR of modern texts largely due to idiosyncrasies and wide variability of font, layout, language, orthography of printed texts before ca. 1850. However, traditional OCR engines were optimized towards supporting the widest possible set of modern text ("OmniFont OCR") with little or no facilities for the user to adapt the engine. Since OCR technologies began embracing deep neural networks, various Free Software OCR engines are now available that can in principle be adapted to different types of documents by training specific models from ground truth (GT). What these engines offer in terms of implementation finesse, they lack in interoperability and standardization. To overcome this, we developed okralact, a set of specifications and a prototypical implementation of an engine-agnostic system for training Open Source OCR engines like Tesseract, OCRopus, kraken or Calamari. We discuss training of these engines, compare their features, describe the specifications and functionality of okralact and outline how a turn-key system for adapting Open Source OCR engines can contribute to better OCR for historical documents and to the general Open Source OCR ecosystem.

References

Matthias Boenig, Konstantin Baierer, Volker Hartmann, Maria Federbusch, and Clemens Neudecker. 2019. Labelling OCR Ground Truth for Usage in Repositories. In Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM, NY, USA, in press. https://doi.org/10.1145/3322905.3322916Google ScholarDigital Library
Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer, NY, USA, 177--186.Google ScholarCross Ref
Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, USA, 111--118.Google ScholarDigital Library
Thomas M. Breuel. 2008. The OCRopus open source OCR system. In Document Recognition and Retrieval XV, Vol. 6815. Society of Photo-Optical Instrumentation Engineers (SPIE), WA, USA, 15. https://doi.org/10.1117/12.783598Google Scholar
Thomas M. Breuel. 2017. High Performance Text Recognition Using a Hybrid Convolutional-LSTM Implementation. In 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017. IEEE Digital Library, NY, USA, 11--16. https://doi.org/10.1109/ICDAR.2017.12Google ScholarCross Ref
Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, NY, USA, 683--687.Google ScholarDigital Library
Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Aletheia-an advanced document layout and text ground-truthing system for production environments. In 2011 International Conference on Document Analysis and Recognition. IEEE, NY, USA, 48--52.Google ScholarDigital Library
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, NY, USA, 369--376.Google ScholarDigital Library
Marcin Heliński, MiłSosz Kmieciak, and Tomasz ParkołSa. 2012. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. http://lib.psnc.pl/publication/428Google Scholar
Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.Google Scholar
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015), 15. https://arxiv.org/abs/1412.6980Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. NIPS, CA, USA, 1097--1105.Google Scholar
John Kunze, Justin Littman, Elizabeth Madden, John Scancella, and Chris Adams. 2018. The BagIt File Packaging Format (V1.0). https://tools.ietf.org/html/rfc8493. Accessed: 2019-06-09.Google Scholar
Laura C. Mandell, Clemens Neudecker, Apostolos Antonacopoulos, Elizabeth Grumbach, Loretta Auvil, Matthew J. Christy, Jacob A. Heil, and Todd Samuelson. 2017. Navigating the storm: IMPACT, eMOP, and agile steering standards. Digital Scholarship in the Humanities 32, 1 (2017), 189--194. https://doi.org/10.1093/llc/fqv062Google Scholar
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association. ISCA, Baixas, France, 1045--1048.Google Scholar
Matthew Thomas Miller, Maxim G. Romanov, and Sarah Bowen Savant. 2018. Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans. International Journal of Middle East Studies 50, 1 (2018), 103--109. https://doi.org/10.1017/S0020743817000964Google ScholarCross Ref
Clemens Neudecker, Konstantin Baierer, Volker Hartmann, Maria Federbusch, Matthias Boenig, and Elisa Hermann. 2019. OCR-D: An end-to-end open source OCR framework for historical printed documents. In Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM, NY, USA, in press. https://doi.org/10.1145/3322905.3322917Google ScholarDigital Library
Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks 12, 1 (1999), 145 - 151. https://doi.org/10.1016/S0893-6080(98)00116-6Google ScholarDigital Library
C. Reul, U. Springmann, C. Wick, and F. Puppe. 2018. Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, NY, USA, 423--428. https://doi.org/10.1109/DAS.2018.30Google ScholarCross Ref
Stephen V. Rice, Frank Robert Jenkins, and Thomas A. Nartker. 1995. The Fourth Annual Test of OCR Accuracy.Google Scholar
Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, and Benjamin Kiessling. 2017. Important New Developments in Arabographic Optical Character Recognition (OCR). CoRR abs/1703.09550 (2017), 1--11. arXiv:1703.09550 http://arxiv.org/abs/1703.09550Google Scholar
Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. IEEE, New York, NY, USA, 629--633. https://doi.org/10.1109/ICDAR.2007.4376991Google Scholar
Ray Smith. 2016. Tesseract blends old and new OCR technology.Google Scholar
Uwe Springmann. 2015. Ocrocis - A high accuracy OCR method to convert early printings into digital text. Tutorial. Center for Information and Language Processing (CIS).Google Scholar
Adnan Ul-Hasan and Thomas M Breuel. 2013. Can we build language-independent OCR using LSTM networks?. In Proceedings of the 4th International Workshop on Multilingual OCR. ACM, NY, USA, 9.Google ScholarDigital Library
Christian Reul und Christoph Wick und Uwe Springmann und Frank Puppe. 2017. Transfer Learning for OCRopus Model Training on Early Printed Books. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture 5, 1 (2017), 38--51. https://doi.org/10.12685/027.7-5-1-169Google Scholar
Christoph Wick, Christian Reul, and Frank Puppe. 2018. Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. CoRR abs/1807.02004 (2018), 1--12. arXiv:1807.02004 http://arxiv.org/abs/1807.02004Google Scholar
Christoph Wick, Christian Reul, and Frank Puppe. 2018. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. JLCL 33, 1 (2018), 79--96.Google Scholar

Index Terms

okralact - a multi-engine Open Source OCR training system
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Adapting the Tesseract open source OCR engine for multilingual OCR
MOCR '09: Proceedings of the International Workshop on Multilingual OCR

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond ...
Read More
An Omnifont Open-Vocabulary OCR System for English and Arabic

We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on Hidden Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. In this paper we focus on ...
Read More
How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation
Human Language Technology. Challenges for Computer Science and Linguistics
Abstract
The current paper presents work that has been carried out in the National Library of Finland (NLF) to improve optical character recognition (OCR) quality of the historical Finnish newspaper collection 1771–1910. Evaluation results reported in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing
September 2019
98 pages
ISBN:9781450376686
DOI:10.1145/3352631

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 September 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
HIP '19 Paper Acceptance Rate15of26submissions,58%Overall Acceptance Rate52of90submissions,58%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 222
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

okralact - a multi-engine Open Source OCR training system

HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adapting the Tesseract open source OCR engine for multilingual OCR

An Omnifont Open-Vocabulary OCR System for English and Arabic

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

okralact - a multi-engine Open Source OCR training system

HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Adapting the Tesseract open source OCR engine for multilingual OCR

An Omnifont Open-Vocabulary OCR System for English and Arabic

How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media