ABSTRACT
Optical character recognition (OCR) of historical documents has been significantly more difficult than OCR of modern texts largely due to idiosyncrasies and wide variability of font, layout, language, orthography of printed texts before ca. 1850. However, traditional OCR engines were optimized towards supporting the widest possible set of modern text ("OmniFont OCR") with little or no facilities for the user to adapt the engine. Since OCR technologies began embracing deep neural networks, various Free Software OCR engines are now available that can in principle be adapted to different types of documents by training specific models from ground truth (GT). What these engines offer in terms of implementation finesse, they lack in interoperability and standardization. To overcome this, we developed okralact, a set of specifications and a prototypical implementation of an engine-agnostic system for training Open Source OCR engines like Tesseract, OCRopus, kraken or Calamari. We discuss training of these engines, compare their features, describe the specifications and functionality of okralact and outline how a turn-key system for adapting Open Source OCR engines can contribute to better OCR for historical documents and to the general Open Source OCR ecosystem.
- Matthias Boenig, Konstantin Baierer, Volker Hartmann, Maria Federbusch, and Clemens Neudecker. 2019. Labelling OCR Ground Truth for Usage in Repositories. In Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM, NY, USA, in press. https://doi.org/10.1145/3322905.3322916Google ScholarDigital Library
- Léon Bottou. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT'2010. Springer, NY, USA, 177--186.Google ScholarCross Ref
- Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, USA, 111--118.Google ScholarDigital Library
- Thomas M. Breuel. 2008. The OCRopus open source OCR system. In Document Recognition and Retrieval XV, Vol. 6815. Society of Photo-Optical Instrumentation Engineers (SPIE), WA, USA, 15. https://doi.org/10.1117/12.783598Google Scholar
- Thomas M. Breuel. 2017. High Performance Text Recognition Using a Hybrid Convolutional-LSTM Implementation. In 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017. IEEE Digital Library, NY, USA, 11--16. https://doi.org/10.1109/ICDAR.2017.12Google ScholarCross Ref
- Thomas M Breuel, Adnan Ul-Hasan, Mayce Ali Al-Azawi, and Faisal Shafait. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, NY, USA, 683--687.Google ScholarDigital Library
- Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2011. Aletheia-an advanced document layout and text ground-truthing system for production environments. In 2011 International Conference on Document Analysis and Recognition. IEEE, NY, USA, 48--52.Google ScholarDigital Library
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning. ACM, NY, USA, 369--376.Google ScholarDigital Library
- Marcin Heliński, MiłSosz Kmieciak, and Tomasz ParkołSa. 2012. Report on the comparison of Tesseract and ABBYY FineReader OCR engines. http://lib.psnc.pl/publication/428Google Scholar
- Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015), 15. https://arxiv.org/abs/1412.6980Google Scholar
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. NIPS, CA, USA, 1097--1105.Google Scholar
- John Kunze, Justin Littman, Elizabeth Madden, John Scancella, and Chris Adams. 2018. The BagIt File Packaging Format (V1.0). https://tools.ietf.org/html/rfc8493. Accessed: 2019-06-09.Google Scholar
- Laura C. Mandell, Clemens Neudecker, Apostolos Antonacopoulos, Elizabeth Grumbach, Loretta Auvil, Matthew J. Christy, Jacob A. Heil, and Todd Samuelson. 2017. Navigating the storm: IMPACT, eMOP, and agile steering standards. Digital Scholarship in the Humanities 32, 1 (2017), 189--194. https://doi.org/10.1093/llc/fqv062Google Scholar
- Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černocky, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association. ISCA, Baixas, France, 1045--1048.Google Scholar
- Matthew Thomas Miller, Maxim G. Romanov, and Sarah Bowen Savant. 2018. Digitizing the Textual Heritage of the Premodern Islamicate World: Principles and Plans. International Journal of Middle East Studies 50, 1 (2018), 103--109. https://doi.org/10.1017/S0020743817000964Google ScholarCross Ref
- Clemens Neudecker, Konstantin Baierer, Volker Hartmann, Maria Federbusch, Matthias Boenig, and Elisa Hermann. 2019. OCR-D: An end-to-end open source OCR framework for historical printed documents. In Proceedings of the Third International Conference on Digital Access to Textual Cultural Heritage. ACM, NY, USA, in press. https://doi.org/10.1145/3322905.3322917Google ScholarDigital Library
- Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. Neural Networks 12, 1 (1999), 145 - 151. https://doi.org/10.1016/S0893-6080(98)00116-6Google ScholarDigital Library
- C. Reul, U. Springmann, C. Wick, and F. Puppe. 2018. Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, NY, USA, 423--428. https://doi.org/10.1109/DAS.2018.30Google ScholarCross Ref
- Stephen V. Rice, Frank Robert Jenkins, and Thomas A. Nartker. 1995. The Fourth Annual Test of OCR Accuracy.Google Scholar
- Maxim Romanov, Matthew Thomas Miller, Sarah Bowen Savant, and Benjamin Kiessling. 2017. Important New Developments in Arabographic Optical Character Recognition (OCR). CoRR abs/1703.09550 (2017), 1--11. arXiv:1703.09550 http://arxiv.org/abs/1703.09550Google Scholar
- Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. IEEE, New York, NY, USA, 629--633. https://doi.org/10.1109/ICDAR.2007.4376991Google Scholar
- Ray Smith. 2016. Tesseract blends old and new OCR technology.Google Scholar
- Uwe Springmann. 2015. Ocrocis - A high accuracy OCR method to convert early printings into digital text. Tutorial. Center for Information and Language Processing (CIS).Google Scholar
- Adnan Ul-Hasan and Thomas M Breuel. 2013. Can we build language-independent OCR using LSTM networks?. In Proceedings of the 4th International Workshop on Multilingual OCR. ACM, NY, USA, 9.Google ScholarDigital Library
- Christian Reul und Christoph Wick und Uwe Springmann und Frank Puppe. 2017. Transfer Learning for OCRopus Model Training on Early Printed Books. 027.7 Zeitschrift für Bibliothekskultur / Journal for Library Culture 5, 1 (2017), 38--51. https://doi.org/10.12685/027.7-5-1-169Google Scholar
- Christoph Wick, Christian Reul, and Frank Puppe. 2018. Calamari - A High-Performance Tensorflow-based Deep Learning Package for Optical Character Recognition. CoRR abs/1807.02004 (2018), 1--12. arXiv:1807.02004 http://arxiv.org/abs/1807.02004Google Scholar
- Christoph Wick, Christian Reul, and Frank Puppe. 2018. Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. JLCL 33, 1 (2018), 79--96.Google Scholar
Index Terms
- okralact - a multi-engine Open Source OCR training system
Recommendations
Adapting the Tesseract open source OCR engine for multilingual OCR
MOCR '09: Proceedings of the International Workshop on Multilingual OCRWe describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond ...
An Omnifont Open-Vocabulary OCR System for English and Arabic
We present an omnifont, unlimited-vocabulary OCR system for English and Arabic. The system is based on Hidden Markov Models (HMM), an approach that has proven to be very successful in the area of automatic speech recognition. In this paper we focus on ...
How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine – Final Notes on Development and Evaluation
Human Language Technology. Challenges for Computer Science and LinguisticsAbstractThe current paper presents work that has been carried out in the National Library of Finland (NLF) to improve optical character recognition (OCR) quality of the historical Finnish newspaper collection 1771–1910. Evaluation results reported in the ...
Comments