Abstract
Nowadays, the accessibility of digitized historical documents is extremely important to facilitate fast and efficient retrieval of historical information and knowledge extraction from such data. To provide such functionality, it is necessary to convert document images into plain text using optical character recognition (OCR). Many OCR related methods and tools have been proposed, however, they are often too complicated for a standard user, some important parts are missing or they are not available in free versions. Therefore, this paper describes a complex and flexible web framework for historical document manipulation and analysis with the main focus on OCR. The framework contains eight modules to facilitate three main tasks: image pre-processing and segmentation, creation of data for OCR model training and the OCR itself. This framework is freely available for non commercial purposes. We have experimentally evaluated this framework on real data and we have shown that this system is efficient and can save human labour in the process of annotated data preparation. Moreover, we have reached state-of-the-art OCR results.
Similar content being viewed by others
References
Ahmadi E, Azimifar Z, Shams M, Famouri M, Shafiee MJ (2015) Document image binarization using a discriminative structural classifier. Pattern Recogn Lett 63:36–42
Alberti M, Bouillon M, Ingold R, Liwicki M (2017) Open Evaluation Tool for Layout Analysis of Document Images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Kyoto, Japan, pp 43–47. https://doi.org/10.1109/ICDAR.2017.311
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (arXiv preprint)
Breuel TM (2008) The ocropus open source OCR system. In: Document recognition and retrieval XV, vol 6815. International Society for Optics and Photonics, p 68150F
Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F (2013) High-performance ocr for printed English and fraktur using LSTM networks. In: 2013 12th international conference on document analysis and recognition (ICDAR), IEEE, pp 683–687
Chernyshova YS, Gayer AV, Sheshkus AV (2018) Generation method of synthetic training data for mobile OCR system. In: Tenth international conference on machine vision (ICMV 2017), vol 10696, International Society for Optics and Photonics, p 106962G
Clausner C, Papadopoulos C, Pletschacher S, Antonacopoulos A (2015) The ENP image and ground truth dataset of historical newspapers. In: 2015 13th international conference on document analysis and recognition (ICDAR), IEEE, pp 931–935
Clausner C, Pletschacher S, Antonacopoulos A (2014) Efficient ocr training data generation with aletheia. In: Proceedings of the international association for pattern recognition (IAPR), Tours, France, pp 7–10
Etter D, Rawls S, Carpenter C, Sell G (2019) A synthetic recipe for OCR. In: 2019 international conference on document analysis and recognition (ICDAR). IEEE, pp 864–869
Garz A, Seuret M, Fischer A, Ingold R (2016) A user-centered segmentation method for complex historical manuscripts based on document graphs. IEEE Trans Human Mach Syst 47(2):181–193
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 369–376
Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in neural information processing systems, pp 545–552
Grüning T, Leifert G, Strauß T, Michael J, Labahn R (2019) A two-stage method for text line detection in historical documents. Int J Doc Anal Recognit 22(3):285–302
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Kanungo T, Lee CH, Czorapinski J, Bella I (2000) Trueviz: a groundtruth/metadata editing and visualizing toolkit for OCR. In: Document recognition and retrieval VIII, vol 4307. International Society for Optics and Photonics, pp 1–13
Kumar V, Sengar PK (2010) Segmentation of printed text in devanagari script and gurmukhi script. Int J Comput Appl 3(8):30–33
LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw 3361(10):1995
Leifert G, Strauß T, Grüning T, Labahn R (2016) Citlab argus for historical handwritten documents
Levenshtein V (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Russ Probl Peredachi Inf 1:12–25
Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. IJDAR 9(2–4):123–138
Margner V, Pechwitz M (2001) Synthetic data for arabic ocr system development. In: Proceedings of sixth international conference on document analysis and recognition, 2001. IEEE, pp 1159–1163
Martínek J, Lenc L, Král P, Nicolaou A, Christlein V (2019) Hybrid training data for historical text OCR. In: 15th international conference on document analysis and recognition (ICDAR 2019), Sydney, Australia, pp 565–570. https://doi.org/10.1109/ICDAR.2019.00096
Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 2010 20th international conference on pattern recognition. IEEE, pp 257–260
Postl W (1988) Method for automatic correction of character skew in the acquisition of a text original in the form of digital scan results. US Patent 4,723,297
Rawls S, Cao H, Kumar S, Natarajan P (2017) Combining convolutional neural networks and LSTMS for segmentation-free OCR. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 155–160
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 234–241
Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236
Shang W, Sohn K, Almeida D, Lee H (2016) Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International conference on machine learning, pp 2217–2225
Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Smith R (2007) An overview of the tesseract OCR engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007), vol 2. IEEE, pp 629–633
Strauß T, Weidemann M, Michael J, Leifert G, Grüning T, Labahn R (2018) System description of citlab’s recognition and retrieval engine for icdar2017 competition on information extraction in historical handwritten records
Van Beusekom J, Shafait F, Breuel TM (2008) Automated OCR ground truth generation. In: Document analysis systems, 2008. DAS’08. The eighth IAPR international workshop on, IEEE, pp 111–117
Zahour A, Likforman-Sulem L, Boussalaa W, Taconet B (2007) Text line segmentation of historical Arabic documents, pp 138–142. https://doi.org/10.1109/ICDAR.2007.4378691
Acknowedgements
This work has been partly supported from ERDF “Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no.: CZ.02.1.01/0.0/0.0/17\_048/0007267), by Cross-border Cooperation Program Czech Republic - Free State of Bavaria ETS Objective 2014–2020 (project no. 211). and by Grant No. SGS- 2019-018 Processing of heterogeneous data and its specialized applications.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lenc, L., Martínek, J., Král, P. et al. HDPA: historical document processing and analysis framework. Evolving Systems 12, 177–190 (2021). https://doi.org/10.1007/s12530-020-09343-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-020-09343-4