HDPA: historical document processing and analysis framework

Lenc, Ladislav; Martínek, Jiří; Král, Pavel; Nicolao, Anguelos; Christlein, Vincent

doi:10.1007/s12530-020-09343-4

HDPA: historical document processing and analysis framework

Original Paper
Published: 20 May 2020

Volume 12, pages 177–190, (2021)
Cite this article

Evolving Systems Aims and scope Submit manuscript

Ladislav Lenc¹,
Jiří Martínek²,
Pavel Král ORCID: orcid.org/0000-0002-3096-675X^1,2,
Anguelos Nicolao³ &
…
Vincent Christlein³

375 Accesses
4 Citations
Explore all metrics

Abstract

Nowadays, the accessibility of digitized historical documents is extremely important to facilitate fast and efficient retrieval of historical information and knowledge extraction from such data. To provide such functionality, it is necessary to convert document images into plain text using optical character recognition (OCR). Many OCR related methods and tools have been proposed, however, they are often too complicated for a standard user, some important parts are missing or they are not available in free versions. Therefore, this paper describes a complex and flexible web framework for historical document manipulation and analysis with the main focus on OCR. The framework contains eight modules to facilitate three main tasks: image pre-processing and segmentation, creation of data for OCR model training and the OCR itself. This framework is freely available for non commercial purposes. We have experimentally evaluated this framework on real data and we have shown that this system is efficient and can save human labour in the process of annotated data preparation. Moreover, we have reached state-of-the-art OCR results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Fig. 10

Fig. 11

Fig. 15

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Ancient text recognition: a review

Article 10 April 2020

Sonika Rani Narang, M. K. Jindal & Munish Kumar

Building an efficient OCR system for historical documents with little training data

Article Open access 09 May 2020

Jiří Martínek, Ladislav Lenc & Pavel Král

Notes

References

Ahmadi E, Azimifar Z, Shams M, Famouri M, Shafiee MJ (2015) Document image binarization using a discriminative structural classifier. Pattern Recogn Lett 63:36–42
Article Google Scholar
Alberti M, Bouillon M, Ingold R, Liwicki M (2017) Open Evaluation Tool for Layout Analysis of Document Images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Kyoto, Japan, pp 43–47. https://doi.org/10.1109/ICDAR.2017.311
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (arXiv preprint)
Breuel TM (2008) The ocropus open source OCR system. In: Document recognition and retrieval XV, vol 6815. International Society for Optics and Photonics, p 68150F
Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F (2013) High-performance ocr for printed English and fraktur using LSTM networks. In: 2013 12th international conference on document analysis and recognition (ICDAR), IEEE, pp 683–687
Chernyshova YS, Gayer AV, Sheshkus AV (2018) Generation method of synthetic training data for mobile OCR system. In: Tenth international conference on machine vision (ICMV 2017), vol 10696, International Society for Optics and Photonics, p 106962G
Clausner C, Papadopoulos C, Pletschacher S, Antonacopoulos A (2015) The ENP image and ground truth dataset of historical newspapers. In: 2015 13th international conference on document analysis and recognition (ICDAR), IEEE, pp 931–935
Clausner C, Pletschacher S, Antonacopoulos A (2014) Efficient ocr training data generation with aletheia. In: Proceedings of the international association for pattern recognition (IAPR), Tours, France, pp 7–10
Etter D, Rawls S, Carpenter C, Sell G (2019) A synthetic recipe for OCR. In: 2019 international conference on document analysis and recognition (ICDAR). IEEE, pp 864–869
Garz A, Seuret M, Fischer A, Ingold R (2016) A user-centered segmentation method for complex historical manuscripts based on document graphs. IEEE Trans Human Mach Syst 47(2):181–193
Article Google Scholar
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868
Article Google Scholar
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 369–376
Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in neural information processing systems, pp 545–552
Grüning T, Leifert G, Strauß T, Michael J, Labahn R (2019) A two-stage method for text line detection in historical documents. Int J Doc Anal Recognit 22(3):285–302
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Kanungo T, Lee CH, Czorapinski J, Bella I (2000) Trueviz: a groundtruth/metadata editing and visualizing toolkit for OCR. In: Document recognition and retrieval VIII, vol 4307. International Society for Optics and Photonics, pp 1–13
Kumar V, Sengar PK (2010) Segmentation of printed text in devanagari script and gurmukhi script. Int J Comput Appl 3(8):30–33
Google Scholar
LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw 3361(10):1995
Google Scholar
Leifert G, Strauß T, Grüning T, Labahn R (2016) Citlab argus for historical handwritten documents
Levenshtein V (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Russ Probl Peredachi Inf 1:12–25
MATH Google Scholar
Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. IJDAR 9(2–4):123–138
Article Google Scholar
Margner V, Pechwitz M (2001) Synthetic data for arabic ocr system development. In: Proceedings of sixth international conference on document analysis and recognition, 2001. IEEE, pp 1159–1163
Martínek J, Lenc L, Král P, Nicolaou A, Christlein V (2019) Hybrid training data for historical text OCR. In: 15th international conference on document analysis and recognition (ICDAR 2019), Sydney, Australia, pp 565–570. https://doi.org/10.1109/ICDAR.2019.00096
Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 2010 20th international conference on pattern recognition. IEEE, pp 257–260
Postl W (1988) Method for automatic correction of character skew in the acquisition of a text original in the form of digital scan results. US Patent 4,723,297
Rawls S, Cao H, Kumar S, Natarajan P (2017) Combining convolutional neural networks and LSTMS for segmentation-free OCR. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 155–160
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 234–241
Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236
Article Google Scholar
Shang W, Sohn K, Almeida D, Lee H (2016) Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International conference on machine learning, pp 2217–2225
Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Article Google Scholar
Smith R (2007) An overview of the tesseract OCR engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007), vol 2. IEEE, pp 629–633
Strauß T, Weidemann M, Michael J, Leifert G, Grüning T, Labahn R (2018) System description of citlab’s recognition and retrieval engine for icdar2017 competition on information extraction in historical handwritten records
Van Beusekom J, Shafait F, Breuel TM (2008) Automated OCR ground truth generation. In: Document analysis systems, 2008. DAS’08. The eighth IAPR international workshop on, IEEE, pp 111–117
Zahour A, Likforman-Sulem L, Boussalaa W, Taconet B (2007) Text line segmentation of historical Arabic documents, pp 138–142. https://doi.org/10.1109/ICDAR.2007.4378691

Download references

Acknowedgements

This work has been partly supported from ERDF “Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no.: CZ.02.1.01/0.0/0.0/17\_048/0007267), by Cross-border Cooperation Program Czech Republic - Free State of Bavaria ETS Objective 2014–2020 (project no. 211). and by Grant No. SGS- 2019-018 Processing of heterogeneous data and its specialized applications.

Author information

Authors and Affiliations

NTIS-New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Ladislav Lenc & Pavel Král
Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia, Univerzitní 8, 306 14, Plzeň, Czech Republic
Jiří Martínek & Pavel Král
Pattern Recognition Lab, Friedrich-Alexander-University Erlangen-Nurnberg, Erlangen, Germany
Anguelos Nicolao & Vincent Christlein

Authors

Ladislav Lenc
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Martínek
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Král
View author publications
You can also search for this author in PubMed Google Scholar
Anguelos Nicolao
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Christlein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Král.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lenc, L., Martínek, J., Král, P. et al. HDPA: historical document processing and analysis framework. Evolving Systems 12, 177–190 (2021). https://doi.org/10.1007/s12530-020-09343-4

Download citation

Received: 20 December 2019
Accepted: 23 April 2020
Published: 20 May 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s12530-020-09343-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HDPA: historical document processing and analysis framework

Abstract

Access this article

Similar content being viewed by others

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Ancient text recognition: a review

Building an efficient OCR system for historical documents with little training data

Notes

References

Acknowedgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HDPA: historical document processing and analysis framework

Abstract

Access this article

Similar content being viewed by others

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Ancient text recognition: a review

Building an efficient OCR system for historical documents with little training data

Notes

References

Acknowedgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation