Skip to main content
Log in

HDPA: historical document processing and analysis framework

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Nowadays, the accessibility of digitized historical documents is extremely important to facilitate fast and efficient retrieval of historical information and knowledge extraction from such data. To provide such functionality, it is necessary to convert document images into plain text using optical character recognition (OCR). Many OCR related methods and tools have been proposed, however, they are often too complicated for a standard user, some important parts are missing or they are not available in free versions. Therefore, this paper describes a complex and flexible web framework for historical document manipulation and analysis with the main focus on OCR. The framework contains eight modules to facilitate three main tasks: image pre-processing and segmentation, creation of data for OCR model training and the OCR itself. This framework is freely available for non commercial purposes. We have experimentally evaluated this framework on real data and we have shown that this system is efficient and can save human labour in the process of annotated data preparation. Moreover, we have reached state-of-the-art OCR results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. https://read.transkribus.eu/about/.

  2. https://www.abbyy.com/.

  3. http://www.portafontium.cz/.

References

  • Ahmadi E, Azimifar Z, Shams M, Famouri M, Shafiee MJ (2015) Document image binarization using a discriminative structural classifier. Pattern Recogn Lett 63:36–42

    Article  Google Scholar 

  • Alberti M, Bouillon M, Ingold R, Liwicki M (2017) Open Evaluation Tool for Layout Analysis of Document Images. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Kyoto, Japan, pp 43–47. https://doi.org/10.1109/ICDAR.2017.311

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (arXiv preprint)

  • Breuel TM (2008) The ocropus open source OCR system. In: Document recognition and retrieval XV, vol 6815. International Society for Optics and Photonics, p 68150F

  • Breuel TM, Ul-Hasan A, Al-Azawi MA, Shafait F (2013) High-performance ocr for printed English and fraktur using LSTM networks. In: 2013 12th international conference on document analysis and recognition (ICDAR), IEEE, pp 683–687

  • Chernyshova YS, Gayer AV, Sheshkus AV (2018) Generation method of synthetic training data for mobile OCR system. In: Tenth international conference on machine vision (ICMV 2017), vol 10696, International Society for Optics and Photonics, p 106962G

  • Clausner C, Papadopoulos C, Pletschacher S, Antonacopoulos A (2015) The ENP image and ground truth dataset of historical newspapers. In: 2015 13th international conference on document analysis and recognition (ICDAR), IEEE, pp 931–935

  • Clausner C, Pletschacher S, Antonacopoulos A (2014) Efficient ocr training data generation with aletheia. In: Proceedings of the international association for pattern recognition (IAPR), Tours, France, pp 7–10

  • Etter D, Rawls S, Carpenter C, Sell G (2019) A synthetic recipe for OCR. In: 2019 international conference on document analysis and recognition (ICDAR). IEEE, pp 864–869

  • Garz A, Seuret M, Fischer A, Ingold R (2016) A user-centered segmentation method for complex historical manuscripts based on document graphs. IEEE Trans Human Mach Syst 47(2):181–193

    Article  Google Scholar 

  • Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868

    Article  Google Scholar 

  • Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 369–376

  • Graves A, Schmidhuber J (2009) Offline handwriting recognition with multidimensional recurrent neural networks. In: Advances in neural information processing systems, pp 545–552

  • Grüning T, Leifert G, Strauß T, Michael J, Labahn R (2019) A two-stage method for text line detection in historical documents. Int J Doc Anal Recognit 22(3):285–302

    Article  Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Kanungo T, Lee CH, Czorapinski J, Bella I (2000) Trueviz: a groundtruth/metadata editing and visualizing toolkit for OCR. In: Document recognition and retrieval VIII, vol 4307. International Society for Optics and Photonics, pp 1–13

  • Kumar V, Sengar PK (2010) Segmentation of printed text in devanagari script and gurmukhi script. Int J Comput Appl 3(8):30–33

    Google Scholar 

  • LeCun Y, Bengio Y et al (1995) Convolutional networks for images, speech, and time series. Handb Brain Theory Neural Netw 3361(10):1995

    Google Scholar 

  • Leifert G, Strauß T, Grüning T, Labahn R (2016) Citlab argus for historical handwritten documents

  • Levenshtein V (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Russ Probl Peredachi Inf 1:12–25

    MATH  Google Scholar 

  • Likforman-Sulem L, Zahour A, Taconet B (2007) Text line segmentation of historical documents: a survey. IJDAR 9(2–4):123–138

    Article  Google Scholar 

  • Margner V, Pechwitz M (2001) Synthetic data for arabic ocr system development. In: Proceedings of sixth international conference on document analysis and recognition, 2001. IEEE, pp 1159–1163

  • Martínek J, Lenc L, Král P, Nicolaou A, Christlein V (2019) Hybrid training data for historical text OCR. In: 15th international conference on document analysis and recognition (ICDAR 2019), Sydney, Australia, pp 565–570. https://doi.org/10.1109/ICDAR.2019.00096

  • Pletschacher S, Antonacopoulos A (2010) The page (page analysis and ground-truth elements) format framework. In: 2010 20th international conference on pattern recognition. IEEE, pp 257–260

  • Postl W (1988) Method for automatic correction of character skew in the acquisition of a text original in the form of digital scan results. US Patent 4,723,297

  • Rawls S, Cao H, Kumar S, Natarajan P (2017) Combining convolutional neural networks and LSTMS for segmentation-free OCR. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 155–160

  • Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 234–241

  • Sauvola J, Pietikäinen M (2000) Adaptive document image binarization. Pattern Recogn 33(2):225–236

    Article  Google Scholar 

  • Shang W, Sohn K, Almeida D, Lee H (2016) Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International conference on machine learning, pp 2217–2225

  • Shi B, Bai X, Yao C (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304

    Article  Google Scholar 

  • Smith R (2007) An overview of the tesseract OCR engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007), vol 2. IEEE, pp 629–633

  • Strauß T, Weidemann M, Michael J, Leifert G, Grüning T, Labahn R (2018) System description of citlab’s recognition and retrieval engine for icdar2017 competition on information extraction in historical handwritten records

  • Van Beusekom J, Shafait F, Breuel TM (2008) Automated OCR ground truth generation. In: Document analysis systems, 2008. DAS’08. The eighth IAPR international workshop on, IEEE, pp 111–117

  • Zahour A, Likforman-Sulem L, Boussalaa W, Taconet B (2007) Text line segmentation of historical Arabic documents, pp 138–142. https://doi.org/10.1109/ICDAR.2007.4378691

Download references

Acknowedgements

This work has been partly supported from ERDF “Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no.: CZ.02.1.01/0.0/0.0/17\_048/0007267), by Cross-border Cooperation Program Czech Republic - Free State of Bavaria ETS Objective 2014–2020 (project no. 211). and by Grant No. SGS- 2019-018 Processing of heterogeneous data and its specialized applications.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Král.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lenc, L., Martínek, J., Král, P. et al. HDPA: historical document processing and analysis framework. Evolving Systems 12, 177–190 (2021). https://doi.org/10.1007/s12530-020-09343-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-020-09343-4

Keywords

Navigation