Skip to main content

Improving Handwritten Cyrillic OCR by Font-Based Synthetic Text Generator

  • Conference paper
  • First Online:
Dynamics of Information Systems (DIS 2023)

Abstract

In this paper, we propose a straight-forward and effective Font-based Synthetic Text Generator (FbSTG) to alleviate the need for annotated data required for not just Cyrillic handwritten text recognition. Unlike standard GAN-based methods, the FbSTG does not have to be trained to learn new characters and styles; all it needs is the fonts, the text, and sampled page backgrounds. In order to show the benefits of the newly proposed method, we train and test two different OCR systems (Tesseract, and TrOCR) on the Handwritten Kazakh and Russian dataset (HKR) both with and without synthetic data. Besides, we evaluate both systems’ performance on a private NKVD dataset containing historical documents from Ukraine with a high amount of out-of-vocabulary (OoV) words representing an extremely challenging task for current state-of-the-art methods. We decreased the CER and WER significantly by adding the synthetic data with the TrOCR-Base-384 model on both datasets. More precisely, we reduced the relative error in terms of CER/WER on (i) HKR-Test1 with OoV samples by around \(20\%\)/\(10\%\), and (ii) NKVD dataset by \(24\%\) CER and \(8\%\) WER. The FbSTG code is available at: https://github.com/mhlzcu/doc_gen.

This research was supported by the Ministry of Culture Czech Republic, project No. DG20P02OVV018. The work described herein has also been supported by the Ministry of Education, Youth and Sports of the Czech Republic, Project No. LM2023062 LINDAT/CLARIAH-CZ. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdallah, A., Hamada, M., Nurseitov, D.: Attention-based fully gated CNN-BGRU for Russian handwritten text. J. Imaging 6(12), 141 (2020)

    Article  Google Scholar 

  2. Bao, H., Dong, L., Wei, F.: BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  3. Bluche, T., Louradour, J., Messina, R.O.: Scan, attend and read: end-to-end handwritten paragraph recognition with MDLSTM attention. CoRR abs/1604.03286 (2016). http://arxiv.org/abs/1604.03286

  4. Bluche, T., Messina, R.: Gated convolutional recurrent neural networks for multilingual handwriting recognition. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 646–651. IEEE (2017)

    Google Scholar 

  5. Bureš, L., Neduchal, P., Hlaváč, M., Hrúz, M.: Generation of synthetic images of full-text documents. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 68–75. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_8

    Chapter  Google Scholar 

  6. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

  7. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Sig. Process. Mag. 35(1), 53–65 (2018)

    Article  Google Scholar 

  8. Davis, B., Tensmeyer, C., Price, B., Wigington, C., Morse, B., Jain, R.: Text and style conditioned GAN for generation of offline handwriting lines. arXiv preprint arXiv:2009.00678 (2020)

  9. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  10. Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., Litman, R.: ScrabbleGAN: semi-supervised varying length handwritten text generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4324–4333 (2020)

    Google Scholar 

  11. Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009). https://doi.org/10.1109/tpami.2008.137

    Article  Google Scholar 

  12. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013)

  13. Gruber, I., et al.: OCR improvements for images of multi-page historical documents. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 226–237. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_21

    Chapter  Google Scholar 

  14. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324 (2016)

    Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227 (2014)

  17. Kang, L., Riba, P., Rusiñol, M., Fornés, A., Villegas, M.: Pay attention to what you read: non-recurrent handwritten text-line recognition. arXiv preprint arXiv:2005.13044. CoRR abs/2005.13044 (2020). http://arxiv.org/abs/2005.13044

  18. Kang, L., Toledo, J.I., Riba, P., Villegas, M., Fornés, A., Rusiñol, M.: Convolve, attend and spell: an attention-based sequence-to-sequence model for handwritten word recognition. In: Brox, T., Bruhn, A., Fritz, M. (eds.) GCPR 2018. LNCS, vol. 11269, pp. 459–472. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12939-2_32

    Chapter  Google Scholar 

  19. Kay, A.: Tesseract: an open-source optical character recognition engine. Linux J. 2007(159), 2 (2007)

    Google Scholar 

  20. Krishnan, P., Dutta, K., Jawahar, C.: Word spotting and recognition using deep embedding. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 1–6 (2018). https://doi.org/10.1109/DAS.2018.70

  21. Li, M., et al.: TrOCR: transformer-based optical character recognition with pre-trained models. arXiv preprint arXiv:2109.10282 (2021)

  22. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  23. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5(1), 39–46 (2002). https://doi.org/10.1007/s100320200071

    Article  Google Scholar 

  24. Nurseitov, D., Bostanbekov, K., Kurmankhojayev, D., Alimova, A., Abdallah, A., Tolegenov, R.: Handwritten Kazakh and Russian (HKR) database for text recognition. Multimed. Tools Appl. 80, 33075–33097 (2021). https://doi.org/10.1007/s11042-021-11399-6

    Article  Google Scholar 

  25. Perlin, K.: An image synthesizer. SIGGRAPH Comput. Graph. 19(3), 287–296 (1985). https://doi.org/10.1145/325165.325247

    Article  Google Scholar 

  26. Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 67–72. IEEE (2017)

    Google Scholar 

  27. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  28. Shonenkov, A., Karachev, D., Novopoltsev, M., Potanin, M., Dimitrov, D.: StackMix and blot augmentations for handwritten text recognition. arXiv preprint arXiv:2108.11667 (2021)

  29. Smith, R.: An overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE (2007)

    Google Scholar 

  30. Stuner, B., Chatelain, C., Paquet, T.: Cohort of LSTM and lexicon verification for handwriting recognition with gigantic lexicon. CoRR abs/1612.07528 (2016). http://arxiv.org/abs/1612.07528

  31. Sueiras, J., Ruiz, V., Sanchez, A., Velez, J.F.: Offline continuous handwriting recognition using sequence to sequence neural networks. Neurocomputing 289(1), 119–128 (2018). https://doi.org/10.1016/j.neucom.2018.02.008

    Article  Google Scholar 

  32. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  33. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010. Curran Associates Inc., Red Hook (2017)

    Google Scholar 

  34. Wigington, C., Stewart, S., Davis, B., Barrett, B., Price, B., Cohen, S.: Data augmentation for recognition of handwritten words and lines using a CNN-LSTM network. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 639–645 (2017). https://doi.org/10.1109/ICDAR.2017.110

  35. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6

  36. Zdenek, J., Nakayama, H.: JokerGAN: memory-efficient model for handwritten text generation with text line awareness. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5655–5663 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Gruber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gruber, I., Picek, L., Hlaváč, M., Neduchal, P., Hrúz, M. (2024). Improving Handwritten Cyrillic OCR by Font-Based Synthetic Text Generator. In: Moosaei, H., Hladík, M., Pardalos, P.M. (eds) Dynamics of Information Systems. DIS 2023. Lecture Notes in Computer Science, vol 14321. Springer, Cham. https://doi.org/10.1007/978-3-031-50320-7_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-50320-7_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-50319-1

  • Online ISBN: 978-3-031-50320-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics