Skip to main content

DiffusionPen: Towards Controlling the Style of Handwritten Text Generation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Handwritten Text Generation (HTG) conditioned on text and style is a challenging task due to the variability of inter-user characteristics and the unlimited combinations of characters that form new words unseen during training. Diffusion Models have recently shown promising results in HTG but still remain under-explored. We present DiffusionPen (DiffPen), a 5-shot style handwritten text generation approach based on Latent Diffusion Models. By utilizing a hybrid style extractor that combines metric learning and classification, our approach manages to capture both textual and stylistic characteristics of seen and unseen words and styles, generating realistic handwritten samples. Moreover, we explore several variation strategies of the data with multi-style mixtures and noisy embeddings, enhancing the robustness and diversity of the generated data. Extensive experiments using IAM offline handwriting database show that our method outperforms existing methods qualitatively and quantitatively, and its additional generated data can improve the performance of Handwriting Text Recognition (HTR) systems. The code is available at: https://github.com/koninik/DiffusionPen.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alonso, E., Moysset, B., Messina, R.: Adversarial generation of handwritten text images conditioned on sequences. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 481–486. IEEE (2019)

    Google Scholar 

  2. Balaji, Y., et al.: eDiff-I: text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)

  3. Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., Shah, M.: Handwriting transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1086–1094 (2021)

    Google Scholar 

  4. Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: Textdiffuser: diffusion models as text painters. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  5. Clark, J., Garrette, D., Turc, I., Wieting, J.: Canine: pre-training an efficient tokenization-free encoder for language representation. Trans. Assoc. Comput. Linguist. 10, 73–91 (2021). https://api.semanticscholar.org/CorpusID:232185112

  6. Davis, B.L., Tensmeyer, C., Price, B.L., Wigington, C., Morse, B., Jain, R.: Text and style conditioned GAN for the generation of offline-handwriting lines. In: Proceedings of the \(31^{st}\) British Machine Vision Conference (BMVC) (2020)

    Google Scholar 

  7. Dowson, D., Landau, B.: The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 12(3), 450–455 (1982)

    Article  Google Scholar 

  8. Giannone, G., Nielsen, D., Winther, O.: Few-Shot Diffusion Models. ArXiv abs/2205.15463 (2022). https://api.semanticscholar.org/CorpusID:249210127

  9. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2008)

    Article  Google Scholar 

  10. Gui, D., Chen, K., Ding, H., Huo, Q.: Zero-shot generation of training data with denoising diffusion probabilistic model for handwritten Chinese character recognition. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14188, pp. 348–365. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_20

    Chapter  Google Scholar 

  11. He, H., et al.: Diff-Font: diffusion model for robust one-shot font generation. Int. J. Comput. Vis. 1–15 (2024)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  13. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)

    Google Scholar 

  14. Kang, L., Riba, P., Rusinol, M., Fornés, A., Villegas, M.: Content and style aware generation of text-line images for handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  15. Kang, L., Riba, P., Wang, Y., Rusiñol, M., Fornés, A., Villegas, M.: GANwriting: content-conditioned generation of styled handwritten word images. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 273–289. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_17

    Chapter  Google Scholar 

  16. Kang, L., Rusinol, M., Fornés, A., Riba, P., Villegas, M.: Unsupervised writer adaptation for synthetic-to-real handwritten word recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3502–3511 (2020)

    Google Scholar 

  17. Kingma, D., Salimans, T., Poole, B., Ho, J.: Variational diffusion models. In: Advances in Neural Information Processing Systems, vol. 34, pp. 21696–21707 (2021)

    Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

    Google Scholar 

  19. Lee, A.W.C., Chung, J., Lee, M.: GNHK: a dataset for English handwriting in the wild. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021 Part IV. LNCS, vol. 12824, pp. 399–412. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_27

    Chapter  Google Scholar 

  20. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2017). https://api.semanticscholar.org/CorpusID:53592270

  21. Marti, U.V., Bunke, H.: The IAM-database: an English sentence database for offline handwriting recognition. Int. J. Doc. Anal. Recogn. 5, 39–46 (2002)

    Article  Google Scholar 

  22. Mattick, A., Mayr, M., Seuret, M., Maier, A., Christlein, V.: SmartPatch: improving handwritten word imitation with patch discriminators. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) ICDAR 2021. LNCS, vol. 12821, pp. 268–283. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86549-8_18

    Chapter  Google Scholar 

  23. Nichol, A.Q., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. In: International Conference on Machine Learning, pp. 16784–16804. PMLR (2022)

    Google Scholar 

  24. Nikolaidou, K., et al.: WordStylist: styled verbatim handwritten text generation with latent diffusion models. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds.) ICDAR 2023. LNCS, vol. 14188, pp. 384–401. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-41679-8_22

    Chapter  Google Scholar 

  25. Pippi, V., Cascianelli, S., Cucchiara, R.: Handwritten text generation from visual archetypes. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 22458–22467 (2023). https://api.semanticscholar.org/CorpusID:257766680

  26. Prince, S.J.: Understanding Deep Learning. MIT Press, Cambridge (2023)

    Google Scholar 

  27. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021). https://api.semanticscholar.org/CorpusID:231591445

  28. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.061251(2), 3 (2022)

  29. Retsinas, G., Sfikas, G., Gatos, B., Nikou, C.: Best practices for a handwritten text recognition system. In: Uchida, S., Barney, E., Eglin, V. (eds.) DAS 2022. LNCS, vol. 13237, pp. 247–259. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-06555-2_17

    Chapter  Google Scholar 

  30. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  31. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015 PArt III. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  32. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494 (2022)

    Google Scholar 

  33. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://api.semanticscholar.org/CorpusID:4555207

  34. Fogel, S., Averbuch-Elor, H, Cohen, S., Mazor, S., Litman, R.: ScrabbleGAN: semi-supervised varying length handwritten text generation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4323–4332 (2020)

    Google Scholar 

  35. Sinha, A., Song, J., Meng, C., Ermon, S.: D2C: diffusion-decoding models for few-shot conditional generation. In: Advances in Neural Information Processing Systems, vol. 34, pp. 12533–12548 (2021)

    Google Scholar 

  36. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)

    Google Scholar 

  37. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2021)

    Google Scholar 

  38. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  39. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

    Article  Google Scholar 

  40. Yang, Z., Peng, D., Kong, Y., Zhang, Y., Yao, C., Jin, L.: FontDiffuser: one-shot font generation via denoising diffusion with multi-scale content aggregation and style contrastive learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 6603–6611 (2024)

    Google Scholar 

  41. Zhang, L., Chen, X., Wang, Y., Lu, Y., Qiao, Y.: Brush your text: synthesize any scene text on images via diffusion model. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 7215–7223 (2024)

    Google Scholar 

  42. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

    Google Scholar 

  43. Zhu, Y., Li, Z., Wang, T., He, M., Yao, C.: Conditional text image generation with diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14235–14245 (2023)

    Google Scholar 

Download references

Acknowledgment

The computations and data handling were enabled by the Berzelius resource provided by the Knut and Alice Wallenberg Foundation at the National Supercomputer Centre at Linköping University. The publication/registration fees were partially covered by the University of West Attica.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Konstantina Nikolaidou .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5724 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nikolaidou, K., Retsinas, G., Sfikas, G., Liwicki, M. (2025). DiffusionPen: Towards Controlling the Style of Handwritten Text Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15143. Springer, Cham. https://doi.org/10.1007/978-3-031-73013-9_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73013-9_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73012-2

  • Online ISBN: 978-3-031-73013-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics