Skip to main content

Font Style Translation in Scene Text Images with CLIPstyler

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15319))

Included in the following conference series:

  • 177 Accesses

Abstract

Scene text editing is widely used in various fields, such as poster design and correcting spelling mistakes in the image. Editing text in images is a challenging task that requires accurately and naturally integrating text within complex backgrounds. Existing methods have achieved changing the text content with the target text without altering the style of text and the background of the image. However, arbitrary style transformation of the text region in the image has not been achieved. To address this issue, we propose a new framework named FontCLIPstyler, which enables the style transformation of text in scene text images using prompts. The proposed method mainly comprises two networks: MaskNet, which extracts mask images of the text region in images, and StyleNet, which performs the generation of stylized images. In addition, we also propose a new loss function named Text-aware Loss, which can guide the StyleNet network in transferring style features to the text region without changing the background. Through extensive experiments and ablation studies, we have demonstrated the effectiveness of our method in scene text style transformation. The experimental results show that our approach can successfully transfer the semantic style from the input prompt to the text region of the image, and create naturally stylized scene text while keeping the readability of the text and the background invariant.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Atarsaikhan, G., Iwana, B.K., Uchida, S.: Contained neural style transfer for decorated logo generation. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 317–322 (2018)

    Google Scholar 

  2. Azadi, S., Fisher, M., Kim, V.G., Wang, Z., Shechtman, E., Darrell, T.: Multi-content GAN for few-shot font style transfer. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 7564–7573 (2018)

    Google Scholar 

  3. Chen, H., et al.: DiffUTE: universal text editing diffusion model. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  4. Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: TextDiffuser: diffusion models as text painters. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  5. Deng, Y., et al.: StyTr2: image style transfer with transformers. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 11326–11336 (2022)

    Google Scholar 

  6. Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (2020)

    Google Scholar 

  7. Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: CLIP-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) 41(4), 1–13 (2022)

    Article  Google Scholar 

  8. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)

    Google Scholar 

  9. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  10. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

  11. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  12. Honghui, Y., Keiji, Y.: Multi-style shape matching GAN for text images. IEICE Trans. Inf. Syst. E107-D, 505–514 (2024)

    Google Scholar 

  13. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 1501–1510 (2017)

    Google Scholar 

  14. Iluz, S., Vinker, Y., Hertz, A., Berio, D., Cohen-Or, D., Shamir, A.: Word-as-image for semantic typography. ACM Trans. Graph. (TOG) 42(4), 1–11 (2023)

    Article  Google Scholar 

  15. Izumi, K., Yanai, K.: Zero-shot font style transfer with a differentiable renderer. In: Proceedings of the 4th ACM International Conference on Multimedia in Asia, pp. 1–5 (2022)

    Google Scholar 

  16. Ji, J., et al.: Improving diffusion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568 (2023)

  17. Kamra, C.G., Mastan, I.D., Gupta, D.: Sem-CS: semantic CLIPStyler for text-based image style transfer. In: IEEE International Conference on Image Processing (ICIP), pp. 395–399 (2023)

    Google Scholar 

  18. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  19. Krishnan, P., Kovvuri, R., Pang, G., Vassilev, B., Hassner, T.: TextStyleBrush: transfer of text aesthetics from a single example. IEEE Trans. Pattern Anal. Mach. Intell. (2023)

    Google Scholar 

  20. Kwon, G., Ye, J.C.: CLIPStyler: image style transfer with a single text condition. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 18062–18071 (2022)

    Google Scholar 

  21. Li, W., He, Y., Qi, Y., Li, Z., Tang, Y.: Fet-GAN: font and effect transfer via k-shot adaptive instance normalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1717–1724 (2020)

    Google Scholar 

  22. Luo, C., Jin, L., Chen, J.: SimAN: exploring self-supervised representation learning of scene text via similarity-aware normalization. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 1039–1048 (2022)

    Google Scholar 

  23. Ma, J., et al.: GlyphDraw: learning to draw Chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870 (2023)

  24. Qu, Y., Tan, Q., Xie, H., Xu, J., Wang, Y., Zhang, Y.: Exploring stroke-level modifications for scene text editing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2119–2127 (2023)

    Google Scholar 

  25. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  26. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)

    Google Scholar 

  27. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241 (2015)

    Google Scholar 

  28. Roy, P., Bhattacharya, S., Ghosh, S., Pal, U.: STEFANN: scene text editor using font adaptive neural network. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 13228–13237 (2020)

    Google Scholar 

  29. Song, Y., Zhang, Y.: CLIPFont: text guided vector wordart generation. In: British Machine Vision Conference. BMVA Press (2022). https://bmvc2022.mpi-inf.mpg.de/0543.pdf

  30. Talebi, H., Milanfar, P.: NIMA: neural image assessment. IEEE Trans. Image Process. 27(8), 3998–4011 (2018)

    Article  MathSciNet  Google Scholar 

  31. Tanveer, M., Wang, Y., Mahdavi-Amiri, A., Zhang, H.: DS-fusion: artistic typography via discriminated and stylized diffusion. In: Proceedings of IEEE International Conference on Computer Vision, pp. 374–384 (2023)

    Google Scholar 

  32. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  33. Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)

  34. Wang, C., Zhou, M., Ge, T., Jiang, Y., Bao, H., Xu, W.: CF-Font: content fusion for few-shot font generation. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 1858–1867 (2023)

    Google Scholar 

  35. Wang, W., Liu, J., Yang, S., Guo, Z.: Typography with decor: intelligent text style transfer. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 5889–5897 (2019)

    Google Scholar 

  36. Wu, L., et al.: Editing text in the wild. In: Proceedings of ACM International Conference Multimedia, pp. 1500–1508 (2019)

    Google Scholar 

  37. Xie, Y., Chen, X., Sun, L., Lu, Y.: DG-Font: deformable generative networks for unsupervised font generation. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 5130–5140 (2021)

    Google Scholar 

  38. Xu, W., Long, C., Wang, R., Wang, G.: DRB-GAN: a dynamic resblock generative adversarial network for artistic style transfer. In: Proceedings of IEEE International Conference on Computer Vision, pp. 6383–6392 (2021)

    Google Scholar 

  39. Yang, Q., Huang, J., Lin, W.: SwapText: image based texts transfer in scenes. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 14700–14709 (2020)

    Google Scholar 

  40. Yang, S., Liu, J., Wang, W., Guo, Z.: TET-GAN: text effects transfer via stylization and destylization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 1238–1245 (2019)

    Google Scholar 

  41. Yang, S., Wang, Z., Wang, Z., Xu, N., Liu, J., Guo, Z.: Controllable artistic text style transfer via shape-matching GAN. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 4442–4451 (2019)

    Google Scholar 

  42. Yang, Y., et al.: GlyphControl: glyph conditional control for visual text generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  43. Yang, Z., Song, H., Wu, Q.: Generative artisan: a semantic-aware and controllable clipstyler. arXiv preprint arXiv:2207.11598 (2022)

  44. Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3836–3847 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keiji Yanai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yuan, H., Yanai, K. (2025). Font Style Translation in Scene Text Images with CLIPstyler. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15319. Springer, Cham. https://doi.org/10.1007/978-3-031-78495-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78495-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78494-1

  • Online ISBN: 978-3-031-78495-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics