Abstract
Scene text editing is widely used in various fields, such as poster design and correcting spelling mistakes in the image. Editing text in images is a challenging task that requires accurately and naturally integrating text within complex backgrounds. Existing methods have achieved changing the text content with the target text without altering the style of text and the background of the image. However, arbitrary style transformation of the text region in the image has not been achieved. To address this issue, we propose a new framework named FontCLIPstyler, which enables the style transformation of text in scene text images using prompts. The proposed method mainly comprises two networks: MaskNet, which extracts mask images of the text region in images, and StyleNet, which performs the generation of stylized images. In addition, we also propose a new loss function named Text-aware Loss, which can guide the StyleNet network in transferring style features to the text region without changing the background. Through extensive experiments and ablation studies, we have demonstrated the effectiveness of our method in scene text style transformation. The experimental results show that our approach can successfully transfer the semantic style from the input prompt to the text region of the image, and create naturally stylized scene text while keeping the readability of the text and the background invariant.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Atarsaikhan, G., Iwana, B.K., Uchida, S.: Contained neural style transfer for decorated logo generation. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp. 317–322 (2018)
Azadi, S., Fisher, M., Kim, V.G., Wang, Z., Shechtman, E., Darrell, T.: Multi-content GAN for few-shot font style transfer. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 7564–7573 (2018)
Chen, H., et al.: DiffUTE: universal text editing diffusion model. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., Wei, F.: TextDiffuser: diffusion models as text painters. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Deng, Y., et al.: StyTr2: image style transfer with transformers. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 11326–11336 (2022)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (2020)
Gal, R., Patashnik, O., Maron, H., Bermano, A.H., Chechik, G., Cohen-Or, D.: StyleGAN-nada: CLIP-guided domain adaptation of image generators. ACM Trans. Graph. (TOG) 41(4), 1–13 (2022)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 2414–2423 (2016)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Honghui, Y., Keiji, Y.: Multi-style shape matching GAN for text images. IEICE Trans. Inf. Syst. E107-D, 505–514 (2024)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 1501–1510 (2017)
Iluz, S., Vinker, Y., Hertz, A., Berio, D., Cohen-Or, D., Shamir, A.: Word-as-image for semantic typography. ACM Trans. Graph. (TOG) 42(4), 1–11 (2023)
Izumi, K., Yanai, K.: Zero-shot font style transfer with a differentiable renderer. In: Proceedings of the 4th ACM International Conference on Multimedia in Asia, pp. 1–5 (2022)
Ji, J., et al.: Improving diffusion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568 (2023)
Kamra, C.G., Mastan, I.D., Gupta, D.: Sem-CS: semantic CLIPStyler for text-based image style transfer. In: IEEE International Conference on Image Processing (ICIP), pp. 395–399 (2023)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Krishnan, P., Kovvuri, R., Pang, G., Vassilev, B., Hassner, T.: TextStyleBrush: transfer of text aesthetics from a single example. IEEE Trans. Pattern Anal. Mach. Intell. (2023)
Kwon, G., Ye, J.C.: CLIPStyler: image style transfer with a single text condition. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 18062–18071 (2022)
Li, W., He, Y., Qi, Y., Li, Z., Tang, Y.: Fet-GAN: font and effect transfer via k-shot adaptive instance normalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1717–1724 (2020)
Luo, C., Jin, L., Chen, J.: SimAN: exploring self-supervised representation learning of scene text via similarity-aware normalization. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 1039–1048 (2022)
Ma, J., et al.: GlyphDraw: learning to draw Chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870 (2023)
Qu, Y., Tan, Q., Xie, H., Xu, J., Wang, Y., Zhang, Y.: Exploring stroke-level modifications for scene text editing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2119–2127 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241 (2015)
Roy, P., Bhattacharya, S., Ghosh, S., Pal, U.: STEFANN: scene text editor using font adaptive neural network. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 13228–13237 (2020)
Song, Y., Zhang, Y.: CLIPFont: text guided vector wordart generation. In: British Machine Vision Conference. BMVA Press (2022). https://bmvc2022.mpi-inf.mpg.de/0543.pdf
Talebi, H., Milanfar, P.: NIMA: neural image assessment. IEEE Trans. Image Process. 27(8), 3998–4011 (2018)
Tanveer, M., Wang, Y., Mahdavi-Amiri, A., Zhang, H.: DS-fusion: artistic typography via discriminated and stylized diffusion. In: Proceedings of IEEE International Conference on Computer Vision, pp. 374–384 (2023)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 (2016)
Wang, C., Zhou, M., Ge, T., Jiang, Y., Bao, H., Xu, W.: CF-Font: content fusion for few-shot font generation. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 1858–1867 (2023)
Wang, W., Liu, J., Yang, S., Guo, Z.: Typography with decor: intelligent text style transfer. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 5889–5897 (2019)
Wu, L., et al.: Editing text in the wild. In: Proceedings of ACM International Conference Multimedia, pp. 1500–1508 (2019)
Xie, Y., Chen, X., Sun, L., Lu, Y.: DG-Font: deformable generative networks for unsupervised font generation. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 5130–5140 (2021)
Xu, W., Long, C., Wang, R., Wang, G.: DRB-GAN: a dynamic resblock generative adversarial network for artistic style transfer. In: Proceedings of IEEE International Conference on Computer Vision, pp. 6383–6392 (2021)
Yang, Q., Huang, J., Lin, W.: SwapText: image based texts transfer in scenes. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 14700–14709 (2020)
Yang, S., Liu, J., Wang, W., Guo, Z.: TET-GAN: text effects transfer via stylization and destylization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 1238–1245 (2019)
Yang, S., Wang, Z., Wang, Z., Xu, N., Liu, J., Guo, Z.: Controllable artistic text style transfer via shape-matching GAN. In: Proceedings of IEEE Computer Vision and Pattern Recognition, pp. 4442–4451 (2019)
Yang, Y., et al.: GlyphControl: glyph conditional control for visual text generation. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Yang, Z., Song, H., Wu, Q.: Generative artisan: a semantic-aware and controllable clipstyler. arXiv preprint arXiv:2207.11598 (2022)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3836–3847 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yuan, H., Yanai, K. (2025). Font Style Translation in Scene Text Images with CLIPstyler. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15319. Springer, Cham. https://doi.org/10.1007/978-3-031-78495-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-78495-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78494-1
Online ISBN: 978-3-031-78495-8
eBook Packages: Computer ScienceComputer Science (R0)