Skip to main content
Log in

CaVIT: An integrated method for image style transfer using parallel CNN and vision transformer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This study focuses on image style transfer, aiming to generate images with the desired style while preserving the underlying content structure. Existing models face challenges in accurately representing both content and style features. To address this, an integrated method for image style transfer is proposed, utilizing a parallel CNN and Vision Transformer (CaVIT). It combines a Convolutional Neural Network (CNN) with a Vision Transformer (VIT) to achieve enhanced performance. Our method utilizes VGG-19 with residual blocks to encode style features for enhanced refinement. Additionally, the PA-Trans Encoder Layer is introduced, inspired by the Transformer Encoder Layer, to efficiently encode content features while preserving the complete content structure. The fused features are then decoded into stylized images using a CNN decoder. Qualitative and quantitative evaluations demonstrate that our proposed method outperforms existing models, delivering high-quality results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

The data that support the findings of this study are publicly accessible at the official website address: https://cocodataset.org/#download and https://paperswithcode.com/dataset/wikiart.

References

  1. Abed A, Akrout B, Amous I (2022) Semantic heads segmentation and counting in crowded retail environment with convolutional neural networks using top view depth images. SN Comp Sci 4(1):61

    Article  Google Scholar 

  2. Abed A, Akrout B, Amous I (2024) Convolutional Neural Network for Head Segmentation and Counting in Crowded Retail Environment Using Top-view Depth Images. Arab J Sci Eng 49(3):3735–3749

    Article  MATH  Google Scholar 

  3. Jing Y, Yang Y, Feng Z, Ye J, Yu Y, Song M (2019) Neural style transfer: A review. IEEE Trans Visualiz Comp Grap 26(11):3365–85

    Article  MATH  Google Scholar 

  4. Wei LY, Levoy M (2000) Fast texture synthesis using tree-structured vector quantization. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp 479–488

  5. Gao W, Li Y, Yin Y, Yang MH (2020) Fast video multi-style transfer. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3222–3230

  6. Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Springer International Publishing, pp 694–711

  7. Chen D, Yuan L, Liao J, Yu N, Hua G (2017) Stylebank: an explicit representation for neural image style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1897–1906

  8. Puy G, Pérez P (2019) A flexible convolutional solver for fast style transfers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8963–8972

  9. Dumoulin V, Shlens J, Kudlur M (2016) A learned representation for artistic style. arXiv preprint arXiv:1610.07629.

  10. Ulyanov D, Lebedev V, Vedaldi A, Lempitsky V (2016) Texture networks: feed-forward synthesis of textures and stylized images. arXiv preprint arXiv:1603.03417.

  11. Zhang H, Dana K (2018) Multi-style generative network for real-time transfer. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp 0-0

  12. An J, Huang S, Song Y, Dou D, Liu W, Luo J (2021) Artflow: Unbiased image style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 862–871

  13. Deng Y, Tang F, Dong W, Ma C, Pan X, Wang L, Xu C (2022) Stytr2: image style transfer with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11326–11336

  14. Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2414–2423

  15. Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

  16. Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision, pp 1501–1510

  17. Lu J, Barnes C, DiVerdi S, Finkelstein A (2013) Realbrush: Painting with examples of physical media. ACM Trans Grap (TOG) 32(4):1–12

    Article  Google Scholar 

  18. Hertzmann A (1998) Painterly rendering with curved brush strokes of multiple sizes. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp 453–460

  19. Portilla J, Simoncelli EP (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. Int J Comput Vision 40:49–70

    Article  MATH  Google Scholar 

  20. Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  21. Risser E, Wilmot P, Barnes C (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893.

  22. Li Y, Wang N, Liu J, Hou X (2017) Demystifying neural style transfer. arXiv preprint arXiv:1701.01036.

  23. Zhang Y, Tang F, Dong W, Huang H, Ma C, Lee TY, Xu C (2022) Domain enhanced arbitrary image style transfer via contrastive learning. In: ACM SIGGRAPH 2022 conference proceedings, pp 1–8

  24. Park DY, Lee KH (2019) Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5880–5888

  25. Jing Y, Liu X, Ding Y, Wang X, Ding E, Song M, Wen S (2020) Dynamic instance normalization for arbitrary style transfer. In: Proceedings of the AAAI conference on artificial intelligence, 34(04):4369–4376

  26. Svoboda J, Anoosheh A, Osendorfer C, Masci J (2020) Two-stage peer-regularized feature recombination for arbitrary image style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13816–13825

  27. Li X, Liu S, Kautz J, Yang MH (2019) Learning linear transformations for fast image and video style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3809–3817

  28. Yao Y, Ren J, Xie X, Liu W, Liu YJ, Wang J (2019) Attention-aware multi-stroke style transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1467–1475

  29. Liu S, Lin T, He D, Li F, Wang M, Li X, Sun Z, Li Q, Ding E (2021) Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6649–6658

  30. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144

    Article  MathSciNet  MATH  Google Scholar 

  31. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134

  32. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232

  33. Kotovenko D, Sanakoyeu A, Ma P, Lang S, Ommer B (2019) A content transformation block for image style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10032–10041

  34. Liu MY, Huang X, Mallya A, Karras T, Aila T, Lehtinen J, Kautz J (2019) Few-shot unsupervised image-to-image translation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10551–10560

  35. Cao Y, Chandrasekar A, Radhika T, Vijayakumar V (2024) Input-to-state stability of stochastic Markovian jump genetic regulatory networks. Mathemat Comput Simulat. 222:174–87

    Article  MathSciNet  MATH  Google Scholar 

  36. Radhika T, Chandrasekar A, Vijayakumar V et al (2023) Analysis of Markovian jump stochastic Cohen-Grossberg BAM neural networks with time delays for exponential input-to-state stability. Neural Proc Lett 55(8):11055–11072

    Article  MATH  Google Scholar 

  37. Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  38. Zhu X, Su W, Lu L, Li B, Wang X, and Dai J (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv: 2010.04159

  39. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090

    MATH  Google Scholar 

  40. Wu X, Hu Z, Sheng L, Xu D (2021) Styleformer: real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14618–14627

  41. Wang B, Komatsuzaki A (2021) GPT-J-6B: a 6 billion parameter autoregressive language model. URL https://github. com/kingoflolz/mesh-transformer-jax

  42. Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, Schuh P (2023) Palm: Scaling language modeling with pathways. J Mach Learn Res 24(240):1–13

    Google Scholar 

  43. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, Proceedings, Part V 13 2014, Springer International Publishing, pp 740–755

  44. Phillips F, Mackintosh B (2011) Wiki Art Gallery Inc: A case for critical thinking. Issues Account Educ 26(3):593–608

    Article  MATH  Google Scholar 

  45. Deng Y, Tang F, Dong W, Ma C, Huang F, Deussen O, Xu C (2020) Exploring the representativity of art paintings. IEEE Trans Multimedia 23:2794–2805

    Article  MATH  Google Scholar 

  46. Chen H, Wang Z, Zhang H, Zuo Z, Li A, Xing W, Lu D (2021) Artistic style transfer with internal-external learning and contrastive learning. Adv Neural Inf Process Syst 34:26561–26573

    Google Scholar 

  47. Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to ZaiFang Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Lu, S., Guo, Q. et al. CaVIT: An integrated method for image style transfer using parallel CNN and vision transformer. Appl Intell 55, 306 (2025). https://doi.org/10.1007/s10489-024-06114-5

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06114-5

Index Terms