Abstract
This study focuses on image style transfer, aiming to generate images with the desired style while preserving the underlying content structure. Existing models face challenges in accurately representing both content and style features. To address this, an integrated method for image style transfer is proposed, utilizing a parallel CNN and Vision Transformer (CaVIT). It combines a Convolutional Neural Network (CNN) with a Vision Transformer (VIT) to achieve enhanced performance. Our method utilizes VGG-19 with residual blocks to encode style features for enhanced refinement. Additionally, the PA-Trans Encoder Layer is introduced, inspired by the Transformer Encoder Layer, to efficiently encode content features while preserving the complete content structure. The fused features are then decoded into stylized images using a CNN decoder. Qualitative and quantitative evaluations demonstrate that our proposed method outperforms existing models, delivering high-quality results.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data that support the findings of this study are publicly accessible at the official website address: https://cocodataset.org/#download and https://paperswithcode.com/dataset/wikiart.
References
Abed A, Akrout B, Amous I (2022) Semantic heads segmentation and counting in crowded retail environment with convolutional neural networks using top view depth images. SN Comp Sci 4(1):61
Abed A, Akrout B, Amous I (2024) Convolutional Neural Network for Head Segmentation and Counting in Crowded Retail Environment Using Top-view Depth Images. Arab J Sci Eng 49(3):3735–3749
Jing Y, Yang Y, Feng Z, Ye J, Yu Y, Song M (2019) Neural style transfer: A review. IEEE Trans Visualiz Comp Grap 26(11):3365–85
Wei LY, Levoy M (2000) Fast texture synthesis using tree-structured vector quantization. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp 479–488
Gao W, Li Y, Yin Y, Yang MH (2020) Fast video multi-style transfer. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3222–3230
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Springer International Publishing, pp 694–711
Chen D, Yuan L, Liao J, Yu N, Hua G (2017) Stylebank: an explicit representation for neural image style transfer. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1897–1906
Puy G, Pérez P (2019) A flexible convolutional solver for fast style transfers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8963–8972
Dumoulin V, Shlens J, Kudlur M (2016) A learned representation for artistic style. arXiv preprint arXiv:1610.07629.
Ulyanov D, Lebedev V, Vedaldi A, Lempitsky V (2016) Texture networks: feed-forward synthesis of textures and stylized images. arXiv preprint arXiv:1603.03417.
Zhang H, Dana K (2018) Multi-style generative network for real-time transfer. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp 0-0
An J, Huang S, Song Y, Dou D, Liu W, Luo J (2021) Artflow: Unbiased image style transfer via reversible neural flows. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 862–871
Deng Y, Tang F, Dong W, Ma C, Pan X, Wang L, Xu C (2022) Stytr2: image style transfer with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11326–11336
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2414–2423
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
Huang X, Belongie S (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE international conference on computer vision, pp 1501–1510
Lu J, Barnes C, DiVerdi S, Finkelstein A (2013) Realbrush: Painting with examples of physical media. ACM Trans Grap (TOG) 32(4):1–12
Hertzmann A (1998) Painterly rendering with curved brush strokes of multiple sizes. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp 453–460
Portilla J, Simoncelli EP (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. Int J Comput Vision 40:49–70
Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Risser E, Wilmot P, Barnes C (2017) Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893.
Li Y, Wang N, Liu J, Hou X (2017) Demystifying neural style transfer. arXiv preprint arXiv:1701.01036.
Zhang Y, Tang F, Dong W, Huang H, Ma C, Lee TY, Xu C (2022) Domain enhanced arbitrary image style transfer via contrastive learning. In: ACM SIGGRAPH 2022 conference proceedings, pp 1–8
Park DY, Lee KH (2019) Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5880–5888
Jing Y, Liu X, Ding Y, Wang X, Ding E, Song M, Wen S (2020) Dynamic instance normalization for arbitrary style transfer. In: Proceedings of the AAAI conference on artificial intelligence, 34(04):4369–4376
Svoboda J, Anoosheh A, Osendorfer C, Masci J (2020) Two-stage peer-regularized feature recombination for arbitrary image style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13816–13825
Li X, Liu S, Kautz J, Yang MH (2019) Learning linear transformations for fast image and video style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3809–3817
Yao Y, Ren J, Xie X, Liu W, Liu YJ, Wang J (2019) Attention-aware multi-stroke style transfer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1467–1475
Liu S, Lin T, He D, Li F, Wang M, Li X, Sun Z, Li Q, Ding E (2021) Adaattn: Revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6649–6658
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125–1134
Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232
Kotovenko D, Sanakoyeu A, Ma P, Lang S, Ommer B (2019) A content transformation block for image style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10032–10041
Liu MY, Huang X, Mallya A, Karras T, Aila T, Lehtinen J, Kautz J (2019) Few-shot unsupervised image-to-image translation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10551–10560
Cao Y, Chandrasekar A, Radhika T, Vijayakumar V (2024) Input-to-state stability of stochastic Markovian jump genetic regulatory networks. Mathemat Comput Simulat. 222:174–87
Radhika T, Chandrasekar A, Vijayakumar V et al (2023) Analysis of Markovian jump stochastic Cohen-Grossberg BAM neural networks with time delays for exponential input-to-state stability. Neural Proc Lett 55(8):11055–11072
Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Zhu X, Su W, Lu L, Li B, Wang X, and Dai J (2020) Deformable DETR: deformable transformers for end-to-end object detection. arXiv preprint arXiv: 2010.04159
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090
Wu X, Hu Z, Sheng L, Xu D (2021) Styleformer: real-time arbitrary style transfer via parametric style composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 14618–14627
Wang B, Komatsuzaki A (2021) GPT-J-6B: a 6 billion parameter autoregressive language model. URL https://github. com/kingoflolz/mesh-transformer-jax
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, Schuh P (2023) Palm: Scaling language modeling with pathways. J Mach Learn Res 24(240):1–13
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, Proceedings, Part V 13 2014, Springer International Publishing, pp 740–755
Phillips F, Mackintosh B (2011) Wiki Art Gallery Inc: A case for critical thinking. Issues Account Educ 26(3):593–608
Deng Y, Tang F, Dong W, Ma C, Huang F, Deussen O, Xu C (2020) Exploring the representativity of art paintings. IEEE Trans Multimedia 23:2794–2805
Chen H, Wang Z, Zhang H, Zuo Z, Li A, Xing W, Lu D (2021) Artistic style transfer with internal-external learning and contrastive learning. Adv Neural Inf Process Syst 34:26561–26573
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Z., Lu, S., Guo, Q. et al. CaVIT: An integrated method for image style transfer using parallel CNN and vision transformer. Appl Intell 55, 306 (2025). https://doi.org/10.1007/s10489-024-06114-5
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06114-5