skip to main content
10.1145/3581783.3612518acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Swin-UNIT: Transformer-based GAN for High-resolution Unpaired Image Translation

Published: 27 October 2023 Publication History

Abstract

The transformer model has gained a lot of success in various computer vision tasks owing to its capacity of modeling long-range dependencies. However, its application has been limited in the area of high-resolution unpaired image translation using GANs due to the quadratic complexity with the spatial resolution of input features. In this paper, we propose a novel transformer-based GAN for high-resolution unpaired image translation named Swin-UNIT. A two-stage generator is designed which consists of a global style translation (GST) module and a recurrent detail supplement (RDS) module. The GST module focuses on translating low-resolution global features using the ability of self-attention. The RDS module offers quick information propagation from the global features to the detail features at a high resolution using cross-attention. Moreover, we customize a dual-branch discriminator to guide the generator. Extensive experiments demonstrate that our model achieves state-of-the-art results on the unpaired image translation tasks.

References

[1]
Sagie Benaim and Lior Wolf. 2017. One-sided unsupervised domain mapping. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[2]
Runfa Chen, Wenbing Huang, Binghui Huang, Fuchun Sun, and Bin Fang. 2020. Reusing discriminators for encoding: Towards unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8168--8177.
[3]
Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. 2018. On Self Modulation for Generative Adversarial Networks. In International Conference on Learning Representations.
[4]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharwächter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2015. The cityscapes dataset. In CVPR Workshop on the Future of Datasets in Vision, Vol. 2.
[5]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[6]
Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and Dacheng Tao. 2019. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2427--2436.
[7]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM, Vol. 63, 11 (2020), 139--144.
[8]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[9]
Muqi Huang and Lefei Zhang. 2022. Atrous Pyramid Transformer with Spectral Convolution for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia. 4674--4683.
[10]
Sheng-Wei Huang, Che-Tsung Lin, Shu-Ping Chen, Yen-Yi Wu, Po-Hao Hsu, and Shang-Hong Lai. 2018a. Auggan: Cross domain adaptation with gan-based data augmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 718--731.
[11]
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018b. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision. 172--189.
[12]
Drew A Hudson and Larry Zitnick. 2021. Generative adversarial transformers. In International Conference on Machine Learning. PMLR, 4487--4499.
[13]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1125--1134.
[14]
Kui Jiang, Zhongyuan Wang, Chen Chen, Zheng Wang, Laizhong Cui, and Chia-Wen Lin. 2022. Magic ELF: Image Deraining Meets Association Learning and Transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 827--836.
[15]
Liming Jiang, Changxu Zhang, Mingyang Huang, Chunxiao Liu, Jianping Shi, and Chen Change Loy. 2020. Tsit: A simple and versatile framework for image-to-image translation. In European Conference on Computer Vision. Springer, 206--222.
[16]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision. Springer, 694--711.
[17]
Chanyong Jung, Gihyun Kwon, and Jong Chul Ye. 2022. Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18260--18269.
[18]
Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwanghee Lee. 2019. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830 (2019).
[19]
Ling Li, Yaochen Li, Chuan Wu, Hang Dong, Peilin Jiang, and Fei Wang. 2021. Detail Fusion GAN: High-Quality Translation for Unpaired Images with GAN-based Data Augmentation. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 1731--1736.
[20]
Jie Liang, Hui Zeng, and Lei Zhang. 2021. High-resolution photorealistic image translation in real-time: A laplacian pyramid translation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9392--9400.
[21]
Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[22]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[23]
Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision. Springer, 319--345.
[24]
Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In European Conference on Computer Vision. Springer, 102--118.
[25]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234--241.
[26]
German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3234--3243.
[27]
Justin Theiss, Jay Leverett, Daeil Kim, and Aayush Prakash. 2022. Unpaired Image Translation via Vector Symbolic Architectures. In European Conference on Computer Vision. Springer, 17--32.
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).
[29]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8798--8807.
[30]
Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, Vol. 13, 4 (2004), 600--612. https://doi.org/10.1109/TIP.2003.819861
[31]
Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. 2017. End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2174--2182.
[32]
Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2022. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5728--5739.
[33]
Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. 2022. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11304--11314.
[34]
Long Zhao, Zizhao Zhang, Ting Chen, Dimitris Metaxas, and Han Zhang. 2021. Improved transformer for high-resolution gans. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18367--18380.
[35]
Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2021. The spatially-correlative loss for various image translation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16407--16417.
[36]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223--2232.

Cited By

View all
  • (2025)RRCGAN: Unsupervised Compression of Radiometric Resolution of Remote Sensing Images Using Contrastive LearningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.352805263(1-20)Online publication date: 2025
  • (2024)Generalizing ISP Model by Unsupervised Raw-to-raw MappingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681666(3809-3817)Online publication date: 28-Oct-2024
  • (2024)MappingFormer: Learning Cross-modal Feature Mapping for Visible-to-infrared Image TranslationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681375(10745-10754)Online publication date: 28-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high-resolution
  2. transformer
  3. unpaired image translation

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Project of New Generation ArtificialIntelligence of China
  • Key Research and Development Program of Shaanxi Province

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)167
  • Downloads (Last 6 weeks)17
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)RRCGAN: Unsupervised Compression of Radiometric Resolution of Remote Sensing Images Using Contrastive LearningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.352805263(1-20)Online publication date: 2025
  • (2024)Generalizing ISP Model by Unsupervised Raw-to-raw MappingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681666(3809-3817)Online publication date: 28-Oct-2024
  • (2024)MappingFormer: Learning Cross-modal Feature Mapping for Visible-to-infrared Image TranslationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681375(10745-10754)Online publication date: 28-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media