research-article

Swin-UNIT: Transformer-based GAN for High-resolution Unpaired Image Translation

Authors:

Yuehu LiuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4657 - 4665

https://doi.org/10.1145/3581783.3612518

Published: 27 October 2023 Publication History

Abstract

The transformer model has gained a lot of success in various computer vision tasks owing to its capacity of modeling long-range dependencies. However, its application has been limited in the area of high-resolution unpaired image translation using GANs due to the quadratic complexity with the spatial resolution of input features. In this paper, we propose a novel transformer-based GAN for high-resolution unpaired image translation named Swin-UNIT. A two-stage generator is designed which consists of a global style translation (GST) module and a recurrent detail supplement (RDS) module. The GST module focuses on translating low-resolution global features using the ability of self-attention. The RDS module offers quick information propagation from the global features to the detail features at a high resolution using cross-attention. Moreover, we customize a dual-branch discriminator to guide the generator. Extensive experiments demonstrate that our model achieves state-of-the-art results on the unpaired image translation tasks.

References

[1]

Sagie Benaim and Lior Wolf. 2017. One-sided unsupervised domain mapping. Advances in Neural Information Processing Systems, Vol. 30 (2017).

[2]

Runfa Chen, Wenbing Huang, Binghui Huang, Fuchun Sun, and Bin Fang. 2020. Reusing discriminators for encoding: Towards unsupervised image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8168--8177.

[3]

Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. 2018. On Self Modulation for Generative Adversarial Networks. In International Conference on Learning Representations.

[4]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Scharwächter, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2015. The cityscapes dataset. In CVPR Workshop on the Future of Datasets in Vision, Vol. 2.

[5]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[6]

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, Kun Zhang, and Dacheng Tao. 2019. Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2427--2436.

[7]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM, Vol. 63, 11 (2020), 139--144.

Digital Library

[8]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, Vol. 30 (2017).

[9]

Muqi Huang and Lefei Zhang. 2022. Atrous Pyramid Transformer with Spectral Convolution for Image Inpainting. In Proceedings of the 30th ACM International Conference on Multimedia. 4674--4683.

Digital Library

[10]

Sheng-Wei Huang, Che-Tsung Lin, Shu-Ping Chen, Yen-Yi Wu, Po-Hao Hsu, and Shang-Hong Lai. 2018a. Auggan: Cross domain adaptation with gan-based data augmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 718--731.

[11]

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018b. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision. 172--189.

Digital Library

[12]

Drew A Hudson and Larry Zitnick. 2021. Generative adversarial transformers. In International Conference on Machine Learning. PMLR, 4487--4499.

[13]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1125--1134.

[14]

Kui Jiang, Zhongyuan Wang, Chen Chen, Zheng Wang, Laizhong Cui, and Chia-Wen Lin. 2022. Magic ELF: Image Deraining Meets Association Learning and Transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 827--836.

Digital Library

[15]

Liming Jiang, Changxu Zhang, Mingyang Huang, Chunxiao Liu, Jianping Shi, and Chen Change Loy. 2020. Tsit: A simple and versatile framework for image-to-image translation. In European Conference on Computer Vision. Springer, 206--222.

Digital Library

[16]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision. Springer, 694--711.

[17]

Chanyong Jung, Gihyun Kwon, and Jong Chul Ye. 2022. Exploring Patch-wise Semantic Relation for Contrastive Learning in Image-to-Image Translation Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18260--18269.

[18]

Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwanghee Lee. 2019. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv preprint arXiv:1907.10830 (2019).

[19]

Ling Li, Yaochen Li, Chuan Wu, Hang Dong, Peilin Jiang, and Fei Wang. 2021. Detail Fusion GAN: High-Quality Translation for Unpaired Images with GAN-based Data Augmentation. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 1731--1736.

[20]

Jie Liang, Hui Zeng, and Lei Zhang. 2021. High-resolution photorealistic image translation in real-time: A laplacian pyramid translation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9392--9400.

[21]

Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems, Vol. 30 (2017).

Digital Library

[22]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[23]

Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision. Springer, 319--345.

Digital Library

[24]

Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In European Conference on Computer Vision. Springer, 102--118.

[25]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention. Springer, 234--241.

[26]

German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3234--3243.

[27]

Justin Theiss, Jay Leverett, Daeil Kim, and Aayush Prakash. 2022. Unpaired Image Translation via Vector Symbolic Architectures. In European Conference on Computer Vision. Springer, 17--32.

Digital Library

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017).

[29]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8798--8807.

[30]

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, Vol. 13, 4 (2004), 600--612. https://doi.org/10.1109/TIP.2003.819861

Digital Library

[31]

Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. 2017. End-to-end learning of driving models from large-scale video datasets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2174--2182.

[32]

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. 2022. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5728--5739.

[33]

Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. 2022. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11304--11314.

[34]

Long Zhao, Zizhao Zhang, Ting Chen, Dimitris Metaxas, and Han Zhang. 2021. Improved transformer for high-resolution gans. Advances in Neural Information Processing Systems, Vol. 34 (2021), 18367--18380.

[35]

Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2021. The spatially-correlative loss for various image translation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16407--16417.

[36]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 2223--2232.

Cited By

Zhang TDai JCheng JLi HZhao RZhang B(2025)RRCGAN: Unsupervised Compression of Radiometric Resolution of Remote Sensing Images Using Contrastive LearningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.352805263(1-20)Online publication date: 2025
https://doi.org/10.1109/TGRS.2025.3528052
Xie DQiao CLiang LWang ZLi TLiu QLi CWang GYang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generalizing ISP Model by Unsupervised Raw-to-raw MappingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681666(3809-3817)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681666
Wang HLi NZhao HWen YSu YFang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MappingFormer: Learning Cross-modal Feature Mapping for Visible-to-infrared Image TranslationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681375(10745-10754)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681375

Index Terms

Swin-UNIT: Transformer-based GAN for High-resolution Unpaired Image Translation
1. Computing methodologies

Recommendations

Multi-Curve Translator for High-Resolution Photorealistic Image Translation
Computer Vision – ECCV 2022
Abstract
The dominant image-to-image translation methods are based on fully convolutional networks, which extract and translate an image’s features and then reconstruct the image. However, they have unacceptable computational costs when working with high-...
HST: Hierarchical Swin Transformer for Compressed Image Super-Resolution
Computer Vision – ECCV 2022 Workshops
Abstract
Compressed Image Super-resolution has achieved great attention in recent years, where images are degraded with compression artifacts and low-resolution artifacts. Since the complex hybrid distortions, it is hard to restore the distorted image with ...
Non-local sparse attention based swin transformer V2 for image super-resolution
Abstract
In single image super resolution tasks, distortion measurement (such as PSNR, SSIM) and perceptual quality (such as PI, NIQE) are contradictory, and methods that perform well in perceptual quality often perform poorly in distortion measurement, ...
Highlights
- A solution to balance high PSNR and image perception quality is proposed.
- A non-local sparse attention method is added to the shallow feature extraction module to identify the location with the most abundant feature information.
- An ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Project of New Generation ArtificialIntelligence of China
Key Research and Development Program of Shaanxi Province

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
314
Total Downloads

Downloads (Last 12 months)167
Downloads (Last 6 weeks)17

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang TDai JCheng JLi HZhao RZhang B(2025)RRCGAN: Unsupervised Compression of Radiometric Resolution of Remote Sensing Images Using Contrastive LearningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.352805263(1-20)Online publication date: 2025
https://doi.org/10.1109/TGRS.2025.3528052
Xie DQiao CLiang LWang ZLi TLiu QLi CWang GYang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generalizing ISP Model by Unsupervised Raw-to-raw MappingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681666(3809-3817)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681666
Wang HLi NZhao HWen YSu YFang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)MappingFormer: Learning Cross-modal Feature Mapping for Visible-to-infrared Image TranslationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681375(10745-10754)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681375

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten