DMDIT: Diverse multi-domain image-to-image translation
Introduction
Numerous computer vision tasks can be categorized as image to image translation problems, such as super-resolution [1], [2], [3], style transfer [4], [5], [6], [7], image inpainting [8], [9], [10], makeup transfer [11], etc. The image translation aims to learn the mapping relationship between two visual domains which has made impressive progress recently. However, the combination of multi-domain and multi-modality translation remains a challenging task. In this paper, we are devoted to exploring a unified framework to perform diverse and multi-domain image-to-image translation in the absence of the paired dataset.
Recently, a few cross-domain image translation methods have been proposed. For instance, Isola et al. [12] presented a conditional adversarial network for image translation, which can not only learn the mapping between the source and target domain, but also learn the loss function for model optimization adaptively. On the basis of [12], BicycleGAN [13] supports that the connection between output and the latent code is reversible, helping to generate multiple outputs. Yet, it can only produce diverse results with different colors and illumination.
The aforementioned methods require paired dataset during the training process, which is difficult to obtain or even inexistent for some visual tasks. To relieve this issue, the cycle consistency regularization is proposed in CycleGAN [14], DiscoGAN [15] and DualGAN [16]. They aim to learn a deterministic mapping with the constraint of reconstruction loss. Besides, UNIT [17] proposes a hypothesis of shared potential space, and the weight sharing strategy is used during the training process to guarantee the same feature space.
Nevertheless, previous methods only consider learning the mapping between two domains. That is, for domains, models need to be trained, which greatly limits their practical application. For this concern, IcGAN [18] and Fader Networks [19] combine encoder–decoder architecture with GAN to achieve the translation of multiple attributes. Whereafter, StarGAN [20] adds the domain label to control the translation between multiple domains. AttGAN [21] imposes the attribute classification constraints on the generated images, which can retain the characteristics of attribute-independent regions to a certain extent. Based on AttGAN’s work, STGAN [22] adds the selective transfer unit to edit the attributes more precisely.
Although the above studies achieve the image translation between several domains, they still learn a deterministic mapping for each domain. However, the data distribution in each domain may contain more than one modality. Hence, different approaches are explored to get diverse outputs recently. DRIT [23] and MUNIT [24] support that the image can be divided into content and attribute space, then the multiple outputs can be obtained by exchanging domain-specific attribute space. SDIT [25] proposes a framework to combine the multi-domain and multi-modality together, but there are merely tiny differences between its diverse outputs. MSGAN [26] puts forward an effective regularization item in the loss function to increase diversity. Although StarGAN v2 [27] can produce diverse visually pleasing results, its domains only involve several rough categories, which cannot control the edit of certain details accurately.
To sum up, there still lacks a versatile framework to perform multi-domain and multi-modality image translation. Noise injection is a common method to improve diversity in image translation. However, the ways and position of noise injection will directly affect the diversity and quality of the generated images. Considering the quality of outputs and the phenomenon of mode collapse, in this paper, we propose a unified model to exploit diverse outputs in multiple domains, as Fig. 1 shows. The latent noise sampled from a normal distribution is injected into the adaptive instance normalization layer to get different outputs. In addition, a straightforward and effective style regularization loss is proposed to enhance the diversity. To prevent the model collapse, a separation module is embedded in the discriminator to constrain the latent noise. Besides, an attention mechanism is applied to preserve the features of attribute-independence areas as much as possible.
Overall, our contributions are as follows:
- •
A novel and unified framework is proposed, which can achieve the combination of multi-domain and multi- modality image to image translation in the absence of paired dataset.
- •
A straightforward and effective style regularization loss is put forward to increase the variety of outputs.
- •
To escape from the mode collapse, a latent code separation module is embedded in the discriminator to impose a restriction on the latent noise. Besides, an attention mechanism is used to guide the model attentively focusing on the most attribute-relevant regions.
In the following sections, we first introduce the related work in Section 2, then the details of the proposed method are elaborated in Section 3. Section 4 shows the experimental setup in detail, including datasets, network structure and training details. In Section 5, abundant experiments are demonstrated to verify the effectiveness of our method. Please refer to the corresponding section for more details.
Section snippets
Generative adversarial networks
Generative adversarial networks (GAN) [28] have been booming as an effective image generation model since its inception. The vanilla GAN embodies two parts: the generator () and discriminator (). is a network whose input is a random noise , from which it generates an image denoted as . is a network to discriminate whether the input is real. and evolve in a minimax game to achieve a balance between them. There are loads of derivatives of GAN [29], [30], [31], [32], which are
DMDIT: Diverse multi-domain image-to-image translation
The purpose of this paper is to explore a unified model which can realize the multi-domain and multi-modality image to image translation without the paired dataset. The multi-domain means that the translation of multiple attributes can be implemented simultaneously using a generator. Multi-modality refers to that the mapping relationship for each domain is one-to-many, hence, the translation results in each domain are various rather than determinate. As illustrated in Fig. 2, our model consists
Datasets
In the paper, all the experiments are implemented on the CelebFaces Attributes (CelebA) dataset, which contains 202,599 face images with 10,177 identities. Among them, each image is annotated including face label boxes, 5 landmark positions and 40 binary attribute labels. The dataset is divided into three parts for training, verification, and testing. All the images are horizontally flipped, cropped randomly to 178 × 178, and resized to 128 × 128 finally for data augmentation. In this paper, 13
Experimental results
In this section, we clearly demonstrate the effectiveness of the proposed DMDIT in three ways. In each part, quantitative and qualitative experiments are conducted to compare with the state-of-the-art methods. In the end, the ablation study is analyzed to show the effect of the different components in our method.
Conclusion
In this paper, the DMDIT is introduced to implement multi-domain and multi-modality image translation using a unified framework. The multi-domain is achieved by constraining the generator with the target domain label, and the multi-modality is by applying the adaptive instance normalization with noise injection. Furthermore, an attention module is used to generate the mask, which can retain the underlying characters better. To further increase diversity, a novel style regularization loss and a
CRediT authorship contribution statement
Mingwen Shao: Supervision, Validation. Youcai Zhang: Investigation, Writing – original draft. Huan Liu: Methodology, Data curation. Chao Wang: Formal analysis, Writing – review. Le Li: Conceptualization, Formal analysis. Xun Shao: Review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper.This work was supported by the grants from the National Natural Science Foundation of China (Nos.61673396, 61976245). All authors approved the version of the manuscript to be published.
References (41)
- et al.
Multi-scale generative adversarial inpainting network based on cross-layer attention transfer mechanism
Knowl.-Based Syst.
(2020) - et al.
Toward multimodal image-to-image translation
- et al.
Iit-gat: Instance-level image transformation via unsupervised generative attention networks with disentangled representations
Knowl.-Based Syst.
(2021) - et al.
Image super-resolution using deep convolutional networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2015) - X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, C. Change Loy, Esrgan: Enhanced super-resolution generative...
- W. Zhang, Y. Liu, C. Dong, Y. Qiao, Ranksrgan: Generative adversarial networks with ranker for image super-resolution,...
- T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: Proceedings...
- H. Chang, J. Lu, F. Yu, A. Finkelstein, Pairedcyclegan: Asymmetric style transfer for applying and removing makeup, in:...
- et al.
Unsupervised image-to-image translation using intra-domain reconstruction loss
Int. J. Mach. Learn. Cybern.
(2020) - et al.
Face attribute editing based on generative adversarial networks
Signal Image Video Process.
(2020)
Unsupervised image-to-image translation networks
Fader networks: Manipulating images by sliding attributes
Cited by (7)
Progressive normalizing flow with learnable spectrum transform for style transfer
2024, Knowledge-Based SystemsHigh-fidelity GAN inversion by frequency domain guidance
2023, Computers and Graphics (Pergamon)CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis
2022, Knowledge-Based SystemsCitation Excerpt :With the development of deep learning (especially generative adversarial networks (GANs) [1]), face synthesis has become one of the most active research areas in computer vision, including facial attribute editing [2–5], text-to-face synthesis [6,7], and face inpainting [8,9], sketch-to-face generation [10–12], and so on.
A lightweight ensemble discriminator for Generative Adversarial Networks
2022, Knowledge-Based SystemsCitation Excerpt :Spatially adaptive normalization [28] was proposed for synthesizing photorealistic images, given a semantic input layout. To deal with the model collapse problem in multi-domain image-to-image translation tasks, Shao et al. [6] proposed style regularization to impose constraints on the input noise. As for regularization, Zhang et al. [11] proposed consistency regularization, widely used in semi-supervised learning, to ensure that the discriminator’s output remains unaffected even when its input data has been augmented.
A Motion Deblurring Disentangled Representation Network
2022, Knowledge-Based SystemsCitation Excerpt :BicycleGAN [33] combines cVAE-GAN [34] together with cLR-GAN to model the distribution of a variety of attribute changes in the output during the transformation from one image to another. The study of image-to-image translation [35–37] acquires synthetic images with various attributes. Neural style transfer [20] has proved the significant effect of image stylization.
Image-to-image translation using an offset-based multi-scale codes GAN encoder
2024, Visual Computer