DMDIT: Diverse multi-domain image-to-image translation

doi:10.1016/j.knosys.2021.107311

Knowledge-Based Systems

Volume 229, 11 October 2021, 107311

https://doi.org/10.1016/j.knosys.2021.107311 Get rights and content

Abstract

Cross-domain image translation studies have shown brilliant progress in recent years, which intend to learn the mapping between two different domains. A good cross-domain image translation model should meet the following conditions: (1) do not rely on paired dataset, (2) can deal with multiple domains, (3) obtain diverse outputs with the same source image. Most state-of-art studies are devoted to addressing two of them i.e., either (1) and (2), or (1) and (3). In this paper, we construct a unified diverse multi-domain image to image translation framework (DMDIT) which can satisfy the above three requirements simultaneously. Different from traditional approaches, the proposed generator can achieve diverse and multi-label image-to-image translation while retaining the underlying features of the input image. The diverse outputs are obtained through a latent noise sampled from the normal distribution randomly. To further improve the multiplicity of the outputs, we propose a novel style regularization loss to restrain the latent noise. The mode collapse problem usually occurs due to the lack of constraints on the noise, so we embed a noise separation module in the discriminator to avoid this issue. In addition, we apply an attention mechanism to make the model attentively focus on the most attribute-relevant regions, helping to improve the quality of the generated images. Extensive qualitative and quantitative evaluations clearly demonstrate the effectiveness of our approach.

Introduction

Numerous computer vision tasks can be categorized as image to image translation problems, such as super-resolution [1], [2], [3], style transfer [4], [5], [6], [7], image inpainting [8], [9], [10], makeup transfer [11], etc. The image translation aims to learn the mapping relationship between two visual domains which has made impressive progress recently. However, the combination of multi-domain and multi-modality translation remains a challenging task. In this paper, we are devoted to exploring a unified framework to perform diverse and multi-domain image-to-image translation in the absence of the paired dataset.

Recently, a few cross-domain image translation methods have been proposed. For instance, Isola et al. [12] presented a conditional adversarial network for image translation, which can not only learn the mapping between the source and target domain, but also learn the loss function for model optimization adaptively. On the basis of [12], BicycleGAN [13] supports that the connection between output and the latent code is reversible, helping to generate multiple outputs. Yet, it can only produce diverse results with different colors and illumination.

The aforementioned methods require paired dataset during the training process, which is difficult to obtain or even inexistent for some visual tasks. To relieve this issue, the cycle consistency regularization is proposed in CycleGAN [14], DiscoGAN [15] and DualGAN [16]. They aim to learn a deterministic mapping with the constraint of reconstruction loss. Besides, UNIT [17] proposes a hypothesis of shared potential space, and the weight sharing strategy is used during the training process to guarantee the same feature space.

Nevertheless, previous methods only consider learning the mapping between two domains. That is, for $N$ domains, $N (N - 1)$ models need to be trained, which greatly limits their practical application. For this concern, IcGAN [18] and Fader Networks [19] combine encoder–decoder architecture with GAN to achieve the translation of multiple attributes. Whereafter, StarGAN [20] adds the domain label to control the translation between multiple domains. AttGAN [21] imposes the attribute classification constraints on the generated images, which can retain the characteristics of attribute-independent regions to a certain extent. Based on AttGAN’s work, STGAN [22] adds the selective transfer unit to edit the attributes more precisely.

Although the above studies achieve the image translation between several domains, they still learn a deterministic mapping for each domain. However, the data distribution in each domain may contain more than one modality. Hence, different approaches are explored to get diverse outputs recently. DRIT [23] and MUNIT [24] support that the image can be divided into content and attribute space, then the multiple outputs can be obtained by exchanging domain-specific attribute space. SDIT [25] proposes a framework to combine the multi-domain and multi-modality together, but there are merely tiny differences between its diverse outputs. MSGAN [26] puts forward an effective regularization item in the loss function to increase diversity. Although StarGAN v2 [27] can produce diverse visually pleasing results, its domains only involve several rough categories, which cannot control the edit of certain details accurately.

To sum up, there still lacks a versatile framework to perform multi-domain and multi-modality image translation. Noise injection is a common method to improve diversity in image translation. However, the ways and position of noise injection will directly affect the diversity and quality of the generated images. Considering the quality of outputs and the phenomenon of mode collapse, in this paper, we propose a unified model to exploit diverse outputs in multiple domains, as Fig. 1 shows. The latent noise sampled from a normal distribution is injected into the adaptive instance normalization layer to get different outputs. In addition, a straightforward and effective style regularization loss is proposed to enhance the diversity. To prevent the model collapse, a separation module is embedded in the discriminator to constrain the latent noise. Besides, an attention mechanism is applied to preserve the features of attribute-independence areas as much as possible.

Overall, our contributions are as follows:

•
A novel and unified framework is proposed, which can achieve the combination of multi-domain and multi- modality image to image translation in the absence of paired dataset.
•
A straightforward and effective style regularization loss is put forward to increase the variety of outputs.
•
To escape from the mode collapse, a latent code separation module is embedded in the discriminator to impose a restriction on the latent noise. Besides, an attention mechanism is used to guide the model attentively focusing on the most attribute-relevant regions.

In the following sections, we first introduce the related work in Section 2, then the details of the proposed method are elaborated in Section 3. Section 4 shows the experimental setup in detail, including datasets, network structure and training details. In Section 5, abundant experiments are demonstrated to verify the effectiveness of our method. Please refer to the corresponding section for more details.

Section snippets

Generative adversarial networks

Generative adversarial networks (GAN) [28] have been booming as an effective image generation model since its inception. The vanilla GAN embodies two parts: the generator ( $G$ ) and discriminator ( $D$ ). $G$ is a network whose input is a random noise $z$ , from which it generates an image denoted as $G (z)$ . $D$ is a network to discriminate whether the input is real. $D$ and $G$ evolve in a minimax game to achieve a balance between them. There are loads of derivatives of GAN [29], [30], [31], [32], which are

DMDIT: Diverse multi-domain image-to-image translation

The purpose of this paper is to explore a unified model which can realize the multi-domain and multi-modality image to image translation without the paired dataset. The multi-domain means that the translation of multiple attributes can be implemented simultaneously using a generator. Multi-modality refers to that the mapping relationship for each domain is one-to-many, hence, the translation results in each domain are various rather than determinate. As illustrated in Fig. 2, our model consists

Datasets

In the paper, all the experiments are implemented on the CelebFaces Attributes (CelebA) dataset, which contains 202,599 face images with 10,177 identities. Among them, each image is annotated including face label boxes, 5 landmark positions and 40 binary attribute labels. The dataset is divided into three parts for training, verification, and testing. All the images are horizontally flipped, cropped randomly to 178 × 178, and resized to 128 × 128 finally for data augmentation. In this paper, 13

Experimental results

In this section, we clearly demonstrate the effectiveness of the proposed DMDIT in three ways. In each part, quantitative and qualitative experiments are conducted to compare with the state-of-the-art methods. In the end, the ablation study is analyzed to show the effect of the different components in our method.

Conclusion

In this paper, the DMDIT is introduced to implement multi-domain and multi-modality image translation using a unified framework. The multi-domain is achieved by constraining the generator with the target domain label, and the multi-modality is by applying the adaptive instance normalization with noise injection. Furthermore, an attention module is used to generate the mask, which can retain the underlying characters better. To further increase diversity, a novel style regularization loss and a

CRediT authorship contribution statement

Mingwen Shao: Supervision, Validation. Youcai Zhang: Investigation, Writing – original draft. Huan Liu: Methodology, Data curation. Chao Wang: Formal analysis, Writing – review. Le Li: Conceptualization, Formal analysis. Xun Shao: Review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper.This work was supported by the grants from the National Natural Science Foundation of China (Nos.61673396, 61976245). All authors approved the version of the manuscript to be published.

References (41)

ShaoM. et al.
Multi-scale generative adversarial inpainting network based on cross-layer attention transfer mechanism
Knowl.-Based Syst.
(2020)
ZhuJ.-Y. et al.
Toward multimodal image-to-image translation
ShaoM. et al.
Iit-gat: Instance-level image transformation via unsupervised generative attention networks with disentangled representations
Knowl.-Based Syst.
(2021)
DongC. et al.
Image super-resolution using deep convolutional networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2015)
X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, C. Change Loy, Esrgan: Enhanced super-resolution generative...
W. Zhang, Y. Liu, C. Dong, Y. Qiao, Ranksrgan: Generative adversarial networks with ranker for image super-resolution,...
T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: Proceedings...
H. Chang, J. Lu, F. Yu, A. Finkelstein, Pairedcyclegan: Asymmetric style transfer for applying and removing makeup, in:...
FanY. et al.
Unsupervised image-to-image translation using intra-domain reconstruction loss
Int. J. Mach. Learn. Cybern.
(2020)
SongX. et al.
Face attribute editing based on generative adversarial networks
Signal Image Video Process.
(2020)

Y. Zeng, J. Fu, H. Chao, B. Guo, Learning pyramid-context encoder network for high-quality image inpainting, in:...

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, Generative image inpainting with contextual attention, in:...

T. Li, R. Qian, C. Dong, S. Liu, Q. Yan, W. Zhu, L. Lin, Beautygan: Instance-level facial makeup transfer with deep...

P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in:...

J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial...

T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial...

Z. Yi, H. Zhang, P. Tan, M. Gong, Dualgan: Unsupervised dual learning for image-to-image translation, in: Proceedings...

LiuM.-Y. et al.

Unsupervised image-to-image translation networks

G. Perarnau, J. Van De Weijer, B. Raducanu, J.M. Álvarez, Invertible conditional gans for image editing, arXiv preprint...

LampleG. et al.

Fader networks: Manipulating images by sliding attributes

Cited by (7)

Progressive normalizing flow with learnable spectrum transform for style transfer
2024, Knowledge-Based Systems
Most current style transfer models are designed as encoder–decoder structures. Some encoding operations, such as downsampling and pooling, cause a loss of image details. If the encoder and decoder are not compatible, it can also introduce distortion. Reversible neural networks have demonstrated their superior power in lossless projection. However, since the inputs and outputs of neural flows are holistic features, merely the high-level features can be utilized for image generation through reverse inference. These high-level features emphasize the image style more, leading to the generated results easily losing content details and producing abstract colors. To address the above issues, we propose LSTFlow, the first progressive reversible neural network capable of feature decomposition. First, LSTFlow incorporates our proposed reversible Learnable Spectrum Transform (LST), which can dynamically decompose the feature into feature spectrum and recover them losslessly. LSTFlow can retain more details by enabling multi-level features to be fused in backward inference. Second, we propose a Progressive Flow Stylization Strategy (PFSS) to balance the model’s emphasis between content and style and enhance the color perception. Forward inference based PFSS is carried out progressively, while the backward inference focuses on progressive generation. To demonstrate the effectiveness of our proposed method, we conducted comparative experiments with seven other state-of-the-art algorithms. The stylized effects are evaluated in terms of visual effects and quantitative indicators. The experiments show that the lightest LSTFlow performs the best in SSIM, Color Entropy, Color Uniformity and FID indicators and outperforms state-of-the-art methods.
High-fidelity GAN inversion by frequency domain guidance
2023, Computers and Graphics (Pergamon)
Generative Adversarial Network (GAN) inversion aims to invert the input image into the latent space of a pre-trained GAN to obtain a latent code, and then apply it to downstream visual tasks. Several studies have shown that deep neural networks suffer from spectral bias, making it difficult for the generative network to learn the high-frequency distribution of images, and thus lead to poor high-frequency detail in the generated images. To overcome the aforementioned limitations, we propose a high-frequency information generation guidance branch to achieve more fidelity image reconstruction. Specifically, we first utilize the frequency domain gap between the initial reconstructed image and the original image, and then apply them to guide the generator to reconstruct the high-frequency part of the hard-to-recover image. In this manner, the high-frequency information content of the image can be better reconstructed. In addition, to learn the high-frequency information of the image that is not learned previously more quickly, we invoke focal frequency loss to accelerate the learning process of the network. Extensive experimental results on several benchmark datasets demonstrate that our method outperforms the state-of-the-art, while the network can converge faster.
CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis
2022, Knowledge-Based Systems
Citation Excerpt :
With the development of deep learning (especially generative adversarial networks (GANs) [1]), face synthesis has become one of the most active research areas in computer vision, including facial attribute editing [2–5], text-to-face synthesis [6,7], and face inpainting [8,9], sketch-to-face generation [10–12], and so on.
Face synthesis based on attribute words is a novel and challenging topic in computer vision, which has various application potentials in public security and multimedia. Existing attribute vector-to-face (V2F) synthesis methods mainly generate faces based on attribute label vectors that lack rich semantic feature information, which leads to low-quality generated face images. To address this challenge, we advocate attribute word-to-face (W2F) synthesis, using attribute-word sequences that contain rich semantic information as input. A novel Cross-Modal Attention Fusion based Generative Adversarial Network (CMAFGAN) is proposed to generate faces from facial attribute words. CMAFGAN is highlighted by two blocks, cross-modal attention fusion (CMAF) and word feature transformation (WFT), which are proposed to explore the correlation between image features and the corresponding attribute word features. Experimental results on the CelebA and LFW datasets demonstrate that our CMAFGAN achieves state-of-the-art performance, effectively improving the quality of the synthesised faces. In particular, the consistency between the predicted images and input attribute words (R-precision) on the CelebA and LFW datasets achieved 61.24% and 64.46% respectively, which is significantly better than previous methods. In addition, CMAFGAN achieves comparable or better performance than the current best methods of text-to-image synthesis (R-precision 83.41% on caltech-ucsd birds-200-2011, CUB).
A lightweight ensemble discriminator for Generative Adversarial Networks
2022, Knowledge-Based Systems
Citation Excerpt :
Spatially adaptive normalization [28] was proposed for synthesizing photorealistic images, given a semantic input layout. To deal with the model collapse problem in multi-domain image-to-image translation tasks, Shao et al. [6] proposed style regularization to impose constraints on the input noise. As for regularization, Zhang et al. [11] proposed consistency regularization, widely used in semi-supervised learning, to ensure that the discriminator’s output remains unaffected even when its input data has been augmented.
While Generative Adversarial Networks (GANs) have brought immense success in various content-generation tasks, they still face enormous challenges in generating high-quality visually realistic images because of the model collapse or instability during GAN training. One common accepted explanation for the model collapse and instability is that the learning signal provided by the discriminator to the generator become inadequate when the discriminator overconcentrates on the most discriminative difference between real and synthetic images and ignores the less discriminative parts. To this end, we propose a lightweight ensemble discriminator to evaluate the generator from multi-perspective. Borrowing the insights from ensemble learning, several auxiliary discriminators are embedded into one deep model. A novel ensemble loss function is designed to promote the complementariness within the ensemble and train the whole framework in an end-to-end manner. Extensive experiments on datasets of varying resolutions and data sizes prove significant performance improvements over the state-of-the-art GANs. The proposed method can be easily embedded into various GAN frameworks and combined with different loss functions.
A Motion Deblurring Disentangled Representation Network
2022, Knowledge-Based Systems
Citation Excerpt :
BicycleGAN [33] combines cVAE-GAN [34] together with cLR-GAN to model the distribution of a variety of attribute changes in the output during the transformation from one image to another. The study of image-to-image translation [35–37] acquires synthetic images with various attributes. Neural style transfer [20] has proved the significant effect of image stylization.
We present a Motion Deblurring Disentangled Representation Network (MDDRNet), an end-to-end learned method for motion deblurring. There are three main parts in MDDRNet, Blur Loss Function, Disentangled Representation Network (DRN) module, and Structural Convolutional Neural Network (SCNN) module. By converting matched Gram matrix into minimized Maximum Mean Discrepancy (MMD), the Blur Loss Function is obtained for extracting the motion blur features. And by means of novel convolution and pooling layers, the DRN module is designed for motion deblurring. Furthermore, by the SCNN module, the deblurred image is further corrected and restored. Experiment results show that the MDDRNet has best performance compare with five methods, under three kinds of datasets.
Image-to-image translation using an offset-based multi-scale codes GAN encoder
2024, Visual Computer

View all citing articles on Scopus

View full text

DMDIT: Diverse multi-domain image-to-image translation

Abstract

Introduction

Section snippets

Generative adversarial networks

DMDIT: Diverse multi-domain image-to-image translation

Datasets

Experimental results

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Knowl.-Based Syst.

Knowl.-Based Syst.

Image super-resolution using deep convolutional networks

IEEE Trans. Pattern Anal. Mach. Intell.

Unsupervised image-to-image translation using intra-domain reconstruction loss

Int. J. Mach. Learn. Cybern.

Face attribute editing based on generative adversarial networks

Signal Image Video Process.

Unsupervised image-to-image translation networks

Fader networks: Manipulating images by sliding attributes