Multimodal Conditional Image Synthesis with Product-of-Experts GANs

Huang, Xun; Mallya, Arun; Wang, Ting-Chun; Liu, Ming-Yu

doi:10.1007/978-3-031-19787-1_6

Xun Huang¹²,
Arun Mallya¹²,
Ting-Chun Wang¹² &
…
Ming-Yu Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

European Conference on Computer Vision

3568 Accesses

Abstract

Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, or sketch. They do not allow users to simultaneously use inputs in multiple modalities to control the image synthesis output. This reduces their practicality as multimodal inputs are more expressive and complement each other. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. We achieve this capability with a single trained model. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at this link.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation

Article 30 May 2020

Semi-supervised FusedGAN for Conditional Image Generation

MMoT: Mixture-of-Modality-Tokens Transformer for Composed Multimodal Conditional Image Synthesis

Article 02 April 2024

Notes

1.
The conditional independence assumption is sound in our setting since an image alone contains sufficient information to infer a modality independent of other modalities. For example, given an image, we do not need its caption to infer its segmentation.
2.
With a slight abuse of notation, we will use $q(z|y_{i})$ (and similarly $p(z|\mathcal {Y})$) to denote both the “true” distribution and the estimated distribution produced by our network.
3.
Except for $p(z^0)$, which is simply a standard Gaussian distribution.
4.
Except for the first layer that convolves a constant feature map and the last layer that convolves the previous output to synthesize the output.
5.
As a result, the baseline scores differ slightly from those in the original papers.

References

Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: ICLR (2017)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: COCO-Stuff: Thing and stuff classes in context. In: CVPR (2018)
Google Scholar
Chen, S.Y., Su, W., Gao, L., Xia, S., Fu, H.: DeepFaceDrawing: deep generation of face images from sketches. ACM Trans. Graphics (TOG) 72, 72:1–72:16 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)
Google Scholar
Child, R.: Very deep VAEs generalize autoregressive models and can outperform them on images. In: ICLR (2021)
Google Scholar
Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)
Google Scholar
Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: Scene-based text-to-image generation with human priors. In: CVPR (2022)
Google Scholar
Ghosh, A., et al.: Interactive sketch & fill: multiclass sketch-to-image translation. In: ICCV (2019)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Google Scholar
Han, J., Shoeiby, M., Petersson, L., Armin, M.A.: Dual contrastive learning for unsupervised image-to-image translation. In: CVPR (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)
Article Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
Google Scholar
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11
Chapter Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)
Google Scholar
Karras, T., et al.: Alias-free generative adversarial networks. In: NeurIPS (2021)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: CVPR (2020)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
Google Scholar
Kutuzova, S., Krause, O., McCloskey, D., Nielsen, M., Igel, C.: Multimodal variational autoencoders for semi-supervised learning: In defense of product-of-experts. arXiv preprint arXiv:2101.07240 (2021)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, M.Y., Huang, X., Yu, J., Wang, T.C., Mallya, A.: Generative adversarial networks for image and video synthesis: algorithms and applications. In: Proceedings of the IEEE (2021)
Google Scholar
Liu, R., Ge, Y., Choi, C.L., Wang, X., Li, H.: DivCo: diverse conditional image synthesis via contrastive generative adversarial network. In: CVPR (2021)
Google Scholar
Liu, X., Yin, G., Shao, J., Wang, X., Li, H.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: NeurIPS (2019)
Google Scholar
Maaløe, L., Fraccaro, M., Liévin, V., Winther, O.: BIVA: a very deep hierarchy of latent variables for generative modeling. In: NeurIPS (2019)
Google Scholar
Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for GANs do actually converge? In: ICML (2018)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR (2018)
Google Scholar
Miyato, T., Koyama, M.: cGANs with projection discriminator. In: ICLR (2018)
Google Scholar
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: ICML (2017)
Google Scholar
Park, T., Efros, A.A., Zhang, R., Zhu, J.-Y.: Contrastive learning for unpaired image-to-image translation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 319–345. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_19
Chapter Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
Google Scholar
Parmar, G., Zhang, R., Zhu, J.Y.: On buggy resizing libraries and surprising subtleties in FID calculation. arXiv preprint arXiv:2104.11222 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: CVPR (2016)
Google Scholar
Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)
Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)
Google Scholar
Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: CVPR (2021)
Google Scholar
Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: controlling deep image synthesis with sketch and color. In: CVPR (2017)
Google Scholar
Schönfeld, E., Sushko, V., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You only need adversarial supervision for semantic image synthesis. In: ICLR (2020)
Google Scholar
Shi, Y., Siddharth, N., Paige, B., Torr, P.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. In: NeurIPS (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. In: NeurIPS (2016)
Google Scholar
Sutter, T.M., Daunhawer, I., Vogt, J.E.: Generalized multimodal ELBO. In: ICLR (2020)
Google Scholar
Suzuki, M., Nakayama, K., Matsuo, Y.: Joint multimodal learning with deep generative models. In: ICLR workshop (2017)
Google Scholar
Tao, M., et al.: DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865 (2020)
Vahdat, A., Kautz, J.: NVAE: A deep hierarchical variational autoencoder. In: NeurIPS (2020)
Google Scholar
Vedantam, R., Fischer, I., Huang, J., Murphy, K.: Generative models of visually grounded imagination. In: ICLR (2018)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)
Google Scholar
Williams, C.K., Agakov, F.V., Felderhof, S.N.: Products of gaussians. In: NeurIPS (2001)
Google Scholar
Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. In: NeurIPS (2018)
Google Scholar
Xia, W., Yang, Y., Xue, J.H., Wu, B.: TediGAN: text-guided diverse face image generation and manipulation. In: CVPR (2021)
Google Scholar
Xu, T., et al.: AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: CVPR (2018)
Google Scholar
Yang, D., Hong, S., Jang, Y., Zhao, T., Lee, H.: Diversity-sensitive conditional generative adversarial networks. In: ICLR (2019)
Google Scholar
Ye, H., Yang, X., Takac, M., Sunderraman, R., Ji, S.: Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423 (2021)
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: CVPR (2021)
Google Scholar
Zhang, H., et al.: StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)
Zhang, Z., et al.: M6-UFC: unifying multi-modal controls for conditional image synthesis. In: NeurIPS (2021)
Google Scholar
Zhu, J.Y., et al.: Toward multimodal image-to-image translation. In: NeurIPS (2017)
Google Scholar
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR (2019)
Google Scholar

Download references

Acknowledgements

We thank Jan Kautz, David Luebke, Tero Karras, Timo Aila, and Zinan Lin for their feedback on the manuscript. We thank Daniel Gifford and Andrea Gagliano on their help on data collection.

Author information

Authors and Affiliations

NVIDIA, California, USA
Xun Huang, Arun Mallya, Ting-Chun Wang & Ming-Yu Liu

Authors

Xun Huang
View author publications
You can also search for this author in PubMed Google Scholar
Arun Mallya
View author publications
You can also search for this author in PubMed Google Scholar
Ting-Chun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Yu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xun Huang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 19308 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, X., Mallya, A., Wang, TC., Liu, MY. (2022). Multimodal Conditional Image Synthesis with Product-of-Experts GANs. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-19787-1_6
Published: 21 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19786-4
Online ISBN: 978-3-031-19787-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Conditional Image Synthesis with Product-of-Experts GANs