Skip to main content

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13676))

Included in the following conference series:

  • 3568 Accesses

Abstract

Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, or sketch. They do not allow users to simultaneously use inputs in multiple modalities to control the image synthesis output. This reduces their practicality as multimodal inputs are more expressive and complement each other. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. We achieve this capability with a single trained model. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at this link.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The conditional independence assumption is sound in our setting since an image alone contains sufficient information to infer a modality independent of other modalities. For example, given an image, we do not need its caption to infer its segmentation.

  2. 2.

    With a slight abuse of notation, we will use \(q(z|y_{i})\) (and similarly \(p(z|\mathcal {Y})\)) to denote both the “true” distribution and the estimated distribution produced by our network.

  3. 3.

    Except for \(p(z^0)\), which is simply a standard Gaussian distribution.

  4. 4.

    Except for the first layer that convolves a constant feature map and the last layer that convolves the previous output to synthesize the output.

  5. 5.

    As a result, the baseline scores differ slightly from those in the original papers.

References

  1. Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: ICLR (2017)

    Google Scholar 

  2. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

    Google Scholar 

  3. Caesar, H., Uijlings, J., Ferrari, V.: COCO-Stuff: Thing and stuff classes in context. In: CVPR (2018)

    Google Scholar 

  4. Chen, S.Y., Su, W., Gao, L., Xia, S., Fu, H.: DeepFaceDrawing: deep generation of face images from sketches. ACM Trans. Graphics (TOG) 72, 72:1–72:16 (2020)

    Google Scholar 

  5. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  6. Chen, W., Hays, J.: SketchyGAN: towards diverse and realistic sketch to image synthesis. In: CVPR (2018)

    Google Scholar 

  7. Child, R.: Very deep VAEs generalize autoregressive models and can outperform them on images. In: ICLR (2021)

    Google Scholar 

  8. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR (2021)

    Google Scholar 

  9. Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., Taigman, Y.: Make-a-scene: Scene-based text-to-image generation with human priors. In: CVPR (2022)

    Google Scholar 

  10. Ghosh, A., et al.: Interactive sketch & fill: multiclass sketch-to-image translation. In: ICCV (2019)

    Google Scholar 

  11. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  12. Han, J., Shoeiby, M., Petersson, L., Armin, M.A.: Dual contrastive learning for unsupervised image-to-image translation. In: CVPR (2021)

    Google Scholar 

  13. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  15. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  16. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)

    Article  Google Scholar 

  17. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)

    Google Scholar 

  18. Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11

    Chapter  Google Scholar 

  19. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)

    Google Scholar 

  20. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: ICLR (2018)

    Google Scholar 

  21. Karras, T., et al.: Alias-free generative adversarial networks. In: NeurIPS (2021)

    Google Scholar 

  22. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)

    Google Scholar 

  23. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: CVPR (2020)

    Google Scholar 

  24. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)

    Google Scholar 

  25. Kutuzova, S., Krause, O., McCloskey, D., Nielsen, M., Igel, C.: Multimodal variational autoencoders for semi-supervised learning: In defense of product-of-experts. arXiv preprint arXiv:2101.07240 (2021)

  26. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  27. Liu, M.Y., Huang, X., Yu, J., Wang, T.C., Mallya, A.: Generative adversarial networks for image and video synthesis: algorithms and applications. In: Proceedings of the IEEE (2021)

    Google Scholar 

  28. Liu, R., Ge, Y., Choi, C.L., Wang, X., Li, H.: DivCo: diverse conditional image synthesis via contrastive generative adversarial network. In: CVPR (2021)

    Google Scholar 

  29. Liu, X., Yin, G., Shao, J., Wang, X., Li, H.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: NeurIPS (2019)

    Google Scholar 

  30. Maaløe, L., Fraccaro, M., Liévin, V., Winther, O.: BIVA: a very deep hierarchy of latent variables for generative modeling. In: NeurIPS (2019)

    Google Scholar 

  31. Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for GANs do actually converge? In: ICML (2018)

    Google Scholar 

  32. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: ICLR (2018)

    Google Scholar 

  33. Miyato, T., Koyama, M.: cGANs with projection discriminator. In: ICLR (2018)

    Google Scholar 

  34. Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: ICML (2017)

    Google Scholar 

  35. Park, T., Efros, A.A., Zhang, R., Zhu, J.-Y.: Contrastive learning for unpaired image-to-image translation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 319–345. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_19

    Chapter  Google Scholar 

  36. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)

    Google Scholar 

  37. Parmar, G., Zhang, R., Zhu, J.Y.: On buggy resizing libraries and surprising subtleties in FID calculation. arXiv preprint arXiv:2104.11222 (2021)

  38. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  39. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: CVPR (2016)

    Google Scholar 

  40. Ramesh, A., et al.: Zero-shot text-to-image generation. In: ICML (2021)

    Google Scholar 

  41. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML (2016)

    Google Scholar 

  42. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: ICML (2014)

    Google Scholar 

  43. Richardson, E., et al.: Encoding in style: a StyleGAN encoder for image-to-image translation. In: CVPR (2021)

    Google Scholar 

  44. Sangkloy, P., Lu, J., Fang, C., Yu, F., Hays, J.: Scribbler: controlling deep image synthesis with sketch and color. In: CVPR (2017)

    Google Scholar 

  45. Schönfeld, E., Sushko, V., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You only need adversarial supervision for semantic image synthesis. In: ICLR (2020)

    Google Scholar 

  46. Shi, Y., Siddharth, N., Paige, B., Torr, P.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. In: NeurIPS (2019)

    Google Scholar 

  47. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  48. Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder variational autoencoders. In: NeurIPS (2016)

    Google Scholar 

  49. Sutter, T.M., Daunhawer, I., Vogt, J.E.: Generalized multimodal ELBO. In: ICLR (2020)

    Google Scholar 

  50. Suzuki, M., Nakayama, K., Matsuo, Y.: Joint multimodal learning with deep generative models. In: ICLR workshop (2017)

    Google Scholar 

  51. Tao, M., et al.: DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865 (2020)

  52. Vahdat, A., Kautz, J.: NVAE: A deep hierarchical variational autoencoder. In: NeurIPS (2020)

    Google Scholar 

  53. Vedantam, R., Fischer, I., Huang, J., Murphy, K.: Generative models of visually grounded imagination. In: ICLR (2018)

    Google Scholar 

  54. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)

    Google Scholar 

  55. Williams, C.K., Agakov, F.V., Felderhof, S.N.: Products of gaussians. In: NeurIPS (2001)

    Google Scholar 

  56. Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. In: NeurIPS (2018)

    Google Scholar 

  57. Xia, W., Yang, Y., Xue, J.H., Wu, B.: TediGAN: text-guided diverse face image generation and manipulation. In: CVPR (2021)

    Google Scholar 

  58. Xu, T., et al.: AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: CVPR (2018)

    Google Scholar 

  59. Yang, D., Hong, S., Jang, Y., Zhao, T., Lee, H.: Diversity-sensitive conditional generative adversarial networks. In: ICLR (2019)

    Google Scholar 

  60. Ye, H., Yang, X., Takac, M., Sunderraman, R., Ji, S.: Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423 (2021)

  61. Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: CVPR (2021)

    Google Scholar 

  62. Zhang, H., et al.: StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)

    Google Scholar 

  63. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  64. Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747 (2020)

  65. Zhang, Z., et al.: M6-UFC: unifying multi-modal controls for conditional image synthesis. In: NeurIPS (2021)

    Google Scholar 

  66. Zhu, J.Y., et al.: Toward multimodal image-to-image translation. In: NeurIPS (2017)

    Google Scholar 

  67. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR (2019)

    Google Scholar 

Download references

Acknowledgements

We thank Jan Kautz, David Luebke, Tero Karras, Timo Aila, and Zinan Lin for their feedback on the manuscript. We thank Daniel Gifford and Andrea Gagliano on their help on data collection.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xun Huang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 19308 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, X., Mallya, A., Wang, TC., Liu, MY. (2022). Multimodal Conditional Image Synthesis with Product-of-Experts GANs. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13676. Springer, Cham. https://doi.org/10.1007/978-3-031-19787-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19787-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19786-4

  • Online ISBN: 978-3-031-19787-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics