Skip to main content

Self-supervised Disentanglement of Modality-Specific and Shared Factors Improves Multimodal Generative Models

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12544))

Abstract

Multimodal generative models learn a joint distribution over multiple modalities and thus have the potential to learn richer representations than unimodal models. However, current approaches are either inefficient in dealing with more than two modalities or fail to capture both modality-specific and shared variations. We introduce a new multimodal generative model that integrates both modality-specific and shared factors and aggregates shared information across any subset of modalities efficiently. Our method partitions the latent space into disjoint subspaces for modality-specific and shared factors and learns to disentangle these in a purely self-supervised manner. Empirically, we show improvements in representation learning and generative performance compared to previous methods and showcase the disentanglement capabilities.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This problem has also been observed in [21] where it is described as “averaging over inseparable individual beliefs”.

  2. 2.

    https://github.com/iffsid/mmvae.

  3. 3.

    The size of latent dimensions for modality-specific and shared representations is a hyperparameter of our model. Empirically, we found the effect of changing the dimensionality to be minor, as long as neither latent space is too small.

  4. 4.

    We further observed that without sampling from the posterior (i.e., reparameterization) both the MVAE and MMVAE tend to generate samples with very little diversity, even if diverse input images are used.

  5. 5.

    Note that the weights of likelihood terms have been observed to be important hyperparameters in both [40] and [31].

References

  1. Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: International Conference on Learning Representations (2017)

    Google Scholar 

  2. Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., Courville, A.C.: Augmented CycleGAN: learning many-to-many mappings from unpaired data. In: International Conference on Machine Learning (2018)

    Google Scholar 

  3. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)

    Article  Google Scholar 

  4. Bouchacourt, D., Tomioka, R., Nowozin, S.: Multi-level variational autoencoder: learning disentangled representations from grouped observations. In: AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  5. Chartsias, A., et al.: Disentangled representation learning in cardiac image analysis. Med. Image Anal. 58, 101535 (2019)

    Article  Google Scholar 

  6. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems (2016)

    Google Scholar 

  7. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Conference on Computer Vision and Pattern Recognition (2018)

    Google Scholar 

  8. Ghosh, P., Sajjadi, M.S.M., Vergari, A., Black, M., Scholkopf, B.: From variational to deterministic autoencoders. In: International Conference on Learning Representations (2020)

    Google Scholar 

  9. Gresele, L., Rubenstein, P.K., Mehrjou, A., Locatello, F., Schölkopf, B.: The incomplete Rosetta Stone problem: identifiability results for multi-view nonlinear ICA. In: Conference on Uncertainty in Artificial Intelligence (2019)

    Google Scholar 

  10. Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics (2010)

    Google Scholar 

  11. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  12. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  13. Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017)

    Google Scholar 

  14. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)

    Article  Google Scholar 

  15. Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: International Conference on Learning Representations (2019)

    Google Scholar 

  16. Hsu, W.N., Glass, J.: Disentangling by partitioning: a representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264 (2018)

  17. Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  18. Ilse, M., Tomczak, J.M., Louizos, C., Welling, M.: DIVA: domain invariant variational autoencoders. arXiv preprint arXiv:1905.10427 (2019)

  19. Khemakhem, I., Kingma, D.P., Monti, R.P., Hyvärinen, A.: Variational autoencoders and nonlinear ICA: a unifying framework. In: International Conference on Artificial Intelligence and Statistics (2020)

    Google Scholar 

  20. Kim, H., Mnih, A.: Disentangling by factorising. In: International Conference on Machine Learning (2018)

    Google Scholar 

  21. Kurle, R., Guennemann, S., van der Smagt, P.: Multi-source neural variational inference. In: AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  22. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  23. Li, Y., Mandt, S.: Disentangled sequential autoencoder. In: International Conference on Machine Learning (2018)

    Google Scholar 

  24. Liu, A.H., Liu, Y.C., Yeh, Y.Y., Wang, Y.C.F.: A unified feature disentangler for multi-domain image translation and manipulation. In: Advances in Neural Information Processing Systems (2018)

    Google Scholar 

  25. Locatello, F., et al.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: International Conference on Machine Learning (2019)

    Google Scholar 

  26. Locatello, F., Abbati, G., Rainforth, T., Bauer, S., Schölkopf, B., Bachem, O.: On the fairness of disentangled representations. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  27. Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., Tschannen, M.: Weakly-supervised disentanglement without compromises. In: International Conference on Machine Learning (2020)

    Google Scholar 

  28. Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 56(11), 5847–5861 (2010)

    Article  MathSciNet  Google Scholar 

  29. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  30. Sermanet, P., Chintala, S., LeCun, Y.: Convolutional neural networks applied to house numbers digit classification. In: International Conference on Pattern Recognition, pp. 3288–3291. IEEE (2012)

    Google Scholar 

  31. Shi, Y., Siddharth, N., Paige, B., Torr, P.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  32. Smith, N.A., Eisner, J.: Contrastive estimation: training log-linear models on unlabeled data. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 354–362 (2005)

    Google Scholar 

  33. Stein, B.E., Stanford, T.R., Rowland, B.A.: The neural basis of multisensory integration in the midbrain: its organization and maturation. Hear. Res. 258(1–2), 4–15 (2009)

    Article  Google Scholar 

  34. Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64(5), 1009–1044 (2012)

    Article  MathSciNet  Google Scholar 

  35. Suzuki, M., Nakayama, K., Matsuo, Y.: Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891 (2016)

  36. Tian, Y., Engel, J.: Latent translation: crossing modalities by bridging generative models. arXiv preprint arXiv:1902.08261 (2019)

  37. Träuble, F., et al.: Is independence all you need? On the generalization of representations learned from correlated data. arXiv preprint arXiv:2006.07886 (2020)

  38. Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Learning Representations (2019)

    Google Scholar 

  39. Wieser, M., Parbhoo, S., Wieczorek, A., Roth, V.: Inverse learning of symmetry transformations. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  40. Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. In: Advances in Neural Information Processing Systems (2018)

    Google Scholar 

  41. Yildirim, I.: From perception to conception: learning multisensory representations. Ph.D. thesis, University of Rochester (2014)

    Google Scholar 

  42. Yildirim, I., Jacobs, R.A.: Transfer of object category knowledge across visual and haptic modalities: experimental and computational studies. Cognition 126(2), 135–148 (2013)

    Article  Google Scholar 

Download references

Acknowledgements

Thanks to Mario Wieser for discussions on learning invariant subspaces, to Yuge Shi for providing code, and to Francesco Locatello for sharing his views on disentanglement in a multimodal setting. ID is supported by the SNSF grant #200021_188466.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imant Daunhawer .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3914 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Daunhawer, I., Sutter, T.M., Marcinkevičs, R., Vogt, J.E. (2021). Self-supervised Disentanglement of Modality-Specific and Shared Factors Improves Multimodal Generative Models. In: Akata, Z., Geiger, A., Sattler, T. (eds) Pattern Recognition. DAGM GCPR 2020. Lecture Notes in Computer Science(), vol 12544. Springer, Cham. https://doi.org/10.1007/978-3-030-71278-5_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-71278-5_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-71277-8

  • Online ISBN: 978-3-030-71278-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics