Self-supervised Disentanglement of Modality-Specific and Shared Factors Improves Multimodal Generative Models

Daunhawer, Imant; Sutter, Thomas M.; Marcinkevičs, Ričards; Vogt, Julia E.

doi:10.1007/978-3-030-71278-5_33

Self-supervised Disentanglement of Modality-Specific and Shared Factors Improves Multimodal Generative Models

Imant Daunhawer¹¹,
Thomas M. Sutter¹¹,
Ričards Marcinkevičs¹¹ &
…
Julia E. Vogt¹¹

Conference paper
First Online: 17 March 2021

1751 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12544))

Abstract

Multimodal generative models learn a joint distribution over multiple modalities and thus have the potential to learn richer representations than unimodal models. However, current approaches are either inefficient in dealing with more than two modalities or fail to capture both modality-specific and shared variations. We introduce a new multimodal generative model that integrates both modality-specific and shared factors and aggregates shared information across any subset of modalities efficiently. Our method partitions the latent space into disjoint subspaces for modality-specific and shared factors and learns to disentangle these in a purely self-supervised manner. Empirically, we show improvements in representation learning and generative performance compared to previous methods and showcase the disentanglement capabilities.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
This problem has also been observed in [21] where it is described as “averaging over inseparable individual beliefs”.
2.
https://github.com/iffsid/mmvae.
3.
The size of latent dimensions for modality-specific and shared representations is a hyperparameter of our model. Empirically, we found the effect of changing the dimensionality to be minor, as long as neither latent space is too small.
4.
We further observed that without sampling from the posterior (i.e., reparameterization) both the MVAE and MMVAE tend to generate samples with very little diversity, even if diverse input images are used.
5.
Note that the weights of likelihood terms have been observed to be important hyperparameters in both [40] and [31].

References

Alemi, A.A., Fischer, I., Dillon, J.V., Murphy, K.: Deep variational information bottleneck. In: International Conference on Learning Representations (2017)
Google Scholar
Almahairi, A., Rajeswar, S., Sordoni, A., Bachman, P., Courville, A.C.: Augmented CycleGAN: learning many-to-many mappings from unpaired data. In: International Conference on Machine Learning (2018)
Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Article Google Scholar
Bouchacourt, D., Tomioka, R., Nowozin, S.: Multi-level variational autoencoder: learning disentangled representations from grouped observations. In: AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Chartsias, A., et al.: Disentangled representation learning in cardiac image analysis. Med. Image Anal. 58, 101535 (2019)
Article Google Scholar
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems (2016)
Google Scholar
Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In: Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Ghosh, P., Sajjadi, M.S.M., Vergari, A., Black, M., Scholkopf, B.: From variational to deterministic autoencoders. In: International Conference on Learning Representations (2020)
Google Scholar
Gresele, L., Rubenstein, P.K., Mehrjou, A., Locatello, F., Schölkopf, B.: The incomplete Rosetta Stone problem: identifiability results for multi-view nonlinear ICA. In: Conference on Uncertainty in Artificial Intelligence (2019)
Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics (2010)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017)
Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)
Article Google Scholar
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: International Conference on Learning Representations (2019)
Google Scholar
Hsu, W.N., Glass, J.: Disentangling by partitioning: a representation learning framework for multimodal sensory data. arXiv preprint arXiv:1805.11264 (2018)
Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Ilse, M., Tomczak, J.M., Louizos, C., Welling, M.: DIVA: domain invariant variational autoencoders. arXiv preprint arXiv:1905.10427 (2019)
Khemakhem, I., Kingma, D.P., Monti, R.P., Hyvärinen, A.: Variational autoencoders and nonlinear ICA: a unifying framework. In: International Conference on Artificial Intelligence and Statistics (2020)
Google Scholar
Kim, H., Mnih, A.: Disentangling by factorising. In: International Conference on Machine Learning (2018)
Google Scholar
Kurle, R., Guennemann, S., van der Smagt, P.: Multi-source neural variational inference. In: AAAI Conference on Artificial Intelligence (2019)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, Y., Mandt, S.: Disentangled sequential autoencoder. In: International Conference on Machine Learning (2018)
Google Scholar
Liu, A.H., Liu, Y.C., Yeh, Y.Y., Wang, Y.C.F.: A unified feature disentangler for multi-domain image translation and manipulation. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Locatello, F., et al.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: International Conference on Machine Learning (2019)
Google Scholar
Locatello, F., Abbati, G., Rainforth, T., Bauer, S., Schölkopf, B., Bachem, O.: On the fairness of disentangled representations. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., Tschannen, M.: Weakly-supervised disentanglement without compromises. In: International Conference on Machine Learning (2020)
Google Scholar
Nguyen, X., Wainwright, M.J., Jordan, M.I.: Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 56(11), 5847–5861 (2010)
Article MathSciNet Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Sermanet, P., Chintala, S., LeCun, Y.: Convolutional neural networks applied to house numbers digit classification. In: International Conference on Pattern Recognition, pp. 3288–3291. IEEE (2012)
Google Scholar
Shi, Y., Siddharth, N., Paige, B., Torr, P.: Variational mixture-of-experts autoencoders for multi-modal deep generative models. In: Advances in Neural Information Processing Systems (2019)
Google Scholar
Smith, N.A., Eisner, J.: Contrastive estimation: training log-linear models on unlabeled data. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 354–362 (2005)
Google Scholar
Stein, B.E., Stanford, T.R., Rowland, B.A.: The neural basis of multisensory integration in the midbrain: its organization and maturation. Hear. Res. 258(1–2), 4–15 (2009)
Article Google Scholar
Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64(5), 1009–1044 (2012)
Article MathSciNet Google Scholar
Suzuki, M., Nakayama, K., Matsuo, Y.: Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891 (2016)
Tian, Y., Engel, J.: Latent translation: crossing modalities by bridging generative models. arXiv preprint arXiv:1902.08261 (2019)
Träuble, F., et al.: Is independence all you need? On the generalization of representations learned from correlated data. arXiv preprint arXiv:2006.07886 (2020)
Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. In: International Conference on Learning Representations (2019)
Google Scholar
Wieser, M., Parbhoo, S., Wieczorek, A., Roth, V.: Inverse learning of symmetry transformations. In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Wu, M., Goodman, N.: Multimodal generative models for scalable weakly-supervised learning. In: Advances in Neural Information Processing Systems (2018)
Google Scholar
Yildirim, I.: From perception to conception: learning multisensory representations. Ph.D. thesis, University of Rochester (2014)
Google Scholar
Yildirim, I., Jacobs, R.A.: Transfer of object category knowledge across visual and haptic modalities: experimental and computational studies. Cognition 126(2), 135–148 (2013)
Article Google Scholar

Download references

Acknowledgements

Thanks to Mario Wieser for discussions on learning invariant subspaces, to Yuge Shi for providing code, and to Francesco Locatello for sharing his views on disentanglement in a multimodal setting. ID is supported by the SNSF grant #200021_188466.

Author information

Authors and Affiliations

Department of Computer Science, ETH Zurich, Zürich, Switzerland
Imant Daunhawer, Thomas M. Sutter, Ričards Marcinkevičs & Julia E. Vogt

Authors

Imant Daunhawer
View author publications
You can also search for this author in PubMed Google Scholar
Thomas M. Sutter
View author publications
You can also search for this author in PubMed Google Scholar
Ričards Marcinkevičs
View author publications
You can also search for this author in PubMed Google Scholar
Julia E. Vogt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Imant Daunhawer .

Editor information

Editors and Affiliations

University of Tübingen, Tübingen, Germany
Zeynep Akata
University of Tübingen, Tübingen, Germany
Andreas Geiger
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3914 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Daunhawer, I., Sutter, T.M., Marcinkevičs, R., Vogt, J.E. (2021). Self-supervised Disentanglement of Modality-Specific and Shared Factors Improves Multimodal Generative Models. In: Akata, Z., Geiger, A., Sattler, T. (eds) Pattern Recognition. DAGM GCPR 2020. Lecture Notes in Computer Science(), vol 12544. Springer, Cham. https://doi.org/10.1007/978-3-030-71278-5_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-71278-5_33
Published: 17 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71277-8
Online ISBN: 978-3-030-71278-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics