Sequential Representation Learning via Static-Dynamic Conditional Disentanglement

Simon, Mathieu Cyrille; Frossard, Pascal; Vleeschouwer, Christophe De

doi:10.1007/978-3-031-73226-3_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15133))

Included in the following conference series:

European Conference on Computer Vision

317 Accesses

Abstract

This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-indep- endent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VideoMamba: Spatio-Temporal Selective State Space Model

Cycle representation-disentangling network: learning to completely disentangle spatial-temporal features in video

Article 15 July 2020

We Have So Much in Common: Modeling Semantic Relational Set Abstractions in Videos

References

Agrawal, S., Dukkipati, A.: Deep variational inference without pixel-wise reconstruction. arXiv preprint arXiv:1611.05209 (2016)
Aifanti, N., Papachristou, C., Delopoulos, A.: The MUG facial expression database. In: International Workshop on Image Analysis for Multimedia Interactive Services (2010)
Google Scholar
Albarracin, J.F.H., Rivera, A.R.: Video reenactment as inductive bias for content-motion disentanglement. In: IEEE TIP (2022)
Google Scholar
Bai, J., Wang, W., Gomes, C.P.: Contrastively disentangled sequential variational autoencoder. In: NeurIPS (2021)
Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. In: IEEE TPAMI (2013)
Google Scholar
Berman, N., Naiman, I., Azencot, O.: Multifactor sequential disentanglement via structured Koopman autoencoders. arXiv preprint arXiv:2303.17264 (2023)
Bouchacourt, D., Tomioka, R., Nowozin, S.: Multi-level variational autoencoder: learning disentangled representations from grouped observations. In: AAAI (2018)
Google Scholar
Brehmer, J., De Haan, P., Lippe, P., Cohen, T.S.: Weakly supervised causal representation learning. In: NeurIPS (2022)
Google Scholar
Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP (2015)
Google Scholar
Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentanglement in variational autoencoders. In: NeurIPS (2018)
Google Scholar
Chen, X., et al.: Variational lossy autoencoder. arXiv preprint arXiv:1611.02731 (2016)
Denton, E.L., et al.: Unsupervised learning of disentangled representations from video. In: NeurIPS (2017)
Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Fragemann, J., Ardizzone, L., Egger, J., Kleesiek, J.: Review of disentanglement approaches for medical applications–towards solving the Gordian knot of generative models in healthcare. arXiv preprint arXiv:2203.11132 (2022)
Gabbay, A., Hoshen, Y.: Demystifying inter-class disentanglement. arXiv preprint arXiv:1906.11796 (2019)
Garnelo, M., et al.: Neural processes. arXiv preprint arXiv:1807.01622 (2018)
Gondal, M.W., et al.: On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In: NeurIPS (2019)
Google Scholar
Haga, T., Kera, H., Kawamoto, K.: Sequential variational autoencoder with adversarial classifier for video disentanglement. Sensors 23(5), 2515 (2023)
Google Scholar
Han, J., Min, M.R., Han, L., Li, L.E., Zhang, X.: Disentangled recurrent wasserstein autoencoder. arXiv preprint arXiv:2101.07496 (2021)
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: ICLR (2016)
Google Scholar
Hsu, W.N., Glass, J.: Scalable factorized hierarchical variational autoencoder training. arXiv preprint arXiv:1804.03201 (2018)
Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: NeurIPS (2017)
Google Scholar
Huang, C.W., Krueger, D., Lacoste, A., Courville, A.: Neural autoregressive flows. In: ICML (2018)
Google Scholar
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Google Scholar
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of styleGAN. In: CVPR (2020)
Google Scholar
Kim, H., Mnih, A.: Disentangling by factorising. In: ICML (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: NeurIPS (2016)
Google Scholar
Li, Y., Mandt, S.: Disentangled sequential autoencoder. arXiv preprint arXiv:1803.02991 (2018)
Lippe, P., Magliacane, S., Löwe, S., Asano, Y.M., Cohen, T., Gavves, S.: CITRIS: causal identifiability from temporal intervened sequences. In: ICML (2022)
Google Scholar
Liu, X., Sanchez, P., Thermos, S., O’Neil, A.Q., Tsaftaris, S.A.: Learning disentangled representations in the imaging domain. Med. Image Anal. 80, 102516 (2022)
Google Scholar
Locatello, F., et al.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: ICML (2019)
Google Scholar
Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., Tschannen, M.: Weakly-supervised disentanglement without compromises. In: ICML (2020)
Google Scholar
Locatello, F., Tschannen, M., Bauer, S., Rätsch, G., Schölkopf, B., Bachem, O.: Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258 (2019)
Luo, Y.J., Ewert, S., Dixon, S.: Towards robust unsupervised disentanglement of sequential data–a case study using music audio. arXiv preprint arXiv:2205.05871 (2022)
Ma, X., Kong, X., Zhang, S., Hovy, E.: Decoupling global and local representations via invertible generative flows. arXiv preprint arXiv:2004.11820 (2020)
Marino, J., Chen, L., He, J., Mandt, S.: Improving sequential latent variable models with autoregressive flows. In: Symposium on Advances in Approximate Bayesian Inference (2020)
Google Scholar
Matthey, L., Higgins, I., Hassabis, D., Lerchner, A.: dSprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/ (2017)
Mita, G., Filippone, M., Michiardi, P.: An identifiable double VAE for disentangled representations. In: ICML (2021)
Google Scholar
Morrow, R., Chiu, W.C.: Variational autoencoders with normalizing flow decoders. arXiv preprint arXiv:2004.05617 (2020)
Naiman, I., Berman, N., Azencot, O.: Sample and predict your latent: Modality-free sequential disentanglement via contrastive estimation. arXiv preprint arXiv:2305.15924 (2023)
Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. NeurIPS (2015)
Google Scholar
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: ICML (2015)
Google Scholar
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069 (2021)
Tonekaboni, S., Li, C.L., Arik, S.O., Goldenberg, A., Pfister, T.: Decoupling local and global representations of time series. In: International Conference on Artificial Intelligence and Statistics (2022)
Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: CVPR (2018)
Google Scholar
Vahdat, A., Kautz, J.: NVAE: a deep hierarchical variational autoencoder. In: NeurIPS (2020)
Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
Von Kügelgen, J., et al.: Self-supervised learning with data augmentations provably isolates content from style. In: NeurIPS (2021)
Google Scholar
Vural, E., Frossard, P.: Learning pattern transformation manifolds for classification. In: ICIP (2012)
Google Scholar
Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learning. arXiv preprint arXiv:2211.11695 (2022)
Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: G3AN: disentangling appearance and motion for video generation. In: CVPR (2020)
Google Scholar
Winkler, C., Worrall, D., Hoogeboom, E., Welling, M.: Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042 (2019)
Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: CausalVAE: Structured causal disentanglement in variational autoencoder. arXiv preprint arXiv:2004.08697 (2020)
Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: CausalVAE: disentangled representation learning via neural structural causal models. In: CVPR (2021)
Google Scholar
Ye, X., Bilodeau, G.A.: A unified model for continuous conditional video prediction. In: CVPR (2023)
Google Scholar
Yin, D., Ren, X., Luo, C., Wang, Y., Xiong, Z., Zeng, W.: Retriever: Learning content-style representation as a token-level bipartite graph. arXiv preprint arXiv:2202.12307 (2022)
Zhao, S., Song, J., Ermon, S.: Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658 (2017)
Zhu, X., Xu, C., Tao, D.: Commutative lie group VAE for disentanglement learning. ICML (2021)
Google Scholar
Zhu, Y., Min, M.R., Kadav, A., Graf, H.P.: S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. In: CVPR (2020)
Google Scholar

Download references

Acknowledgements

Mathieu Cyrille Simon is a Research Fellow of the Fonds de la Recherche Scientifique - FNRS of Belgium. Computational resources have been provided by the supercomputing facilities of the Université catholique de Louvain (CISM/UCL) and the Consortium des Équipements de Calcul Intensif en Fédération Wallonie Bruxelles (CÉCI) funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under convention 2.5020.11 and by the Walloon Region

Author information

Authors and Affiliations

UCLouvain, ICTEAM, Louvain-la-Neuve, Belgium
Mathieu Cyrille Simon & Christophe De Vleeschouwer
EPFL, LTS4 laboratory, Lausanne, Switzerland
Pascal Frossard

Authors

Mathieu Cyrille Simon
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Frossard
View author publications
You can also search for this author in PubMed Google Scholar
Christophe De Vleeschouwer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mathieu Cyrille Simon .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 12430 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Simon, M.C., Frossard, P., Vleeschouwer, C.D. (2025). Sequential Representation Learning via Static-Dynamic Conditional Disentanglement. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-73226-3_7
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73225-6
Online ISBN: 978-3-031-73226-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sequential Representation Learning via Static-Dynamic Conditional Disentanglement