Abstract
This paper explores self-supervised disentangled representation learning within sequential data, focusing on separating time-indep- endent and time-varying factors in videos. We propose a new model that breaks the usual independence assumption between those factors by explicitly accounting for the causal relationship between the static/dynamic variables and that improves the model expressivity through additional Normalizing Flows. A formal definition of the factors is proposed. This formalism leads to the derivation of sufficient conditions for the ground truth factors to be identifiable, and to the introduction of a novel theoretically grounded disentanglement constraint that can be directly and efficiently incorporated into our new framework. The experiments show that the proposed approach outperforms previous complex state-of-the-art techniques in scenarios where the dynamics of a scene are influenced by its content.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agrawal, S., Dukkipati, A.: Deep variational inference without pixel-wise reconstruction. arXiv preprint arXiv:1611.05209 (2016)
Aifanti, N., Papachristou, C., Delopoulos, A.: The MUG facial expression database. In: International Workshop on Image Analysis for Multimedia Interactive Services (2010)
Albarracin, J.F.H., Rivera, A.R.: Video reenactment as inductive bias for content-motion disentanglement. In: IEEE TIP (2022)
Bai, J., Wang, W., Gomes, C.P.: Contrastively disentangled sequential variational autoencoder. In: NeurIPS (2021)
Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. In: IEEE TPAMI (2013)
Berman, N., Naiman, I., Azencot, O.: Multifactor sequential disentanglement via structured Koopman autoencoders. arXiv preprint arXiv:2303.17264 (2023)
Bouchacourt, D., Tomioka, R., Nowozin, S.: Multi-level variational autoencoder: learning disentangled representations from grouped observations. In: AAAI (2018)
Brehmer, J., De Haan, P., Lippe, P., Cohen, T.S.: Weakly supervised causal representation learning. In: NeurIPS (2022)
Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP (2015)
Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentanglement in variational autoencoders. In: NeurIPS (2018)
Chen, X., et al.: Variational lossy autoencoder. arXiv preprint arXiv:1611.02731 (2016)
Denton, E.L., et al.: Unsupervised learning of disentangled representations from video. In: NeurIPS (2017)
Dinh, L., Krueger, D., Bengio, Y.: NICE: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Fragemann, J., Ardizzone, L., Egger, J., Kleesiek, J.: Review of disentanglement approaches for medical applications–towards solving the Gordian knot of generative models in healthcare. arXiv preprint arXiv:2203.11132 (2022)
Gabbay, A., Hoshen, Y.: Demystifying inter-class disentanglement. arXiv preprint arXiv:1906.11796 (2019)
Garnelo, M., et al.: Neural processes. arXiv preprint arXiv:1807.01622 (2018)
Gondal, M.W., et al.: On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. In: NeurIPS (2019)
Haga, T., Kera, H., Kawamoto, K.: Sequential variational autoencoder with adversarial classifier for video disentanglement. Sensors 23(5), 2515 (2023)
Han, J., Min, M.R., Han, L., Li, L.E., Zhang, X.: Disentangled recurrent wasserstein autoencoder. arXiv preprint arXiv:2101.07496 (2021)
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: ICLR (2016)
Hsu, W.N., Glass, J.: Scalable factorized hierarchical variational autoencoder training. arXiv preprint arXiv:1804.03201 (2018)
Hsu, W.N., Zhang, Y., Glass, J.: Unsupervised learning of disentangled and interpretable representations from sequential data. In: NeurIPS (2017)
Huang, C.W., Krueger, D., Lacoste, A., Courville, A.: Neural autoregressive flows. In: ICML (2018)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: CVPR (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of styleGAN. In: CVPR (2020)
Kim, H., Mnih, A.: Disentangling by factorising. In: ICML (2018)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: NeurIPS (2016)
Li, Y., Mandt, S.: Disentangled sequential autoencoder. arXiv preprint arXiv:1803.02991 (2018)
Lippe, P., Magliacane, S., Löwe, S., Asano, Y.M., Cohen, T., Gavves, S.: CITRIS: causal identifiability from temporal intervened sequences. In: ICML (2022)
Liu, X., Sanchez, P., Thermos, S., O’Neil, A.Q., Tsaftaris, S.A.: Learning disentangled representations in the imaging domain. Med. Image Anal. 80, 102516 (2022)
Locatello, F., et al.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: ICML (2019)
Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., Tschannen, M.: Weakly-supervised disentanglement without compromises. In: ICML (2020)
Locatello, F., Tschannen, M., Bauer, S., Rätsch, G., Schölkopf, B., Bachem, O.: Disentangling factors of variation using few labels. arXiv preprint arXiv:1905.01258 (2019)
Luo, Y.J., Ewert, S., Dixon, S.: Towards robust unsupervised disentanglement of sequential data–a case study using music audio. arXiv preprint arXiv:2205.05871 (2022)
Ma, X., Kong, X., Zhang, S., Hovy, E.: Decoupling global and local representations via invertible generative flows. arXiv preprint arXiv:2004.11820 (2020)
Marino, J., Chen, L., He, J., Mandt, S.: Improving sequential latent variable models with autoregressive flows. In: Symposium on Advances in Approximate Bayesian Inference (2020)
Matthey, L., Higgins, I., Hassabis, D., Lerchner, A.: dSprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/ (2017)
Mita, G., Filippone, M., Michiardi, P.: An identifiable double VAE for disentangled representations. In: ICML (2021)
Morrow, R., Chiu, W.C.: Variational autoencoders with normalizing flow decoders. arXiv preprint arXiv:2004.05617 (2020)
Naiman, I., Berman, N., Azencot, O.: Sample and predict your latent: Modality-free sequential disentanglement via contrastive estimation. arXiv preprint arXiv:2305.15924 (2023)
Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. NeurIPS (2015)
Rezende, D., Mohamed, S.: Variational inference with normalizing flows. In: ICML (2015)
Tian, Y., et al.: A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069 (2021)
Tonekaboni, S., Li, C.L., Arik, S.O., Goldenberg, A., Pfister, T.: Decoupling local and global representations of time series. In: International Conference on Artificial Intelligence and Statistics (2022)
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: CVPR (2018)
Vahdat, A., Kautz, J.: NVAE: a deep hierarchical variational autoencoder. In: NeurIPS (2020)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
Von Kügelgen, J., et al.: Self-supervised learning with data augmentations provably isolates content from style. In: NeurIPS (2021)
Vural, E., Frossard, P.: Learning pattern transformation manifolds for classification. In: ICIP (2012)
Wang, X., Chen, H., Tang, S., Wu, Z., Zhu, W.: Disentangled representation learning. arXiv preprint arXiv:2211.11695 (2022)
Wang, Y., Bilinski, P., Bremond, F., Dantcheva, A.: G3AN: disentangling appearance and motion for video generation. In: CVPR (2020)
Winkler, C., Worrall, D., Hoogeboom, E., Welling, M.: Learning likelihoods with conditional normalizing flows. arXiv preprint arXiv:1912.00042 (2019)
Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: CausalVAE: Structured causal disentanglement in variational autoencoder. arXiv preprint arXiv:2004.08697 (2020)
Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: CausalVAE: disentangled representation learning via neural structural causal models. In: CVPR (2021)
Ye, X., Bilodeau, G.A.: A unified model for continuous conditional video prediction. In: CVPR (2023)
Yin, D., Ren, X., Luo, C., Wang, Y., Xiong, Z., Zeng, W.: Retriever: Learning content-style representation as a token-level bipartite graph. arXiv preprint arXiv:2202.12307 (2022)
Zhao, S., Song, J., Ermon, S.: Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658 (2017)
Zhu, X., Xu, C., Tao, D.: Commutative lie group VAE for disentanglement learning. ICML (2021)
Zhu, Y., Min, M.R., Kadav, A., Graf, H.P.: S3VAE: self-supervised sequential VAE for representation disentanglement and data generation. In: CVPR (2020)
Acknowledgements
Mathieu Cyrille Simon is a Research Fellow of the Fonds de la Recherche Scientifique - FNRS of Belgium. Computational resources have been provided by the supercomputing facilities of the Université catholique de Louvain (CISM/UCL) and the Consortium des Équipements de Calcul Intensif en Fédération Wallonie Bruxelles (CÉCI) funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under convention 2.5020.11 and by the Walloon Region
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Simon, M.C., Frossard, P., Vleeschouwer, C.D. (2025). Sequential Representation Learning via Static-Dynamic Conditional Disentanglement. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-73226-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73225-6
Online ISBN: 978-3-031-73226-3
eBook Packages: Computer ScienceComputer Science (R0)