Learning to Disentangle Latent Physical Factors for Video Prediction

Zhu, Deyao; Munderloh, Marco; Rosenhahn, Bodo; Stückler, Jörg

doi:10.1007/978-3-030-33676-9_42

Deyao Zhu^11,12,
Marco Munderloh¹²,
Bodo Rosenhahn¹² &
…
Jörg Stückler¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11824))

Included in the following conference series:

German Conference on Pattern Recognition

2181 Accesses

Abstract

Physical scene understanding is a fundamental human ability. Empowering artificial systems with such understanding is an important step towards flexible and adaptive behavior in the real world. As a step in this direction, we propose a novel approach to physical scene understanding in video. We train a deep neural network for video prediction which embeds the video sequence in a low-dimensional recurrent latent space representation. We optimize the total correlation of the latent dimensions within a variational recurrent auto-encoder framework. This encourages the representation to disentangle the latent physical factors of variation in the training data. To train and evaluate our approach, we use synthetic video sequences in three different physical scenarios with various degrees of difficulty. Our experiments demonstrate that our model can disentangle several appearance-related properties in the unsupervised case. If we add supervision signals for the latent code, our model can further improve the disentanglement of dynamics-related properties.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning to Identify Physical Parameters from Video Using Differentiable Physics

Physical Representation Learning and Parameter Identification from Video Using Differentiable Physics

Article Open access 17 October 2021

DYAN: A Dynamical Atoms-Based Network for Video Prediction

Notes

1.
Dataset available from: https://github.com/TsuTikgiau/DisentPhys4VidPredict.

References

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R., Levine, S.: Stochastic variational video prediction. In: ICLR (2018)
Google Scholar
Battaglia, P., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics. In: NIPS (2016)
Google Scholar
Burgess, C., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., Lerchner, A.: Understanding disentangling in beta -VAE. In: Learning Disentangle Representations: From Perception to Control workshop (2017)
Google Scholar
Chen, T., Li, X., Grosse, R., Duvenaud, D.: Isolating sources of disentanglement in VAEs. In: NIPS (2018)
Google Scholar
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In: NIPS (2016)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS Workshop (2014)
Google Scholar
Ebert, F., Finn, C., Lee, X., Levine, S.: Self-supervised visual planning with temporal skip connections. In: CoRL (2017)
Google Scholar
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: NIPS (2016)
Google Scholar
Fraccaro, M., Kamronn, S., Paquet, U., Winther, O.: A disentangled recognition and nonlinear dynamics model for unsupervised learning (2017)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)
Google Scholar
Hafner, D., et al.: Learning latent dynamics for planning from pixels. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2555–2565. PMLR (2019)
Google Scholar
Higgins, I., et al.: beta-VAE: learning basic visual concepts with a constrained variational framework. In: ICLR (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Neural Computation (1997)
Google Scholar
Johnson, M., Duvenaud, D.K., Wiltschko, A., Adams, R.P., Datta, S.R.: Composing graphical models with neural networks for structured representations and fast inference. In: Advances in Neural Information Processing Systems 29 (NIPS), pp. 2946–2954 (2016)
Google Scholar
Kim, H., Mnih, A.: Disentangling by factorising. In: CoRR (2018)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kingma, D., Welling, M.: Auto-encoding variational Bayes. In: CoRR (2013)
Google Scholar
Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15
Chapter Google Scholar
Larsen, A., Sønderby, S., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML (2016)
Google Scholar
Lee, X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. In: arXiv preprint (2018)
Google Scholar
Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: ICML (2016)
Google Scholar
Mottaghi, R., Bagherinezhad, H., Rastegari, M., Farhadi, A.: Newtonian scene understanding: Unfolding the dynamics of objects in static images. In: CVPR (2016)
Google Scholar
Piloto, L., et al.: Probing physics knowledge using tools from developmental psychology. In: CoRR (2018)
Google Scholar
Riochet, R., et al.: IntPhys: a framework and benchmark for visual intuitive physics reasoning. In: arXiv preprint (2018)
Google Scholar
Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control. In: ICML (2018)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51
Chapter Google Scholar
Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks. In: NIPS (2017)
Google Scholar
Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC (2016)
Google Scholar
Wu, J., Lu, E., Kohli, P., Freeman, W., Tenenbaum, J.: Learning to see physics via visual de-animation. In: NIPS (2017)
Google Scholar
Ye, T., Wang, X., Davidson, J., Gupta, A.: Interpretable intuitive physics model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 89–105. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_6
Chapter Google Scholar
Zhang, R., Wu, J., Zhang, C., Freeman, W., Tenenbaum, J.: A comparative evaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical scene understanding. In: Annual Conference of the Cognitive Science Society (2016)
Google Scholar
Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., Zhu, S.: Scene understanding by reasoning stability and safety. In: IJCV (2015)
Google Scholar
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_18
Chapter Google Scholar

Download references

Acknowledgements

This work has been supported through Cyber Valley.

Author information

Authors and Affiliations

Max Planck Institute for Intelligent Systems, Tübingen, Germany
Deyao Zhu & Jörg Stückler
Leibniz Universität Hannover, Hanover, Germany
Deyao Zhu, Marco Munderloh & Bodo Rosenhahn

Authors

Deyao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Marco Munderloh
View author publications
You can also search for this author in PubMed Google Scholar
Bodo Rosenhahn
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Stückler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deyao Zhu .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
University of Hamburg, Hamburg, Germany
Simone Frintrop
University of Münster, Münster, Germany
Xiaoyi Jiang

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 18400 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, D., Munderloh, M., Rosenhahn, B., Stückler, J. (2019). Learning to Disentangle Latent Physical Factors for Video Prediction. In: Fink, G., Frintrop, S., Jiang, X. (eds) Pattern Recognition. DAGM GCPR 2019. Lecture Notes in Computer Science(), vol 11824. Springer, Cham. https://doi.org/10.1007/978-3-030-33676-9_42

Download citation

DOI: https://doi.org/10.1007/978-3-030-33676-9_42
Published: 25 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33675-2
Online ISBN: 978-3-030-33676-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics