PhyLoNet: Physically-Constrained Long-Term Video Prediction

Zikri, Nir Ben; Sharf, Andrei

doi:10.1007/978-3-031-26293-7_34

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13847))

Included in the following conference series:

Asian Conference on Computer Vision

431 Accesses

Abstract

Motions in videos are often governed by physical and biological laws such as gravity, collisions, flocking, etc. Accounting for such natural properties is an appealing way to improve realism in future frame video prediction. Nevertheless, the definition and computation of intricate physical and biological properties in motion videos are challenging. In this work, we introduce PhyLoNet, a PhyDNet extension that learns long-term future frame prediction and manipulation. Similar to PhyDNet, our network consists of a two-branch deep architecture that explicitly disentangles physical dynamics from complementary information. It uses a recurrent physical cell (PhyCell) for performing physically-constrained prediction in latent space. In contrast to PhyDNet, PhyLoNet introduces a modified encoder-decoder architecture together with a novel relative flow loss. This enables a longer-term future frame prediction from a small input sequence with higher accuracy and quality. We have carried out extensive experiments, showing the ability of PhyLoNet to outperform PhyDNet on various challenging natural motion datasets such as ball collisions, flocking, and pool games. Ablation studies highlight the importance of our new components. Finally, we show an application of PhyLoNet for video manipulation and editing by a novel class label modification architecture.

This research was partially supported by the Lynn and William Frankel Center for Computer Science at BGU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aoyagi, Y., Murata, N., Sakaino, H.: Spatio-temporal predictive network for videos with physical properties. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2268–2278 (2021). https://doi.org/10.1109/CVPRW53098.2021.00256
Battaglia, P.W., Pascanu, R., Lai, M., Rezende, D., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics (2016)
Google Scholar
Brabandere, B.D., Jia, X., Tuytelaars, T., Gool, L.V.: Dynamic filter networks (2016)
Google Scholar
Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: ContextVP: fully context-aware video prediction (2017). https://doi.org/10.48550/ARXIV.1710.08518. https://arxiv.org/abs/1710.08518
Cuturi, M., Blondel, M.: Soft-DTW: a differentiable loss function for time-series (2017)
Google Scholar
Denton, E., Birodkar, V.: Unsupervised learning of disentangled representations from video (2017)
Google Scholar
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale (2021)
Google Scholar
Eslami, S.M.A., et al.: Attend, infer, repeat: fast scene understanding with generative models (2016)
Google Scholar
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction (2016). https://doi.org/10.48550/ARXIV.1605.07157. https://arxiv.org/abs/1605.07157
Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning (2015)
Google Scholar
Gao, H., Xu, H., Cai, Q.Z., Wang, R., Yu, F., Darrell, T.: Disentangling propagation and generation for video prediction (2019)
Google Scholar
Guen, V.L., Thome, N.: Shape and time distortion loss for training deep time series forecasting models (2019)
Google Scholar
Hsieh, J.T., Liu, B., Huang, D.A., Fei-Fei, L., Niebles, J.C.: Learning to decompose and disentangle representations for video prediction (2018)
Google Scholar
Hui, T.W., Tang, X., Loy, C.C.: LiteFlowNet: a lightweight convolutional neural network for optical flow estimation (2018)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks (2016)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational inference for interacting systems (2018)
Google Scholar
Kosiorek, A.R., Kim, H., Posner, I., Teh, Y.W.: Sequential attend, infer, repeat: generative modelling of moving objects (2018)
Google Scholar
Krishnan, R.G., Shalit, U., Sontag, D.: Deep Kalman filters (2015)
Google Scholar
Kwon, Y.H., Park, M.G.: Predicting future frames using retrospective cycle GAN. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1811–1820 (2019). https://doi.org/10.1109/CVPR.2019.00191
Le Guen, V., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.-H.: Flow-grounded spatial-temporal video prediction from still images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 609–625. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_37
Chapter Google Scholar
Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction (2017)
Google Scholar
Liu, Z., Yeh, R.A., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow (2017)
Google Scholar
Long, Z., Lu, Y., Dong, B.: PDE-Net 2.0: learning PDEs from data with a numeric-symbolic hybrid deep network. J. Comput. Phys. 399, 108925 (2019). https://doi.org/10.1016/j.jcp.2019.108925
Article MathSciNet MATH Google Scholar
Long, Z., Lu, Y., Ma, X., Dong, B.: PDE-Net: learning PDEs from data (2018)
Google Scholar
Luo, Z., Peng, B., Huang, D.A., Alahi, A., Fei-Fei, L.: Unsupervised learning of long-term motion dynamics for videos (2017)
Google Scholar
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error (2015)
Google Scholar
Mo, S., Cho, M., Shin, J.: InstaGAN: instance-aware image-to-image translation (2019)
Google Scholar
Mrowca, D., et al.: Flexible neural representation for physics prediction (2018)
Google Scholar
Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 745–761. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_44
Chapter Google Scholar
Palm, R.B., Paquet, U., Winther, O.: Recurrent relational networks (2017)
Google Scholar
Pan, T., Jiang, Z., Han, J., Wen, S., Men, A., Wang, H.: Taylor saves for later: disentanglement for video prediction using Taylor representation. Neurocomputing 472, 166–174 (2022)
Article Google Scholar
Patraucean, V., Handa, A., Cipolla, R.: Spatio-temporal video autoencoder with differentiable memory (2015)
Google Scholar
Raissi, M.: Deep hidden physics models: deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19(1), 932–955 (2018)
MathSciNet MATH Google Scholar
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics informed deep learning (part II): data-driven discovery of nonlinear partial differential equations (2017)
Google Scholar
Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network (2016)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Rudy, S.H., Brunton, S.L., Proctor, J.L., Kutz, J.N.: Data-driven discovery of partial differential equations. Sci. Adv. 3(4), e1602614 (2016)
Article Google Scholar
Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control (2018)
Google Scholar
Seo, S., Liu, Y.: Differentiable physics-informed graph networks (2019)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting (2015). https://doi.org/10.48550/ARXIV.1506.04214. https://arxiv.org/abs/1506.04214
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 802–810. MIT Press, Cambridge (2015)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs (2015). https://doi.org/10.48550/ARXIV.1502.04681. https://arxiv.org/abs/1502.04681
van Steenkiste, S., Chang, M., Greff, K., Schmidhuber, J.: Relational neural expectation maximization: unsupervised discovery of objects and their interactions (2018)
Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume (2018)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need (2017)
Google Scholar
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction (2018)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics (2016)
Google Scholar
Wang, Y., Gao, Z., Long, M., Wang, J., Yu, P.S.: PredRNN++: towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning (2018). https://doi.org/10.48550/ARXIV.1804.06300. https://arxiv.org/abs/1804.06300
Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3D LSTM: a model for video prediction and beyond. In: ICLR (2019)
Google Scholar
Wang, Y., et al.: PredRNN: a recurrent neural network for spatiotemporal predictive learning (2021). https://doi.org/10.48550/ARXIV.2103.09504. https://arxiv.org/abs/2103.09504
Wang, Y., Zhang, J., Zhu, H., Long, M., Wang, J., Yu, P.S.: Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics (2018). https://doi.org/10.48550/ARXIV.1811.07490. https://arxiv.org/abs/1811.07490
Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks (2017)
Google Scholar
Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, Red Hook, NY, USA, pp. 152–163. Curran Associates Inc. (2017)
Google Scholar
Wu, Y., Gao, R., Park, J., Chen, Q.: Future video synthesis with object motion prediction (2020)
Google Scholar
Xu, J., Ni, B., Li, Z., Cheng, S., Yang, X.: Structure preserving video prediction. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1460–1469 (2018). https://doi.org/10.1109/CVPR.2018.00158
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks (2016)
Google Scholar
Yin, Y., et al.: Augmenting physical models with deep networks for complex dynamics forecasting. J. Stat. Mech. Theory Exp. 2021(12), 124012 (2021). https://doi.org/10.1088/1742-5468/ac3ae5
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Ben-Gurion University of the Negev, P.O.B. 653, 8410501, Be’er Sheva, Israel
Nir Ben Zikri & Andrei Sharf

Authors

Nir Ben Zikri
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Sharf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrei Sharf .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zikri, N.B., Sharf, A. (2023). PhyLoNet: Physically-Constrained Long-Term Video Prediction. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-031-26293-7_34
Published: 11 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26292-0
Online ISBN: 978-3-031-26293-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics