Abstract
Video prediction aims to generate future frames from the past several given frames. It has many applications for abnormal action recognition, future traffic prediction, long-term planning and autonomous driving. Recently, various deep learning-based methods have been proposed to address this task. However, these methods seem only to focus on increasing the network performance and ignore the computational cost problem of them. Even, several methods require two separate networks to perform with two different input types such as RGB, temporal gradient and optical flow. This makes them more and more complex and requires a extremely huge computational cost and memory space. In this paper, we introduce a simple yet robust approach to learn simultaneous both appearance and motion features in only a network regardless diversity of input video modalities. Moreover, we also present a lightweight autoencoder network for addressing this issue. Our framework is conducted on various benchmarks such as KTH, KITTI and BAIR datasets. The experimental results have shown that our approach achieves competitive performance compared to state-of-the-art video prediction methods with only 34.24MB of memory space and 2.59GFLOPs. With a smaller model size and less computational cost, our framework can run faster with a small inference time compared to the other methods. Besides, it only with 2.934 s to predict the next frame, our framework is a promising approach to deploy on embedded or mobile devices without GPU in real time.






Similar content being viewed by others
Data availability
All datasets used in this work are public available in the Internet.
References
Duc, Q.V.: Self-knowledge distillation: an efficient approach for falling detection. In: ICABDE, pp. 369–380. Springer (2022)
Xu, H., Liu, W., Xing, W., Wei, X.: Motion-aware future frame prediction for video anomaly detection based on saliency perception. SIViP 16(8), 2121–2129 (2022)
Vu, D.Q., Thu, T.P.T., Le, N., Wang, J.C., et al.: Deep learning for human action recognition: a comprehensive review. APSIPA Transactions on signal and information processing 12(2)
Bhattacharyya, A., Fritz, M., Schiele, B.: Long-term on-board prediction of people in traffic scenes under uncertainty. In: CVPR, pp. 4194–4202 (2018)
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. NIPS 29 (2016)
Lee, D.-H., Liu, J.-L.: End-to-end deep learning of lane detection and path prediction for real-time autonomous driving. SIViP 17(1), 199–205 (2023)
Akbulut, O., Konyar, M.Z.: Improved intra-subpartition coding mode for versatile video coding. SIViP 16(5), 1363–1368 (2022)
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML, pp. 6105–6114 (2019). PMLR
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV, pp. 10012–10022 (2021)
Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: CVPR, pp. 13029–13038 (2021)
Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp. 5559–5568 (2021)
Vu, D.Q., Wang, J.C., : A novel self-knowledge distillation approach with siamese representation learning for action recognition. In: VCIP, pp. 1–5 . IEEE (2021)
Vu, D.Q., Le, N.T., Wang, J.C.: (2+ 1) d distilled shufflenet: a lightweight unsupervised distillation network for human action recognition. In: ICPR, pp. 3197–3203 . IEEE (2022)
Gao, Z., Tan, C., Wu, L., Li, S.Z.: Simvp: Simpler yet better video prediction. In: CVPR, pp. 3170–3180 (2022)
Wang, Y., Gao, Z., Long, M., Wang, J., Philip, S.Y.: Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In: ICML, pp. 5123–5132. PMLR (2018)
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: ICML, pp. 1174–1183. PMLR (2018)
Wu, B., Nair, S., Martin-Martin, R., Fei-Fei, L., Finn, C.: Greedy hierarchical variational autoencoders for large-scale video prediction. In: CVPR, pp. 2318–2328 (2021)
Akan, A.K., Erdem, E., Erdem, A., Güney, F.: Slamp: Stochastic latent appearance and motion prediction. In: ICCV, pp. 14728–14737 (2021)
Phung, T., Nguyen, V.T., Ma, T.H.T., Duc, Q.V.: A (2+ 1) d attention convolutional neural network for video prediction. In: ICABDE, pp. 395–406. Springer (2022)
Yuan, P., Guan, Y., Huang, J.: Video prediction based on spatial information transfer and time backtracking. SIViP 16(3), 825–833 (2022)
Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. In: Proceedings of the European conference on computer vision (ECCV), pp. 716–731 (2018)
Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. ICLR (2017)
Wu, Y., Wen, Q., Chen, Q.: Optimizing video prediction via video frame interpolation. In: CVPR, pp. 17814–17823 (2022)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: ICPR, vol. 3, pp. 32–36 . IEEE (2004)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR, pp. 3354–3361. IEEE (2012)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)
Ebert, F., Finn, C., Lee, A.X., Levine, S.: Self-supervised visual planning with temporal skip connections. CoRL 12, 16 (2017)
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. NIPS 28 (2015)
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. NIPS 30 (2017)
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
Jin, B., Hu, Y., Zeng, Y., Tang, Q., Liu, S., Ye, J.: Varnet: Exploring variations for unsupervised video prediction. In: IROS, pp. 5801–5806 (2018). IEEE
Wang, Y., Jiang, L., Yang, M.H., Li, L.J., Long, M., Fei-Fei, L.: Eidetic 3d lstm: a model for video prediction and beyond. In: ICLR (2019)
Franceschi, J.Y., Delasalles, E., Chen, M., Lamprier, S., Gallinari, P.: Stochastic latent residual video prediction. In: ICML, pp. 3233–3246 (2020). PMLR
Lee, S., Kim, H.G., Choi, D.H., Kim, H.I., Ro, Y.M.: Video prediction recalling long-term motion context via memory alignment learning. In: CVPR, pp. 3054–3063 (2021)
Ye, X., Bilodeau, G.-A.: Video prediction by efficient transformers. Image Vis. Comput. 130, 104612 (2023)
Yu, W., Lu, Y., Easterbrook, S., Fidler, S.: Efficient and information-preserving future frame prediction and beyond. In: ICLR (2020)
Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: CVPR, pp. 11474–11484 (2020)
Funding
This work is supported in part by the Thai Nguyen University of Education under Grant TNUE-2022-03.
Author information
Authors and Affiliations
Contributions
(1) Quang proposed this idea. (2) All authors proceed to code, experiment and compare the results of the proposed approach with state-of-the-art methods. (3) All authors wrote the draft of the manuscript and prepared tables. (4) Quang prepared all figures. (5) All authors revised and proofread the manuscript. (6) All authors reviewed the manuscript. (7) Trang submitted the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vu, DQ., Thu, T.P.T. Simultaneous context and motion learning in video prediction. SIViP 17, 3933–3942 (2023). https://doi.org/10.1007/s11760-023-02623-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02623-x