Progressive Multi-granularity Analysis for Video Prediction

Xu, Jingwei; Ni, Bingbing; Yang, Xiaokang

doi:10.1007/s11263-020-01389-w

Progressive Multi-granularity Analysis for Video Prediction

Published: 22 October 2020

Volume 129, pages 601–618, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jingwei Xu¹,
Bingbing Ni¹ &
Xiaokang Yang^1,2

952 Accesses
4 Citations
Explore all metrics

Abstract

Video prediction is challenging as real-world motion dynamics are usually multi-modally distributed. Existing stochastic methods commonly formulate random noise input with simple prior distribution, which is insufficient to model highly complex motion dynamics. This work proposes a progressive multiple granularity analysis framework to tackle the above difficulty. Firstly, to achieve coarse alignment, the input sequence is matched to prototype motion dynamics in the training set, based on self-supervised auto-encoder learning via motion/appearance disentanglement. Secondly, motion dynamics is transferred from the matched prototype sequence to input sequence via adaptively learned kernel, and the predicted frames are further refined through a motion-aware prediction model. Extensive qualitative and quantitative experiments on three widely used video prediction datasets demonstrate that: (1) the proposed framework essentially decomposes the hard task into a series of more approachable sub-tasks where a better solution is easier to be sought and (2) our proposed method performs favorably against state-of-the-art prediction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 19

Adaptive Hierarchical Motion-Focused Model for Video Prediction

Random Temporal Skipping for Multirate Video Analysis

Self-supervised Motion Representation via Scattering Local Motion Cues

Notes

References

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction.
Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 4724–4733).
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Deng, J., Krause, J., Stark, M., & Li, F. (2016). Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4), 666–676.
Article Google Scholar
Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In ICML.
Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NeurIPS.
Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS.
Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2015). Local alignments for fine-grained categorization. IJCV., 111(2), 191–212.
Article Google Scholar
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
Hariharan, B., Arbelaez, P., Girshick, R. B., & Malik, J. (2017). Object instance segmentation and fine-grained localization using hypercolumns. TPAMI, 39(4), 627–639.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hsieh, J., Liu, B., Huang, D., Li, F., & Niebles, J. C. (2018). Learning to decompose and disentangle representations for video prediction. In NeurIPS.
Huang, Z., Xu, J., & Ni, B. (2018). Human motion generation via cross-space constrained sampling. In IJCAI (pp. 757–763).
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 1647–1655).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7), 1325–1339.
Article Google Scholar
Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.
Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML.
Jia, X., Brabandere, B. D., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. TPAMI, 24(7), 881–892.
Article Google Scholar
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). Learning plannable representations with causal infogan. In NeurIPS.
Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. CoRR.
Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. In NeurIPS.
Li, H., Huang, D., Morvan, J., Wang, Y., & Chen, L. (2015). Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3D keypoint descriptors. IJCV, 113(2), 128–142.
Article MathSciNet Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.
Lin, T., Roy Chowdhury, A., & Maji, S. (2018). Bilinear convolutional neural networks for fine-grained visual recognition. TPAMI, 40(6), 1309–1322.
Article Google Scholar
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.
Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(11), 2579–2605.
MATH Google Scholar
Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. In NeurIPS.
Ni, B., Paramathayalan, V. R., Li, T., & Moulin, P. (2016). Multiple granularity modeling: A coarse-to-fine framework for fine-grained action analysis. IJCV, 120(1), 28–43.
Article MathSciNet Google Scholar
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML.
Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (xxxx) Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119 (3):346–373 (16).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
Ruder, M., Dosovitskiy, A., & Brox, T. (2018). Artistic style transfer for videos and spherical images. IJCV, 126(11), 1199–1219.
Article MathSciNet Google Scholar
Ryoo, M. S., & Matthies, L. H. (2016). First-person activity recognition: Feature, temporal structure, and prediction. IJCV, 119(3), 307–328.
Article MathSciNet Google Scholar
Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NeurIPS.
Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.
Shen, F., Yan, S., & Zeng, G. (2018). Neural style transfer via meta networks. In CVPR (pp. 8061–8069).
Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.
Tian, Y., Li, J., Yu, S., & Huang, T. (2015). Learning complementary saliency priors for foreground object segmentation in complex scenes. IJCV, 111(2), 153–170.
Article Google Scholar
Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR.
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction.
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In ICML.
Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. In ICML.
Wu, X., Hiramatsu, K., & Kashino, K. (2018). Label propagation with ensemble of pairwise geometric relations: Towards robust large-scale retrieval of object instances. IJCV, 126(7), 689–713.
Article MathSciNet Google Scholar
Xia, S., Wang, C., Chai, J., & Hodgins, J. K. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics, 34(4), 1191–11910.
Article Google Scholar
Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR.
Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018a). Structure preserving video prediction. In CVPR (pp. 1460–1469).
Xu, J., Ni, B., & Yang, X. (2018b). Video prediction via selective sampling. In NeurIPS.
Xu, J., Xu, H., Ni, B., Yang, X., & Darrell, T. (2020a). Video prediction via example guidance. CoRR, arXiv:2007.01738.
Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical style-based networks for motion synthesis. CoRR, arXiv:2008.10162.
Xu, Z., Tao, D., Huang, S., & Zhang, Y. (xxxx). Friend or foe: Fine-grained categorization with weak supervision. TIP, 26 (1):135–146.
Xue, T., Wu, J., Bouman, K. L., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS.
Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM MM (pp. 199–207).
Yang, R., Ni, B., Ma, C., Xu, Y., & Yang, X. (2017). Video segmentation via multiple granularity analysis. In CVPR.
Zhao, B., Feng, J., Wu, X., & Yan, S. (2017). A survey on deep learning-based fine-grained object classification and semantic segmentation. IJAC, 14, 119–135.
Google Scholar

Download references

Acknowledgements

This work was supported by National Science Foundation of China (61976137, U1611461, 61527804, U19B2035), STCSM(18DZ1112300). This work was also supported by National Key Research and Development Program of China (2016YFB1001003). The authors would like to acknowledge the (partial) support from the Open Project Program (No.KBDat1604) of Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science & Technology.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, 200240, China
Jingwei Xu, Bingbing Ni & Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Xiaokang Yang

Authors

Jingwei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bingbing Ni
View author publications
You can also search for this author in PubMed Google Scholar
Xiaokang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bingbing Ni.

Additional information

Communicated by Ivan Laptev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, J., Ni, B. & Yang, X. Progressive Multi-granularity Analysis for Video Prediction. Int J Comput Vis 129, 601–618 (2021). https://doi.org/10.1007/s11263-020-01389-w

Download citation

Received: 06 June 2019
Accepted: 26 September 2020
Published: 22 October 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11263-020-01389-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Progressive Multi-granularity Analysis for Video Prediction

Abstract

Access this article

Similar content being viewed by others

Adaptive Hierarchical Motion-Focused Model for Video Prediction

Random Temporal Skipping for Multirate Video Analysis

Self-supervised Motion Representation via Scattering Local Motion Cues

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Progressive Multi-granularity Analysis for Video Prediction

Abstract

Access this article

Similar content being viewed by others

Adaptive Hierarchical Motion-Focused Model for Video Prediction

Random Temporal Skipping for Multirate Video Analysis

Self-supervised Motion Representation via Scattering Local Motion Cues

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation