Skip to main content
Log in

Progressive Multi-granularity Analysis for Video Prediction

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Video prediction is challenging as real-world motion dynamics are usually multi-modally distributed. Existing stochastic methods commonly formulate random noise input with simple prior distribution, which is insufficient to model highly complex motion dynamics. This work proposes a progressive multiple granularity analysis framework to tackle the above difficulty. Firstly, to achieve coarse alignment, the input sequence is matched to prototype motion dynamics in the training set, based on self-supervised auto-encoder learning via motion/appearance disentanglement. Secondly, motion dynamics is transferred from the matched prototype sequence to input sequence via adaptively learned kernel, and the predicted frames are further refined through a motion-aware prediction model. Extensive qualitative and quantitative experiments on three widely used video prediction datasets demonstrate that: (1) the proposed framework essentially decomposes the hard task into a series of more approachable sub-tasks where a better solution is easier to be sought and (2) our proposed method performs favorably against state-of-the-art prediction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. https://drive.google.com/open?id=1G5gEnpSTt-xt887bJnEcWTASixsvQngK

  2. https://drive.google.com/open?id=1G5gEnpSTt-xt887bJnEcWTASixsvQngK

References

  • Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction.

  • Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 4724–4733).

  • Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Deng, J., Krause, J., Stark, M., & Li, F. (2016). Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4), 666–676.

    Article  Google Scholar 

  • Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In ICML.

  • Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NeurIPS.

  • Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS.

  • Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2015). Local alignments for fine-grained categorization. IJCV., 111(2), 191–212.

    Article  Google Scholar 

  • Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.

  • Hariharan, B., Arbelaez, P., Girshick, R. B., & Malik, J. (2017). Object instance segmentation and fine-grained localization using hypercolumns. TPAMI, 39(4), 627–639.

    Article  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hsieh, J., Liu, B., Huang, D., Li, F., & Niebles, J. C. (2018). Learning to decompose and disentangle representations for video prediction. In NeurIPS.

  • Huang, Z., Xu, J., & Ni, B. (2018). Human motion generation via cross-space constrained sampling. In IJCAI (pp. 757–763).

  • Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 1647–1655).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7), 1325–1339.

    Article  Google Scholar 

  • Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.

  • Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML.

  • Jia, X., Brabandere, B. D., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.

  • Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. TPAMI, 24(7), 881–892.

    Article  Google Scholar 

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

  • Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). Learning plannable representations with causal infogan. In NeurIPS.

  • Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. CoRR.

  • Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. In NeurIPS.

  • Li, H., Huang, D., Morvan, J., Wang, Y., & Chen, L. (2015). Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3D keypoint descriptors. IJCV, 113(2), 128–142.

    Article  MathSciNet  Google Scholar 

  • Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.

  • Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.

  • Lin, T., Roy Chowdhury, A., & Maji, S. (2018). Bilinear convolutional neural networks for fine-grained visual recognition. TPAMI, 40(6), 1309–1322.

    Article  Google Scholar 

  • Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.

  • Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(11), 2579–2605.

    MATH  Google Scholar 

  • Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. In NeurIPS.

  • Ni, B., Paramathayalan, V. R., Li, T., & Moulin, P. (2016). Multiple granularity modeling: A coarse-to-fine framework for fine-grained action analysis. IJCV, 120(1), 28–43.

    Article  MathSciNet  Google Scholar 

  • Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML.

  • Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (xxxx) Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119 (3):346–373 (16).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.

  • Ruder, M., Dosovitskiy, A., & Brox, T. (2018). Artistic style transfer for videos and spherical images. IJCV, 126(11), 1199–1219.

    Article  MathSciNet  Google Scholar 

  • Ryoo, M. S., & Matthies, L. H. (2016). First-person activity recognition: Feature, temporal structure, and prediction. IJCV, 119(3), 307–328.

    Article  MathSciNet  Google Scholar 

  • Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NeurIPS.

  • Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.

  • Shen, F., Yan, S., & Zeng, G. (2018). Neural style transfer via meta networks. In CVPR (pp. 8061–8069).

  • Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.

  • Tian, Y., Li, J., Yu, S., & Huang, T. (2015). Learning complementary saliency priors for foreground object segmentation in complex scenes. IJCV, 111(2), 153–170.

    Article  Google Scholar 

  • Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR.

  • Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction.

  • Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In ICML.

  • Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. In ICML.

  • Wu, X., Hiramatsu, K., & Kashino, K. (2018). Label propagation with ensemble of pairwise geometric relations: Towards robust large-scale retrieval of object instances. IJCV, 126(7), 689–713.

    Article  MathSciNet  Google Scholar 

  • Xia, S., Wang, C., Chai, J., & Hodgins, J. K. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics, 34(4), 1191–11910.

    Article  Google Scholar 

  • Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR.

  • Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018a). Structure preserving video prediction. In CVPR (pp. 1460–1469).

  • Xu, J., Ni, B., & Yang, X. (2018b). Video prediction via selective sampling. In NeurIPS.

  • Xu, J., Xu, H., Ni, B., Yang, X., & Darrell, T. (2020a). Video prediction via example guidance. CoRR, arXiv:2007.01738.

  • Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical style-based networks for motion synthesis. CoRR, arXiv:2008.10162.

  • Xu, Z., Tao, D., Huang, S., & Zhang, Y. (xxxx). Friend or foe: Fine-grained categorization with weak supervision. TIP, 26 (1):135–146.

  • Xue, T., Wu, J., Bouman, K. L., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS.

  • Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM MM (pp. 199–207).

  • Yang, R., Ni, B., Ma, C., Xu, Y., & Yang, X. (2017). Video segmentation via multiple granularity analysis. In CVPR.

  • Zhao, B., Feng, J., Wu, X., & Yan, S. (2017). A survey on deep learning-based fine-grained object classification and semantic segmentation. IJAC, 14, 119–135.

    Google Scholar 

Download references

Acknowledgements

This work was supported by National Science Foundation of China (61976137, U1611461, 61527804, U19B2035), STCSM(18DZ1112300). This work was also supported by National Key Research and Development Program of China (2016YFB1001003). The authors would like to acknowledge the (partial) support from the Open Project Program (No.KBDat1604) of Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science & Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bingbing Ni.

Additional information

Communicated by Ivan Laptev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, J., Ni, B. & Yang, X. Progressive Multi-granularity Analysis for Video Prediction. Int J Comput Vis 129, 601–618 (2021). https://doi.org/10.1007/s11263-020-01389-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01389-w

Keywords

Navigation