Abstract
Video prediction is challenging as real-world motion dynamics are usually multi-modally distributed. Existing stochastic methods commonly formulate random noise input with simple prior distribution, which is insufficient to model highly complex motion dynamics. This work proposes a progressive multiple granularity analysis framework to tackle the above difficulty. Firstly, to achieve coarse alignment, the input sequence is matched to prototype motion dynamics in the training set, based on self-supervised auto-encoder learning via motion/appearance disentanglement. Secondly, motion dynamics is transferred from the matched prototype sequence to input sequence via adaptively learned kernel, and the predicted frames are further refined through a motion-aware prediction model. Extensive qualitative and quantitative experiments on three widely used video prediction datasets demonstrate that: (1) the proposed framework essentially decomposes the hard task into a series of more approachable sub-tasks where a better solution is easier to be sought and (2) our proposed method performs favorably against state-of-the-art prediction methods.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig18_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig19_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-020-01389-w/MediaObjects/11263_2020_1389_Fig20_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., & Levine, S. (2017). Stochastic variational video prediction.
Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR (pp. 4724–4733).
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Deng, J., Krause, J., Stark, M., & Li, F. (2016). Leveraging the wisdom of the crowd for fine-grained recognition. TPAMI, 38(4), 666–676.
Denton, E., & Fergus, R. (2018). Stochastic video generation with a learned prior. In ICML.
Denton, E. L., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In NeurIPS.
Finn, C., Goodfellow, I. J., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NeurIPS.
Gavves, E., Fernando, B., Snoek, C. G. M., Smeulders, A. W. M., & Tuytelaars, T. (2015). Local alignments for fine-grained categorization. IJCV., 111(2), 191–212.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. C., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
Hariharan, B., Arbelaez, P., Girshick, R. B., & Malik, J. (2017). Object instance segmentation and fine-grained localization using hypercolumns. TPAMI, 39(4), 627–639.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hsieh, J., Liu, B., Huang, D., Li, F., & Niebles, J. C. (2018). Learning to decompose and disentangle representations for video prediction. In NeurIPS.
Huang, Z., Xu, J., & Ni, B. (2018). Human motion generation via cross-space constrained sampling. In IJCAI (pp. 757–763).
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR (pp. 1647–1655).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI, 36(7), 1325–1339.
Isola, P., Zhu, J., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In CVPR.
Jang, Y., Kim, G., & Song, Y. (2018). Video prediction with appearance and motion conditions. In ICML.
Jia, X., Brabandere, B. D., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.
Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., & Wu, A. Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. TPAMI, 24(7), 881–892.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.
Kurutach, T., Tamar, A., Yang, G., Russell, S. J., & Abbeel, P. (2018). Learning plannable representations with causal infogan. In NeurIPS.
Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., & Levine, S. (2018). Stochastic adversarial video prediction. CoRR.
Lee, S., Purushwalkam, S., Cogswell, M., Ranjan, V., Crandall, D. J., & Batra, D. (2016). Stochastic multiple choice learning for training diverse deep ensembles. In NeurIPS.
Li, H., Huang, D., Morvan, J., Wang, Y., & Chen, L. (2015). Towards 3D face recognition in the real: A registration-free approach using fine-grained matching of 3D keypoint descriptors. IJCV, 113(2), 128–142.
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. (2018). Flow-grounded spatial-temporal video prediction from still images. In ECCV.
Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In ICCV.
Lin, T., Roy Chowdhury, A., & Maji, S. (2018). Bilinear convolutional neural networks for fine-grained visual recognition. TPAMI, 40(6), 1309–1322.
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.
Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(11), 2579–2605.
Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018). Visual reinforcement learning with imagined goals. In NeurIPS.
Ni, B., Paramathayalan, V. R., Li, T., & Moulin, P. (2016). Multiple granularity modeling: A coarse-to-fine framework for fine-grained action analysis. IJCV, 120(1), 28–43.
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In ICML.
Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (xxxx) Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV, 119 (3):346–373 (16).
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In MICCAI.
Ruder, M., Dosovitskiy, A., & Brox, T. (2018). Artistic style transfer for videos and spherical images. IJCV, 126(11), 1199–1219.
Ryoo, M. S., & Matthies, L. H. (2016). First-person activity recognition: Feature, temporal structure, and prediction. IJCV, 119(3), 307–328.
Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved techniques for training gans. In NeurIPS.
Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.
Shen, F., Yan, S., & Zeng, G. (2018). Neural style transfer via meta networks. In CVPR (pp. 8061–8069).
Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., & Woo, W. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using lstms. In ICML.
Tian, Y., Li, J., Yu, S., & Huang, T. (2015). Learning complementary saliency priors for foreground object segmentation in complex scenes. IJCV, 111(2), 153–170.
Tulyakov, S., Liu, M., Yang, X., & Kautz, J. (2018). Mocogan: Decomposing motion and content for video generation. In CVPR.
Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction.
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In ICML.
Wichers, N., Villegas, R., Erhan, D., & Lee, H. (2018). Hierarchical long-term video prediction without supervision. In ICML.
Wu, X., Hiramatsu, K., & Kashino, K. (2018). Label propagation with ensemble of pairwise geometric relations: Towards robust large-scale retrieval of object instances. IJCV, 126(7), 689–713.
Xia, S., Wang, C., Chai, J., & Hodgins, J. K. (2015). Realtime style transfer for unlabeled heterogeneous human motion. ACM Transactions on Graphics, 34(4), 1191–11910.
Xu, B., Wang, N., Chen, T., & Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. CoRR.
Xu, J., Ni, B., Li, Z., Cheng, S., & Yang, X. (2018a). Structure preserving video prediction. In CVPR (pp. 1460–1469).
Xu, J., Ni, B., & Yang, X. (2018b). Video prediction via selective sampling. In NeurIPS.
Xu, J., Xu, H., Ni, B., Yang, X., & Darrell, T. (2020a). Video prediction via example guidance. CoRR, arXiv:2007.01738.
Xu, J., Xu, H., Ni, B., Yang, X., Wang, X., & Darrell, T. (2020). Hierarchical style-based networks for motion synthesis. CoRR, arXiv:2008.10162.
Xu, Z., Tao, D., Huang, S., & Zhang, Y. (xxxx). Friend or foe: Fine-grained categorization with weak supervision. TIP, 26 (1):135–146.
Xue, T., Wu, J., Bouman, K. L., & Freeman, B. (2016). Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS.
Yan, Y., Xu, J., Ni, B., Zhang, W., & Yang, X. (2017). Skeleton-aided articulated motion generation. In ACM MM (pp. 199–207).
Yang, R., Ni, B., Ma, C., Xu, Y., & Yang, X. (2017). Video segmentation via multiple granularity analysis. In CVPR.
Zhao, B., Feng, J., Wu, X., & Yan, S. (2017). A survey on deep learning-based fine-grained object classification and semantic segmentation. IJAC, 14, 119–135.
Acknowledgements
This work was supported by National Science Foundation of China (61976137, U1611461, 61527804, U19B2035), STCSM(18DZ1112300). This work was also supported by National Key Research and Development Program of China (2016YFB1001003). The authors would like to acknowledge the (partial) support from the Open Project Program (No.KBDat1604) of Jiangsu Key Laboratory of Big Data Analysis Technology, Nanjing University of Information Science & Technology.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xu, J., Ni, B. & Yang, X. Progressive Multi-granularity Analysis for Video Prediction. Int J Comput Vis 129, 601–618 (2021). https://doi.org/10.1007/s11263-020-01389-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01389-w