Abstract
Human motion prediction aims to automatically predict the future motion sequence based on an observed human motion sequence. In this paper, we propose a novel skip-attention encoder–decoder (SAED) framework to model human motion dependences in spatiotemporal space, by utilizing the encoder and decoder to encode the observed motions, and decode the predicted motions, respectively. Overall, this framework has two main insights. First, we design a new self-renewing ConvGRU as the unit of encoder and decoder to effectively capture temporal and spatial skeleton-motion dependencies. Second, we present a new skip-attention mechanism (SAM) to aggregate the motion information of all layers based on their importance. In experiments, quantitative and qualitative results on the Human3.6M and CMU motion capture datasets show the effectiveness of the proposed SAED compared with the related methods.




Similar content being viewed by others
References
Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3d human motion modelling. In: ICCV (2019)
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations (2015). arXiv:1511.06432
Bappy, J.H., Roy-Chowdhury, A.K.: CNN based region proposals for efficient object detection. In: ICIP (2016)
Blot, M., Cord, M., Thome, N.: Max-min convolutional neural networks for image classification. In: ICIP (2016)
Brand, M., Hertzmann, A.: Style machines. In: SIGGRAPH, pp. 183–192 (2000)
Chiu, H.k., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: WACV (2019)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014). arXiv:1406.1078
Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR (2017)
Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: CVPR (2020)
Dong, M., Xu, C.: On retrospecting human dynamics with attention. In: IJCAI (2019)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: ICCV (2019)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (2013)
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: Deep learning on spatio-temporal graphs. In: CVPR (2016)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR (2017)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning applied to document recognition. Proc. IEEE (1998)
Li, C., Zhang, Z., Sun Lee, W., Hee Lee, G.: Convolutional sequence to sequence model for human dynamics. In: CVPR (2018)
Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: CVPR (2020)
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Pavllo, D., Grangier, D., Auli, M.: Quaternet: A quaternion-based recurrent model for human motion (2018). arXiv:1805.06485
Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: NeurIPS (2001)
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: NeurIPS (2015)
Shu, X., Tang, J., Qi, G., Liu, W., Yang, J.: Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell (2019)
Shu, X., Tang, J., Qi, G.J., Song, Y., Li, Z., Zhang, L.: Concurrence-aware long short-term sub-memories for person-person action recognition. In: CVPRW (2017)
Shu, X., Zhang, L., Sun, Y., Tang, J.: Host-Parasite: Graph LSTM-in-LSTM for Group Activity Recognition. IEEE Trans. Neural Netw. Learn. Syst (2020)
Sigal, L., Balan, A.O., Black, M.J.: Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vision (2010)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI (2017)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions (2014)
Tan, W.R., Chan, C.S., Aguirre, H.E., Tanaka, K.: ArtGAN: Artwork synthesis with conditional categorical GANs. In: ICIP (2017)
Tang, J., Shu, X., Yan, R., Zhang, L.: Coherence constrained graph LSTM for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell (2019)
Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: ICCV (2017)
Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell (2007)
Yan, R., Tang, J., Shu, X., Li, Z., Tian, Q.: Participation-contributed temporal dynamic model for group activity recognition. In: ACM MM (2018)
Yan, R., Xie, L., Tang, J., Shu, X., Tian, Q.: Social Adaptive Module for Weakly-supervised Group Activity Recognition (2020). arXiv:2007.09470
Ye, Q., Li, Z., Fu, L., Zhang, Z., Yang, W., Yang, G.: Nonpeaked discriminant analysis for data representation. IEEE Trans. Neural Netw. Learn. Syst (2019)
Ye, Q., Yang, J., Liu, F., Zhao, C., Ye, N., Yin, T.: L1-norm distance linear discriminant analysis based on an effective iterative algorithm. IEEE Trans. Circuits Syst. Video Technol (2016)
Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3d human dynamics from video. In: ICCV (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work is supported by the National Key R&D Program of China (No. 2018AAA0102001) and the National Natural Science Foundation of China (Grant Nos. 62072245, and 61932020).
Rights and permissions
About this article
Cite this article
Zhang, R., Shu, X., Yan, R. et al. Skip-attention encoder–decoder framework for human motion prediction. Multimedia Systems 28, 413–422 (2022). https://doi.org/10.1007/s00530-021-00807-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-021-00807-4