skip to main content
10.1145/3595916.3626384acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

Published: 01 January 2024 Publication History

Abstract

Existing Graph Convolutional Networks to achieve human motion prediction largely adopt a one-step scheme, which output the prediction straight from history input, failing to exploit human motion patterns. We observe that human motions have transitional patterns and can be split into snippets representative of each transition. Each snippet can be reconstructed from its starting and ending poses referred to as the transitional poses. We propose a snippet-to-motion multi-stage framework that breaks motion prediction into sub-tasks easier to accomplish. Each sub-task integrates three modules: transitional pose prediction, snippet reconstruction, and snippet-to-motion prediction. Specifically, we propose to first predict only the transitional poses. Then we use them to reconstruct the corresponding snippets, obtaining a close approximation to the true motion sequence. Finally we refine them to produce the final prediction output. To implement the network, we propose a novel unified graph modeling, which allows for direct and effective feature propagation compared to existing approaches which rely on separate space-time modeling. Extensive experiments on Human 3.6M, CMU Mocap and 3DPW datasets verify the effectiveness of our method which achieves state-of-the-art performance.

References

[1]
Najib Ben Aoun, Mahmoud Mejdoub, and Chokri Ben Amar. 2014. Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation 25, 2 (2014), 329–338.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473.
[3]
Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. In arXiv:1312.6203.
[4]
Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017. Deep representation learning for human motion prediction and classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6158–6166.
[5]
Qiongjie Cui, Huaijiang Sun, and Fei Yang. 2020. Learning dynamic relationships for 3d human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6519–6527.
[6]
Lingwei Dang, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV). 11467–11476.
[7]
Pengxiang Ding and Jianqin Yin. 2022. Towards More Realistic Human Motion Prediction With Attention to Motion Coordination. IEEE Transactions on Circuits and Systems for Video Technology 32, 9 (2022), 5846–5858.
[8]
Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent network models for human dynamics. In IEEE/CVF International Conference on Computer Vision (ICCV). 4346–4354.
[9]
Xiang Gao, Wei Hu, Jiaxiang Tang, Jiaying Liu, and Zongming Guo. 2019. Optimized skeleton-based action recognition via sparsified graph regression. In ACM International Conference on Multimedia (ACM MM). 601–610.
[10]
Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In AAAI conference on artificial intelligence (AAAI). 922–929.
[11]
Wen Guo, Yuming Du, Xi Shen, Vincent Lepetit, Xavier Alameda-Pineda, and Francesc Moreno-Noguer. 2023. Back to mlp: A simple baseline for human motion prediction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4809–4819.
[12]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems (NeurIPS) 30 (2017).
[13]
Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, and Trevor Darrell. 2019. Spatio-temporal action graph networks. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[14]
Junhui Hou, Lap-Pui Chau, Nadia Magnenat-Thalmann, and Ying He. 2014. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology 25, 1 (2014), 51–62.
[15]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325–1339.
[16]
Sena Kiciroglu, Wei Wang, Mathieu Salzmann, and P. Fua. 2020. Long Term Motion Prediction Using Keyposes. In arXiv:2012.04731.
[17]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In arXiv:1412.6980.
[18]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In arXiv:1609.02907.
[19]
Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 156–165.
[20]
Bin Li, Xi Li, Zhongfei Zhang, and Fei Wu. 2019. Spatio-temporal graph routing for skeleton-based action recognition. In AAAI conference on artificial intelligence (AAAI). 8561–8568.
[21]
Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional sequence to sequence model for human dynamics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5226–5234.
[22]
Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 3595–3603.
[23]
Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 214–223.
[24]
Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. Multiscale spatio-temporal graph neural networks for 3d skeleton-based motion prediction. IEEE Transactions on Image Processing 30 (2021), 7760–7775.
[25]
Han-Chao Liu, Fang-Lue Zhang, David Marshall, Luping Shi, and Shi-Min Hu. 2017. High-speed video generation with an event camera. The Visual Computer 33 (2017), 749–759.
[26]
Jinfu Liu, Xinshun Wang, Can Wang, Yuan Gao, and Mengyuan Liu. 2023. Temporal Decoupling Graph Convolutional Network for Skeleton-based Gesture Recognition. IEEE Transactions on Multimedia (2023).
[27]
Mengyuan Liu, Fanyang Meng, Chen Chen, and Songtao Wu. 2023. Novel Motion Patterns Matter for Practical Skeleton-based Action Recognition. In AAAI Conference on Artificial Intelligence (AAAI).
[28]
Mengyuan Liu and Junsong Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 1159–1168.
[29]
Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu. 2020. Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology 31, 6 (2020), 2133–2146.
[30]
Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 143–152.
[31]
Tiezheng Ma, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2022. Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6437–6446.
[32]
Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2020. History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision (ECCV). 474–489.
[33]
Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning trajectory dependencies for human motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV). 9489–9497.
[34]
Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2891–2900.
[35]
Qianhui Men, Edmond SL Ho, Hubert PH Shum, and Howard Leung. 2020. A quadruple diffusion convolutional recurrent network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology 31, 9 (2020), 3417–3432.
[36]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 652–660.
[37]
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61–80.
[38]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Skeleton-based action recognition with directed graph neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 7912–7921.
[39]
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2021. AdaSGN: Adapting Joint Number and Model Size for Efficient Skeleton-Based Action Recognition. In IEEE/CVF International Conference on Computer Vision (ICCV). 13413–13422.
[40]
Theodoros Sofianos, Alessio Sampieri, Luca Franco, and Fabio Galasso. 2021. Space-time-separable graph convolutional network for pose forecasting. In IEEE/CVF International Conference on Computer Vision (ICCV). 11209–11218.
[41]
Jin Tang, Jin Zhang, Rui Ding, Baoxuan Gu, and Jianqin Yin. 2023. Collaborative Multi-dynamic Pattern Modeling for Human Motion Prediction. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[42]
Zhigang Tu, Zhisheng Huang, Yujin Chen, Di Kang, Linchao Bao, Bisheng Yang, and Junsong Yuan. 2023. Consistent 3d hand reconstruction in video via self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
[43]
Zhigang Tu, Xiangjian Liu, and Xuan Xiao. 2022. A general dynamic knowledge distillation method for visual analytics. IEEE Transactions on Image Processing 31 (2022), 6517–6531.
[44]
Zhigang Tu, Yuanzhong Liu, Yan Zhang, Qizi Mu, and Junsong Yuan. 2023. DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos. IEEE Transactions on Image Processing (2023).
[45]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. In arXiv:1710.10903.
[46]
Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-temporal graph modeling. In arXiv:1906.00121.
[47]
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI conference on artificial intelligence (AAAI).
[48]
Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In arXiv:1709.04875.
[49]
Fang-Lue Zhang, Xian Wu, Rui-Long Li, Jue Wang, Zhao-Heng Zheng, and Shi-Min Hu. 2018. Detecting and removing visual distractors for video aesthetic enhancement. IEEE Transactions on Multimedia 20, 8 (2018), 1987–1999.
[50]
Honghong Zhou, Caili Guo, Hao Zhang, and Yanjun Wang. 2021. Learning multiscale correlations for human motion prediction. In IEEE International Conference on Development and Learning (ICDL). 1–7.

Cited By

View all
  • (2025)A Multiscale Mixed-Graph Neural Network Based on Kinematic and Dynamic Joint Features for Human Motion PredictionApplied Sciences10.3390/app1504189715:4(1897)Online publication date: 12-Feb-2025
  • (2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024

Index Terms

  1. Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia
    December 2023
    745 pages
    ISBN:9798400702051
    DOI:10.1145/3595916
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. graph convolution
    2. human motion prediction
    3. neural networks

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    MMAsia '23
    Sponsor:
    MMAsia '23: ACM Multimedia Asia
    December 6 - 8, 2023
    Tainan, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 59 of 204 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A Multiscale Mixed-Graph Neural Network Based on Kinematic and Dynamic Joint Features for Human Motion PredictionApplied Sciences10.3390/app1504189715:4(1897)Online publication date: 12-Feb-2025
    • (2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media