research-article

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction

Authors:

Mengyuan LiuAuthors Info & Claims

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

Article No.: 15, Pages 1 - 8

https://doi.org/10.1145/3595916.3626384

Published: 01 January 2024 Publication History

Abstract

Existing Graph Convolutional Networks to achieve human motion prediction largely adopt a one-step scheme, which output the prediction straight from history input, failing to exploit human motion patterns. We observe that human motions have transitional patterns and can be split into snippets representative of each transition. Each snippet can be reconstructed from its starting and ending poses referred to as the transitional poses. We propose a snippet-to-motion multi-stage framework that breaks motion prediction into sub-tasks easier to accomplish. Each sub-task integrates three modules: transitional pose prediction, snippet reconstruction, and snippet-to-motion prediction. Specifically, we propose to first predict only the transitional poses. Then we use them to reconstruct the corresponding snippets, obtaining a close approximation to the true motion sequence. Finally we refine them to produce the final prediction output. To implement the network, we propose a novel unified graph modeling, which allows for direct and effective feature propagation compared to existing approaches which rely on separate space-time modeling. Extensive experiments on Human 3.6M, CMU Mocap and 3DPW datasets verify the effectiveness of our method which achieves state-of-the-art performance.

References

[1]

Najib Ben Aoun, Mahmoud Mejdoub, and Chokri Ben Amar. 2014. Graph-based approach for human action recognition using spatio-temporal features. Journal of Visual Communication and Image Representation 25, 2 (2014), 329–338.

Digital Library

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In arXiv:1409.0473.

[3]

Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2013. Spectral networks and locally connected networks on graphs. In arXiv:1312.6203.

[4]

Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. 2017. Deep representation learning for human motion prediction and classification. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6158–6166.

[5]

Qiongjie Cui, Huaijiang Sun, and Fei Yang. 2020. Learning dynamic relationships for 3d human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6519–6527.

[6]

Lingwei Dang, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2021. MSR-GCN: Multi-scale residual graph convolution networks for human motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV). 11467–11476.

[7]

Pengxiang Ding and Jianqin Yin. 2022. Towards More Realistic Human Motion Prediction With Attention to Motion Coordination. IEEE Transactions on Circuits and Systems for Video Technology 32, 9 (2022), 5846–5858.

[8]

Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recurrent network models for human dynamics. In IEEE/CVF International Conference on Computer Vision (ICCV). 4346–4354.

[9]

Xiang Gao, Wei Hu, Jiaxiang Tang, Jiaying Liu, and Zongming Guo. 2019. Optimized skeleton-based action recognition via sparsified graph regression. In ACM International Conference on Multimedia (ACM MM). 601–610.

Digital Library

[10]

Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. 2019. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In AAAI conference on artificial intelligence (AAAI). 922–929.

Digital Library

[11]

Wen Guo, Yuming Du, Xi Shen, Vincent Lepetit, Xavier Alameda-Pineda, and Francesc Moreno-Noguer. 2023. Back to mlp: A simple baseline for human motion prediction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4809–4819.

[12]

Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems (NeurIPS) 30 (2017).

[13]

Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, and Trevor Darrell. 2019. Spatio-temporal action graph networks. In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[14]

Junhui Hou, Lap-Pui Chau, Nadia Magnenat-Thalmann, and Ying He. 2014. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology 25, 1 (2014), 51–62.

[15]

Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. 2013. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 7 (2013), 1325–1339.

Digital Library

[16]

Sena Kiciroglu, Wei Wang, Mathieu Salzmann, and P. Fua. 2020. Long Term Motion Prediction Using Keyposes. In arXiv:2012.04731.

[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In arXiv:1412.6980.

[18]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In arXiv:1609.02907.

[19]

Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 156–165.

[20]

Bin Li, Xi Li, Zhongfei Zhang, and Fei Wu. 2019. Spatio-temporal graph routing for skeleton-based action recognition. In AAAI conference on artificial intelligence (AAAI). 8561–8568.

Digital Library

[21]

Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. 2018. Convolutional sequence to sequence model for human dynamics. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5226–5234.

[22]

Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 3595–3603.

[23]

Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 214–223.

[24]

Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. 2021. Multiscale spatio-temporal graph neural networks for 3d skeleton-based motion prediction. IEEE Transactions on Image Processing 30 (2021), 7760–7775.

Digital Library

[25]

Han-Chao Liu, Fang-Lue Zhang, David Marshall, Luping Shi, and Shi-Min Hu. 2017. High-speed video generation with an event camera. The Visual Computer 33 (2017), 749–759.

Digital Library

[26]

Jinfu Liu, Xinshun Wang, Can Wang, Yuan Gao, and Mengyuan Liu. 2023. Temporal Decoupling Graph Convolutional Network for Skeleton-based Gesture Recognition. IEEE Transactions on Multimedia (2023).

Digital Library

[27]

Mengyuan Liu, Fanyang Meng, Chen Chen, and Songtao Wu. 2023. Novel Motion Patterns Matter for Practical Skeleton-based Action Recognition. In AAAI Conference on Artificial Intelligence (AAAI).

[28]

Mengyuan Liu and Junsong Yuan. 2018. Recognizing human actions as the evolution of pose estimation maps. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 1159–1168.

[29]

Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu. 2020. Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology 31, 6 (2020), 2133–2146.

[30]

Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. 2020. Disentangling and unifying graph convolutions for skeleton-based action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 143–152.

[31]

Tiezheng Ma, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. 2022. Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6437–6446.

[32]

Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2020. History repeats itself: Human motion prediction via motion attention. In European Conference on Computer Vision (ECCV). 474–489.

Digital Library

[33]

Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. 2019. Learning trajectory dependencies for human motion prediction. In IEEE/CVF International Conference on Computer Vision (ICCV). 9489–9497.

[34]

Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2891–2900.

[35]

Qianhui Men, Edmond SL Ho, Hubert PH Shum, and Howard Leung. 2020. A quadruple diffusion convolutional recurrent network for human motion prediction. IEEE Transactions on Circuits and Systems for Video Technology 31, 9 (2020), 3417–3432.

[36]

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 652–660.

[37]

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2008. The graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2008), 61–80.

Digital Library

[38]

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Skeleton-based action recognition with directed graph neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops (CVPR). 7912–7921.

[39]

Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2021. AdaSGN: Adapting Joint Number and Model Size for Efficient Skeleton-Based Action Recognition. In IEEE/CVF International Conference on Computer Vision (ICCV). 13413–13422.

[40]

Theodoros Sofianos, Alessio Sampieri, Luca Franco, and Fabio Galasso. 2021. Space-time-separable graph convolutional network for pose forecasting. In IEEE/CVF International Conference on Computer Vision (ICCV). 11209–11218.

[41]

Jin Tang, Jin Zhang, Rui Ding, Baoxuan Gu, and Jianqin Yin. 2023. Collaborative Multi-dynamic Pattern Modeling for Human Motion Prediction. IEEE Transactions on Circuits and Systems for Video Technology (2023).

Digital Library

[42]

Zhigang Tu, Zhisheng Huang, Yujin Chen, Di Kang, Linchao Bao, Bisheng Yang, and Junsong Yuan. 2023. Consistent 3d hand reconstruction in video via self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

Digital Library

[43]

Zhigang Tu, Xiangjian Liu, and Xuan Xiao. 2022. A general dynamic knowledge distillation method for visual analytics. IEEE Transactions on Image Processing 31 (2022), 6517–6531.

[44]

Zhigang Tu, Yuanzhong Liu, Yan Zhang, Qizi Mu, and Junsong Yuan. 2023. DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos. IEEE Transactions on Image Processing (2023).

[45]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. In arXiv:1710.10903.

[46]

Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. 2019. Graph wavenet for deep spatial-temporal graph modeling. In arXiv:1906.00121.

[47]

Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI conference on artificial intelligence (AAAI).

[48]

Bing Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In arXiv:1709.04875.

[49]

Fang-Lue Zhang, Xian Wu, Rui-Long Li, Jue Wang, Zhao-Heng Zheng, and Shi-Min Hu. 2018. Detecting and removing visual distractors for video aesthetic enhancement. IEEE Transactions on Multimedia 20, 8 (2018), 1987–1999.

[50]

Honghong Zhou, Caili Guo, Hao Zhang, and Yanjun Wang. 2021. Learning multiscale correlations for human motion prediction. In IEEE International Conference on Development and Learning (ICDL). 1–7.

Cited By

Zhao RWei BHan LCai YMa YLi C(2025)A Multiscale Mixed-Graph Neural Network Based on Kinematic and Dynamic Joint Features for Human Motion PredictionApplied Sciences10.3390/app1504189715:4(1897)Online publication date: 12-Feb-2025
https://doi.org/10.3390/app15041897
Yang YCao LShi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681444

Index Terms

Learning Snippet-to-Motion Progression for Skeleton-based Human Motion Prediction
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Dynamic Compositional Graph Convolutional Network for Efficient Composite Human Motion Prediction
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

With potential applications in fields including intelligent surveillance and human-robot interaction, the human motion prediction task has become a hot research topic and also has achieved high success, especially using the recent Graph Convolutional ...
Graph-Guided MLP-Mixer for Skeleton-Based Human Motion Prediction
MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

In recent years, Graph Convolutional Networks (GCNs) have been widely used in human motion prediction, but their performance remains unsatisfactory. Recently, MLP-Mixer, initially developed for vision tasks, has been leveraged into human motion ...
Human Motion Prediction based on IMUs and MetaFormer
IPMV '23: Proceedings of the 2023 5th International Conference on Image Processing and Machine Vision

Human motion prediction forecasts future human poses from the histories, which is necessary for all tasks that need human-robot interactions. Currently, almost existing approaches make predictions based on visual observations, while vision-based motion ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '23: Proceedings of the 5th ACM International Conference on Multimedia in Asia

December 2023

745 pages

ISBN:9798400702051

DOI:10.1145/3595916

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MMAsia '23

Sponsor:

SIGMM

MMAsia '23: ACM Multimedia Asia

December 6 - 8, 2023

Tainan, Taiwan

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
63
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)3

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao RWei BHan LCai YMa YLi C(2025)A Multiscale Mixed-Graph Neural Network Based on Kinematic and Dynamic Joint Features for Human Motion PredictionApplied Sciences10.3390/app1504189715:4(1897)Online publication date: 12-Feb-2025
https://doi.org/10.3390/app15041897
Yang YCao LShi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multi-Instance Multi-Label Learning for Text-motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681444(5829-5837)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681444

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten