3D Human Pose Estimation via Non-causal Retentive Networks

Zheng, Kaili; Lu, Feixiang; Lv, Yihao; Zhang, Liangjun; Guo, Chenyi; Wu, Ji

doi:10.1007/978-3-031-73414-4_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15091))

Included in the following conference series:

European Conference on Computer Vision

405 Accesses

Abstract

Temporal dependencies are essential in 3D human pose estimation to mitigate depth ambiguity. Previous methods typically use a fixed-length sliding window to capture these dependencies. However, they treat past and future frames equally, ignoring the fact that relying on too many future frames increases the inference latency. In this paper, we present a 3D human pose estimation model based on Retentive Networks (RetNet) that incorporates temporal information by utilizing a large number of past frames and a few future frames. The Non-Causal RetNet (NC-RetNet) is designed to allow the originally causal RetNet to be aware of future information. Additionally, we propose a knowledge transfer strategy, i.e., training the model with a larger chunk size and using a smaller chunk size during inference, to reduce latency while maintaining comparable accuracy. Extensive experiments have been conducted on the Human3.6M and MPI-INF-3DHP datasets, and the results demonstrate that our method achieves state-of-the-art performance. Code and models are available at https://github.com/Kelly510/PoseRetNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploiting Temporal Information for 3D Human Pose Estimation

Diffusion model with temporal constraint for 3D human pose estimation

Article 26 December 2024

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

References

Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2272–2281 (2019)
Google Scholar
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 32(1), 198–209 (2021)
Article Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Google Scholar
Chi, T.C., Fan, T.H., Ramadge, P.J., Rudnicky, A.: Kerple: kernelized relative positional embedding for length extrapolation. Adv. Neural. Inf. Process. Syst. 35, 8386–8399 (2022)
Google Scholar
Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3d human pose and shape from a video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1964–1973 (2021)
Google Scholar
Choi, S., Choi, S., Kim, C.: Mobilehumanpose: toward real-time 3d human pose estimation in mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2328–2338 (2021)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Einfalt, M., Ludwig, K., Lienhart, R.: Uplift and upsample: efficient 3d human pose estimation with uplifting transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2903–2913 (2023)
Google Scholar
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: Ghostnet: more features from cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1580–1589 (2020)
Google Scholar
Hassanin, M., Khamiss, A., Bennamoun, M., Boussaid, F., Radwan, I.: Crossformer: Cross spatio-temporal transformer for 3d human pose estimation. arXiv preprint arXiv:2203.13387 (2022)
Hesse, N., Schröder, A.S., Müller-Felber, W., Bodensteiner, C., Arens, M., Hofmann, U.G.: Body pose estimation in depth images for infant motion analysis. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1909–1912. IEEE (2017)
Google Scholar
Hu, W., Zhang, C., Zhan, F., Zhang, L., Wong, T.T.: Conditional directed graph convolution for 3d human pose estimation. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 602–611 (2021)
Google Scholar
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: alexnet-level accuracy with 50x fewer parameters and $<$ 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Li, H., et al.: Pose-oriented transformer with uncertainty-guided refinement for 2d-to-3d human pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1296–1304 (2023)
Google Scholar
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3d human pose estimation. IEEE Trans. Multimedia 25, 1282–1293 (2022)
Article Google Scholar
Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13147–13156 (2022)
Google Scholar
Li, Y., et al.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021)
Google Scholar
Lin, J., Lee, G.H.: Trajectory space factorization for deep video-based 3d human pose estimation. arXiv preprint arXiv:1908.08289 (2019)
Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.c., Asari, V.: Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5064–5073 (2020)
Google Scholar
Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)
Google Scholar
Mehta, D., et al.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 2017 International Conference on 3D vision (3DV), pp. 506–516. IEEE (2017)
Google Scholar
Mehta, D., et al.: Vnect: real-time 3d human pose estimation with a single rgb camera. Acm Trans. Graph. (tog) 36(4), 1–14 (2017)
Article MathSciNet Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)
Google Scholar
Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7753–7762 (2019)
Google Scholar
Press, O., Smith, N.A., Lewis, M.: Train short, test long: attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409 (2021)
Rayat Imtiaz Hossain, M., Little, J.J.: Exploiting temporal information for 3d pose estimation. arXiv e-prints pp. arXiv–1711 (2017)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Google Scholar
Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., Gao, W.: P-stmo: pre-trained spatial temporal many-to-one model for 3d human pose estimation. In: European Conference on Computer Vision, pp. 461–478. Springer (2022). https://doi.org/10.1007/978-3-031-20065-6_27
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., Liu, Y.: Roformer: enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Google Scholar
Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2602–2611 (2017)
Google Scholar
Sun, Y., et al.: Retentive network: a successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023)
Sun, Y., et al.: A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554 (2022)
Svenstrup, M., Tranberg, S., Andersen, H.J., Bak, T.: Pose estimation and adaptive robot behaviour for human-robot interaction. In: 2009 IEEE International Conference on Robotics and Automation, pp. 3571–3576. IEEE (2009)
Google Scholar
Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3d human pose estimation with spatio-temporal criss-cross attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4790–4799 (2023)
Google Scholar
Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3d pose estimation from videos. In: European Conference on Computer Vision, pp. 764–780. Springer (2020). https://doi.org/10.1007/978-3-030-58601-0_45
Wehrbein, T., Rudolph, M., Rosenhahn, B., Wandt, B.: Probabilistic monocular 3d human pose estimation with normalizing flows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11199–11208 (2021)
Google Scholar
Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16105–16114 (2021)
Google Scholar
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: simple vision transformer baselines for human pose estimation. Adv. Neural. Inf. Process. Syst. 35, 38571–38584 (2022)
Google Scholar
Xue, Y., Chen, J., Gu, X., Ma, H., Ma, H.: Boosting monocular 3d human pose estimation with part aware attention. IEEE Trans. Image Process. 31, 4278–4291 (2022)
Article Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Zeng, A., Sun, X., Yang, L., Zhao, N., Liu, M., Xu, Q.: Learning skeletal graph neural networks for hard 3d pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11436–11445 (2021)
Google Scholar
Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13232–13242 (2022)
Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Google Scholar
Zhao, Q., Zheng, C., Liu, M., Chen, C.: A single 2d pose with context is worth hundreds for 3d human pose estimation. Advances in Neural Information Processing Systems 36 (2024)
Google Scholar
Zhao, Q., Zheng, C., Liu, M., Wang, P., Chen, C.: Poseformerv2: exploring frequency domain for efficient and robust 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8877–8886 (2023)
Google Scholar
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3d human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11656–11665 (2021)
Google Scholar
Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: Motionbert: a unified perspective on learning human motion representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15085–15099 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, Beijing, China
Kaili Zheng, Chenyi Guo & Ji Wu
Baidu Research, Beijing, China
Feixiang Lu, Yihao Lv & Liangjun Zhang
College of AI, Tsinghua University, Beijing, China
Ji Wu
Beijing National Research Center for Information Science and Technology, Beijing, China
Ji Wu

Authors

Kaili Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Feixiang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yihao Lv
View author publications
You can also search for this author in PubMed Google Scholar
Liangjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chenyi Guo
View author publications
You can also search for this author in PubMed Google Scholar
Ji Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji Wu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1133 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, K., Lu, F., Lv, Y., Zhang, L., Guo, C., Wu, J. (2025). 3D Human Pose Estimation via Non-causal Retentive Networks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15091. Springer, Cham. https://doi.org/10.1007/978-3-031-73414-4_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-73414-4_7
Published: 25 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73413-7
Online ISBN: 978-3-031-73414-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

3D Human Pose Estimation via Non-causal Retentive Networks