Skip to main content

Advertisement

Log in

Self-supervised action representation learning from partial consistency skeleton sequences

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In recent years, self-supervised representation learning for skeleton-based action recognition has achieved remarkable results using skeleton sequences with the advance of contrastive learning methods. However, existing methods often overlook the local information within the skeleton data, so as to not efficiently learn fine-grained features. To leverage local features to enhance representation capacity and capture discriminative representations, we design an adaptive self-supervised contrastive learning framework for action recognition called AdaSCLR. In AdaSCLR, we introduce an adaptive spatiotemporal graph convolutional network to learn the topology of different samples and hierarchical levels and apply an attention mask module to extract salient and non-salient local features from the global features, emphasizing their significance and facilitating similarity-based learning. In addition, AdaSCLR extracts information from the upper and lower limbs as local features to assist the model in learning more discriminative representation. Experimental results show that our approach is better than the state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data that support the findings of this study are available from the first author upon reasonable request.

References

  1. Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299

  2. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703

  3. Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition, pp 899–908

  4. Zheng C, Mendieta M, Wang P, Lu A, Chen C (2022) A lightweight graph transformer network for human mesh reconstruction from 2d human pose. In: Proceedings of the 30th ACM international conference on multimedia, pp 5496–5507

  5. Li M, Wei F, Li Y, Zhang S, Xu G (2020) Three-dimensional pose estimation of infants lying supine using data from a kinect sensor with low training cost. IEEE Sens J 21(5):6904–6913

    Article  Google Scholar 

  6. Wang P, Wen J, Si C, Qian Y, Wang L (2022) Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition. IEEE Trans Image Process 31:6224–6238

    Article  Google Scholar 

  7. Gao X, Yang Y, Du S (2021) Contrastive self-supervised learning for skeleton action recognition. In: NeurIPS 2020 workshop on pre-registration in machine learning, pp 51–61, PMLR

  8. Chen Z, Liu H, Guo T, Chen Z, Song P, Tang H (2022) Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition, arXiv preprint arXiv:2207.03065,

  9. Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R (2022) Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proc AAAI Conf AI 36:762–770

    Google Scholar 

  10. Wu W, Hua Y, Zheng C, Wu S, Chen C, Lu A (2023) Skeletonmae: spatial-temporal masked autoencoders for self-supervised skeleton action recognition. In: 2023 IEEE international conference on multimedia and expo workshops (ICMEW), pp 224–229, IEEE

  11. Li L, Wang M, Ni B, Wang H, Yang J, Zhang W (2021) 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4741–4750

  12. Zhang J, Lin L, Liu J (2023) Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. Proc AAAI Conf AI 37:3427–3435

    Google Scholar 

  13. Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748,

  14. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805,

  15. Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A et al (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 35:27730–27744

    Google Scholar 

  16. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924

    Google Scholar 

  17. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738

  18. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp 1597–1607, PMLR

  19. Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758

  20. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660

  21. Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9620–9629

  22. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  23. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1290–1297, IEEE

  24. Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 14–19, IEEE

  25. Huang C-P, Hsieh C-H, Lai K-T, Huang W-Y (2011) Human action recognition using histogram of oriented gradient of motion history image. In: 2011 first international conference on instrumentation, measurement, computer, communication and control, pp 353–356

  26. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inform Process Syst, 27

  27. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297

  28. Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2329–2338

  29. Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2969–2978

  30. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7024–7033

  31. Yan A, Wang Y, Li Z, Qiao Y (2019) Pa3d: pose-action 3d machine for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7922–7931

  32. Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets, arXiv preprint arXiv:1507.02159

  33. Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471

    Article  MathSciNet  Google Scholar 

  34. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978

    Article  Google Scholar 

  35. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE international conference on computer vision, pp 2117–2126

  36. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, 32

  37. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236

  38. Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545

    Article  Google Scholar 

  39. Zhang X, Xu C, Tao D (2020) Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14333–14342

  40. Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J (2022) Human action recognition from various data modalities: a review. IEEE Trans Pattern Anal Mach Intell

  41. Zhu Y, Shuai H, Liu G, Liu Q (2022) Multilevel spatial-temporal excited graph network for skeleton-based action recognition. IEEE Trans Image Process 32:496–508

    Article  Google Scholar 

  42. Davoodikakhki M,Yin K (2020) Hierarchical action classification with network pruning. In: Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part I 15, pp 291–305, Springer

  43. Su K, Liu X, Shlizerman E (2020) Predict & cluster: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9631–9640

  44. Lin L, Song S, Yang W, Liu J (2020) Ms2l: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2490–2498

  45. Zhan Y, Chen Y, Ren P, Sun H, Wang J, Qi Q, Liao J (2021) Spatial temporal enhanced contrastive and pretext learning for skeleton-based action representation. In: Asian conference on machine learning, pp 534–547, PMLR

  46. Hua Y, Wu W, Zheng C, Lu A, Liu M, Chen C, Wu S (2023) Part aware contrastive learning for self-supervised action recognition. arXiv preprint arXiv:2305.00666

  47. Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  48. Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701

    Article  Google Scholar 

  49. Liu C, Hu Y, Li Y, Song S, Liu J (2017) Pku-mmd: A large scale benchmark for skeleton-based human action understanding. In: Proceedings of the workshop on visual analysis in smart and connected communities, pp 1–8

  50. Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res, 9(11)

  51. Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z (2018) Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, 32

  52. Xu S, Rao H, Hu X, Cheng J, Hu B (2021) Prototypical contrast and reverse prediction: unsupervised skeleton based action recognition. IEEE Trans Multimed

  53. Rao H, Xu S, Hu X, Cheng J, Hu B (2021) Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf Sci 569:90–109

    Article  Google Scholar 

  54. Kundu JN, Gor M, Uppala PK, Radhakrishnan VB (2019) Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In: 2019 IEEE winter conference on applications of computer vision (WACV), pp 1459–1467, IEEE

  55. Nie Q, Liu Z, Liu Y (2020) Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pp 102–118, Springer

  56. Dong J, Sun S, Liu Z, Chen S, Liu B, Wang X (2023) Hierarchical contrast for unsupervised skeleton-based action representation learning. Proc AAAI Conf AI 37:525–533

    Google Scholar 

  57. Li M, Chen S,Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603

  58. Zhou Y, Cheng Z-Q, He J-Y, Luo B, Geng Y, Xie X, Keuper M (2023) Overcoming topology agnosticism: Enhancing skeleton-based action recognition through redefined skeletal topology awareness. arXiv preprint arXiv:2305.11468

  59. Chen Y, Zhang Z, Yuan C, Li B, Deng Y,Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368

  60. Thoker FM, Doughty H, Snoek CG (2021) Skeleton-contrastive 3d action representation learning. In: Proceedings of the 29th ACM international conference on multimedia, pp 1655–1663

  61. Si C, Nie X, Wang W,Wang L, Tan T, Feng J (2020) Adversarial self-supervised learning for semi-supervised 3d action recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp 35–51, Springer

Download references

Acknowledgements

This work is supported partially by the National Natural Science Foundation of China (NSFC) Grant No. 62272108.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yinwei Zhan.

Ethics declarations

Conflict of interest

Authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, B., Zhan, Y. Self-supervised action representation learning from partial consistency skeleton sequences. Neural Comput & Applic 36, 12385–12395 (2024). https://doi.org/10.1007/s00521-024-09671-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09671-5

Keywords

Navigation