Self-supervised action representation learning from partial consistency skeleton sequences

Lin, Biyun; Zhan, Yinwei

doi:10.1007/s00521-024-09671-5

Self-supervised action representation learning from partial consistency skeleton sequences

Original Article
Published: 21 April 2024

Volume 36, pages 12385–12395, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

217 Accesses
Explore all metrics

Abstract

In recent years, self-supervised representation learning for skeleton-based action recognition has achieved remarkable results using skeleton sequences with the advance of contrastive learning methods. However, existing methods often overlook the local information within the skeleton data, so as to not efficiently learn fine-grained features. To leverage local features to enhance representation capacity and capture discriminative representations, we design an adaptive self-supervised contrastive learning framework for action recognition called AdaSCLR. In AdaSCLR, we introduce an adaptive spatiotemporal graph convolutional network to learn the topology of different samples and hierarchical levels and apply an attention mask module to extract salient and non-salient local features from the global features, emphasizing their significance and facilitating similarity-based learning. In addition, AdaSCLR extracts information from the upper and lower limbs as local features to assist the model in learning more discriminative representation. Experimental results show that our approach is better than the state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Class-Aware Contrastive Learning for Fine-Grained Skeleton-Based Action Recognition

Asymmetric information-regularized learning for skeleton-based action recognition

Article 02 December 2023

Spatial-Temporal Attention Network with Multi-similarity Loss for Fine-Grained Skeleton-Based Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data that support the findings of this study are available from the first author upon reasonable request.

References

Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703
Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition, pp 899–908
Zheng C, Mendieta M, Wang P, Lu A, Chen C (2022) A lightweight graph transformer network for human mesh reconstruction from 2d human pose. In: Proceedings of the 30th ACM international conference on multimedia, pp 5496–5507
Li M, Wei F, Li Y, Zhang S, Xu G (2020) Three-dimensional pose estimation of infants lying supine using data from a kinect sensor with low training cost. IEEE Sens J 21(5):6904–6913
Article Google Scholar
Wang P, Wen J, Si C, Qian Y, Wang L (2022) Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition. IEEE Trans Image Process 31:6224–6238
Article Google Scholar
Gao X, Yang Y, Du S (2021) Contrastive self-supervised learning for skeleton action recognition. In: NeurIPS 2020 workshop on pre-registration in machine learning, pp 51–61, PMLR
Chen Z, Liu H, Guo T, Chen Z, Song P, Tang H (2022) Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition, arXiv preprint arXiv:2207.03065,
Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R (2022) Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proc AAAI Conf AI 36:762–770
Google Scholar
Wu W, Hua Y, Zheng C, Wu S, Chen C, Lu A (2023) Skeletonmae: spatial-temporal masked autoencoders for self-supervised skeleton action recognition. In: 2023 IEEE international conference on multimedia and expo workshops (ICMEW), pp 224–229, IEEE
Li L, Wang M, Ni B, Wang H, Yang J, Zhang W (2021) 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4741–4750
Zhang J, Lin L, Liu J (2023) Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. Proc AAAI Conf AI 37:3427–3435
Google Scholar
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748,
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805,
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A et al (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 35:27730–27744
Google Scholar
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
Google Scholar
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp 1597–1607, PMLR
Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9620–9629
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1290–1297, IEEE
Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 14–19, IEEE
Huang C-P, Hsieh C-H, Lai K-T, Huang W-Y (2011) Human action recognition using histogram of oriented gradient of motion history image. In: 2011 first international conference on instrumentation, measurement, computer, communication and control, pp 353–356
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inform Process Syst, 27
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2329–2338
Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2969–2978
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7024–7033
Yan A, Wang Y, Li Z, Qiao Y (2019) Pa3d: pose-action 3d machine for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7922–7931
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets, arXiv preprint arXiv:1507.02159
Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471
Article MathSciNet Google Scholar
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978
Article Google Scholar
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE international conference on computer vision, pp 2117–2126
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, 32
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236
Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545
Article Google Scholar
Zhang X, Xu C, Tao D (2020) Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14333–14342
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J (2022) Human action recognition from various data modalities: a review. IEEE Trans Pattern Anal Mach Intell
Zhu Y, Shuai H, Liu G, Liu Q (2022) Multilevel spatial-temporal excited graph network for skeleton-based action recognition. IEEE Trans Image Process 32:496–508
Article Google Scholar
Davoodikakhki M,Yin K (2020) Hierarchical action classification with network pruning. In: Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part I 15, pp 291–305, Springer
Su K, Liu X, Shlizerman E (2020) Predict & cluster: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9631–9640
Lin L, Song S, Yang W, Liu J (2020) Ms2l: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2490–2498
Zhan Y, Chen Y, Ren P, Sun H, Wang J, Qi Q, Liao J (2021) Spatial temporal enhanced contrastive and pretext learning for skeleton-based action representation. In: Asian conference on machine learning, pp 534–547, PMLR
Hua Y, Wu W, Zheng C, Lu A, Liu M, Chen C, Wu S (2023) Part aware contrastive learning for self-supervised action recognition. arXiv preprint arXiv:2305.00666
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701
Article Google Scholar
Liu C, Hu Y, Li Y, Song S, Liu J (2017) Pku-mmd: A large scale benchmark for skeleton-based human action understanding. In: Proceedings of the workshop on visual analysis in smart and connected communities, pp 1–8
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res, 9(11)
Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z (2018) Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, 32
Xu S, Rao H, Hu X, Cheng J, Hu B (2021) Prototypical contrast and reverse prediction: unsupervised skeleton based action recognition. IEEE Trans Multimed
Rao H, Xu S, Hu X, Cheng J, Hu B (2021) Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf Sci 569:90–109
Article Google Scholar
Kundu JN, Gor M, Uppala PK, Radhakrishnan VB (2019) Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In: 2019 IEEE winter conference on applications of computer vision (WACV), pp 1459–1467, IEEE
Nie Q, Liu Z, Liu Y (2020) Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pp 102–118, Springer
Dong J, Sun S, Liu Z, Chen S, Liu B, Wang X (2023) Hierarchical contrast for unsupervised skeleton-based action representation learning. Proc AAAI Conf AI 37:525–533
Google Scholar
Li M, Chen S,Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603
Zhou Y, Cheng Z-Q, He J-Y, Luo B, Geng Y, Xie X, Keuper M (2023) Overcoming topology agnosticism: Enhancing skeleton-based action recognition through redefined skeletal topology awareness. arXiv preprint arXiv:2305.11468
Chen Y, Zhang Z, Yuan C, Li B, Deng Y,Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368
Thoker FM, Doughty H, Snoek CG (2021) Skeleton-contrastive 3d action representation learning. In: Proceedings of the 29th ACM international conference on multimedia, pp 1655–1663
Si C, Nie X, Wang W,Wang L, Tan T, Feng J (2020) Adversarial self-supervised learning for semi-supervised 3d action recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp 35–51, Springer

Download references

Acknowledgements

This work is supported partially by the National Natural Science Foundation of China (NSFC) Grant No. 62272108.

Author information

Authors and Affiliations

School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006, Guangdong, China
Biyun Lin & Yinwei Zhan

Authors

Biyun Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yinwei Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yinwei Zhan.

Ethics declarations

Conflict of interest

Authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lin, B., Zhan, Y. Self-supervised action representation learning from partial consistency skeleton sequences. Neural Comput & Applic 36, 12385–12395 (2024). https://doi.org/10.1007/s00521-024-09671-5

Download citation

Received: 21 July 2023
Accepted: 25 March 2024
Published: 21 April 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s00521-024-09671-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-supervised action representation learning from partial consistency skeleton sequences

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Class-Aware Contrastive Learning for Fine-Grained Skeleton-Based Action Recognition

Asymmetric information-regularized learning for skeleton-based action recognition

Spatial-Temporal Attention Network with Multi-similarity Loss for Fine-Grained Skeleton-Based Action Recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Self-supervised action representation learning from partial consistency skeleton sequences

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Class-Aware Contrastive Learning for Fine-Grained Skeleton-Based Action Recognition

Asymmetric information-regularized learning for skeleton-based action recognition

Spatial-Temporal Attention Network with Multi-similarity Loss for Fine-Grained Skeleton-Based Action Recognition

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation