Abstract
In recent years, self-supervised representation learning for skeleton-based action recognition has achieved remarkable results using skeleton sequences with the advance of contrastive learning methods. However, existing methods often overlook the local information within the skeleton data, so as to not efficiently learn fine-grained features. To leverage local features to enhance representation capacity and capture discriminative representations, we design an adaptive self-supervised contrastive learning framework for action recognition called AdaSCLR. In AdaSCLR, we introduce an adaptive spatiotemporal graph convolutional network to learn the topology of different samples and hierarchical levels and apply an attention mask module to extract salient and non-salient local features from the global features, emphasizing their significance and facilitating similarity-based learning. In addition, AdaSCLR extracts information from the upper and lower limbs as local features to assist the model in learning more discriminative representation. Experimental results show that our approach is better than the state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available from the first author upon reasonable request.
References
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703
Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition, pp 899–908
Zheng C, Mendieta M, Wang P, Lu A, Chen C (2022) A lightweight graph transformer network for human mesh reconstruction from 2d human pose. In: Proceedings of the 30th ACM international conference on multimedia, pp 5496–5507
Li M, Wei F, Li Y, Zhang S, Xu G (2020) Three-dimensional pose estimation of infants lying supine using data from a kinect sensor with low training cost. IEEE Sens J 21(5):6904–6913
Wang P, Wen J, Si C, Qian Y, Wang L (2022) Contrast-reconstruction representation learning for self-supervised skeleton-based action recognition. IEEE Trans Image Process 31:6224–6238
Gao X, Yang Y, Du S (2021) Contrastive self-supervised learning for skeleton action recognition. In: NeurIPS 2020 workshop on pre-registration in machine learning, pp 51–61, PMLR
Chen Z, Liu H, Guo T, Chen Z, Song P, Tang H (2022) Contrastive learning from spatio-temporal mixed skeleton sequences for self-supervised skeleton-based action recognition, arXiv preprint arXiv:2207.03065,
Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R (2022) Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proc AAAI Conf AI 36:762–770
Wu W, Hua Y, Zheng C, Wu S, Chen C, Lu A (2023) Skeletonmae: spatial-temporal masked autoencoders for self-supervised skeleton action recognition. In: 2023 IEEE international conference on multimedia and expo workshops (ICMEW), pp 224–229, IEEE
Li L, Wang M, Ni B, Wang H, Yang J, Zhang W (2021) 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4741–4750
Zhang J, Lin L, Liu J (2023) Hierarchical consistent contrastive learning for skeleton-based action recognition with growing augmentations. Proc AAAI Conf AI 37:3427–3435
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding, arXiv preprint arXiv:1807.03748,
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805,
Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A et al (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 35:27730–27744
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. Adv Neural Inf Process Syst 33:9912–9924
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, pp 1597–1607, PMLR
Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660
Chen X, Xie S, He K (2021) An empirical study of training self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9620–9629
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition, pp 1290–1297, IEEE
Yang X, Tian YL (2012) Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 14–19, IEEE
Huang C-P, Hsieh C-H, Lai K-T, Huang W-Y (2011) Human action recognition using histogram of oriented gradient of motion history image. In: 2011 first international conference on instrumentation, measurement, computer, communication and control, pp 353–356
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inform Process Syst, 27
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2329–2338
Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2969–2978
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7024–7033
Yan A, Wang Y, Li Z, Qiao Y (2019) Pa3d: pose-action 3d machine for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7922–7931
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets, arXiv preprint arXiv:1507.02159
Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans Image Process 27(7):3459–3471
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE international conference on computer vision, pp 2117–2126
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, 32
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1227–1236
Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545
Zhang X, Xu C, Tao D (2020) Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14333–14342
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J (2022) Human action recognition from various data modalities: a review. IEEE Trans Pattern Anal Mach Intell
Zhu Y, Shuai H, Liu G, Liu Q (2022) Multilevel spatial-temporal excited graph network for skeleton-based action recognition. IEEE Trans Image Process 32:496–508
Davoodikakhki M,Yin K (2020) Hierarchical action classification with network pruning. In: Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part I 15, pp 291–305, Springer
Su K, Liu X, Shlizerman E (2020) Predict & cluster: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9631–9640
Lin L, Song S, Yang W, Liu J (2020) Ms2l: multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2490–2498
Zhan Y, Chen Y, Ren P, Sun H, Wang J, Qi Q, Liao J (2021) Spatial temporal enhanced contrastive and pretext learning for skeleton-based action representation. In: Asian conference on machine learning, pp 534–547, PMLR
Hua Y, Wu W, Zheng C, Lu A, Liu M, Chen C, Wu S (2023) Part aware contrastive learning for self-supervised action recognition. arXiv preprint arXiv:2305.00666
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701
Liu C, Hu Y, Li Y, Song S, Liu J (2017) Pku-mmd: A large scale benchmark for skeleton-based human action understanding. In: Proceedings of the workshop on visual analysis in smart and connected communities, pp 1–8
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res, 9(11)
Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z (2018) Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, 32
Xu S, Rao H, Hu X, Cheng J, Hu B (2021) Prototypical contrast and reverse prediction: unsupervised skeleton based action recognition. IEEE Trans Multimed
Rao H, Xu S, Hu X, Cheng J, Hu B (2021) Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf Sci 569:90–109
Kundu JN, Gor M, Uppala PK, Radhakrishnan VB (2019) Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In: 2019 IEEE winter conference on applications of computer vision (WACV), pp 1459–1467, IEEE
Nie Q, Liu Z, Liu Y (2020) Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pp 102–118, Springer
Dong J, Sun S, Liu Z, Chen S, Liu B, Wang X (2023) Hierarchical contrast for unsupervised skeleton-based action representation learning. Proc AAAI Conf AI 37:525–533
Li M, Chen S,Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603
Zhou Y, Cheng Z-Q, He J-Y, Luo B, Geng Y, Xie X, Keuper M (2023) Overcoming topology agnosticism: Enhancing skeleton-based action recognition through redefined skeletal topology awareness. arXiv preprint arXiv:2305.11468
Chen Y, Zhang Z, Yuan C, Li B, Deng Y,Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368
Thoker FM, Doughty H, Snoek CG (2021) Skeleton-contrastive 3d action representation learning. In: Proceedings of the 29th ACM international conference on multimedia, pp 1655–1663
Si C, Nie X, Wang W,Wang L, Tan T, Feng J (2020) Adversarial self-supervised learning for semi-supervised 3d action recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp 35–51, Springer
Acknowledgements
This work is supported partially by the National Natural Science Foundation of China (NSFC) Grant No. 62272108.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lin, B., Zhan, Y. Self-supervised action representation learning from partial consistency skeleton sequences. Neural Comput & Applic 36, 12385–12395 (2024). https://doi.org/10.1007/s00521-024-09671-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09671-5