Unsupervised skeleton-based action representation learning via relation consistency pursuit

Zhang, Wenjing; Hou, Yonghong; Zhang, Haoyuan

doi:10.1007/s00521-022-07584-9

Unsupervised skeleton-based action representation learning via relation consistency pursuit

Original Article
Published: 23 July 2022

Volume 34, pages 20327–20339, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

357 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose a Skeleton-based Relation Consistency Learning scheme (SRCL) for unsupervised 3D action representation learning. By leveraging the inter-instance similarity score distribution as relation metric, SRCL is able to pursue not only the similarity but also the inter-instance relation consistency of different augmentations from same skeleton instance. The architecture of SRCL consists of two asymmetric neural networks, referred to as online and target networks. The online network is trained to mimic the inter-instance similarity score distribution inferred by the target network over a set of skeleton instances. Moreover, with the relation consistency achieved by distribution similarity learning, diversified skeleton positives can be potentially provided, which further boosts the representation learning. Experimental results verify that the proposed framework outperforms state-of-the-art methods on the challenging NTU-60 and NTU-120 datasets under unsupervised settings. Code will be available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-View Nearest Neighbor Contrastive Learning of Human Skeleton Representation

Self-supervised action representation learning from partial consistency skeleton sequences

Article 21 April 2024

Asymmetric information-regularized learning for skeleton-based action recognition

Article 02 December 2023

References

Ben Tanfous A, Drira H, Ben Amor B (2018) Coding kendall’s shape trajectories for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2840–2849
Berretti S, Daoudi M, Turaga P, Basu A (2018) Representation, analysis, and recognition of 3d humans: a survey. ACM Trans Multimed Comput Commun Appl (TOMM) 14:1–36
Google Scholar
Caetano C, Brémond F, Schwartz WR (2019) Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE. pp 16–23
Chen J, Samuel RDJ, Poovendran P (2021) Lstm with bio inspired algorithm for action recognition in sports videos. Image Vis. Comput 112:104214
Article Google Scholar
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp 1597–1607
Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758
Fang Z, Wang J, Wang L, Zhang L, Yang Y, Liu Z (2021) Seed: self-supervised distillation for visual representation. arXiv preprint arXiv:2101.04731
Grill JB, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG et al (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733
Gui LY, Wang YX, Liang X, Moura JM (2018) Adversarial geometry-aware human motion prediction. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 786–803
Gutmann MU, Hyvärinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13:2
MathSciNet MATH Google Scholar
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J et al (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press, 237–243
Holzinger A, Malle B, Saranti A, Pfeifer B (2021) Towards multi-modal causability with graph neural networks enabling information fusion for explainable ai. Inf Fusion 71:28–37
Article Google Scholar
Hou Y, Li Z, Wang P, Li W (2018) Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol 28:807–811
Article Google Scholar
Hu JF, Zheng WS, Ma L, Wang G, Lai J, Zhang J (2018) Early action prediction by soft regression. IEEE transact pattern anal mach intell 41:2568–2583
Article Google Scholar
Jing C, Wei P, Sun H, Zheng N (2020) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293–4302
Article Google Scholar
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. arXiv preprint arXiv:2004.11362
Kong Q, Wei W, Deng Z, Yoshinaga T, Murakami T (2020) Cycle-contrast for self-supervised video representation learning. arXiv preprint arXiv:2010.14810
Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24:624–628
Article Google Scholar
Li C, Zhong Q, Xie D, Pu S (2017b) Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE international conference on multimedia & expo workshops (ICMEW), IEEE. pp 597–600
Li J, Wong Y, Zhao Q, Kankanhalli MS (2018a) Unsupervised learning of view-invariant action representations. arXiv preprint arXiv:1809.01844
Li L, Wang M, Ni B, Wang H, Yang J, Zhang W (2021) 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4741–4750
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3595–3603
Li S, Li W, Cook C, Zhu C, Gao Y (2018b) Independently recurrent neural network (indrnn): Building a longer and deeper rnn, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466
Liang D, Fan G, Lin G, Chen W, Pan X, Zhu H (2019) Three-stream convolutional neural network with multi-task and ensemble learning for 3d action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 934–940
Lin L, Song S, Yang W, Liu J (2020) Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2490–2498
Liu J, Shahroudy A, Perez M, Wang G, Duan LY, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE trans pattern anal mach intell 42:2684–2701
Article Google Scholar
Liu M, Liu H, Chen C (2017) 3d action recognition using multiscale energy-based global ternary image. IEEE Trans Circuits Syst Video Technol 28:1824–1838
Article MathSciNet Google Scholar
Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware lstm for action recognition. Neural Comput Appl 32:14593–14602
Article Google Scholar
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020b) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152
Loshchilov I, Hutter F (2016) Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983
Luo Z, Peng B, Huang DA, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2203–2212
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Ni B, Wang G, Moulin P (2011) Rgbd-hudaact: A color-depth video database for human daily activity recognition. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops), IEEE, pp 1147–1153
Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748
Rao H, Xu S, Hu X, Cheng J, Hu B (2021) Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf Sci 569:90–109
Article Google Scholar
Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12026–12035
Shi Z, Kim TK (2017) Learning and refining of privileged information-based rnns for action recognition from depth sequences. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 3461–3470
Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236
Singh D, Merdivan E, Psychoula I, Kropf J, Hanke S, Geist M, Holzinger A (2017) Human activity recognition using recurrent neural networks. In: International cross-domain conference for machine learning and knowledge extraction, Springer, pp 267–274
Singh T, Vishwakarma DK (2021) A deeply coupled convnet for human activity recognition using dynamic and rgb images. Neural Comput Appl 33:469–485
Article Google Scholar
Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans image process 27:3459–3471
Article MathSciNet Google Scholar
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, PMLR, pp 843–852
Su K, Liu X, Shlizerman E (2020) Predict & cluster: unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9631–9640
Sun N, Leng L, Liu J, Han G (2021) Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vis Comput 109:104141
Article Google Scholar
Thoker FM, Doughty H, Snoek CG (2021) Skeleton-contrastive 3d action representation learning. In: Proceedings of the 29th ACM international conference on multimedia, pp 1655–1663
Tian Y, Krishnan D, Isola P (2020) Contrastive multiview coding. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, Springer, pp 776–794
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE. pp 1290–1297
Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 595–604
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742
Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inf Sci 480:287–304
Article Google Scholar
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455
You Y, Gitman I, Ginsburg B (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888
Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230
Zhang H, Hou Y, Wang P, Guo Z, Li W (2020) Sar-nas: skeleton-based action recognition via neural architecture searching. J Vis Commun Image Represent 73:102942
Article Google Scholar
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International conference on computer vision, pp 2117–2126
Zhang X, Xu C, Tao D (2020b) Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14333–14342
Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z (2018) Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of AAAI conference on artificial intelligence, 32

Download references

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, China
Wenjing Zhang, Yonghong Hou & Haoyuan Zhang

Authors

Wenjing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yonghong Hou
View author publications
You can also search for this author in PubMed Google Scholar
Haoyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenjing Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, W., Hou, Y. & Zhang, H. Unsupervised skeleton-based action representation learning via relation consistency pursuit. Neural Comput & Applic 34, 20327–20339 (2022). https://doi.org/10.1007/s00521-022-07584-9

Download citation

Received: 09 December 2021
Accepted: 28 June 2022
Published: 23 July 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s00521-022-07584-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised skeleton-based action representation learning via relation consistency pursuit

Abstract

Access this article

Similar content being viewed by others

Cross-View Nearest Neighbor Contrastive Learning of Human Skeleton Representation

Self-supervised action representation learning from partial consistency skeleton sequences

Asymmetric information-regularized learning for skeleton-based action recognition

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised skeleton-based action representation learning via relation consistency pursuit

Abstract

Access this article

Similar content being viewed by others

Cross-View Nearest Neighbor Contrastive Learning of Human Skeleton Representation

Self-supervised action representation learning from partial consistency skeleton sequences

Asymmetric information-regularized learning for skeleton-based action recognition

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation