Skip to main content
Log in

Unsupervised skeleton-based action representation learning via relation consistency pursuit

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a Skeleton-based Relation Consistency Learning scheme (SRCL) for unsupervised 3D action representation learning. By leveraging the inter-instance similarity score distribution as relation metric, SRCL is able to pursue not only the similarity but also the inter-instance relation consistency of different augmentations from same skeleton instance. The architecture of SRCL consists of two asymmetric neural networks, referred to as online and target networks. The online network is trained to mimic the inter-instance similarity score distribution inferred by the target network over a set of skeleton instances. Moreover, with the relation consistency achieved by distribution similarity learning, diversified skeleton positives can be potentially provided, which further boosts the representation learning. Experimental results verify that the proposed framework outperforms state-of-the-art methods on the challenging NTU-60 and NTU-120 datasets under unsupervised settings. Code will be available.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Ben Tanfous A, Drira H, Ben Amor B (2018) Coding kendall’s shape trajectories for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2840–2849

  2. Berretti S, Daoudi M, Turaga P, Basu A (2018) Representation, analysis, and recognition of 3d humans: a survey. ACM Trans Multimed Comput Commun Appl (TOMM) 14:1–36

    Google Scholar 

  3. Caetano C, Brémond F, Schwartz WR (2019) Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE. pp 16–23

  4. Chen J, Samuel RDJ, Poovendran P (2021) Lstm with bio inspired algorithm for action recognition in sports videos. Image Vis. Comput 112:104214

    Article  Google Scholar 

  5. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR, pp 1597–1607

  6. Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758

  7. Fang Z, Wang J, Wang L, Zhang L, Yang Y, Liu Z (2021) Seed: self-supervised distillation for visual representation. arXiv preprint arXiv:2101.04731

  8. Grill JB, Strub F, Altché F, Tallec C, Richemond PH, Buchatskaya E, Doersch C, Pires BA, Guo ZD, Azar MG et al (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733

  9. Gui LY, Wang YX, Liang X, Moura JM (2018) Adversarial geometry-aware human motion prediction. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 786–803

  10. Gutmann MU, Hyvärinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13:2

    MathSciNet  MATH  Google Scholar 

  11. He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738

  12. Hochreiter S, Bengio Y, Frasconi P, Schmidhuber J et al (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press, 237–243

  13. Holzinger A, Malle B, Saranti A, Pfeifer B (2021) Towards multi-modal causability with graph neural networks enabling information fusion for explainable ai. Inf Fusion 71:28–37

    Article  Google Scholar 

  14. Hou Y, Li Z, Wang P, Li W (2018) Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol 28:807–811

    Article  Google Scholar 

  15. Hu JF, Zheng WS, Ma L, Wang G, Lai J, Zhang J (2018) Early action prediction by soft regression. IEEE transact pattern anal mach intell 41:2568–2583

    Article  Google Scholar 

  16. Jing C, Wei P, Sun H, Zheng N (2020) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293–4302

    Article  Google Scholar 

  17. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297

  18. Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. arXiv preprint arXiv:2004.11362

  19. Kong Q, Wei W, Deng Z, Yoshinaga T, Murakami T (2020) Cycle-contrast for self-supervised video representation learning. arXiv preprint arXiv:2010.14810

  20. Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24:624–628

    Article  Google Scholar 

  21. Li C, Zhong Q, Xie D, Pu S (2017b) Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE international conference on multimedia & expo workshops (ICMEW), IEEE. pp 597–600

  22. Li J, Wong Y, Zhao Q, Kankanhalli MS (2018a) Unsupervised learning of view-invariant action representations. arXiv preprint arXiv:1809.01844

  23. Li L, Wang M, Ni B, Wang H, Yang J, Zhang W (2021) 3d human action representation learning via cross-view consistency pursuit. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4741–4750

  24. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3595–3603

  25. Li S, Li W, Cook C, Zhu C, Gao Y (2018b) Independently recurrent neural network (indrnn): Building a longer and deeper rnn, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466

  26. Liang D, Fan G, Lin G, Chen W, Pan X, Zhu H (2019) Three-stream convolutional neural network with multi-task and ensemble learning for 3d action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 934–940

  27. Lin L, Song S, Yang W, Liu J (2020) Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2490–2498

  28. Liu J, Shahroudy A, Perez M, Wang G, Duan LY, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE trans pattern anal mach intell 42:2684–2701

    Article  Google Scholar 

  29. Liu M, Liu H, Chen C (2017) 3d action recognition using multiscale energy-based global ternary image. IEEE Trans Circuits Syst Video Technol 28:1824–1838

    Article  MathSciNet  Google Scholar 

  30. Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware lstm for action recognition. Neural Comput Appl 32:14593–14602

    Article  Google Scholar 

  31. Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020b) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152

  32. Loshchilov I, Hutter F (2016) Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983

  33. Luo Z, Peng B, Huang DA, Alahi A, Fei-Fei L (2017) Unsupervised learning of long-term motion dynamics for videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2203–2212

  34. Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  35. Ni B, Wang G, Moulin P (2011) Rgbd-hudaact: A color-depth video database for human daily activity recognition. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops), IEEE, pp 1147–1153

  36. Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

  37. Rao H, Xu S, Hu X, Cheng J, Hu B (2021) Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf Sci 569:90–109

    Article  Google Scholar 

  38. Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  39. Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12026–12035

  40. Shi Z, Kim TK (2017) Learning and refining of privileged information-based rnns for action recognition from depth sequences. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 3461–3470

  41. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236

  42. Singh D, Merdivan E, Psychoula I, Kropf J, Hanke S, Geist M, Holzinger A (2017) Human activity recognition using recurrent neural networks. In: International cross-domain conference for machine learning and knowledge extraction, Springer, pp 267–274

  43. Singh T, Vishwakarma DK (2021) A deeply coupled convnet for human activity recognition using dynamic and rgb images. Neural Comput Appl 33:469–485

    Article  Google Scholar 

  44. Song S, Lan C, Xing J, Zeng W, Liu J (2018) Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans image process 27:3459–3471

    Article  MathSciNet  Google Scholar 

  45. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on machine learning, PMLR, pp 843–852

  46. Su K, Liu X, Shlizerman E (2020) Predict & cluster: unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9631–9640

  47. Sun N, Leng L, Liu J, Han G (2021) Multi-stream slowfast graph convolutional networks for skeleton-based action recognition. Image Vis Comput 109:104141

    Article  Google Scholar 

  48. Thoker FM, Doughty H, Snoek CG (2021) Skeleton-contrastive 3d action representation learning. In: Proceedings of the 29th ACM international conference on multimedia, pp 1655–1663

  49. Tian Y, Krishnan D, Isola P (2020) Contrastive multiview coding. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, Springer, pp 776–794

  50. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  51. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition, IEEE. pp 1290–1297

  52. Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 595–604

  53. Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742

  54. Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inf Sci 480:287–304

    Article  Google Scholar 

  55. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455

  56. You Y, Gitman I, Ginsburg B (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888

  57. Zbontar J, Jing L, Misra I, LeCun Y, Deny S (2021) Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230

  58. Zhang H, Hou Y, Wang P, Guo Z, Li W (2020) Sar-nas: skeleton-based action recognition via neural architecture searching. J Vis Commun Image Represent 73:102942

    Article  Google Scholar 

  59. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International conference on computer vision, pp 2117–2126

  60. Zhang X, Xu C, Tao D (2020b) Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14333–14342

  61. Zheng N, Wen J, Liu R, Long L, Dai J, Gong Z (2018) Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Proceedings of AAAI conference on artificial intelligence, 32

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenjing Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Hou, Y. & Zhang, H. Unsupervised skeleton-based action representation learning via relation consistency pursuit. Neural Comput & Applic 34, 20327–20339 (2022). https://doi.org/10.1007/s00521-022-07584-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-07584-9

Keywords

Navigation