JointContrast: Skeleton-Based Mutual Action Recognition with Contrastive Learning

Jia, Xiangze; Zhang, Ji; Wang, Zhen; Luo, Yonglong; Chen, Fulong; Xiao, Jing

doi:10.1007/978-3-031-20868-3_35

Xiangze Jia¹¹,
Ji Zhang¹²,
Zhen Wang¹³,
Yonglong Luo¹⁴,
Fulong Chen¹⁴ &
…
Jing Xiao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13631))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

1225 Accesses
1 Citations

Abstract

Skeleton-based action recognition relies on skeleton sequences to detect certain categories of human actions. In skeleton-based action recognition, it is observed that many scenes are mutual actions characterized by more than one subject, and the existing works deal with subjects independently or use the pooling layer for feature fusion leading to ineffective learning and fusion of different subjects. In this paper, we propose a novel framework, JointContrast, for Skeleton-based action recognition to deal with these challenges. Our JointContrast includes two innovative components. One is the pre-training process with a fine-grained contrastive loss that effectively enhances the representation ability of the model, and the other is an Interactive Graph (IG) representation for skeletal sequences that contributes to the fusion of features between subjects. We validate our JointContrast in the popular SBU and NTU RGB-D datasets, and experimental results show that our model outperforms other baseline methods in terms of recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: European Conference on Computer Vision, pp. 816–833. Springer, Cham (2016)
Google Scholar
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297. IEEE, Honolulu (2017)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence. AAAI, New Orleans (2018)
Google Scholar
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595–3603. IEEE, Long Beach (2019)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7912–7921. IEEE, Long Beach (2019)
Google Scholar
Devlin, J., Chang, M. W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805 (2018)
Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32 (2019)
Google Scholar
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)
Article Google Scholar
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236. IEEE, Long Beach (2019)
Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In: Proceedings of the Asian Conference on Computer Vision (2020)
Google Scholar
Liu, Y., Zhang, H., Xu, D., He, K.: Graph transformer network with Temporal Kernel Attention for skeleton-based action recognition. Knowledge-Based Syst. 240, 108146 (2022)
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Singh, A., Chakraborty, O., Varshney, A., Panda, R., Feris, R., Saenko, K., Das, A.: Semi-supervised action recognition with temporal contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10389–10399. IEEE (2021)
Google Scholar
Van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv e-prints, arXiv-1807 (2018)
Google Scholar
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. IEEE (2020)
Google Scholar
Yun, K., Honorio, J., Chattopadhyay, D., Berg, T. L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE (2012)
Google Scholar
Shahroudy, A., Liu, J., Ng, T. T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019. IEEE, Las Vegas (2016)
Google Scholar
Ji, Y., Cheng, H., Zheng, Y., Li, H.: Learning contrastive feature distribution model for interaction recognition. J. Vis. Commun. Image Represent. 33, 340–349 (2015)
Article Google Scholar
Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, No. 1. AAAI, Phoenix (2016)
Google Scholar
Liu, J., Wang, G., Duan, L.Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 27(4), 1586–1599 (2017)
Article MathSciNet MATH Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126. IEEE, Honolulu (2017)
Google Scholar
Perez, M., Liu, J., Kot, A.C.: Interaction relational network for mutual action recognition. IEEE Trans. Multimed. 24, 366–376 (2021)
Article Google Scholar
Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., Weinberger, K.: Simplifying graph convolutional networks. In: International Conference on Machine Learning, pp. 6861–6871. PMLR. Long Beach (2019)
Google Scholar
Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 635–644. IEEE, Snowmass Village (2020)
Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1963–1978 (2019)
Article Google Scholar
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595–3603. IEEE, Long Beach (2019)
Google Scholar
Plizzari, C., Cannici, M., Matteucci, M.: Spatial temporal transformer network for skeleton-based action recognition. In International Conference on Pattern Recognition, pp. 694–701. Springer, Cham (2021)
Google Scholar
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192. IEEE (2020)
Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152. IEEE (2020)
Google Scholar

Download references

Acknowledgment

This research is partially supported by Zhejiang Lab (No. 2022PI0AC03 and No. 111010-AN2201) and National Natural Science Foundation of China (61972438).

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, China
Xiangze Jia
University of Southern Queensland, Darling Heights, Australia
Ji Zhang
Zhejiang Lab, Hangzhou City, China
Zhen Wang
Anhui Normal University, Wuhu, China
Yonglong Luo & Fulong Chen
South China Normal University, Guangzhou, China
Jing Xiao

Authors

Xiangze Jia
View author publications
You can also search for this author in PubMed Google Scholar
Ji Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yonglong Luo
View author publications
You can also search for this author in PubMed Google Scholar
Fulong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji Zhang .

Editor information

Editors and Affiliations

CSIRO Australian e-Health Research Centre, Brisbane, QLD, Australia
Sankalp Khanna
Shanghai Jiao Tong University, Shanghai, China
Jian Cao
University of Tasmania, Hobart, TAS, Australia
Quan Bai
University of Technology Sydney, Sydney, NSW, Australia
Guandong Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, X., Zhang, J., Wang, Z., Luo, Y., Chen, F., Xiao, J. (2022). JointContrast: Skeleton-Based Mutual Action Recognition with Contrastive Learning. In: Khanna, S., Cao, J., Bai, Q., Xu, G. (eds) PRICAI 2022: Trends in Artificial Intelligence. PRICAI 2022. Lecture Notes in Computer Science, vol 13631. Springer, Cham. https://doi.org/10.1007/978-3-031-20868-3_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-20868-3_35
Published: 04 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20867-6
Online ISBN: 978-3-031-20868-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

JointContrast: Skeleton-Based Mutual Action Recognition with Contrastive Learning