A Supervised Spatio-Temporal Contrastive Learning Framework with Optimal Skeleton Subgraph Topology for Human Action Recognition

Deng, Zelin; Zhou, Hao; Ouyang, Wei; He, Pei; Yun, Song; Tang, Qiang; Yu, Li

doi:10.1007/978-981-99-8141-0_13

Zelin Deng¹⁰,
Hao Zhou¹⁰,
Wei Ouyang¹²,
Pei He¹¹,
Song Yun¹⁰,
Qiang Tang¹⁰ &
…
Li Yu¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1964))

Included in the following conference series:

International Conference on Neural Information Processing

342 Accesses

Abstract

Human action recognition (HAR) is a hotspot in the field of computer vision, the models based on Graph Convolutional Network (GCN) show great advantages in skeleton-based HAR. However,most existing GCN based methods do not consider the diversity of action trajectories, and not highlight the key joints. To address these issues, a supervised spatio-temporal contrastive learning framework with optimal skeleton subgraph topology for HAR (SSTCL-optSST) is proposed. SSTCL-optSST uses the samples with the same lablel as the target action (anchor) to build a positive sample set, each of them represents a trajectory of an action. The sample set is used to design a loss function to guide the model recognize different poses of the action. Furthermore, the subgraphs of an original skeleton graph are used to construct a skeleton subgraph topology space, each subgraph in it is evaluated, and the optimal one is selected to highlight the key joints. Extensive experiments have been conducted on NTU RGB+D 60 and Kinetics datasets, the results show that our model has competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gajjar, V., Gurnani, A., Khandhediya, Y.: Human detection and tracking for video surveillance: a cognitive science approach. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2805–2809 (2017)
Google Scholar
Sahaï, A., Desantis, A., Grynszpan, O., Pacherie, E., Berberian, B.: Action co-representation and the sense of agency during a joint simon task: comparing human and machine co-agents. Conscious. Cogn. 67, 44–55 (2019)
Article Google Scholar
Pilarski, P.M., Butcher, A., Johanson, M., Botvinick, M.M., Bolt, A., Parker, A.S.: Learned human-agent decision-making, communication and joint action in a virtual reality environment. arXiv preprint arXiv:1905.02691 (2019)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Kim, T.S., Reiter, A.: Interpretable 3D human action analysis with temporal convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28 (2017)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Duan, H., Zhao, Y., Chen, K., Lin,D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)
Google Scholar
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R.: Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080 (2014)
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017)
Google Scholar
Ji, X., Zhao, Q., Cheng, J., Ma, C.: Exploiting spatio-temporal representation for 3D human action recognition from depth map sequences. Knowl.-Based Syst. 227, 107040 (2021)
Article Google Scholar
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1112–1121 (2020)
Google Scholar
Peng, W., Hong, X., Zhao, G.: Tripool: graph triplet pooling for 3d skeleton-based action recognition. Pattern Recogn. 115, 107921 (2021)
Article Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Khosla, P.: Supervised contrastive learning. Adv. Neural. Inf. Process. Syst. 33, 18661–18673 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2, no. 7. arXiv preprint arXiv:1503.02531 (2015)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Kay, W., et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Rao, H., Shihao, X., Xiping, H., Cheng, J., Bin, H.: Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition. Inf. Sci. 569, 90–109 (2021)
Article Google Scholar
Dai, C., Wei, Y., Xu, Z., Chen, M., Liu, Y., Fan, J.: ConMLP: MLP-based self-supervised contrastive learning for skeleton data analysis and action recognition. Sensors 23(5), 2452 (2023)
Article Google Scholar
Li, C., Xie, C., Zhang, B., Han, J., Zhen, X., Chen, J.: Memory attention networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 33(9), 4800–4814 (2021)
Article Google Scholar
Li, S., Yi, J., Farha, Y.A., Gall, J.: Pose refinement graph convolutional network for skeleton-based action recognition. IEEE Rob. Autom. Lett. 6(2), 1028–1035 (2021)
Article Google Scholar
Ding, W., Li, X., Li, G., Wei, Y.: Global relational reasoning with spatial temporal graph interaction networks for skeleton-based action recognition. Signal Process. Image Commun. 83, 115776 (2020)
Article Google Scholar
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595–3603 (2019)
Google Scholar
Gao, X., Hu, W., Tang, J., Liu,J., Guo, Z.: Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 601–610 (2019)
Google Scholar
Liu, Y., Zhang, H., Dan, X., He, K.: Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl.-Based Syst. 240, 108146 (2022)
Article Google Scholar
Li, B., Li, X., Zhang, Z., Fei, W.: Spatio-temporal graph routing for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8561–8568 (2019)
Google Scholar
Yoon, Y., Jongmin, Yu., Jeon, M.: Predictively encoded graph convolutional network for noise-robust skeleton-based action recognition. Appl. Intell. 52(3), 2317–2331 (2022)
Article Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61977018; Natural Science Foundation of Changsha under Grant No. kq2202215; Practical Innovation and Entrepreneurship Enhancement Program for Professional Degree Postgraduates of Changsha University of Science and Technology (CLSJCX22114).

Author information

Authors and Affiliations

Changsha University of Science and Technology, Changsha, 410114, China
Zelin Deng, Hao Zhou, Song Yun & Qiang Tang
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou, 510006, China
Pei He
Hunan Hkt Technology Co. Ltd., Changsha, 410000, China
Wei Ouyang & Li Yu

Authors

Zelin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Wei Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Pei He
View author publications
You can also search for this author in PubMed Google Scholar
Song Yun
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Li Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Zhou .

Editor information

Editors and Affiliations

School of Automation, Central South University, Changsha, China
Biao Luo
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Long Cheng
Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, China
Zheng-Guang Wu
School of Automation, Guangdong University of Technology, Guangzhou, China
Hongyi Li
School of Electrical Engineering and Telecommunications, UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Deng, Z. et al. (2024). A Supervised Spatio-Temporal Contrastive Learning Framework with Optimal Skeleton Subgraph Topology for Human Action Recognition. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1964. Springer, Singapore. https://doi.org/10.1007/978-981-99-8141-0_13

Download citation

DOI: https://doi.org/10.1007/978-981-99-8141-0_13
Published: 26 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8140-3
Online ISBN: 978-981-99-8141-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Supervised Spatio-Temporal Contrastive Learning Framework with Optimal Skeleton Subgraph Topology for Human Action Recognition