Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation

Ji, Haoyu; Chen, Bowen; Xu, Xinglong; Ren, Weihong; Wang, Zhiyong; Liu, Honghai

doi:10.1007/978-3-031-72949-2_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15112))

Included in the following conference series:

European Conference on Computer Vision

401 Accesses

Abstract

Skeleton-based Temporal Action Segmentation (STAS) aims to densely segment and classify human actions in long, untrimmed skeletal motion sequences. Existing STAS methods primarily model spatial dependencies among joints and temporal relationships among frames to generate frame-level one-hot classifications. However, these methods overlook the deep mining of semantic relations among joints as well as actions at a linguistic level, which limits the comprehensiveness of skeleton action understanding. In this work, we propose a Language-assisted Skeleton Action Understanding (LaSA) method that leverages the language modality to assist in learning semantic relationships among joints and actions. Specifically, in terms of joint relationships, the Joint Relationships Establishment (JRE) module establishes correlations among joints in the feature sequence by applying attention between joint texts and differentiates distinct joints by embedding joint texts as positional embeddings. Regarding action relationships, the Action Relationships Supervision (ARS) module enhances the discrimination across action classes through contrastive learning of single-class action-text pairs and models the semantic associations of adjacent actions by contrasting mixed-class clip-text pairs. Performance evaluation on five public datasets demonstrates that LaSA achieves state-of-the-art results. Code is available at https://github.com/HaoyuJi/LaSA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

Article 17 October 2023

Multi-semantic Fusion Model For Generalized Zero-Shot Skeleton-Based Action Recognition

Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Bahrami, E., Francesca, G., Gall, J.: How much temporal long-term context is needed for action segmentation? In: ICCV, pp. 10351–10361 (2023)
Google Scholar
Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 52–68. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_4
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Google Scholar
Chen, B., et al.: Autoenp: an auto rating pipeline for expressing needs via pointing protocol. In: ICPR, pp. 3280–3286. IEEE (2022)
Google Scholar
Dave, I., Scheffer, Z., Kumar, A., Shiraz, S., Rawat, Y.S., Shah, M.: Gabriellav2: towards better generalization in surveillance videos for action detection. In: WACV, pp. 122–132 (2022)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL, vol. 1, pp. 4171–4186 (2019)
Google Scholar
Ding, G., Sener, F., Yao, A.: Temporal action segmentation: an analysis of modern techniques. IEEE TPAMI 46(2), 1011–1030 (2024)
Article Google Scholar
Ding, L., Xu, C.: Tricornet: a hybrid temporal convolutional and recurrent network for video action segmentation. arXiv preprint arXiv:1705.07818 (2017)
Du, D., Su, B., Li, Y., Qi, Z., Si, L., Shan, Y.: Do we really need temporal convolutions in action segmentation? In: ICME, pp. 1014–1019. IEEE (2023)
Google Scholar
Farha, Y.A., Gall, J.: Ms-tcn: multi-stage temporal convolutional network for action segmentation. In: CVPR, pp. 3575–3584 (2019)
Google Scholar
Filtjens, B., Vanrumste, B., Slaets, P.: Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans. Emerg. Top. Comput. 1–11 (2022)
Google Scholar
Gao, S.H., Han, Q., Li, Z.Y., Peng, P., Wang, L., Cheng, M.M.: Global2local: efficient structure search for video action segmentation. In: CVPR, pp. 16805–16814 (2021)
Google Scholar
Gao, S., Li, Z.Y., Han, Q., Cheng, M.M., Wang, L.: RF-Next: efficient receptive field search for convolutional neural networks. IEEE TPAMI 45(3), 2984–3002 (2023)
Google Scholar
Ghosh, P., Yao, Y., Davis, L., Divakaran, A.: Stacked spatio-temporal graph convolutional networks for action segmentation. In: WACV, pp. 576–585 (2020)
Google Scholar
Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: WACV, pp. 2322–2331 (2021)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916. PMLR (2021)
Google Scholar
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: fast autoregressive transformers with linear attention. In: ICML, pp. 5156–5165. PMLR (2020)
Google Scholar
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR, pp. 156–165 (2017)
Google Scholar
Li, M., et al.: Bridge-prompt: towards ordinal action understanding in instructional videos. In: CVPR, pp. 19880–19889 (2022)
Google Scholar
Li, Q., Wang, Y., Lv, F.: Semantic correlation attention-based multiorder multiscale feature fusion network for human motion prediction. IEEE Trans. Cybern. 54(2), 825–838 (2024)
Article Google Scholar
Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE TPAMI 45(6), 6647–6658 (2023)
Article Google Scholar
Li, X., et al.: Action recognition based on multimode fusion for VR online platform. Virtual Reality, pp. 1–16 (2023)
Google Scholar
Li, Y., Li, Z., Gao, S., Wang, Q., Qibin, H., Mingming, C.: A decoupled spatio-temporal framework for skeleton-based action segmentation. arXiv preprint arXiv:2312.05830 (2023)
Li, Y.H., Liu, K.Y., Liu, S.L., Feng, L., Qiao, H.: Involving distinguished temporal graph convolutional networks for skeleton-based temporal action segmentation. IEEE TCSVT 34(1), 647–660 (2024)
Google Scholar
Li, Y., et al.: Efficient two-step networks for temporal action segmentation. Neurocomputing 454, 373–381 (2021)
Article Google Scholar
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for skeleton-based human action understanding. In: ACM VSCC, pp. 1–8 (2017)
Google Scholar
Liu, D., Li, Q., Dinh, A.D., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. In: ICCV, pp. 10139–10149 (2023)
Google Scholar
Liu, K., Li, Y., Xu, Y., Liu, S., Liu, S.: Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Process. Lett. 29, 1883–1887 (2022)
Article Google Scholar
Liu, S., et al.: Temporal segmentation of fine-gained semantic action: a motion-centered figure skating dataset. In: AAAI, pp. 2163–2171 (2021)
Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: CVPR, pp. 143–152 (2020)
Google Scholar
Nguyen, H.C., Nguyen, T.H., Scherer, R., Le, V.H.: Deep learning-based for human activity recognition on 3d human skeleton: Survey and comparative study. Sensors 23(11), 5121 (2023)
Article Google Scholar
Niemann, F., et al.: LARa: creating a dataset for human activity recognition in logistics using semantic attributes. Sensors 20(15), 4083 (2020)
Article Google Scholar
Qiu, H., Hou, B., Ren, B., Zhang, X.: Spatio-temporal segments attention for skeleton-based action recognition. Neurocomputing 518, 30–38 (2023)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Salisu, S., Ruhaiyem, N.I.R., Eisa, T.A.E., Nasser, M., Saeed, F., Younis, H.A.: Motion capture technologies for ergonomics: a systematic literature review. Diagnostics 13(15), 2593 (2023)
Article Google Scholar
Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: CVPR. pp. 21096–21106 (2022)
Google Scholar
Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., Black, M.J.: On the integration of optical flow and action recognition. In: Brox, T., Bruhn, A., Fritz, M. (eds.) GCPR 2018. LNCS, vol. 11269, pp. 281–297. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12939-2_20
Chapter Google Scholar
Siam, M., et al.: Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In: ICRA, pp. 50–56. IEEE (2019)
Google Scholar
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: CVPR, pp. 1961–1970 (2016)
Google Scholar
Singhania, D., Rahaman, R., Yao, A.: C2F-TCN: a framework for semi-and fully-supervised temporal action segmentation. IEEE TPAMI 45(10), 11484–11501 (2023)
Article Google Scholar
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE TPAMI 45(3), 3200–3225 (2023)
Google Scholar
Tian, X., Jin, Y., Zhang, Z., Liu, P., Tang, X.: STGA-Net: spatial-temporal graph attention network for skeleton-based temporal action segmentation. In: ICMEW, pp. 218–223. IEEE (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. NeurIPS 30, 6000–6010 (2017)
Google Scholar
Wang, J., Wang, Z., Zhuang, S., Hao, Y., Wang, H.: Cross-enhancement transformer for action segmentation. Multimed. Tools Appl. 1–14 (2023)
Google Scholar
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_3
Chapter Google Scholar
Xiang, W., Li, C., Zhou, Y., Wang, B., Zhang, L.: Generative action description prompts for skeleton-based action recognition. In: ICCV, pp. 10276–10285 (2023)
Google Scholar
Xu, H., Gao, Y., Hui, Z., Li, J., Gao, X.: Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398 (2023)
Xu, L., Wang, Q., Lin, X., Yuan, L.: An efficient framework for few-shot skeleton-based temporal action segmentation. Comput. Vis. Image Underst. 232, 103707 (2023)
Article Google Scholar
Xu, L., Wang, Q., Lin, X., Yuan, L., Ma, X.: Skeleton-based tai chi action segmentation using trajectory primitives and content. Neural Comput. Appl. 35(13), 9549–9566 (2023)
Article Google Scholar
Yan, S., et al.: Unloc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)
Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
Google Scholar
Yang, D., et al.: Lac-latent action composition for skeleton-based action segmentation. In: ICCV, pp. 13679–13690 (2023)
Google Scholar
Yi, F., Wen, H., Jiang, T.: Asformer: transformer for action segmentation. In: BMVC (2021)
Google Scholar
Zhang, J., Jia, Y., Xie, W., Tu, Z.: Zoom transformer for skeleton-based group activity recognition. IEEE TCSVT 32(12), 8646–8659 (2022)
Google Scholar
Zheng, C., et al.: Deep learning-based human pose estimation: a survey. ACM Comput. Surv. 56(1), 1–37 (2023)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China under Grant 2022YFB4703200, by the National Natural Science Foundation of China under Grant 62261160652, 52275013, 62206075 and 61733011, by the Guangdong Science and Technology Research Council under Grant 2020B1515120064.

Author information

Authors and Affiliations

State Key Laboratory of Robotics and Systems, Harbin Institute of Technology Shenzhen, Shenzhen, 518055, China
Haoyu Ji, Bowen Chen, Xinglong Xu, Weihong Ren, Zhiyong Wang & Honghai Liu

Authors

Haoyu Ji
View author publications
You can also search for this author in PubMed Google Scholar
Bowen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xinglong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weihong Ren
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Honghai Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyong Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 361 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, H., Chen, B., Xu, X., Ren, W., Wang, Z., Liu, H. (2025). Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15112. Springer, Cham. https://doi.org/10.1007/978-3-031-72949-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72949-2_23
Published: 31 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72948-5
Online ISBN: 978-3-031-72949-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics