Abstract
Skeleton-based Temporal Action Segmentation (STAS) aims to densely segment and classify human actions in long, untrimmed skeletal motion sequences. Existing STAS methods primarily model spatial dependencies among joints and temporal relationships among frames to generate frame-level one-hot classifications. However, these methods overlook the deep mining of semantic relations among joints as well as actions at a linguistic level, which limits the comprehensiveness of skeleton action understanding. In this work, we propose a Language-assisted Skeleton Action Understanding (LaSA) method that leverages the language modality to assist in learning semantic relationships among joints and actions. Specifically, in terms of joint relationships, the Joint Relationships Establishment (JRE) module establishes correlations among joints in the feature sequence by applying attention between joint texts and differentiates distinct joints by embedding joint texts as positional embeddings. Regarding action relationships, the Action Relationships Supervision (ARS) module enhances the discrimination across action classes through contrastive learning of single-class action-text pairs and models the semantic associations of adjacent actions by contrasting mixed-class clip-text pairs. Performance evaluation on five public datasets demonstrates that LaSA achieves state-of-the-art results. Code is available at https://github.com/HaoyuJi/LaSA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Bahrami, E., Francesca, G., Gall, J.: How much temporal long-term context is needed for action segmentation? In: ICCV, pp. 10351–10361 (2023)
Behrmann, N., Golestaneh, S.A., Kolter, Z., Gall, J., Noroozi, M.: Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 52–68. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_4
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Chen, B., et al.: Autoenp: an auto rating pipeline for expressing needs via pointing protocol. In: ICPR, pp. 3280–3286. IEEE (2022)
Dave, I., Scheffer, Z., Kumar, A., Shiraz, S., Rawat, Y.S., Shah, M.: Gabriellav2: towards better generalization in surveillance videos for action detection. In: WACV, pp. 122–132 (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL, vol. 1, pp. 4171–4186 (2019)
Ding, G., Sener, F., Yao, A.: Temporal action segmentation: an analysis of modern techniques. IEEE TPAMI 46(2), 1011–1030 (2024)
Ding, L., Xu, C.: Tricornet: a hybrid temporal convolutional and recurrent network for video action segmentation. arXiv preprint arXiv:1705.07818 (2017)
Du, D., Su, B., Li, Y., Qi, Z., Si, L., Shan, Y.: Do we really need temporal convolutions in action segmentation? In: ICME, pp. 1014–1019. IEEE (2023)
Farha, Y.A., Gall, J.: Ms-tcn: multi-stage temporal convolutional network for action segmentation. In: CVPR, pp. 3575–3584 (2019)
Filtjens, B., Vanrumste, B., Slaets, P.: Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans. Emerg. Top. Comput. 1–11 (2022)
Gao, S.H., Han, Q., Li, Z.Y., Peng, P., Wang, L., Cheng, M.M.: Global2local: efficient structure search for video action segmentation. In: CVPR, pp. 16805–16814 (2021)
Gao, S., Li, Z.Y., Han, Q., Cheng, M.M., Wang, L.: RF-Next: efficient receptive field search for convolutional neural networks. IEEE TPAMI 45(3), 2984–3002 (2023)
Ghosh, P., Yao, Y., Davis, L., Divakaran, A.: Stacked spatio-temporal graph convolutional networks for action segmentation. In: WACV, pp. 576–585 (2020)
Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: WACV, pp. 2322–2331 (2021)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916. PMLR (2021)
Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: fast autoregressive transformers with linear attention. In: ICML, pp. 5156–5165. PMLR (2020)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR, pp. 156–165 (2017)
Li, M., et al.: Bridge-prompt: towards ordinal action understanding in instructional videos. In: CVPR, pp. 19880–19889 (2022)
Li, Q., Wang, Y., Lv, F.: Semantic correlation attention-based multiorder multiscale feature fusion network for human motion prediction. IEEE Trans. Cybern. 54(2), 825–838 (2024)
Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE TPAMI 45(6), 6647–6658 (2023)
Li, X., et al.: Action recognition based on multimode fusion for VR online platform. Virtual Reality, pp. 1–16 (2023)
Li, Y., Li, Z., Gao, S., Wang, Q., Qibin, H., Mingming, C.: A decoupled spatio-temporal framework for skeleton-based action segmentation. arXiv preprint arXiv:2312.05830 (2023)
Li, Y.H., Liu, K.Y., Liu, S.L., Feng, L., Qiao, H.: Involving distinguished temporal graph convolutional networks for skeleton-based temporal action segmentation. IEEE TCSVT 34(1), 647–660 (2024)
Li, Y., et al.: Efficient two-step networks for temporal action segmentation. Neurocomputing 454, 373–381 (2021)
Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for skeleton-based human action understanding. In: ACM VSCC, pp. 1–8 (2017)
Liu, D., Li, Q., Dinh, A.D., Jiang, T., Shah, M., Xu, C.: Diffusion action segmentation. In: ICCV, pp. 10139–10149 (2023)
Liu, K., Li, Y., Xu, Y., Liu, S., Liu, S.: Spatial focus attention for fine-grained skeleton-based action tasks. IEEE Signal Process. Lett. 29, 1883–1887 (2022)
Liu, S., et al.: Temporal segmentation of fine-gained semantic action: a motion-centered figure skating dataset. In: AAAI, pp. 2163–2171 (2021)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: CVPR, pp. 143–152 (2020)
Nguyen, H.C., Nguyen, T.H., Scherer, R., Le, V.H.: Deep learning-based for human activity recognition on 3d human skeleton: Survey and comparative study. Sensors 23(11), 5121 (2023)
Niemann, F., et al.: LARa: creating a dataset for human activity recognition in logistics using semantic attributes. Sensors 20(15), 4083 (2020)
Qiu, H., Hou, B., Ren, B., Zhang, X.: Spatio-temporal segments attention for skeleton-based action recognition. Neurocomputing 518, 30–38 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Salisu, S., Ruhaiyem, N.I.R., Eisa, T.A.E., Nasser, M., Saeed, F., Younis, H.A.: Motion capture technologies for ergonomics: a systematic literature review. Diagnostics 13(15), 2593 (2023)
Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: CVPR. pp. 21096–21106 (2022)
Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., Black, M.J.: On the integration of optical flow and action recognition. In: Brox, T., Bruhn, A., Fritz, M. (eds.) GCPR 2018. LNCS, vol. 11269, pp. 281–297. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12939-2_20
Siam, M., et al.: Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In: ICRA, pp. 50–56. IEEE (2019)
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: CVPR, pp. 1961–1970 (2016)
Singhania, D., Rahaman, R., Yao, A.: C2F-TCN: a framework for semi-and fully-supervised temporal action segmentation. IEEE TPAMI 45(10), 11484–11501 (2023)
Sun, Z., Ke, Q., Rahmani, H., Bennamoun, M., Wang, G., Liu, J.: Human action recognition from various data modalities: a review. IEEE TPAMI 45(3), 3200–3225 (2023)
Tian, X., Jin, Y., Zhang, Z., Liu, P., Tang, X.: STGA-Net: spatial-temporal graph attention network for skeleton-based temporal action segmentation. In: ICMEW, pp. 218–223. IEEE (2023)
Vaswani, A., et al.: Attention is all you need. NeurIPS 30, 6000–6010 (2017)
Wang, J., Wang, Z., Zhuang, S., Hao, Y., Wang, H.: Cross-enhancement transformer for action segmentation. Multimed. Tools Appl. 1–14 (2023)
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_3
Xiang, W., Li, C., Zhou, Y., Wang, B., Zhang, L.: Generative action description prompts for skeleton-based action recognition. In: ICCV, pp. 10276–10285 (2023)
Xu, H., Gao, Y., Hui, Z., Li, J., Gao, X.: Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv preprint arXiv:2305.12398 (2023)
Xu, L., Wang, Q., Lin, X., Yuan, L.: An efficient framework for few-shot skeleton-based temporal action segmentation. Comput. Vis. Image Underst. 232, 103707 (2023)
Xu, L., Wang, Q., Lin, X., Yuan, L., Ma, X.: Skeleton-based tai chi action segmentation using trajectory primitives and content. Neural Comput. Appl. 35(13), 9549–9566 (2023)
Yan, S., et al.: Unloc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI (2018)
Yang, D., et al.: Lac-latent action composition for skeleton-based action segmentation. In: ICCV, pp. 13679–13690 (2023)
Yi, F., Wen, H., Jiang, T.: Asformer: transformer for action segmentation. In: BMVC (2021)
Zhang, J., Jia, Y., Xie, W., Tu, Z.: Zoom transformer for skeleton-based group activity recognition. IEEE TCSVT 32(12), 8646–8659 (2022)
Zheng, C., et al.: Deep learning-based human pose estimation: a survey. ACM Comput. Surv. 56(1), 1–37 (2023)
Acknowledgements
This work is supported by the National Key Research and Development Program of China under Grant 2022YFB4703200, by the National Natural Science Foundation of China under Grant 62261160652, 52275013, 62206075 and 61733011, by the Guangdong Science and Technology Research Council under Grant 2020B1515120064.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ji, H., Chen, B., Xu, X., Ren, W., Wang, Z., Liu, H. (2025). Language-Assisted Skeleton Action Understanding for Skeleton-Based Temporal Action Segmentation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15112. Springer, Cham. https://doi.org/10.1007/978-3-031-72949-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-72949-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72948-5
Online ISBN: 978-3-031-72949-2
eBook Packages: Computer ScienceComputer Science (R0)