Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

Tian, Xiaoyan; Jin, Ye; Zhang, Zhao; Liu, Peng; Tang, Xianglong

doi:10.1007/s11042-023-17276-8

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

Published: 17 October 2023

Volume 83, pages 44273–44297, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xiaoyan Tian¹,
Ye Jin¹,
Zhao Zhang²,
Peng Liu¹ &
…
Xianglong Tang¹

357 Accesses
2 Citations
Explore all metrics

Abstract

Temporal action segmentation (TAS) of minute-long untrimmed videos involves locating and classifying human action segments using multiple action class labels. Previously, research on this task typically involved generating an initial estimate using designed temporal convolutional layers and gradually refining this estimate solely based on RGB features. This approach, however, exhibits several limitations, including the inability to capture inherent long-range dependencies and insufficient consideration of intricate spatial-temporal correlations in the changing relationships between human joints. To address these constraints, we introduce a novel spatial-temporal graph transformer network (STGT) for the skeleton-based TAS task. Our STGT employs a series of skeleton graph transformer blocks (SGT blocks) within an encoder-decoder architecture. Particularly, the spatial-temporal graph layer with an adaptive graph strategy enhances the graph structure, rendering it more flexible and robust. Additionally, the spatial-temporal transformer layer in the SGT block constructs parallel attention mechanisms to model the dynamic spatial and non-linear temporal correlations. Integrating these advancements into the TAS task represents a notable achievement. Experimental results on three challenging datasets (PKU-MMD, HuGaDB, and LARa) indicate the improved performance of the proposed framework compared with that of existing TAS models (MS-TCN, ASRF, BCN, ETSN, and ASFormer). Furthermore, our approach effectively addresses concerns regarding over-segmentation errors and ambiguous boundaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation

Refining Skeleton-Based Temporal Action Segmentation with Edge Information

Structure-Aware Human-Action Generation

Data Availibility

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Zhang Z, Wang W, Tian X (2023) Semantic segmentation of metal surface defects and corresponding strategies. IEEE Trans Instrum Meas 72:1–13
Casini L, Marchetti N, Montanucci A et al (2023) A human-AI collaboration workflow for archaeological sites detection. Sci Rep 13(1):8699
Article Google Scholar
Kong F, Wang Y (2019) Multimodal interface interaction design model based on dynamic augmented reality. Multimedia Tools Appl 78:4623–4653
Article Google Scholar
Ding G, Sener F, Yao A (2022) Temporal action segmentation: an analysis of modern technique. arXiv:2210.10352
Rashmi M, Ashwin TS, Guddeti RMR (2021) Surveillance video analysis for student action recognition and localization inside computer laboratories of a smart campus. Multimedia Tools Appl 80:2907–2929
Article Google Scholar
Tsai MF, Huang SH (2022) Enhancing accuracy of human action recognition system using skeleton point correction method. Multimedia Tools Appl 81(5):7439–7459
Article Google Scholar
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 6299–6308
Soomro K, Zamir A R, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Lea C, Flynn M D, Vidal R et al (2017) Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 156–165
Kuehne H, Gall J, Serre T (2016) An end-to-end generative framework for video segmentation and recognition. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–8
Farha Y A, Gall J (2019) Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 3575–3584
Li SJ, AbuFarha Y, Liu Y et al (2020) Ms-tcn++: multi-stage temporal convolutional network for action segmentation. IEEE Trans Pattern Anal Mach Intell 45:6647–6658
Ishikawa Y, Kasai S, Aoki Y, et al (2021) Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp 2322–2331
Wang Z, Gao Z, Wang L, et al (2020) Boundary-aware cascade networks for temporal action segmentation. In: Computer vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXV 16. Springer International Publishing, pp 34–51
Yang D, Cao Z, Mao L et al (2023) A temporal and channel-combined attention block for action segmentation. Appl Intell 53(3):2738–2750
Article Google Scholar
Li Y, Dong Z, Liu K et al (2021) Efficient two-step networks for temporal action segmentation. Neurocomputing 454:373–381
Article Google Scholar
Yi F, Wen H, Jiang T (2021) Asformer: Transformer for action segmentation. arXiv:2110.08568
Aziere N, Todorovic S (2022) Multistage temporal convolution transformer for action segmentation. Image Vis Comput 128:104567
Article Google Scholar
Tian X, Jin Y, Tang X (2023) Local-global transformer neural network for temporal action segmentation. Multimedia Syst 29(2):615–626
Article Google Scholar
Tian X, Jin Y, Tang X (2023) TSRN: two-stage refinement network for temporal action segmentation. Pattern Anal Appl 26:1375–1393
Singhania D, Rahaman R, Yao A (2021) Coarse to fine multi-resolution temporal convolutional network. arXiv:2105.10859
Park J, Kim D, Huh S et al (2022) Maximization and restoration: Action segmentation through dilation passing and temporal reconstruction. Pattern Recogn 129:108764
Article Google Scholar
Du D, Su B, Li Y, et al (2022) Efficient U-transformer with boundary-aware loss for action segmentation. arXiv:2205.13425
Kipf T N, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907
Shi L, Zhang Y, Cheng J et al (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 12026–12035
Plizzari C, Cannici M, Matteucci M (2021) Spatial temporal transformer network for skeleton-based action recognition. Pattern Recognition. In: ICPR international workshops and challenges: virtual event, January 10-15, 2021, Proceedings, Part III. Springer International Publishing, pp 694–701
Shi L, Zhang Y, Cheng J et al (2020) Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In: Proceedings of the Asian conference on computer vision
Filtjens B, Vanrumste B, Slaets P (2022) Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans Emerg Top Comput https://doi.org/10.1109/TETC.2022.3230912
Xu L, Wang Q, Lin X et al (2023) An efficient framework for few-shot skeleton-based temporal action segmentation. Comput Vis Image Underst 232:103707
Article Google Scholar
Liu K, Li Y, Xu Y et al (2022) Spatial focus attention for fine-grained skeleton-based action task. IEEE Signal Process Lett 29:1883–1887
Article Google Scholar
Chen J, Zhong M, Li J et al (2021) Effective deep attributed network representation learning with topology adapted smoothing. IEEE Trans Cybern 52(7):5935–5946
Article Google Scholar
Chen J, Zhong M, Li J, Liu Y, Zhang H, Xu D et al (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 240:108146
Article Google Scholar
Liu Z, Zhang H, Chen Z et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 143–152
Niemann F, Reining C, Moya Rueda F et al (2020) Lara: Creating a dataset for human activity recognition in logistics using semantic attributes. Sensors 20(15):4083
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Proc AAAI Conf Artif Intell 32(1). https://doi.org/10.1609/aaai.v32i1.12328
Si C, Chen W, Wang W et al (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 1227–1236
Li C, Zhong Q, Xie D et al (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv:1804.06055
Caetano C, Sena J, Brémond F et al (2019) Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. 2019 16th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, pp 1–8
Li H, Zhang Z, Zhao X et al (2022) Enhancing multi-modal features using local self-attention for 3D object detection. Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part X. Springer Nature Switzerland, Cham, pp 532–549
Li W, Huang L (2023) YOLOSA: Object detection based on 2D local feature superimposed self-attention. Pattern Recogn Lett 168:86–92
Article Google Scholar
Ribeiro L F R, Saverese P H P, Figueiredo D R (2017) struc2vec: Learning node representations from structural identity. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp 385–394
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Processing Syst 30:1–11
Liu C, Hu Y, Li Y et al (2017) PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In: Proceedings of the workshop on visual analysis in smart and connected communities. pp 1–8
Chereshnev R, Kertész-Farkas A (2018) Hugadb: Human gait database for activity recognition from wearable inertial sensor networks. Analysis of images, social networks and texts: 6th international conference, AIST 2017, Moscow, Russia, July 27-29, 2017, Revised Selected Papers 6. Springer International Publishing, pp 131–141

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Number: 51935005), Basic Scientific Research Project (Grant Number: JCKY20200603C010), Natural Science Foundation of Heilongjiang Province of China (Grant Number: LH2021F023).

Author information

Authors and Affiliations

Faculty of Computing, Harbin Institute of Technology, 92 West Da Zhi St, Harbin, 150001, China
Xiaoyan Tian, Ye Jin, Peng Liu & Xianglong Tang
School of Instrument Science and Engineering, Harbin Institute of Technology, 92 West Da Zhi St, Harbin, 150001, China
Zhao Zhang

Authors

Xiaoyan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Ye Jin
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xianglong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ye Jin.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tian, X., Jin, Y., Zhang, Z. et al. Spatial-temporal graph transformer network for skeleton-based temporal action segmentation. Multimed Tools Appl 83, 44273–44297 (2024). https://doi.org/10.1007/s11042-023-17276-8

Download citation

Received: 14 June 2023
Revised: 01 August 2023
Accepted: 22 September 2023
Published: 17 October 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17276-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

Abstract

Access this article

Similar content being viewed by others

Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation

Refining Skeleton-Based Temporal Action Segmentation with Edge Information

Structure-Aware Human-Action Generation

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatial-temporal graph transformer network for skeleton-based temporal action segmentation

Abstract

Access this article

Similar content being viewed by others

Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation

Refining Skeleton-Based Temporal Action Segmentation with Edge Information

Structure-Aware Human-Action Generation

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation