Skip to main content

Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14434))

Included in the following conference series:

  • 937 Accesses

Abstract

Skeleton-based Temporal Action Segmentation (TAS) plays an important role in analyzing long videos of motion-centered human actions. Recent approaches perform spatial and temporal information modeling simultaneously in the spatial-temporal topological graph, leading to high computational costs due to the large graph magnitude. Additionally, multi-modal skeleton data has sufficient semantic information, which has not been fully explored. This paper proposes a Hierarchical Spatial-Temporal Network (HSTN) for skeleton-based TAS. In HSTN, the Multi-Branch Transfer Fusion (MBTF) module utilizes a multi-branch graph convolution structure with an attention mechanism to capture spatial dependencies in multi-modal skeleton data. In addition, the Multi-Scale Temporal Convolution (MSTC) module aggregates spatial information and performs multi-scale temporal information modeling to capture long-range dependencies. Extensive experiments on two challenging datasets are performed and our proposed method outperforms the State-of-the-Art (SOTA) methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chen, H.T., Chen, H.S., Lee, S.Y.: Physics-based ball tracking in volleyball videos with its applications to set type recognition and action detection. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), vol. 1, pp. I–1097. IEEE (2007)

    Google Scholar 

  2. Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)

    Google Scholar 

  3. Collins, R.T., Lipton, A.J., Kanade, T.: Introduction to the special section on video surveillance. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 745–746 (2000)

    Article  Google Scholar 

  4. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)

    Google Scholar 

  5. Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)

    Google Scholar 

  6. Filtjens, B., Vanrumste, B., Slaets, P.: Skeleton-based action segmentation with multi-stage spatial-temporal graph convolutional neural networks. IEEE Trans. Emerg. Top. Comput. 1–11 (2022)

    Google Scholar 

  7. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  8. Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Alleviating over-segmentation errors by detecting action boundaries. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2322–2331 (2021)

    Google Scholar 

  9. Jiang, X., Xu, K., Sun, T.: Action recognition scheme based on skeleton representation with DS-LSTM network. IEEE Trans. Circuits Syst. Video Technol. 30(7), 2129–2140 (2019)

    Article  Google Scholar 

  10. Joze, H.R.V., Shaban, A., Iuzzolino, M.L., Koishida, K.: MMTM: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13289–13299 (2020)

    Google Scholar 

  11. Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop, p. 5 (2014)

    Google Scholar 

  12. Kim, T.S., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1623–1631. IEEE (2017)

    Google Scholar 

  13. Krüger, V., Kragic, D., Ude, A., Geib, C.: The meaning of action: a review on action recognition and mapping. Adv. Robot. 21(13), 1473–1501 (2007)

    Article  Google Scholar 

  14. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 156–165 (2017)

    Google Scholar 

  15. Li, C., Zhong, Q., Xie, D., Pu, S.: Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 597–600. IEEE (2017)

    Google Scholar 

  16. Liu, C., Hu, Y., Li, Y., Song, S., Liu, J.: PKU-MMD: a large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475 (2017)

  17. Liu, S., et al.:: Temporal segmentation of fine-gained semantic action: a motion-centered figure skating dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2163–2171 (2021)

    Google Scholar 

  18. Ramachandram, D., Taylor, G.W.: Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 34(6), 96–108 (2017)

    Article  Google Scholar 

  19. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1194–1201. IEEE (2012)

    Google Scholar 

  20. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

    Google Scholar 

  21. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI Conference on Artificial Intelligence (2017)

    Google Scholar 

  22. Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1625–1633 (2020)

    Google Scholar 

  23. Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 34–51. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_3

  24. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7444–7452 (2018)

    Google Scholar 

  25. Yi, F., Wen, H., Jiang, T.: Asformer: transformer for action segmentation, p. 236 (2021)

    Google Scholar 

  26. Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1112–1121 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, C., Sun, T., Fu, T., Wang, Y., Xu, M., Liu, S. (2024). Hierarchical Spatial-Temporal Network for Skeleton-Based Temporal Action Segmentation. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14434. Springer, Singapore. https://doi.org/10.1007/978-981-99-8549-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8549-4_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8548-7

  • Online ISBN: 978-981-99-8549-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics