Skip to main content
Log in

Beyond coordinate attention: spatial-temporal recalibration and channel scaling for skeleton-based action recognition

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Skeleton-based action recognition is an attractive issue in computer vision. Recent lightweight attention mechanisms (e.g. coordinate attention) have proven to be highly effective in skeleton-based action recognition. However, since long-range dependencies are captured along spatial and temporal directions, respectively, coordination attention cannot capture accurate long-range dependencies in the entire spatio-temporal domain and inevitably leads to inaccurate spatio-temporal location. In this work, we propose an efficient and lightweight attention mechanism, called coordinate enhanced attention, which consists of spatial-temporal recalibration and channel scaling. Spatial-temporal recalibration aims to capture precise long-range dependencies directly in the entire spatial-temporal domain. And channel scaling is introduced to efficiently utilize the multi-channel weight information. Our coordinate enhanced attention is efficient and lightweight, which can be easily integrated into classical neural networks. On two large-size datasets for skeleton-based action recognition (i.e. NTU RGB+D 60 and NTU RGB+D 120), our coordinate enhanced attention obtains consistent improvements. Experiments on two popular object detection datasets (i.e. COCO and Pascal VOC) and semantic segmentation dataset (i.e. Cityscapes) indicate that the proposed coordinate enhanced attention outperforms other lightweight attention mechanisms, which further validates its transferable ability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Data availability

All data generated or analysed during this study are included in this published article.

References

  1. Zhang, Y.X., Zhang, H.B., Du, J.X., et al.: RGB+ 2D skeleton: local hand-crafted and 3d convolution feature coding for action recognition. Signal Image Video Process. 15, 1379–1386 (2021)

    Article  Google Scholar 

  2. Li, H., Hu, W., Zang, Y., et al.: Action recognition based on attention mechanism and depthwise separable residual module. Signal Image Video Process. 17(1), 57–65 (2023)

    Article  Google Scholar 

  3. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)

  4. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

  5. Song, Y.F., Zhang, Z., Shan, C., et al.: Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1474–1488 (2022)

    Article  Google Scholar 

  6. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  7. Woo, S., Park, J., Lee, J.Y., et al.: Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  8. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13713–13722 (2021)

  9. Shi, L., Zhang, Y., Cheng, J., et al.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019)

  10. Zhang, P., Lan, C., Zeng, W., et al.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  11. Xin, W., Liu, R., Liu, Y., et al.: Transformer for skeleton-based action recognition: a review of recent advances. Neurocomputing 537, 164–186 (2023)

    Article  Google Scholar 

  12. Song, Y.F., Zhang, Z., Shan, C., et al.: Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1625–1633 (2020)

  13. Howard, A., Sandler, M., Chu, G., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1314–1324 (2019)

  14. Shahroudy, A., Liu, J., Ng, T.T., et al.: Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

  15. Liu, J., Shahroudy, A., Perez, M., et al.: Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2019)

  16. Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 Sept, 2014, Proceedings, Part V 13, pp. pp 740–755. Springer (2014)

  17. Everingham, M., Eslami, S.A., Van Gool, L., et al.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136 (2015)

    Article  Google Scholar 

  18. Cordts, M., Omran, M., Ramos, S., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

  19. Sandler, M., Howard, A., Zhu, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

  20. Liu, W., Anguelov, D., Erhan, D., et al.: Ssd: single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 Oct, 2016, Proceedings, Part I, pp. 21–37. Springer (2016)

  21. Chen, L.C., Papandreou, G., Schroff, F., et al.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

  22. Zhou, B., Khosla, A., Lapedriza, A., et al.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

Download references

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62072468, the Natural Science Foundation of Shandong Province under Grant ZR2019MF073.

Author information

Authors and Affiliations

Authors

Contributions

JT involved in conceptualization, writing—original draft, and validation. SG involved in validation. YW and BL involved in supervision, conceptualization, and funding acquisition. CD and BG involved in software. All authors reviewed and approved the manuscript.

Corresponding author

Correspondence to Yanjiang Wang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, J., Gong, S., Wang, Y. et al. Beyond coordinate attention: spatial-temporal recalibration and channel scaling for skeleton-based action recognition. SIViP 18, 199–206 (2024). https://doi.org/10.1007/s11760-023-02747-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02747-0

Keywords

Navigation