Local motion feature extraction and spatiotemporal attention mechanism for action recognition

Song, Xiaogang; Zhang, Dongdong; Liang, Li; He, Min; Hei, Xinhong

doi:10.1007/s00371-023-03205-1

Local motion feature extraction and spatiotemporal attention mechanism for action recognition

Original article
Published: 19 December 2023

Volume 40, pages 7747–7759, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Xiaogang Song ORCID: orcid.org/0000-0001-9841-9624¹,
Dongdong Zhang¹,
Li Liang¹,
Min He² &
…
Xinhong Hei¹

245 Accesses
Explore all metrics

Abstract

Video action recognition faces the important and challenging problem of spatiotemporal relationship modeling. In order to solve this issue, current methods typically rely on 2D or 3D CNN operations to model local spatiotemporal dependencies at fixed scales. However, most of these models fail to emphasize the keyframes and action-sensitive regions of the input video, resulting in poor performance. In this paper, an action recognition network with local motion feature extraction and spatiotemporal attention mechanism is proposed. The proposed network consists of a motion capture (MC) module and a temporal attention (TA) and spatiotemporal attention (STA) module, which capture detailed motion features, and learns the contribution of each frame and each region to the action at the feature level, respectively. To evaluate our network, we construct a concrete water addition violation dataset (CWAVD), which can be used to identify water addition violations by construction site workers and improve construction management efficiency and quality. The proposed network achieves the state-of-the-art performance on three of the most challenging datasets, UCF101 (97.6%), HMDB51 (77.3%) and SSV2 (67.8%).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Differential motion attention network for efficient action recognition

Article 13 June 2024

DM-CTSA: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition

Article 09 February 2021

EAN: Event Adaptive Network for Enhanced Action Recognition

Article 07 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Imen, J., Anouar, B.K., Ihsen, A., Mohamed, A.M.: Vision-based human action recognition: an overview and real world challenges. Forens. Sci. Int. Digit. Investig. 32, 200901 (2020)
Google Scholar
Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, pp. 9937–9945. IEEE (2019)
Bobick, A., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2001)
Article Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)
Article Google Scholar
Fujiyoshi, H., Lipton, A.: Real-time human motion analysis by image skeletonization. In: Fourth IEEE Workshop on Applications of Computer Vision, pp. 15–21. IEEE (1998)
Qiu, Z.X., Zhang, H.B., Deng, W.M., Du, J.X., Lei, Q., Zhang, G.L.: Effective skeleton topology and semantics-guided adaptive graph convolution network for action recognition. Vis. Comput. 39(5), 2191–2203 (2022)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: The 27th International Conference on Neural Information Processing Systems, pp. 568–576. MIT Press (2014)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: 2016 European Conference on Computer Vision, pp. 20–36. Springer (2016)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733. IEEE (2017)
Abdelbaky, A., Aly, S.: Two-stream spatiotemporal feature fusion for human action recognition. Vis. Comput. 37(7), 1821–1835 (2021)
Article Google Scholar
Fei, K., Wang, C., Zhang, J., Liu, Y., Xie, X., Tu, Z.: Flow-pose Net: an effective two-stream network for fall detection. Vis. Comput. 2022, 1–16 (2022)
Google Scholar
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1), 185–203 (1981)
Article Google Scholar
Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2432–2439. IEEE (2010)
Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., Smagt, P.v.d., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 2758–2766. IEEE (2015)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1647–1655. IEEE (2017)
Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: 2018 Asian Conference on Computer Vision, pp. 363–378. Springer (2019)
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: Motion-augmented rgb stream for action recognition. In: 2019 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7874–7883. IEEE (2019)
Lu, Y., Wang, Q., Ma, S., Geng, T., Chen, Y.V., Chen, H., Liu, D.: Transflow: Transformer as flow learner. In: 2023 IEEE Conference on Computer Vision and Pattern Recognition. pp. 18063–18073. IEEE (2023)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. IEEE (2018)
Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., Li, J., Huang, F., Lu, T.: TEINet: Towards an efficient architecture for video recognition. In: The AAAI Conference on Artificial Intelligence, vol. 34(07), pp. 11669–11676 (2020)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. In: 2020 IEEE Conference on Computer Vision and Pattern Recognition, pp. 906–915. IEEE (2020)
Wang, L., Tong, Z., Ji, B., Wu, G.: Tdn: Temporal difference networks for efficient action recognition. In: 2021 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1895–1904. IEEE (2021)
Geng, T., Zheng, F., Hou, X., Lu, K., Qi, G.-J., Shao, L.: Spatial-temporal pyramid graph reasoning for action recognition. IEEE Trans. Image Process. 31, 5484–5497 (2022)
Article Google Scholar
Sang, H., Zhao, Z., He, D.: Two-level attention model based video action recognition network. IEEE Access. 7, 118388–118401 (2019)
Article Google Scholar
Zhu, Y., Liu, G.: Fine-grained action recognition using multi-view attentions. Vis. Comput. 36(9), 1771–1781 (2020)
Article Google Scholar
Dong, W., Zhang, Z., Song, C., Tan, T.: Identifying the key frames: an attention-aware sampling method for action recognition. Pattern Recognit. 130, 108797 (2022)
Article Google Scholar
Li, J., Wei, P., Zheng, N.: Nesting spatiotemporal attention networks for action recognition. Neurocomputing 459, 338–348 (2021)
Article Google Scholar
Kim, J., Li, G., Yun, I., Jung, C., Kim, J.: Weakly-supervised temporal attention 3d network for human action recognition. Pattern Recognit. 119, 108068 (2021)
Article Google Scholar
Yan, L., Wang, Q., Cui, Y., Feng, F., Quan, X., Zhang, X., Liu, D.: GL-RG: global-local representation granularity for video captioning. In: 2022 International Joint Conference on Artificial Intelligence. (2022).
Cui, Y., Yan, L., Cao, Z., Liu, D.: Tf-blender: Temporal feature blender for video object detection. In: 2021 IEEE International Conference on Computer Vision. pp. 8138–8147. IEEE (2021)
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: 2019 IEEE International Conference on Computer Vision, pp. 7082–7092. IEEE (2019)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision, pp. 764–773. IEEE (2017)
Geng, Z., Guo, M.-H., Chen, H., Li, X., Wei, K., Lin, Z.: Is attention better than matrix decomposition? In: 2021 International Conference on Learning Representations (2021)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. IEEE (2018)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR arXiv:1212.0402 (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: A large video database for human motion recognition. In: 2011 IEEE International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., Memisevic, R.: The “something something” video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision, pp. 5843–5851. IEEE (2017)
Xie, Z., Sato, I., Sugiyama, M.: A diffusion theory for deep learning dynamics: stochastic gradient descent exponentially favors flat minima. In: International Conference on Learning Representations (2021)
Diba, A., Fayyaz, M., Sharma, V., Arzani, M. M., Yousefzadeh, R., Gall, J., Van Gool, L.: Spatio-temporal channel correlation networks for action classification. In: 2018 European Conference on Computer Vision, pp. 299–315. Springer (2018)
Zolfaghari, M., Singh, K., Brox, T.: Eco: Efficient convolutional network for online video understanding. In: 2018 European Conference on Computer Vision, pp. 713–730. Springer (2018)
Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1430–1439. IEEE (2018)
Zhang, G., Huang, G., Chen, H., Pun, C.-M., Yu, Z., Ling, W.-K.: Video action recognition with key-detail motion capturing based on motion spectrum analysis and multiscale feature fusion. Vis. Comput. 39(2), 539–556 (2023)
Article Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE (2016)

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (No. 2022YFB2602203).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an, China
Xiaogang Song, Dongdong Zhang, Li Liang & Xinhong Hei
School of Civil Engineering and Architecture, Xi’an University of Technology, Xi’an, China
Min He

Authors

Xiaogang Song
View author publications
You can also search for this author inPubMed Google Scholar
Dongdong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Li Liang
View author publications
You can also search for this author inPubMed Google Scholar
Min He
View author publications
You can also search for this author inPubMed Google Scholar
Xinhong Hei
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

XS contributed to writing—review editing, and methodology. DZ contributed to writing—draft, and software. LL carried out data management and analysis. MH and XH performed writing—draft, and writing—review and editing.

Corresponding author

Correspondence to Xiaogang Song.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Song, X., Zhang, D., Liang, L. et al. Local motion feature extraction and spatiotemporal attention mechanism for action recognition. Vis Comput 40, 7747–7759 (2024). https://doi.org/10.1007/s00371-023-03205-1

Download citation

Accepted: 17 November 2023
Published: 19 December 2023
Issue Date: November 2024
DOI: https://doi.org/10.1007/s00371-023-03205-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local motion feature extraction and spatiotemporal attention mechanism for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Differential motion attention network for efficient action recognition

DM-CTSA: a discriminative multi-focused and complementary temporal/spatial attention framework for action recognition

EAN: Event Adaptive Network for Enhanced Action Recognition

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now