Skip to main content
Log in

Multi-receptive field spatiotemporal network for action recognition

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Despite the great progress in action recognition made by deep neural networks, visual tempo may be overlooked in the feature learning process of existing methods. The visual tempo is the dynamic and temporal scale variation of actions. Existing models usually understand spatiotemporal scenes using temporal and spatial convolutions, which are limited in both temporal and spatial dimensions, and they cannot cope with differences in visual tempo changes. To address these issues, we propose a multi-receptive field spatiotemporal (MRF-ST) network to effectively model the spatial and temporal information of different receptive fields. In the proposed network, dilated convolution is utilized to obtain different receptive fields. Meanwhile, dynamic weighting for different dilation rates is designed based on the attention mechanism. Thus, the proposed MRF-ST network can directly caption various tempos in the same network layer without any additional cost. Moreover, the network can improve the accuracy of action recognition by learning more visual tempos of different actions. Extensive evaluations show that MRF-ST reaches the state-of-the-art on three popular benchmarks for action recognition: UCF-101, HMDB-51, and Diving-48. Further analysis also indicates that MRF-ST can significantly improve the performance at the scenes with large variances in visual tempo.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Luvizon DC, Picard D, Tabia H (2021) Multi-task deep learning for real-time 3d human pose estimation and action recognition. IEEE Trans Pattern Anal Mach Intell 43(8):2752–2764

    Google Scholar 

  2. Liu Y, Yuan J, Tu Z (2022) Motion-driven visual tempo learning for video-based action recognition. IEEE Trans Image Process 31:4104–4116

    Article  Google Scholar 

  3. Jin X, Sun W, Jin Z (2020) A discriminative deep association learning for facial expression recognition. Int J Mach Learn Cybern 11(4):779–793

    Article  Google Scholar 

  4. Lu H, Zhang M, Xu X, Li Y, Shen HT (2020) Deep fuzzy hashing network for efficient image retrieval. IEEE Trans Fuzzy Syst 29(1):166–176

    Article  Google Scholar 

  5. Yue R, Tian Z, Du S (2022) Action recognition based on rgb and skeleton data sets: a survey. Neurocomputing 512:287–306

    Article  Google Scholar 

  6. Javed MH, Yu Z, Li T, Rajeh TM, Rafique F, Waqar S (2022) Hybrid two-stream dynamic CNN for view adaptive human action recognition using ensemble learning. Int J Mach Learn Cybern 13(4):1157–1166

    Article  Google Scholar 

  7. Wu W, He D, Tan X, Chen S, Wen S (2019) Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6222–6231

  8. Du Y, Yuan C, Li B, Zhao L, Li Y, Hu W (2018) Interaction-aware spatio-temporal pyramid attention networks for action classification. In: Proceedings of the European Conference on Computer Vision, pp 373–389

  9. Javed MH, Yu Z, Li T, Rajeh TM, Rafique F, Waqar S (2021) Hybrid two-stream dynamic cnn for view adaptive human action recognition using ensemble learning. Int J Mach Learn Cybern 2:1–10

    Google Scholar 

  10. Ziaeefard M, Bergevin R (2015) Semantic human activity recognition: a literature review. Pattern Recogn 48(8):2329–2345

    Article  Google Scholar 

  11. Chen L, Song Z, Lu J, Zhou J (2019) Learning principal orientations and residual descriptor for action recognition. Pattern Recogn 86:14–26

    Article  Google Scholar 

  12. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6450–6459

  13. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5533–5541

  14. Zhu Y, Newsam S (2018) Random temporal skipping for multirate video analysis. In: Asian Conference on Computer Vision, pp 542–557

  15. Zhang D, Dai X, Wang Y-F (2018) Dynamic temporal pyramid network: A closer look at multi-scale modeling for activity detection. In: Asian Conference on Computer Vision, pp 712–728

  16. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6202–6211

  17. Zheng Z, An G, Wu D, Ruan Q (2019) Spatial-temporal pyramid based convolutional neural network for action recognition. Neurocomputing 358:446–455

    Article  Google Scholar 

  18. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. CoRR arXiv:1511.07122

  19. Kuehne H, Jhuang H, Garrote E, Poggio TA, Serre T (2011) HMDB: A large video database for human motion recognition. In: Metaxas DN, Quan L, Sanfeliu A, Gool LV (eds) IEEE International Conference on Computer Vision, pp 2556–2563

  20. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR arXiv:1212.0402

  21. Li Y, Li Y, Vasconcelos N (2018) Resound: Towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 513–528

  22. Chen Y, Ma G, Yuan C, Li B, Zhang H, Wang F, Hu W (2020) Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recogn 103:107321

    Article  Google Scholar 

  23. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4489–4497

  24. Li J, Liu X, Zhang M, Wang D (2020) Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recogn 98:107037

    Article  Google Scholar 

  25. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 568–576 (2014)

  26. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6546–6555

  27. Zhuang D, Jiang M, Kong J, Liu T (2021) Spatiotemporal attention enhanced features fusion network for action recognition. Int J Mach Learn Cybern 12(3):823–841

    Article  Google Scholar 

  28. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941

  29. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4768–4777

  30. Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding. In: Proceedings of the European Conference on Computer Vision, pp 695–712

  31. Du W, Wang Y, Qiao Y (2017) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27(3):1347–1360

    Article  MathSciNet  MATH  Google Scholar 

  32. Li C, Zhang B, Chen C, Ye Q, Han J, Guo G, Ji R (2019) Deep manifold structure transfer for action recognition. IEEE Trans Image Process 28(9):4646–4658

    Article  MathSciNet  MATH  Google Scholar 

  33. Yang H, Yuan C, Li B, Du Y, Xing J, Hu W, Maybank SJ (2019) Asymmetric 3d convolutional neural networks for action recognition. Pattern Recogn 85:1–12

    Article  Google Scholar 

  34. Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision, vol. 11205, pp 831–846

  35. Shi Y, Tian Y, Huang T, Wang Y (2018) Temporal attentive network for action recognition. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), pp 1–6

  36. Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 7083–7093

  37. Luo C, Yuille AL (2019) Grouped spatial-temporal aggregation for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5512–5521

  38. Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. Adv Neural Inf Process Syst 2:3468–3476

    Google Scholar 

  39. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision, pp. 305–321

  40. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4724–4733

  41. Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7872–7881

  42. Xu B, Ye H, Zheng Y, Wang H, Luwang T, Jiang Y-G (2019) Dense dilated network for video action recognition. IEEE Trans Image Process 28(10):4941–4953

    Article  MathSciNet  MATH  Google Scholar 

  43. Fu J, Liu J, Jiang J, Li Y, Bao Y, Lu H (2020) Scene segmentation with dual relation-aware attention network. IEEE Trans Neural Netw Learn Syst 2:2

    Google Scholar 

  44. Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 909–918

  45. Wang Z, Chen K, Zhang M, He P, Wang Y, Zhu P, Yang Y (2019) Multi-scale aggregation network for temporal action proposals. Pattern Recogn Lett 122:60–65

    Article  Google Scholar 

  46. Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 591–600

  47. Li X, Shuai B, Tighe J (2020) Directional temporal modeling for action recognition. In: Proceedings of the European Conference on Computer Vision, pp 275–291

  48. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pp 5998–6008

  49. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7794–7803

  50. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. CoRR arXiv:1705.06950

  51. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2018) Temporal segment networks for action recognition in videos. IEEE Trans Pattern Anal Mach Intell 41(11):2740–2755

    Article  Google Scholar 

  52. Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimedia 22(11):2990–3001

    Article  Google Scholar 

  53. Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2000–2009

  54. Diba A, Fayyaz M, Sharma V, Arzani MM, Yousefzadeh R, Gall J, Van Gool L (2018) Spatio-temporal channel correlation networks for action classification. In: Proceedings of the European Conference on Computer Vision, pp 284–299

  55. Zhou Y, Sun X, Zha Z-J, Zeng W (2018) Mict: Mixed 3d/2d convolutional tube for human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 449–458

  56. Yang H, Yuan C, Zhang L, Sun Y, Hu W, Maybank SJ (2020) STA-CNN: convolutional spatial-temporal attention learning for action recognition. IEEE Trans Image Process 29:5783–5793

    Article  MATH  Google Scholar 

  57. Kanojia G, Kumawat S, Raman S (2019) Attentive spatio-temporal representation learning for diving classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 2467–2476

  58. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: Proceedings of the International Conference on Machine Learning (ICML), vol. 139, pp 813–824

  59. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2020) Grad-cam: Visual explanations from deep networks via gradient-based localization. Int J Comput Vis 128(2):336–359

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported partly by the National Natural Science Foundation of China grant (No. 61773117).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wankou Yang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nie, M., Yang, S., Wang, Z. et al. Multi-receptive field spatiotemporal network for action recognition. Int. J. Mach. Learn. & Cyber. 14, 2439–2453 (2023). https://doi.org/10.1007/s13042-023-01774-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01774-0

Keywords

Navigation