Skip to main content
Log in

STAR: Efficient SpatioTemporal Modeling for Action Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Action recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. By contrast, 3D CNNs capture spatiotemporal information well, but they incur a high computational cost, making deployment challenging. In video, key information is typically confined to a small number of frames, though many current approaches require decompressing and processing all frames, which wastes resources. Others work directly on the compressed domain but require multiple input streams to understand the data. In our work, we directly operate on compressed video and extract information solely from intracoded frames (I-frames) avoiding the use of motion vectors and residuals for motion information making this a single-stream network. This reduces processing time and energy consumption, by extension, making this approach more accessible for a wider range of machines and uses. Extensive testing is employed on the UCF101 (Soomro et al. in UCF101: a dataset of 101 human actions classes from videos in the Wild, 2012) and HMDB51 (Kuehne et al., in: Jhuang, Garrote, Poggio, Serre (eds) Proceedings of the international conference on computer vision (ICCV), 2011) datasets to evaluate our framework and show that computational complexity is reduced significantly while achieving competitive accuracy to existing compressed domain efforts, i.e., 92.6% top1 accuracy in UCF-101 and 62.9% in HMDB-51 dataset with 24.3M parameters and 4 GFLOPS and energy savings of over 11\(\times \) for the two datasets versus CoViAR (Wu et al. in Compressed video action recognition, 2018).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. R.V. Babu, M. Tom, P. Wadekar, A survey on compressed domain video analysis techniques. Multimedia Tools Appl. 75(2), 1043–1078 (2016). https://doi.org/10.1007/s11042-014-2345-z

    Article  Google Scholar 

  2. B. Battash, H. Barad, H. Tang, A. Bleiweiss, Mimic the raw domain: accelerating action recognition in the compressed domain, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 32926–2934 (2020)

  3. H. Cao, S. Yu, J. Feng, Compressed Video Action Recognition with Refined Motion Vector (2019)

  4. J. Carreira, A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (2018)

  5. Y. Chen, Y. Kalantidis, J. Li, S. Yan, J. Feng, Two-stream convolutional networks for action recognition in videos. arxiv:1807.11195 (2018)

  6. Cisco Visual Networking Index: Forecast and Trends, 20172022 White Paper—Cisco (2019)

  7. M.A. Goodale, A.D. Milner, Separate visual pathways for perception and action. Trends Neurosci. 15(1), 20–25 (1992)

    Article  Google Scholar 

  8. K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition (2015)

  9. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR arxiv:1704.04861 (2017)

  10. H. Hu, W. Zhou, X. Li, N. Yan, H. Li, Mv2flow: learning motion representation for fast compressed video action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 16(3s), 66 (2021). https://doi.org/10.1145/3422360

    Article  Google Scholar 

  11. S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)

  12. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR arxiv:1502.03167 (2015)

  13. S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59

    Article  Google Scholar 

  14. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223

  15. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223

  16. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The Kinetics Human Action Video Dataset (2017)

  17. D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization (2017)

  18. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386

    Article  Google Scholar 

  19. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in Proceedings of the International Conference on Computer Vision (ICCV) (2011)

  20. J. Lin, C. Gan, S. Han, Temporal shift module for efficient video understanding. CoRR arxiv:1811.08383 (2018)

  21. M.K. Mandal, Digital Video Compression Techniques, pp. 203–237 (Springer, Boston, 2003). https://doi.org/10.1007/978-1-4615-0265-4_9

  22. S. Ranjbar Alvar, H. Choi, I.V. Bajic, Can you tell a face from a hevc bitstream? in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 257–261 (2018). https://doi.org/10.1109/MIPR.2018.00060

  23. Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S.-F. Chang, Z. Yan, DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition (2019)

  24. K. Simonyan, A. Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos (2014)

  25. K. Soomro, A.R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild (2012)

  26. S. Tomar, Converting video formats with ffmpeg. Linux J. 2006(146), 10 (2006)

    Google Scholar 

  27. D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks (2015)

  28. H. Wang, B. Raj, On the Origin of Deep Learning (2017)

  29. H. Wang, C. Schmid, Action recognition with improved trajectories, in 2013 IEEE International Conference on Computer Vision, pp. 3551–3558 (2013). https://doi.org/10.1109/ICCV.2013.441

  30. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards Good Practices for Very Deep Two-Stream ConvNets (2015)

  31. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L.V. Gool Temporal Segment Networks: Towards Good Practices for Deep Action Recognition (2016)

  32. C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A.J. Smola, P. Krähenbühl, Compressed Video Action Recognition (2018)

  33. C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A.J. Smola, P. Krähenbühl, Compressed Video Action Recognition (2018)

  34. Z. Xu, Y. Yang, A.G. Hauptmann, A discriminative CNN video representation for event detection. CoRR arxiv:1411.4006 (2014)

  35. G. Yao, T. Lei, J. Zhong, A review of convolutional-neural-network-based action recognition. Pattern Recognit. Lett. 118, 14–22 (2019). https://doi.org/10.1016/j.patrec.2018.05.018; Cooperative and Social Robots: Understanding Human Activities and Intentions

  36. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, A. Courville, Describing Videos by Exploiting Temporal Structure (2015)

  37. C. Zhou, X. Chen, P. Sun, G. Zhang, W. Zhou, Compressed video action recognition using motion vector representation, in Pattern Recognition. ICPR International Workshops and Challenges. ed. by A. Del Bimbo, R. Cucchiara, S. Sclaroff, G.M. Farinella, T. Mei, M. Bertini, H.J. Escalante, R. Vezzani (Springer, Cham, 2021), pp.701–713

    Chapter  Google Scholar 

  38. Y. Zhu, X. Li, C. Liu, M. Zolfaghari, Y. Xiong, C. Wu, Z. Zhang, J. Tighe, R. Manmatha, M. Li, A Comprehensive Study of Deep Video Action Recognition (2020)

Download references

Acknowledgements

This work was supported by Center for Brain-inspired Computing (C-BRIC) & Center for Research in Intelligent Storage and Processing in Memory (CRISP), two of the center from Semiconductor Research Center (SRC) program sponsored by DARPA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhijeet Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A., Abrams, S., Kumar, A. et al. STAR: Efficient SpatioTemporal Modeling for Action Recognition. Circuits Syst Signal Process 42, 705–723 (2023). https://doi.org/10.1007/s00034-022-02160-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02160-x

Keywords

Navigation