STAR: Efficient SpatioTemporal Modeling for Action Recognition

Kumar, Abhijeet; Abrams, Samuel; Kumar, Abhishek; Narayanan, Vijaykrishnan

doi:10.1007/s00034-022-02160-x

STAR: Efficient SpatioTemporal Modeling for Action Recognition

Published: 14 September 2022

Volume 42, pages 705–723, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abhijeet Kumar ORCID: orcid.org/0000-0002-1821-9313¹,
Samuel Abrams¹,
Abhishek Kumar¹ &
…
Vijaykrishnan Narayanan¹

313 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Action recognition in video has gained significant attention over the past several years. While conventional 2D CNNs have found great success in understanding images, they are not as effective in capturing temporal relationships present in video. By contrast, 3D CNNs capture spatiotemporal information well, but they incur a high computational cost, making deployment challenging. In video, key information is typically confined to a small number of frames, though many current approaches require decompressing and processing all frames, which wastes resources. Others work directly on the compressed domain but require multiple input streams to understand the data. In our work, we directly operate on compressed video and extract information solely from intracoded frames (I-frames) avoiding the use of motion vectors and residuals for motion information making this a single-stream network. This reduces processing time and energy consumption, by extension, making this approach more accessible for a wider range of machines and uses. Extensive testing is employed on the UCF101 (Soomro et al. in UCF101: a dataset of 101 human actions classes from videos in the Wild, 2012) and HMDB51 (Kuehne et al., in: Jhuang, Garrote, Poggio, Serre (eds) Proceedings of the international conference on computer vision (ICCV), 2011) datasets to evaluate our framework and show that computational complexity is reduced significantly while achieving competitive accuracy to existing compressed domain efforts, i.e., 92.6% top1 accuracy in UCF-101 and 62.9% in HMDB-51 dataset with 24.3M parameters and 4 GFLOPS and energy savings of over 11\(\times \) for the two datasets versus CoViAR (Wu et al. in Compressed video action recognition, 2018).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

A novel algorithm for human action recognition in compressed domain using attention-guided approach

Article 06 November 2023

Learning hierarchical video representation for action recognition

Article 15 February 2017

References

R.V. Babu, M. Tom, P. Wadekar, A survey on compressed domain video analysis techniques. Multimedia Tools Appl. 75(2), 1043–1078 (2016). https://doi.org/10.1007/s11042-014-2345-z
Article Google Scholar
B. Battash, H. Barad, H. Tang, A. Bleiweiss, Mimic the raw domain: accelerating action recognition in the compressed domain, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 32926–2934 (2020)
H. Cao, S. Yu, J. Feng, Compressed Video Action Recognition with Refined Motion Vector (2019)
J. Carreira, A. Zisserman, Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (2018)
Y. Chen, Y. Kalantidis, J. Li, S. Yan, J. Feng, Two-stream convolutional networks for action recognition in videos. arxiv:1807.11195 (2018)
Cisco Visual Networking Index: Forecast and Trends, 20172022 White Paper—Cisco (2019)
M.A. Goodale, A.D. Milner, Separate visual pathways for perception and action. Trends Neurosci. 15(1), 20–25 (1992)
Article Google Scholar
K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition (2015)
A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mobile vision applications. CoRR arxiv:1704.04861 (2017)
H. Hu, W. Zhou, X. Li, N. Yan, H. Li, Mv2flow: learning motion representation for fast compressed video action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 16(3s), 66 (2021). https://doi.org/10.1145/3422360
Article Google Scholar
S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015)
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR arxiv:1502.03167 (2015)
S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013). https://doi.org/10.1109/TPAMI.2012.59
Article Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014). https://doi.org/10.1109/CVPR.2014.223
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, A. Zisserman, The Kinetics Human Action Video Dataset (2017)
D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization (2017)
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017). https://doi.org/10.1145/3065386
Article Google Scholar
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in Proceedings of the International Conference on Computer Vision (ICCV) (2011)
J. Lin, C. Gan, S. Han, Temporal shift module for efficient video understanding. CoRR arxiv:1811.08383 (2018)
M.K. Mandal, Digital Video Compression Techniques, pp. 203–237 (Springer, Boston, 2003). https://doi.org/10.1007/978-1-4615-0265-4_9
S. Ranjbar Alvar, H. Choi, I.V. Bajic, Can you tell a face from a hevc bitstream? in 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 257–261 (2018). https://doi.org/10.1109/MIPR.2018.00060
Z. Shou, X. Lin, Y. Kalantidis, L. Sevilla-Lara, M. Rohrbach, S.-F. Chang, Z. Yan, DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition (2019)
K. Simonyan, A. Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos (2014)
K. Soomro, A.R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild (2012)
S. Tomar, Converting video formats with ffmpeg. Linux J. 2006(146), 10 (2006)
Google Scholar
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks (2015)
H. Wang, B. Raj, On the Origin of Deep Learning (2017)
H. Wang, C. Schmid, Action recognition with improved trajectories, in 2013 IEEE International Conference on Computer Vision, pp. 3551–3558 (2013). https://doi.org/10.1109/ICCV.2013.441
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, Towards Good Practices for Very Deep Two-Stream ConvNets (2015)
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L.V. Gool Temporal Segment Networks: Towards Good Practices for Deep Action Recognition (2016)
C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A.J. Smola, P. Krähenbühl, Compressed Video Action Recognition (2018)
C.-Y. Wu, M. Zaheer, H. Hu, R. Manmatha, A.J. Smola, P. Krähenbühl, Compressed Video Action Recognition (2018)
Z. Xu, Y. Yang, A.G. Hauptmann, A discriminative CNN video representation for event detection. CoRR arxiv:1411.4006 (2014)
G. Yao, T. Lei, J. Zhong, A review of convolutional-neural-network-based action recognition. Pattern Recognit. Lett. 118, 14–22 (2019). https://doi.org/10.1016/j.patrec.2018.05.018; Cooperative and Social Robots: Understanding Human Activities and Intentions
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, A. Courville, Describing Videos by Exploiting Temporal Structure (2015)
C. Zhou, X. Chen, P. Sun, G. Zhang, W. Zhou, Compressed video action recognition using motion vector representation, in Pattern Recognition. ICPR International Workshops and Challenges. ed. by A. Del Bimbo, R. Cucchiara, S. Sclaroff, G.M. Farinella, T. Mei, M. Bertini, H.J. Escalante, R. Vezzani (Springer, Cham, 2021), pp.701–713
Chapter Google Scholar
Y. Zhu, X. Li, C. Liu, M. Zolfaghari, Y. Xiong, C. Wu, Z. Zhang, J. Tighe, R. Manmatha, M. Li, A Comprehensive Study of Deep Video Action Recognition (2020)

Download references

Acknowledgements

This work was supported by Center for Brain-inspired Computing (C-BRIC) & Center for Research in Intelligent Storage and Processing in Memory (CRISP), two of the center from Semiconductor Research Center (SRC) program sponsored by DARPA.

Author information

Authors and Affiliations

EECS Department, The Pennsylvania State University, Univeristy Park, State College, PA, 16802, USA
Abhijeet Kumar, Samuel Abrams, Abhishek Kumar & Vijaykrishnan Narayanan

Authors

Abhijeet Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Samuel Abrams
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Vijaykrishnan Narayanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhijeet Kumar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kumar, A., Abrams, S., Kumar, A. et al. STAR: Efficient SpatioTemporal Modeling for Action Recognition. Circuits Syst Signal Process 42, 705–723 (2023). https://doi.org/10.1007/s00034-022-02160-x

Download citation

Received: 31 January 2022
Revised: 16 August 2022
Accepted: 19 August 2022
Published: 14 September 2022
Issue Date: February 2023
DOI: https://doi.org/10.1007/s00034-022-02160-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STAR: Efficient SpatioTemporal Modeling for Action Recognition

Abstract

Access this article

Similar content being viewed by others

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

A novel algorithm for human action recognition in compressed domain using attention-guided approach

Learning hierarchical video representation for action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STAR: Efficient SpatioTemporal Modeling for Action Recognition

Abstract

Access this article

Similar content being viewed by others

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

A novel algorithm for human action recognition in compressed domain using attention-guided approach

Learning hierarchical video representation for action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation