Using efficient group pseudo-3D network to learn spatio-temporal features

Chen, Yaosen; Guo, Bing; Shen, Yan; Wang, Wei; Suo, Xinhua; Zhang, Zhen

doi:10.1007/s11760-020-01758-5

Using efficient group pseudo-3D network to learn spatio-temporal features

Original Paper
Published: 03 August 2020

Volume 15, pages 361–369, (2021)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Yaosen Chen¹,
Bing Guo¹,
Yan Shen²,
Wei Wang¹,
Xinhua Suo¹ &
…
Zhen Zhang¹

436 Accesses
6 Citations
Explore all metrics

Abstract

Action classification is a challenging problem in computer vision in recent years; the three-dimensional convolutional neural network plays an important role in spatio-temporal feature extraction. However, the 3D convolution approach needs expensive computation and memory resources. This paper proposes an efficient group pseudo-3D (GP3D) convolution to reduce the model size and need less computational power. We built the GP3D with MobileNetV3 to extend the 2D pre-training parameters directly to the 3D convolutional network. We also used GP3D to replace the original inflated 3D convolutional network to efficiently reduce the model size. Compared with other state-of-the-art 3D convolutional networks, GP3D with the efficient network of MobileNetV3 can save about 3 to 22 times of parameters but maintain the same accuracy on the dataset of UCF-101. GP3D with an inflated 3D convolutional network can achieve about 90% top1 accuracy, while the model size is only about half of the original inflated 3D convolutional network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Article 01 July 2021

Y. Wang, X. J. Shen, … J. X. Sun

Gated 3D-CNN for Action Recognition

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

Article 07 January 2020

D. Srihari, P. V. V. Kishore, … Ch. Raghava Prasad

References

Wang, H., Kläser, A., Schmid, C., et al.: Action recognition by dense trajectories. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176 (2011)
Action Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Thung, G., Jiang, H.: A Torch Library for Action Recognition and Detection Using CNNs and LSTMs
Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017)
Tran, D., et al.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Ng, Y.-H.J., et al.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)
Feichtenhofer, C., et al.: Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Kim, K.-H., et al.: Pvanet: deep but lightweight neural networks for real-time object detection. arXiv preprint arXiv:1608.08021 (2016)
Iandola, F.N., et al.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\) 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016)
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Howard, A., et al.: Searching for mobilenetv3. arXiv preprint arXiv:1905.02244 (2019)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01, 4, 6 (2012)
Kuehne, H., et al.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision. IEEE, pp. 2556–2563 (2011)
Caba Heilbron, F., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Ghiasi, G., Lin, T.-Y., Le, Q.V.: Dropblock: a regularization method for convolutional networks. In: Advances in Neural Information Processing Systems, pp. 10727–10737 (2018)
Tran, D., Ray, J., Shou, Z., Chang, S., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv:1708.05038 (2017)
https://pytorch.org/
https://docs.nvidia.com/deeplearning/sdk/cudnn-install/

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61772352; the Science and Technology Planning Project of Sichuan Province under Grant No. 2019YFG0400, 2018GZDZX0031, 2018GZDZX0004, 2017GZDZX0003, 2018JY0182, 19ZDYF1286, 2020YFG0322.

Author information

Authors and Affiliations

College of Computer Science, Sichuan University, Chengdu, Sichuan, China
Yaosen Chen, Bing Guo, Wei Wang, Xinhua Suo & Zhen Zhang
School of Control Engineering, Chengdu University of Information Technology, Chengdu, Sichuan, China
Yan Shen

Authors

Yaosen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bing Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yan Shen
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinhua Suo
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Guo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Y., Guo, B., Shen, Y. et al. Using efficient group pseudo-3D network to learn spatio-temporal features. SIViP 15, 361–369 (2021). https://doi.org/10.1007/s11760-020-01758-5

Download citation

Received: 12 November 2019
Revised: 20 July 2020
Accepted: 23 July 2020
Published: 03 August 2020
Issue Date: March 2021
DOI: https://doi.org/10.1007/s11760-020-01758-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Using efficient group pseudo-3D network to learn spatio-temporal features

Abstract

Access this article

Similar content being viewed by others

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Gated 3D-CNN for Action Recognition

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using efficient group pseudo-3D network to learn spatio-temporal features

Abstract

Access this article

Similar content being viewed by others

Action Recognition in Videos with Spatio-Temporal Fusion 3D Convolutional Neural Networks

Gated 3D-CNN for Action Recognition

A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation