Skip to main content
Log in

Using efficient group pseudo-3D network to learn spatio-temporal features

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Action classification is a challenging problem in computer vision in recent years; the three-dimensional convolutional neural network plays an important role in spatio-temporal feature extraction. However, the 3D convolution approach needs expensive computation and memory resources. This paper proposes an efficient group pseudo-3D (GP3D) convolution to reduce the model size and need less computational power. We built the GP3D with MobileNetV3 to extend the 2D pre-training parameters directly to the 3D convolutional network. We also used GP3D to replace the original inflated 3D convolutional network to efficiently reduce the model size. Compared with other state-of-the-art 3D convolutional networks, GP3D with the efficient network of MobileNetV3 can save about 3 to 22 times of parameters but maintain the same accuracy on the dataset of UCF-101. GP3D with an inflated 3D convolutional network can achieve about 90% top1 accuracy, while the model size is only about half of the original inflated 3D convolutional network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Wang, H., Kläser, A., Schmid, C., et al.: Action recognition by dense trajectories. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176 (2011)

  2. Action Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

  3. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

  4. Thung, G., Jiang, H.: A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

  5. Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

  6. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541 (2017)

  7. Tran, D., et al.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

  8. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  9. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

  10. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

  11. Ng, Y.-H.J., et al.: Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)

  12. Feichtenhofer, C., et al.: Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982 (2018)

  13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  14. Kim, K.-H., et al.: Pvanet: deep but lightweight neural networks for real-time object detection. arXiv preprint arXiv:1608.08021 (2016)

  15. Iandola, F.N., et al.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\) 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016)

  16. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  17. Sandler, M., et al.: Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

  18. Howard, A., et al.: Searching for mobilenetv3. arXiv preprint arXiv:1905.02244 (2019)

  19. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01, 4, 6 (2012)

  20. Kuehne, H., et al.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision. IEEE, pp. 2556–2563 (2011)

  21. Caba Heilbron, F., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

  22. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  23. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  24. Ghiasi, G., Lin, T.-Y., Le, Q.V.: Dropblock: a regularization method for convolutional networks. In: Advances in Neural Information Processing Systems, pp. 10727–10737 (2018)

  25. Tran, D., Ray, J., Shou, Z., Chang, S., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv:1708.05038 (2017)

  26. https://pytorch.org/

  27. https://docs.nvidia.com/deeplearning/sdk/cudnn-install/

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61772352; the Science and Technology Planning Project of Sichuan Province under Grant No. 2019YFG0400, 2018GZDZX0031, 2018GZDZX0004, 2017GZDZX0003, 2018JY0182, 19ZDYF1286, 2020YFG0322.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Guo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Guo, B., Shen, Y. et al. Using efficient group pseudo-3D network to learn spatio-temporal features. SIViP 15, 361–369 (2021). https://doi.org/10.1007/s11760-020-01758-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-020-01758-5

Keywords

Navigation