Skip to main content
Log in

3D convolutional networks with multi-layer-pooling selection fusion for video classification

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

C3D has been widely used for video representation and understanding. However, it is performed on spatio-temporal contexts in a global view, which often weakens its capacity of learning local representation. To alleviate this problem, a concise and novel multi-layer feature fusion network with the cooperation of local and global views is introduced. For the current network, the global view branch is used to learn the core video semantics, while the local view branch is used to capture the contextual local semantics. Unlike traditional C3D model, the global view branch can only provide the big view branch with the most activated video features from a broader 3D receptive field. Via adding such shallow-view contexts, the local view branch can learn more robust and discriminative spatio-temporal representations for video classification. Thus we propose 3D convolutional networks with multi-layer-pooling selection fusion for video classification, the integrated deep global feature is combined with the information originated from shallow layer of local feature extraction networks, through the space-time pyramid pooling, adaptive pooling and attention pooling three different pooling units, different time–space feature information is obtained, and finally cascaded and used for classification. Experiments on the UCF-101 and HMDB-51 datasets achieve correct classification rate 95.0% and 72.2% respectively. The results show that the proposed 3D convolutional networks with multi-layer-pooling selection fusion has better classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Ali A, Zhu Y, Chen Q et al (2019) Leveraging spatio-temporal patterns for predicting citywide traffic crowd flows using deep hybrid neural networks, in: Proc Intern Conf Parallel Distributed Syst 125–132

  2. Arandjelovic R, Gronat P, Torii A et al (2018) NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40:1437–1451

    Article  Google Scholar 

  3. Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, in: 2017 IEEE Conf Comp Vision Pattern Recogn 4724–4733

  4. Cheng C, Lv P, Su B (2018) Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition, in: Intern Conf Image Process 3468–3472

  5. Donahue J, Hendricks L, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description, in: IEEE Conf Computer Vision Pattern Recogn 2625–2634

  6. Donahue J, Hendricks L, Rohrbach M et al (2017) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans Pattern Analysis Machine Intell (39):677–691

  7. Dosovitskiy A, Fischer P, Ilg E et al (2015) FlowNet: learning optical flow with convolutional networks, in: IEEE Intern Confe Comp Vision 2758–2766

  8. Du W, Wang Y, Qiao Y (2017) RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos. IEEE Intern Conf Comp Vision 2017:3745–3754

    Google Scholar 

  9. Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?, in. IEEE/CVF Conf Comp Vision Pattern Recogn 2018:6546–6555

    Google Scholar 

  10. He K, Zhang X, Ren S et al (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916

    Article  Google Scholar 

  11. He K, Zhang X, Ren S et al (2016) Deep Residual Learning for Image Recognition, in. IEEE Conf Comp Vision Pattern Recogn 770–778

    Google Scholar 

  12. Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation Networks, in. IEEE/CVF Conf Comp Vision Pattern Recogn 7132–7141

    Google Scholar 

  13. Hu Y, Gao J, Xu C (2020) Learning Dual-Pooling Graph Neural Networks for Few-shot Video Classification, IEEE Trans Multimedia (Early Access). https://doi.org/10.1109/TMM.2020.3039329

  14. Ilg E, Mayer N, Saikia T et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks, 2017 IEEE Conf Comp Vision Pattern Recogn 1647–1655

  15. Jiang Y, Wu Z, Tang J et al (2018) Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Trans Multimedia 20:3137–3147

    Article  Google Scholar 

  16. Jing L, Yang X, Tian Y (2018) Video you only look once: Overall temporal convolutions for action recognition. J Vis Commun Image Represent 52:58–65

    Article  Google Scholar 

  17. Karpathy A, Toderici G, Shetty S (2014) Large-Scale Video Classification with Convolutional Neural Networks, in: IEEE Conf Comp Vision Pattern Recogn 1725–1732

  18. Li C, Zhang B, Chen C et al (2019) Deep manifold structure transfer for action recognition. IEEE Trans Image Process 28:4646–4658

    Article  MathSciNet  Google Scholar 

  19. Lin J, Gan C, Han D (2019) TSM: Temporal Shift Module for Efficient Video Understanding, in: Intern Conf Comp Vision 7082–7092

  20. Piergiovanni A, Ryoo M (2019) Representation Flow for Action Recognition, in. IEEE/CVF Conf Comp Vision Pattern Recogn 9937–9945

    Google Scholar 

  21. Qiu Z, Yao T, Mei T (2017) Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, in. IEEE Intern Conf Comp Vision 5534–5542

    Google Scholar 

  22. Shi Y, Tian Y, Wang Y et al (2017) Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network, in: IEEE Inter Conf Comp Vision 716–725

  23. Simonyan K, Zisserman A (2014) Two-stream convolutional networksfor action recognition in videos. Proc 27th Intern Confer Neural Inform Process Syst 568–576

  24. Sudhakaran S, Escalera S, Lanz O (2020) Gate-Shift Networks for Video Action Recognition, in. IEEE/CVF Conf Comp Vision Pattern Recogn 1099–1108

    Google Scholar 

  25. Tran D, Bourdev L, Fergus R et al (2015) Learning Spatiotemporal Features with 3D Convolutional Networks, in. IEEE Intern Conf Comp Vision 4489–4497

    Google Scholar 

  26. Tran D, Wang H, Feiszli M et al (2019) Video Classification With Channel-Separated Convolutional Networks, in. IEEE/CVF Intern Conf Comp Vision 25551–5560

    Google Scholar 

  27. Varol G, Laptev I, Schmid C (2018) Long-Term Temporal Convolutions for Action Recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517

    Article  Google Scholar 

  28. Wang H, Schmid C (2013) Action recognition with improved trajectories, in: Intern Conf Comp Vision 3551–3558

  29. Wang J, Wang W, Gao W (2018) Multiscale Deep Alternative Neural Network for Large-Scale Video Classification IEEE Transact Multimedia 20:2578–2592

  30. Wang L, Xiong Y, Wang Z et al (2020) Temporal segment networks: towards good practices for deep action recognition, in: Euro Conf Comp Vision 20–36

  31. Wu H, Ma X, Li Y (2020) Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos. IEEE Trans Multimedia 22:2293–2306

    Article  Google Scholar 

  32. Zhang J, Mei K, Zheng Y et al (2019) Exploiting Mid-Level Semantics for Large-Scale Complex Video Classification. IEEE Trans Multimedia 21:2518–2530

    Article  Google Scholar 

  33. Zhao S, Liu Y, Han Y et al (2018) Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition. IEEE Trans Circuits Syst Video Technol 28:1839–1849

    Article  Google Scholar 

  34. Zhu S, Fang Z, Wang Y et al (2019) Multimodal activity recognition with local block CNN and attention-based spatial weighted CNN. J Vis Commun Image Represent 60:38–43

    Article  Google Scholar 

  35. Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding, Proc Euro Con Comp Vision 713–730

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 61771420 and and 62001413, the Natural Science Foundation of Hebei Province under Grants F2020203064, as well as the China Postdoctoral Science Foundation Grant 2018M641674 and the Doctoral Foundation of Yanshan University under Grants BL18033. In this paper, we utilize the public video database and thank the provider of the databases.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zheng-ping Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Zp., Zhang, Rx., Qiu, Y. et al. 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimed Tools Appl 80, 33179–33192 (2021). https://doi.org/10.1007/s11042-021-11403-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11403-z

Keywords

Navigation