3D convolutional networks with multi-layer-pooling selection fusion for video classification

Hu, Zheng-ping; Zhang, Rui-xue; Qiu, Yue; Zhao, Meng-yao; Sun, Zhe

doi:10.1007/s11042-021-11403-z

3D convolutional networks with multi-layer-pooling selection fusion for video classification

Published: 13 August 2021

Volume 80, pages 33179–33192, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zheng-ping Hu ORCID: orcid.org/0000-0003-0300-6144^1,2,
Rui-xue Zhang¹,
Yue Qiu¹,
Meng-yao Zhao¹ &
…
Zhe Sun^1,2

375 Accesses
6 Citations
Explore all metrics

Abstract

C3D has been widely used for video representation and understanding. However, it is performed on spatio-temporal contexts in a global view, which often weakens its capacity of learning local representation. To alleviate this problem, a concise and novel multi-layer feature fusion network with the cooperation of local and global views is introduced. For the current network, the global view branch is used to learn the core video semantics, while the local view branch is used to capture the contextual local semantics. Unlike traditional C3D model, the global view branch can only provide the big view branch with the most activated video features from a broader 3D receptive field. Via adding such shallow-view contexts, the local view branch can learn more robust and discriminative spatio-temporal representations for video classification. Thus we propose 3D convolutional networks with multi-layer-pooling selection fusion for video classification, the integrated deep global feature is combined with the information originated from shallow layer of local feature extraction networks, through the space-time pyramid pooling, adaptive pooling and attention pooling three different pooling units, different time–space feature information is obtained, and finally cascaded and used for classification. Experiments on the UCF-101 and HMDB-51 datasets achieve correct classification rate 95.0% and 72.2% respectively. The results show that the proposed 3D convolutional networks with multi-layer-pooling selection fusion has better classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extracting Deep Video Feature for Mobile Video Classification with ELU-3DCNN

Action Recognition Using Multiple Pooling Strategies of CNN Features

Article 03 October 2018

Global Features of Fused Frame Relationships Help Video Classification

References

Ali A, Zhu Y, Chen Q et al (2019) Leveraging spatio-temporal patterns for predicting citywide traffic crowd flows using deep hybrid neural networks, in: Proc Intern Conf Parallel Distributed Syst 125–132
Arandjelovic R, Gronat P, Torii A et al (2018) NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40:1437–1451
Article Google Scholar
Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, in: 2017 IEEE Conf Comp Vision Pattern Recogn 4724–4733
Cheng C, Lv P, Su B (2018) Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition, in: Intern Conf Image Process 3468–3472
Donahue J, Hendricks L, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description, in: IEEE Conf Computer Vision Pattern Recogn 2625–2634
Donahue J, Hendricks L, Rohrbach M et al (2017) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans Pattern Analysis Machine Intell (39):677–691
Dosovitskiy A, Fischer P, Ilg E et al (2015) FlowNet: learning optical flow with convolutional networks, in: IEEE Intern Confe Comp Vision 2758–2766
Du W, Wang Y, Qiao Y (2017) RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos. IEEE Intern Conf Comp Vision 2017:3745–3754
Google Scholar
Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?, in. IEEE/CVF Conf Comp Vision Pattern Recogn 2018:6546–6555
Google Scholar
He K, Zhang X, Ren S et al (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916
Article Google Scholar
He K, Zhang X, Ren S et al (2016) Deep Residual Learning for Image Recognition, in. IEEE Conf Comp Vision Pattern Recogn 770–778
Google Scholar
Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation Networks, in. IEEE/CVF Conf Comp Vision Pattern Recogn 7132–7141
Google Scholar
Hu Y, Gao J, Xu C (2020) Learning Dual-Pooling Graph Neural Networks for Few-shot Video Classification, IEEE Trans Multimedia (Early Access). https://doi.org/10.1109/TMM.2020.3039329
Ilg E, Mayer N, Saikia T et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks, 2017 IEEE Conf Comp Vision Pattern Recogn 1647–1655
Jiang Y, Wu Z, Tang J et al (2018) Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Trans Multimedia 20:3137–3147
Article Google Scholar
Jing L, Yang X, Tian Y (2018) Video you only look once: Overall temporal convolutions for action recognition. J Vis Commun Image Represent 52:58–65
Article Google Scholar
Karpathy A, Toderici G, Shetty S (2014) Large-Scale Video Classification with Convolutional Neural Networks, in: IEEE Conf Comp Vision Pattern Recogn 1725–1732
Li C, Zhang B, Chen C et al (2019) Deep manifold structure transfer for action recognition. IEEE Trans Image Process 28:4646–4658
Article MathSciNet Google Scholar
Lin J, Gan C, Han D (2019) TSM: Temporal Shift Module for Efficient Video Understanding, in: Intern Conf Comp Vision 7082–7092
Piergiovanni A, Ryoo M (2019) Representation Flow for Action Recognition, in. IEEE/CVF Conf Comp Vision Pattern Recogn 9937–9945
Google Scholar
Qiu Z, Yao T, Mei T (2017) Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, in. IEEE Intern Conf Comp Vision 5534–5542
Google Scholar
Shi Y, Tian Y, Wang Y et al (2017) Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network, in: IEEE Inter Conf Comp Vision 716–725
Simonyan K, Zisserman A (2014) Two-stream convolutional networksfor action recognition in videos. Proc 27th Intern Confer Neural Inform Process Syst 568–576
Sudhakaran S, Escalera S, Lanz O (2020) Gate-Shift Networks for Video Action Recognition, in. IEEE/CVF Conf Comp Vision Pattern Recogn 1099–1108
Google Scholar
Tran D, Bourdev L, Fergus R et al (2015) Learning Spatiotemporal Features with 3D Convolutional Networks, in. IEEE Intern Conf Comp Vision 4489–4497
Google Scholar
Tran D, Wang H, Feiszli M et al (2019) Video Classification With Channel-Separated Convolutional Networks, in. IEEE/CVF Intern Conf Comp Vision 25551–5560
Google Scholar
Varol G, Laptev I, Schmid C (2018) Long-Term Temporal Convolutions for Action Recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517
Article Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories, in: Intern Conf Comp Vision 3551–3558
Wang J, Wang W, Gao W (2018) Multiscale Deep Alternative Neural Network for Large-Scale Video Classification IEEE Transact Multimedia 20:2578–2592
Wang L, Xiong Y, Wang Z et al (2020) Temporal segment networks: towards good practices for deep action recognition, in: Euro Conf Comp Vision 20–36
Wu H, Ma X, Li Y (2020) Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos. IEEE Trans Multimedia 22:2293–2306
Article Google Scholar
Zhang J, Mei K, Zheng Y et al (2019) Exploiting Mid-Level Semantics for Large-Scale Complex Video Classification. IEEE Trans Multimedia 21:2518–2530
Article Google Scholar
Zhao S, Liu Y, Han Y et al (2018) Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition. IEEE Trans Circuits Syst Video Technol 28:1839–1849
Article Google Scholar
Zhu S, Fang Z, Wang Y et al (2019) Multimodal activity recognition with local block CNN and attention-based spatial weighted CNN. J Vis Commun Image Represent 60:38–43
Article Google Scholar
Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding, Proc Euro Con Comp Vision 713–730

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 61771420 and and 62001413, the Natural Science Foundation of Hebei Province under Grants F2020203064, as well as the China Postdoctoral Science Foundation Grant 2018M641674 and the Doctoral Foundation of Yanshan University under Grants BL18033. In this paper, we utilize the public video database and thank the provider of the databases.

Author information

Authors and Affiliations

School of Information Science and Engineering, Yanshan University, Qinhuangdao, Hebei, China
Zheng-ping Hu, Rui-xue Zhang, Yue Qiu, Meng-yao Zhao & Zhe Sun
Hebei Key Laboratory of Information Transmission & Signal Processing, Qinhuangdao, HeBei, China
Zheng-ping Hu & Zhe Sun

Authors

Zheng-ping Hu
View author publications
You can also search for this author in PubMed Google Scholar
Rui-xue Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yue Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Meng-yao Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng-ping Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Zp., Zhang, Rx., Qiu, Y. et al. 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimed Tools Appl 80, 33179–33192 (2021). https://doi.org/10.1007/s11042-021-11403-z

Download citation

Received: 30 October 2020
Revised: 27 July 2021
Accepted: 02 August 2021
Published: 13 August 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s11042-021-11403-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D convolutional networks with multi-layer-pooling selection fusion for video classification

Abstract

Access this article

Similar content being viewed by others

Extracting Deep Video Feature for Mobile Video Classification with ELU-3DCNN

Action Recognition Using Multiple Pooling Strategies of CNN Features

Global Features of Fused Frame Relationships Help Video Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

3D convolutional networks with multi-layer-pooling selection fusion for video classification

Abstract

Access this article

Similar content being viewed by others

Extracting Deep Video Feature for Mobile Video Classification with ELU-3DCNN

Action Recognition Using Multiple Pooling Strategies of CNN Features

Global Features of Fused Frame Relationships Help Video Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation