Abstract
C3D has been widely used for video representation and understanding. However, it is performed on spatio-temporal contexts in a global view, which often weakens its capacity of learning local representation. To alleviate this problem, a concise and novel multi-layer feature fusion network with the cooperation of local and global views is introduced. For the current network, the global view branch is used to learn the core video semantics, while the local view branch is used to capture the contextual local semantics. Unlike traditional C3D model, the global view branch can only provide the big view branch with the most activated video features from a broader 3D receptive field. Via adding such shallow-view contexts, the local view branch can learn more robust and discriminative spatio-temporal representations for video classification. Thus we propose 3D convolutional networks with multi-layer-pooling selection fusion for video classification, the integrated deep global feature is combined with the information originated from shallow layer of local feature extraction networks, through the space-time pyramid pooling, adaptive pooling and attention pooling three different pooling units, different time–space feature information is obtained, and finally cascaded and used for classification. Experiments on the UCF-101 and HMDB-51 datasets achieve correct classification rate 95.0% and 72.2% respectively. The results show that the proposed 3D convolutional networks with multi-layer-pooling selection fusion has better classification performance.
Similar content being viewed by others
References
Ali A, Zhu Y, Chen Q et al (2019) Leveraging spatio-temporal patterns for predicting citywide traffic crowd flows using deep hybrid neural networks, in: Proc Intern Conf Parallel Distributed Syst 125–132
Arandjelovic R, Gronat P, Torii A et al (2018) NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans Pattern Anal Mach Intell 40:1437–1451
Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset, in: 2017 IEEE Conf Comp Vision Pattern Recogn 4724–4733
Cheng C, Lv P, Su B (2018) Spatiotemporal pyramid pooling in 3D convolutional neural networks for action recognition, in: Intern Conf Image Process 3468–3472
Donahue J, Hendricks L, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description, in: IEEE Conf Computer Vision Pattern Recogn 2625–2634
Donahue J, Hendricks L, Rohrbach M et al (2017) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans Pattern Analysis Machine Intell (39):677–691
Dosovitskiy A, Fischer P, Ilg E et al (2015) FlowNet: learning optical flow with convolutional networks, in: IEEE Intern Confe Comp Vision 2758–2766
Du W, Wang Y, Qiao Y (2017) RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos. IEEE Intern Conf Comp Vision 2017:3745–3754
Hara K, Kataoka H, Satoh Y (2018) Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?, in. IEEE/CVF Conf Comp Vision Pattern Recogn 2018:6546–6555
He K, Zhang X, Ren S et al (2015) Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916
He K, Zhang X, Ren S et al (2016) Deep Residual Learning for Image Recognition, in. IEEE Conf Comp Vision Pattern Recogn 770–778
Hu J, Shen L, Sun G (2018) Squeeze-and-Excitation Networks, in. IEEE/CVF Conf Comp Vision Pattern Recogn 7132–7141
Hu Y, Gao J, Xu C (2020) Learning Dual-Pooling Graph Neural Networks for Few-shot Video Classification, IEEE Trans Multimedia (Early Access). https://doi.org/10.1109/TMM.2020.3039329
Ilg E, Mayer N, Saikia T et al (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks, 2017 IEEE Conf Comp Vision Pattern Recogn 1647–1655
Jiang Y, Wu Z, Tang J et al (2018) Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification. IEEE Trans Multimedia 20:3137–3147
Jing L, Yang X, Tian Y (2018) Video you only look once: Overall temporal convolutions for action recognition. J Vis Commun Image Represent 52:58–65
Karpathy A, Toderici G, Shetty S (2014) Large-Scale Video Classification with Convolutional Neural Networks, in: IEEE Conf Comp Vision Pattern Recogn 1725–1732
Li C, Zhang B, Chen C et al (2019) Deep manifold structure transfer for action recognition. IEEE Trans Image Process 28:4646–4658
Lin J, Gan C, Han D (2019) TSM: Temporal Shift Module for Efficient Video Understanding, in: Intern Conf Comp Vision 7082–7092
Piergiovanni A, Ryoo M (2019) Representation Flow for Action Recognition, in. IEEE/CVF Conf Comp Vision Pattern Recogn 9937–9945
Qiu Z, Yao T, Mei T (2017) Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks, in. IEEE Intern Conf Comp Vision 5534–5542
Shi Y, Tian Y, Wang Y et al (2017) Learning Long-Term Dependencies for Action Recognition with a Biologically-Inspired Deep Network, in: IEEE Inter Conf Comp Vision 716–725
Simonyan K, Zisserman A (2014) Two-stream convolutional networksfor action recognition in videos. Proc 27th Intern Confer Neural Inform Process Syst 568–576
Sudhakaran S, Escalera S, Lanz O (2020) Gate-Shift Networks for Video Action Recognition, in. IEEE/CVF Conf Comp Vision Pattern Recogn 1099–1108
Tran D, Bourdev L, Fergus R et al (2015) Learning Spatiotemporal Features with 3D Convolutional Networks, in. IEEE Intern Conf Comp Vision 4489–4497
Tran D, Wang H, Feiszli M et al (2019) Video Classification With Channel-Separated Convolutional Networks, in. IEEE/CVF Intern Conf Comp Vision 25551–5560
Varol G, Laptev I, Schmid C (2018) Long-Term Temporal Convolutions for Action Recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517
Wang H, Schmid C (2013) Action recognition with improved trajectories, in: Intern Conf Comp Vision 3551–3558
Wang J, Wang W, Gao W (2018) Multiscale Deep Alternative Neural Network for Large-Scale Video Classification IEEE Transact Multimedia 20:2578–2592
Wang L, Xiong Y, Wang Z et al (2020) Temporal segment networks: towards good practices for deep action recognition, in: Euro Conf Comp Vision 20–36
Wu H, Ma X, Li Y (2020) Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos. IEEE Trans Multimedia 22:2293–2306
Zhang J, Mei K, Zheng Y et al (2019) Exploiting Mid-Level Semantics for Large-Scale Complex Video Classification. IEEE Trans Multimedia 21:2518–2530
Zhao S, Liu Y, Han Y et al (2018) Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition. IEEE Trans Circuits Syst Video Technol 28:1839–1849
Zhu S, Fang Z, Wang Y et al (2019) Multimodal activity recognition with local block CNN and attention-based spatial weighted CNN. J Vis Commun Image Represent 60:38–43
Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding, Proc Euro Con Comp Vision 713–730
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grants 61771420 and and 62001413, the Natural Science Foundation of Hebei Province under Grants F2020203064, as well as the China Postdoctoral Science Foundation Grant 2018M641674 and the Doctoral Foundation of Yanshan University under Grants BL18033. In this paper, we utilize the public video database and thank the provider of the databases.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hu, Zp., Zhang, Rx., Qiu, Y. et al. 3D convolutional networks with multi-layer-pooling selection fusion for video classification. Multimed Tools Appl 80, 33179–33192 (2021). https://doi.org/10.1007/s11042-021-11403-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11403-z