Abstract
Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.
Similar content being viewed by others
References
Arunnehru J, Chamundeeswari G, Prasanna Bharathi S (2018) Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Procedia Computer Sci 133:471–477
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39. Springer, Berlin
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision, pp 25–36. Springer, Berlin
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chen Chenglizhao, Li Shuai, Wang Yongguang, Qin Hong, Hao Aimin (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process 26(7):3156–3170
Chen Zhe, Wang Xin, Sun Zhen, Wang Zhijian (2016) Motion saliency detection using a temporal Fourier transform. Opt Laser Technol 80:1–15
Cong R, Lei J, Fu H, Cheng MM, Lin W, Huang Q (2018) Review of visual saliency detection with comprehensive information. IEEE Trans Circuits Syst Video Technol 29:2941
Cui X, Liu Q, Zhang S, Yang F, Metaxas DN (2012) Temporal spectral residual for fast salient motion detection. Neurocomputing 86:24–32
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Gong W, Qi L, Xu Y (2018) Privacy-aware multidimensional mobile service quality prediction and recommendation in distributed fog environment. Wireless Commun Mobile Comput. https://doi.org/10.1155/2018/3075849
Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: IEEE conference on computer vision and pattern recognition, pp 1–8. IEEE
Hara K, Kataoka H, Satoh Y. (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3154–3160
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555
Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Advances in neural information processing systems, pp 545–552
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Horn Berthold KP, Schunck Brian G (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: IEEE conference on computer vision and pattern recognition, pp 1–8. IEEE
Itti Laurent, Koch Christof, Niebur Ernst (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 11:1254–1259
Ji Shuiwang, Wei Xu, Yang Ming, Kai Yu (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Ji Yuzhu, Zhang Haijun, Wu QM Jonathan (2018) Salient object detection via multi-scale attention cnn. Neurocomputing 322:130–140
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Trevor B, Paul N et al 2017 The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563
Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05144-7
Liu Z, Li Z, Zong M, Ji W, Wang R, Tian Y (2019) Spatiotemporal saliency based multi-stream networks for action recognition. In: Asian conference on pattern recognition, pp 74–84. Springer, Singapore
Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Pereira Eduardo M, Ciobanu Lucian, Cardoso Jaime S (2017) Cross-layer classification framework for automatic social behavioural analysis in surveillance scenario. Neural Comput Appl 28(9):2425–2444
Pérez JS, Meinhardt-Llopis E, Facciolo G (2013) TV-L1 optical flow estimation. Image Process On Line 2013:137–150
Qi L, Chen Y, Yuan Y, Fu S, Zhang X, Xu X (2019) A QoS-aware virtual machine scheduling method for energy conservation in cloud-based cyber-physical systems. World Wide Web 23:1275
Qi Lianyong, Dai Peiqiang, Jiguo Yu, Zhou Zhili, Yanwei Xu (2017) Time-location-frequency-aware internet of things service selection based on historical records. Int J Distrib Sens Netw 13(1):1550147716688696
Qi Lianyong, Dou Wanchun, Chen Jinjun (2016) Weighted principal component analysis-based service selection method for multimedia services in cloud. Computing 98(1–2):195–214
Qi Lianyong, Wang Ruili, Chunhua Hu, Li Shancang, He Qiang, Xiaolong Xu (2019) Time-aware distributed service recommendation with privacy-preservation. Inf Sci 480:354–364
Qi L, Xu X, Dou W, Yu J, Zhou Z, Zhang X (2016) Time-aware IOE service recommendation on sparse data. Mobile Inf Syst 2016:4397061. https://doi.org/10.1155/2016/4397061
Qi L, Yu J, Zhou Z (2017) An invocation cost optimization method for web services in cloud environment. Scientific Program. https://doi.org/10.1155/2017/4358536
Qi Lianyong, Zhang Xuyun, Dou Wanchun, Chunhua Hu, Yang Chi, Chen Jinjun (2018) A two-stage locality-sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge environment. Future Gener Comput Syst 88:636–643
Qi Lianyong, Zhang Xuyun, Dou Wanchun, Ni Qiang (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624
Qi Lianyong, Zhou Zhili, Jiguo Yu, Liu Qi (2017) Data-sparsity tolerant web service recommendation approach based on improved collaborative filtering. IEICE Trans Inf Syst 100(9):2092–2099
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Shamsolmoali Pourya, Zareapoor Masoumeh, Wang Ruili, Jain Deepak Kumar, Yang Jie (2019) G-ganisr: gradual generative adversarial network for image super resolution. Neurocomputing 366:140–153
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Tian Chunwei, Yong Xu, Zuo Wangmeng (2020) Image denoising using deep cnn with batch renormalization. Neural Netw 121:461–473
Tian C, Xu Y, Zuo W, Zhang B, Fei L, Lin CW (2020) Coarse-to-fine CNN for image super-resolution. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2020.2999182
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatio-temporal vlad for video action recognition. IEEE Trans Image Process 28:2799
Varol Gül, Laptev Ivan, Schmid Cordelia (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Wang H, Kläser A, Schmid C, Liu CL, (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision & pattern recognition, pp 3169–3176
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36. Springer, Singapore
Xue Y, Guo X, Cao X (2012) Motion saliency detection using low-rank and sparse decomposition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1485–1488
Ng YH Joe, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zeng Shaoning, Gou Jianping, Yang Xiong (2018) Improving sparsity of coefficients for robust sparse and collaborative representation-based image classification. Neural Comput Appl 30(10):2965–2978
Zhang Haijun, Ji Yuzhu, Huang Wang, Liu Linlin (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 31(11):7361–7380
Zhang Shichao, Li Xuelong, Zong Ming, Zhu Xiaofeng, Wang Ruili (2018) Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785
Zheng H, Wang R, Ji W, Zong M, Wong WK, Lai Z, Lv H (2020) Discriminative deep multi-task learning for facial expression recognition. Inf Sci. https://doi.org/10.1016/j.ins.2020.04.041
Zhou Y, Sun X, Zha ZJ, Zeng W (2018) Mict: Mixed 3d/2d convolutional tube for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 449–458
Acknowledgements
This work was in part Supported by the National Key Research and Development Program of China (No. 2018YFB1404102), the Fundamental Research Funds for the Central Universities (No. 2002B02181), Natural Science Foundation of China 51979085, Natural Science Foundation of Jiangsu Province BK2020022539, Major Basic Research of Shandong Natural Science Foundation (ZR2019ZD10), Key Research and Development Plan of Shandong Province (2019GGX101050), Major agricultural application technology innovation project of Shandong Province(SD2019NJ007), China Scholarship Council (CSC) and the New Zealand China Doctoral Research Scholarships Program. Finally, we also thanks to Professor Chunhua Shen and anonymous reviewers for their constructive comments, which significantly improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zong, M., Wang, R., Chen, Z. et al. Multi-cue based 3D residual network for action recognition. Neural Comput & Applic 33, 5167–5181 (2021). https://doi.org/10.1007/s00521-020-05313-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-020-05313-8