Multi-cue based 3D residual network for action recognition

Zong, Ming; Wang, Ruili; Chen, Zhe; Wang, Maoli; Wang, Xun; Potgieter, Johan

doi:10.1007/s00521-020-05313-8

Multi-cue based 3D residual network for action recognition

Original Article
Published: 02 September 2020

Volume 33, pages 5167–5181, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ming Zong^1,2,
Ruili Wang^1,2,
Zhe Chen³,
Maoli Wang ORCID: orcid.org/0000-0001-5420-1463⁴,
Xun Wang^1,2 &
…
Johan Potgieter⁵

1089 Accesses
12 Citations
Explore all metrics

Abstract

Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

References

Arunnehru J, Chamundeeswari G, Prasanna Bharathi S (2018) Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Procedia Computer Sci 133:471–477
Article Google Scholar
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39. Springer, Berlin
Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision, pp 25–36. Springer, Berlin
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chen Chenglizhao, Li Shuai, Wang Yongguang, Qin Hong, Hao Aimin (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process 26(7):3156–3170
Article MathSciNet Google Scholar
Chen Zhe, Wang Xin, Sun Zhen, Wang Zhijian (2016) Motion saliency detection using a temporal Fourier transform. Opt Laser Technol 80:1–15
Article Google Scholar
Cong R, Lei J, Fu H, Cheng MM, Lin W, Huang Q (2018) Review of visual saliency detection with comprehensive information. IEEE Trans Circuits Syst Video Technol 29:2941
Article Google Scholar
Cui X, Liu Q, Zhang S, Yang F, Metaxas DN (2012) Temporal spectral residual for fast salient motion detection. Neurocomputing 86:24–32
Article Google Scholar
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Gong W, Qi L, Xu Y (2018) Privacy-aware multidimensional mobile service quality prediction and recommendation in distributed fog environment. Wireless Commun Mobile Comput. https://doi.org/10.1155/2018/3075849
Article Google Scholar
Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: IEEE conference on computer vision and pattern recognition, pp 1–8. IEEE
Hara K, Kataoka H, Satoh Y. (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3154–3160
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555
Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Advances in neural information processing systems, pp 545–552
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Horn Berthold KP, Schunck Brian G (1981) Determining optical flow. Artif Intell 17(1–3):185–203
Article Google Scholar
Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: IEEE conference on computer vision and pattern recognition, pp 1–8. IEEE
Itti Laurent, Koch Christof, Niebur Ernst (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 11:1254–1259
Article Google Scholar
Ji Shuiwang, Wei Xu, Yang Ming, Kai Yu (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Ji Yuzhu, Zhang Haijun, Wu QM Jonathan (2018) Salient object detection via multi-scale attention cnn. Neurocomputing 322:130–140
Article Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Trevor B, Paul N et al 2017 The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563
Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05144-7
Article Google Scholar
Liu Z, Li Z, Zong M, Ji W, Wang R, Tian Y (2019) Spatiotemporal saliency based multi-stream networks for action recognition. In: Asian conference on pattern recognition, pp 74–84. Springer, Singapore
Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Pereira Eduardo M, Ciobanu Lucian, Cardoso Jaime S (2017) Cross-layer classification framework for automatic social behavioural analysis in surveillance scenario. Neural Comput Appl 28(9):2425–2444
Article Google Scholar
Pérez JS, Meinhardt-Llopis E, Facciolo G (2013) TV-L1 optical flow estimation. Image Process On Line 2013:137–150
Article Google Scholar
Qi L, Chen Y, Yuan Y, Fu S, Zhang X, Xu X (2019) A QoS-aware virtual machine scheduling method for energy conservation in cloud-based cyber-physical systems. World Wide Web 23:1275
Article Google Scholar
Qi Lianyong, Dai Peiqiang, Jiguo Yu, Zhou Zhili, Yanwei Xu (2017) Time-location-frequency-aware internet of things service selection based on historical records. Int J Distrib Sens Netw 13(1):1550147716688696
Article Google Scholar
Qi Lianyong, Dou Wanchun, Chen Jinjun (2016) Weighted principal component analysis-based service selection method for multimedia services in cloud. Computing 98(1–2):195–214
Article MathSciNet Google Scholar
Qi Lianyong, Wang Ruili, Chunhua Hu, Li Shancang, He Qiang, Xiaolong Xu (2019) Time-aware distributed service recommendation with privacy-preservation. Inf Sci 480:354–364
Article Google Scholar
Qi L, Xu X, Dou W, Yu J, Zhou Z, Zhang X (2016) Time-aware IOE service recommendation on sparse data. Mobile Inf Syst 2016:4397061. https://doi.org/10.1155/2016/4397061
Article Google Scholar
Qi L, Yu J, Zhou Z (2017) An invocation cost optimization method for web services in cloud environment. Scientific Program. https://doi.org/10.1155/2017/4358536
Article Google Scholar
Qi Lianyong, Zhang Xuyun, Dou Wanchun, Chunhua Hu, Yang Chi, Chen Jinjun (2018) A two-stage locality-sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge environment. Future Gener Comput Syst 88:636–643
Article Google Scholar
Qi Lianyong, Zhang Xuyun, Dou Wanchun, Ni Qiang (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624
Article Google Scholar
Qi Lianyong, Zhou Zhili, Jiguo Yu, Liu Qi (2017) Data-sparsity tolerant web service recommendation approach based on improved collaborative filtering. IEICE Trans Inf Syst 100(9):2092–2099
Article Google Scholar
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Shamsolmoali Pourya, Zareapoor Masoumeh, Wang Ruili, Jain Deepak Kumar, Yang Jie (2019) G-ganisr: gradual generative adversarial network for image super resolution. Neurocomputing 366:140–153
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Tian Chunwei, Yong Xu, Zuo Wangmeng (2020) Image denoising using deep cnn with batch renormalization. Neural Netw 121:461–473
Article Google Scholar
Tian C, Xu Y, Zuo W, Zhang B, Fei L, Lin CW (2020) Coarse-to-fine CNN for image super-resolution. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2020.2999182
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatio-temporal vlad for video action recognition. IEEE Trans Image Process 28:2799
Article MathSciNet Google Scholar
Varol Gül, Laptev Ivan, Schmid Cordelia (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Article Google Scholar
Wang H, Kläser A, Schmid C, Liu CL, (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision & pattern recognition, pp 3169–3176
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36. Springer, Singapore
Xue Y, Guo X, Cao X (2012) Motion saliency detection using low-rank and sparse decomposition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1485–1488
Ng YH Joe, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Zeng Shaoning, Gou Jianping, Yang Xiong (2018) Improving sparsity of coefficients for robust sparse and collaborative representation-based image classification. Neural Comput Appl 30(10):2965–2978
Article Google Scholar
Zhang Haijun, Ji Yuzhu, Huang Wang, Liu Linlin (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 31(11):7361–7380
Article Google Scholar
Zhang Shichao, Li Xuelong, Zong Ming, Zhu Xiaofeng, Wang Ruili (2018) Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785
Article MathSciNet Google Scholar
Zheng H, Wang R, Ji W, Zong M, Wong WK, Lai Z, Lv H (2020) Discriminative deep multi-task learning for facial expression recognition. Inf Sci. https://doi.org/10.1016/j.ins.2020.04.041
Article Google Scholar
Zhou Y, Sun X, Zha ZJ, Zeng W (2018) Mict: Mixed 3d/2d convolutional tube for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 449–458

Download references

Acknowledgements

This work was in part Supported by the National Key Research and Development Program of China (No. 2018YFB1404102), the Fundamental Research Funds for the Central Universities (No. 2002B02181), Natural Science Foundation of China 51979085, Natural Science Foundation of Jiangsu Province BK2020022539, Major Basic Research of Shandong Natural Science Foundation (ZR2019ZD10), Key Research and Development Plan of Shandong Province (2019GGX101050), Major agricultural application technology innovation project of Shandong Province(SD2019NJ007), China Scholarship Council (CSC) and the New Zealand China Doctoral Research Scholarships Program. Finally, we also thanks to Professor Chunhua Shen and anonymous reviewers for their constructive comments, which significantly improve the quality of this paper.

Author information

Authors and Affiliations

School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China
Ming Zong, Ruili Wang & Xun Wang
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
Ming Zong, Ruili Wang & Xun Wang
College of Computer and Information, Hohai University, Nanjing, China
Zhe Chen
School of Information Science and Engineering, Qufu Normal University, Rizhao, 276800, China
Maoli Wang
School of Food and Advanced Technology, Massey University, Auckland, New Zealand
Johan Potgieter

Authors

Ming Zong
View author publications
You can also search for this author in PubMed Google Scholar
Ruili Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Maoli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Johan Potgieter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maoli Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zong, M., Wang, R., Chen, Z. et al. Multi-cue based 3D residual network for action recognition. Neural Comput & Applic 33, 5167–5181 (2021). https://doi.org/10.1007/s00521-020-05313-8

Download citation

Received: 19 March 2020
Accepted: 19 August 2020
Published: 02 September 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s00521-020-05313-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-cue based 3D residual network for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-cue based 3D residual network for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

A review of convolutional neural networks in computer vision

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation