Skip to main content
Log in

Multi-cue based 3D residual network for action recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Arunnehru J, Chamundeeswari G, Prasanna Bharathi S (2018) Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Procedia Computer Sci 133:471–477

    Article  Google Scholar 

  2. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39. Springer, Berlin

  3. Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: European conference on computer vision, pp 25–36. Springer, Berlin

  4. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  5. Chen Chenglizhao, Li Shuai, Wang Yongguang, Qin Hong, Hao Aimin (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process 26(7):3156–3170

    Article  MathSciNet  Google Scholar 

  6. Chen Zhe, Wang Xin, Sun Zhen, Wang Zhijian (2016) Motion saliency detection using a temporal Fourier transform. Opt Laser Technol 80:1–15

    Article  Google Scholar 

  7. Cong R, Lei J, Fu H, Cheng MM, Lin W, Huang Q (2018) Review of visual saliency detection with comprehensive information. IEEE Trans Circuits Syst Video Technol 29:2941

    Article  Google Scholar 

  8. Cui X, Liu Q, Zhang S, Yang F, Metaxas DN (2012) Temporal spectral residual for fast salient motion detection. Neurocomputing 86:24–32

    Article  Google Scholar 

  9. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  10. Feichtenhofer C, Pinz A, Wildes RP (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476

  11. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777

  12. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  13. Girshick R (2015) Fast R-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  14. Gong W, Qi L, Xu Y (2018) Privacy-aware multidimensional mobile service quality prediction and recommendation in distributed fog environment. Wireless Commun Mobile Comput. https://doi.org/10.1155/2018/3075849

    Article  Google Scholar 

  15. Guo C, Ma Q, Zhang L (2008) Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: IEEE conference on computer vision and pattern recognition, pp 1–8. IEEE

  16. Hara K, Kataoka H, Satoh Y. (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3154–3160

  17. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555

  18. Harel J, Koch C, Perona P (2007) Graph-based visual saliency. In: Advances in neural information processing systems, pp 545–552

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  20. Horn Berthold KP, Schunck Brian G (1981) Determining optical flow. Artif Intell 17(1–3):185–203

    Article  Google Scholar 

  21. Hou X, Zhang L (2007) Saliency detection: a spectral residual approach. In: IEEE conference on computer vision and pattern recognition, pp 1–8. IEEE

  22. Itti Laurent, Koch Christof, Niebur Ernst (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 11:1254–1259

    Article  Google Scholar 

  23. Ji Shuiwang, Wei Xu, Yang Ming, Kai Yu (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  24. Ji Yuzhu, Zhang Haijun, Wu QM Jonathan (2018) Salient object detection via multi-scale attention cnn. Neurocomputing 322:130–140

    Article  Google Scholar 

  25. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732

  26. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Trevor B, Paul N et al 2017 The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  27. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  28. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, pp 2556–2563

  29. Liu Z, Li Z, Wang R, Zong M, Ji W (2020) Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05144-7

    Article  Google Scholar 

  30. Liu Z, Li Z, Zong M, Ji W, Wang R, Tian Y (2019) Spatiotemporal saliency based multi-stream networks for action recognition. In: Asian conference on pattern recognition, pp 74–84. Springer, Singapore

  31. Shelhamer E, Long J, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  32. Pereira Eduardo M, Ciobanu Lucian, Cardoso Jaime S (2017) Cross-layer classification framework for automatic social behavioural analysis in surveillance scenario. Neural Comput Appl 28(9):2425–2444

    Article  Google Scholar 

  33. Pérez JS, Meinhardt-Llopis E, Facciolo G (2013) TV-L1 optical flow estimation. Image Process On Line 2013:137–150

    Article  Google Scholar 

  34. Qi L, Chen Y, Yuan Y, Fu S, Zhang X, Xu X (2019) A QoS-aware virtual machine scheduling method for energy conservation in cloud-based cyber-physical systems. World Wide Web 23:1275

    Article  Google Scholar 

  35. Qi Lianyong, Dai Peiqiang, Jiguo Yu, Zhou Zhili, Yanwei Xu (2017) Time-location-frequency-aware internet of things service selection based on historical records. Int J Distrib Sens Netw 13(1):1550147716688696

    Article  Google Scholar 

  36. Qi Lianyong, Dou Wanchun, Chen Jinjun (2016) Weighted principal component analysis-based service selection method for multimedia services in cloud. Computing 98(1–2):195–214

    Article  MathSciNet  Google Scholar 

  37. Qi Lianyong, Wang Ruili, Chunhua Hu, Li Shancang, He Qiang, Xiaolong Xu (2019) Time-aware distributed service recommendation with privacy-preservation. Inf Sci 480:354–364

    Article  Google Scholar 

  38. Qi L, Xu X, Dou W, Yu J, Zhou Z, Zhang X (2016) Time-aware IOE service recommendation on sparse data. Mobile Inf Syst 2016:4397061. https://doi.org/10.1155/2016/4397061

    Article  Google Scholar 

  39. Qi L, Yu J, Zhou Z (2017) An invocation cost optimization method for web services in cloud environment. Scientific Program. https://doi.org/10.1155/2017/4358536

    Article  Google Scholar 

  40. Qi Lianyong, Zhang Xuyun, Dou Wanchun, Chunhua Hu, Yang Chi, Chen Jinjun (2018) A two-stage locality-sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge environment. Future Gener Comput Syst 88:636–643

    Article  Google Scholar 

  41. Qi Lianyong, Zhang Xuyun, Dou Wanchun, Ni Qiang (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624

    Article  Google Scholar 

  42. Qi Lianyong, Zhou Zhili, Jiguo Yu, Liu Qi (2017) Data-sparsity tolerant web service recommendation approach based on improved collaborative filtering. IEICE Trans Inf Syst 100(9):2092–2099

    Article  Google Scholar 

  43. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541

  44. Shamsolmoali Pourya, Zareapoor Masoumeh, Wang Ruili, Jain Deepak Kumar, Yang Jie (2019) G-ganisr: gradual generative adversarial network for image super resolution. Neurocomputing 366:140–153

    Article  Google Scholar 

  45. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  46. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  47. Tian Chunwei, Yong Xu, Zuo Wangmeng (2020) Image denoising using deep cnn with batch renormalization. Neural Netw 121:461–473

    Article  Google Scholar 

  48. Tian C, Xu Y, Zuo W, Zhang B, Fei L, Lin CW (2020) Coarse-to-fine CNN for image super-resolution. IEEE Trans Multimedia. https://doi.org/10.1109/TMM.2020.2999182

  49. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  50. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  51. Tu Z, Li H, Zhang D, Dauwels J, Li B, Yuan J (2019) Action-stage emphasized spatio-temporal vlad for video action recognition. IEEE Trans Image Process 28:2799

    Article  MathSciNet  Google Scholar 

  52. Varol Gül, Laptev Ivan, Schmid Cordelia (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  Google Scholar 

  53. Wang H, Kläser A, Schmid C, Liu CL, (2011) Action recognition by dense trajectories. In: IEEE conference on computer vision & pattern recognition, pp 3169–3176

  54. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  55. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp 20–36. Springer, Singapore

  56. Xue Y, Guo X, Cao X (2012) Motion saliency detection using low-rank and sparse decomposition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1485–1488

  57. Ng YH Joe, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  58. Zeng Shaoning, Gou Jianping, Yang Xiong (2018) Improving sparsity of coefficients for robust sparse and collaborative representation-based image classification. Neural Comput Appl 30(10):2965–2978

    Article  Google Scholar 

  59. Zhang Haijun, Ji Yuzhu, Huang Wang, Liu Linlin (2019) Sitcom-star-based clothing retrieval for video advertising: a deep learning framework. Neural Comput Appl 31(11):7361–7380

    Article  Google Scholar 

  60. Zhang Shichao, Li Xuelong, Zong Ming, Zhu Xiaofeng, Wang Ruili (2018) Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785

    Article  MathSciNet  Google Scholar 

  61. Zheng H, Wang R, Ji W, Zong M, Wong WK, Lai Z, Lv H (2020) Discriminative deep multi-task learning for facial expression recognition. Inf Sci. https://doi.org/10.1016/j.ins.2020.04.041

    Article  Google Scholar 

  62. Zhou Y, Sun X, Zha ZJ, Zeng W (2018) Mict: Mixed 3d/2d convolutional tube for human action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 449–458

Download references

Acknowledgements

This work was in part Supported by the National Key Research and Development Program of China (No. 2018YFB1404102), the Fundamental Research Funds for the Central Universities (No. 2002B02181), Natural Science Foundation of China 51979085, Natural Science Foundation of Jiangsu Province BK2020022539, Major Basic Research of Shandong Natural Science Foundation (ZR2019ZD10), Key Research and Development Plan of Shandong Province (2019GGX101050), Major agricultural application technology innovation project of Shandong Province(SD2019NJ007), China Scholarship Council (CSC) and the New Zealand China Doctoral Research Scholarships Program. Finally, we also thanks to Professor Chunhua Shen and anonymous reviewers for their constructive comments, which significantly improve the quality of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maoli Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zong, M., Wang, R., Chen, Z. et al. Multi-cue based 3D residual network for action recognition. Neural Comput & Applic 33, 5167–5181 (2021). https://doi.org/10.1007/s00521-020-05313-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05313-8

Keywords

Navigation