Skip to main content
Log in

Multi-modality learning for human action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The multi-modality based human action recognition is an increasing topic. Multi-modality can provide more abundant and complementary information than single modality. However, it is difficult for multi-modality learning to capture the spatial-temporal information from the entire RGB and depth sequence effectively. In this paper, to obtain better representation of spatial-temporal information, we propose a bidirectional rank pooling method to construct the RGB Visual Dynamic Images (VDIs) and Depth Dynamic Images (DDIs). Furthermore, we design an effective segmentation convolutional networks (ConvNets) architecture based on multi-modality hierarchical fusion strategy for human action recognition. The proposed method has been verified and achieved the state-of-the-art results on the widely used NTU RGB+D, SYSU 3D HOI and UWA3D II datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Asadi-Aghbolaghi M, Kasaei S (2018) Supervised spatio-temporal kernel descriptor for human action recognition from RGB-depth videos. Multimed Tools Appl 77(11):14115–14135

    Article  Google Scholar 

  2. Baradel F, Wolf C, Mille J (2018) Human activity recognition with pose-driven attention to RGB. In: British machine vision conference (BMVC), pp 1–14

  3. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3034–3042

  4. Deng J, Dong W, Socher R, Li L, Li K, Li F (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 248–255

  5. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell (TPAMI) 39 (4):677–691

    Article  Google Scholar 

  6. Fernando B, Gavves E, Oramas MJ, Ghodrati A, Tuytelaars T (2017) Rank pooling for action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 39(4):773–787

    Article  Google Scholar 

  7. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 770–778

  8. Hu J, Zheng W, Lai J, Zhang J (2017) Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans Pattern Anal Mach Intellgence (TPAMI) 39(11):2186–2200

    Article  Google Scholar 

  9. Ijjina EP, Chalavadi KM (2017) Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recogn 72:504–516

    Article  Google Scholar 

  10. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate SHIFT. In: 32nd International conference on machine learning (ICML), vol 1, pp 448– 456

  11. Ji Y, Ye G, Cheng H (2014) Interactive body part contrast mining for human interaction recognition. In: IEEE International conference on multimedia and expo workshops (ICMEW), pp 1–6

  12. Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl-Based Syst 122:64–74

    Article  Google Scholar 

  13. Ji X, Cheng J, Feng W, Tao D (2018) Skeleton embedded motion body partition for human action recognition using depth sequences. Signal Process 143:56–68

    Article  Google Scholar 

  14. Jiang Y, Dai Q, Liu W, Xue X, Ngo C (2015) Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans Image Process (TIP) 24(11):3781–3795

    Article  MathSciNet  Google Scholar 

  15. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1725–1732

  16. Khaire P, Kumar P, Imran J (2018) Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recogn Lett 115:107–116

    Article  Google Scholar 

  17. Kong Y, Fu Y (2015) Bilinear heterogeneous information machine for RGB-D action recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1054–1062

  18. Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International joint conference on artificial intelligence (IJCAI), pp 786–792

  19. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: European conference on computer vision (ECCV), vol 9907, pp 816–833

  20. Liu Z, Zhang C, Tian Y (2016) 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100

    Article  Google Scholar 

  21. Liu J, Wang G, Hu P, Duan L, Kot AC (2017) Global context-aware attention LSTM networks for 3D action recognition. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 3671–3680

  22. Moghaddam Z, Piccardi M (2014) Training initialization of hidden Markov models in human action recognition. IEEE Trans Autom Sci Eng (TASE) 11(2):394–408

    Article  Google Scholar 

  23. Rahmani H, Mian A (2016) 3D action recognition from novel viewpoints. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1506–1515

  24. Rahmani H, Mahmood A, Huynh D, Mian A (2016) Histogram of oriented principal components for cross-view action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 38(12):2430–2443

    Article  Google Scholar 

  25. Sempena S, Maulidevi N, Aryan P (2011) Human action recognition using dynamic time warping. In: International conference on electrical engineering and informatics (ICEEI), pp 1–5

  26. Shahroudy A, Liu J, Ng T, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 1010–1019

  27. Shahroudy A, Ng T, Gong Y, Wang G (2018) Deep multimodal feature analysis for action recognition in RGB+D videos. IEEE Trans Pattern Anal Mach Intell (TPAMI) 40(5):1045–1058

    Article  Google Scholar 

  28. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems (NIPS), vol 1, pp 568–576

  29. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14(3):199–222

    Article  MathSciNet  Google Scholar 

  30. Sun L, Jia K, Yeung D, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: IEEE International conference on computer vision (ICCV), pp 4597–4605

  31. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 2818–2826

  32. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: IEEE International conference on computer vision (ICCV), pp 4489–4497

  33. Veeriah V, Zhuang N, Qi G (2015) Differential recurrent neural networks for action recognition. In: IEEE International conference on computer vision (ICCV), pp 4041–4049

  34. Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 36 (5):914–927

    Article  Google Scholar 

  35. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European Conference on computer vision (ECCV), vol 9912, pp 20–36

  36. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016) Action recognition from depth maps using deep convolutional neural networks. IEEE Trans Human-Mach Syst (THMS) 46(4):498–509

    Article  Google Scholar 

  37. Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017) Scene flow to action map: a new representation for RGB-D based action recognition with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition (CVPR), pp 416–425

  38. Wang P, Li W, Wan J, Ogunbona P, Liu X (2018) Cooperative training of deep aggregation networks for RGB-D action recognition. In: 32nd AAAI Conference on artificial intelligence (AAAI), pp 7404–7411

  39. Wang P, Li W, Gao Z, Tang C, Ogunbona P (2018) Depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans Multimed (TMM) 20(5):1051–1061

    Article  Google Scholar 

  40. Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inform Sci 480:287–304

    Article  Google Scholar 

  41. Zhang K, Zhang L (2018) Extracting hierarchical spatial and temporal features for human action recognition. Multimed Tools Appl 77(13):16053–16068

    Article  Google Scholar 

  42. Zhang J, Li W, Ogunbona P, Wang P, Tang C (2016) RGB-D-based action recognition datasets: a survey. Pattern Recogn 60:86–105

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by National Key R&D Program of China (2018YFB 1308000), National Natural Science Funds of China (U1913202,U1813205, U1713213, 61772508, 61801428), Shenzhen Technology Project (JCYJ20180507182610734, JCYJ20170413152535587), CAS Key Technology Talent Program, Zhejiang Provincial Natural Science Foundation of China (LY18F020034).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Cheng.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ziliang Ren and Qieshi Zhang contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ren, Z., Zhang, Q., Gao, X. et al. Multi-modality learning for human action recognition. Multimed Tools Appl 80, 16185–16203 (2021). https://doi.org/10.1007/s11042-019-08576-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08576-z

Keywords

Navigation