Skip to main content
Log in

A Review of Dynamic Maps for 3D Human Motion Recognition Using ConvNets and Its Improvement

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

RGB-D based action recognition is attracting more and more attention in both the research and industrial communities. However, due to the lack of training data, pre-training based methods are popular in this field. This paper presents a review of the concept of dynamic maps for RGB-D based human motion recognition using pretrained models in image domain. The dynamic maps recursively encode the spatial, temporal and structural information contained in the video sequence into dynamic motion images simultaneously. They enable the usage of Convolutional Neural Network and its pretained models on ImageNet for 3D human motion recognition. This simple, compact and effective representation achieves state-of-the-art results on various gesture/action/activities recognition datasets. Based on the review of previous methods using this concept upon different modalities (depth, skeleton or RGB-D data), a novel encoding scheme is developed and presented in this paper. The improved method generates effective flow-guided dynamic maps, and they could select the high motion window and distinguish the order among the frames with small motion. The improved flow-guided dynamic maps achieve state-of-the-art results on the large Chalearn LAP IsoGD and NTU RGB+D datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR

  2. Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp 168–172

  3. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634

  4. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118

  5. Duan J, Wan J, Zhou S, Guo X, Li S (2017) A unified framework for multi-modal isolated gesture recognition. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM),(under review, round 2)

  6. Fothergill S, Mentis HM, Nowozin S, Kohli P (2012) Instructing people for training gestural interactive systems. In: ACM HCI

  7. Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra based action recognition using convolutional neural networks. In: TCSVT, pp 1–5

  8. Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2462–2470

  9. Jayaraman D, Grauman K (2016) Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR

  10. Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. TPAMI 35(1):221–231

    Article  Google Scholar 

  11. Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl-Based Syst 122:64–74

    Article  Google Scholar 

  12. Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628

    Article  Google Scholar 

  13. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14

  14. Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2016a) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. TCYB

  15. Liu J, Shahroudy A, Xu D, Wang G (2016b) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: ECCV, pp 816–833

  16. Liu M, Liu H, Chen C (2017) 3d action recognition using multiscale energy-based global ternary image. IEEE Trans Circuits Syst Video Technol 28(8):1824–1838

    Article  MathSciNet  Google Scholar 

  17. Lu C, Jia J, Tang CK (2014) Range-sample depth feature for action recognition. In: CVPR, pp 772–779

  18. Oreifej O, Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723

  19. Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+ D: A large scale dataset for 3D human activity analysis. In: CVPR

  20. Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119

  21. Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: CVPR, pp 1297–1304

  22. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  23. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: ICML, pp 843–852

  24. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp 4489–4497

  25. Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049

  26. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp 588–595

  27. Wan J, Guo G, Li SZ (2016a) Explore efficient local features from RGB-D data for one-shot learning gesture recognition. TPAMI 38(8):1626–1639

    Article  Google Scholar 

  28. Wan J, Li SZ, Zhao Y, Zhou S, Guyon I, Escalera S (2016b) Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: CVPRW, pp 1–9

  29. Wang H, Wang P, Song Z, Li W (2017a) Large-scale multimodal gesture recognition using heterogeneous networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3129–3137

  30. Wang H, Wang P, Song Z, Li W (2017b) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3138–3146

  31. Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297

  32. Wang P, Li W, Ogunbona P, Gao Z, Zhang H (2014) Mining mid-level features for action recognition based on effective skeleton representation. In: DICTA, pp 1–8

  33. Wang P, Li W, Gao Z, Tang C, Zhang J, Ogunbona PO (2015) Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: ACM MM, pp 1119–1122

  34. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016a) Action recognition from depth maps using deep convolutional neural networks. THMS 46(4):498–509

    Google Scholar 

  35. Wang P, Li W, Liu S, Gao Z, Tang C, Ogunbona P (2016b) Large-scale isolated gesture recognition using convolutional neural networks. In: Pattern recognition (ICPR), 2016 23rd international conference on, IEEE, pp 7–12

  36. Wang P, Li Z, Hou Y, Li W (2016c) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106

  37. Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017c) Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  38. Wang P, Li W, Gao Z, Tang C, Ogunbona PO (2018) Depth pooling based large-scale 3-d action recognition with convolutional neural networks. IEEE Trans Multimed 20(5):1051–1061

    Article  Google Scholar 

  39. Xia L, Chen CC, Aggarwal J (2012) View invariant human action recognition using histograms of 3D joints. In: CVPRW, pp 20–27

  40. Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inform Sci 480:287–304

    Article  Google Scholar 

  41. Yang X, Tian Y (2012) Eigenjoints-based action recognition using Naive-Bayes-Nearest-Neighbor. In: CVPRW, pp 14–19

  42. Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811

  43. Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060

  44. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: CVPR, pp 4694–4702

  45. Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3d convolution and convolutional lstm. IEEE Access

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant nos. 61906173, 61822701).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huogen Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, Z., Wang, P., Wang, H. et al. A Review of Dynamic Maps for 3D Human Motion Recognition Using ConvNets and Its Improvement. Neural Process Lett 52, 1501–1515 (2020). https://doi.org/10.1007/s11063-020-10320-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10320-w

Keywords

Navigation