Abstract
RGB-D based action recognition is attracting more and more attention in both the research and industrial communities. However, due to the lack of training data, pre-training based methods are popular in this field. This paper presents a review of the concept of dynamic maps for RGB-D based human motion recognition using pretrained models in image domain. The dynamic maps recursively encode the spatial, temporal and structural information contained in the video sequence into dynamic motion images simultaneously. They enable the usage of Convolutional Neural Network and its pretained models on ImageNet for 3D human motion recognition. This simple, compact and effective representation achieves state-of-the-art results on various gesture/action/activities recognition datasets. Based on the review of previous methods using this concept upon different modalities (depth, skeleton or RGB-D data), a novel encoding scheme is developed and presented in this paper. The improved method generates effective flow-guided dynamic maps, and they could select the high motion window and distinguish the order among the frames with small motion. The improved flow-guided dynamic maps achieve state-of-the-art results on the large Chalearn LAP IsoGD and NTU RGB+D datasets.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: CVPR
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: ICIP, pp 168–172
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp 1110–1118
Duan J, Wan J, Zhou S, Guo X, Li S (2017) A unified framework for multi-modal isolated gesture recognition. In: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM),(under review, round 2)
Fothergill S, Mentis HM, Nowozin S, Kohli P (2012) Instructing people for training gestural interactive systems. In: ACM HCI
Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra based action recognition using convolutional neural networks. In: TCSVT, pp 1–5
Ilg E, Mayer N, Saikia T, Keuper M, Dosovitskiy A, Brox T (2017) Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2462–2470
Jayaraman D, Grauman K (2016) Slow and steady feature analysis: higher order temporal coherence in video. In: CVPR
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. TPAMI 35(1):221–231
Ji X, Cheng J, Tao D, Wu X, Feng W (2017) The spatial laplacian and temporal energy pyramid representation for human action recognition using depth sequences. Knowl-Based Syst 122:64–74
Li C, Hou Y, Wang P, Li W (2017) Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process Lett 24(5):624–628
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: CVPRW, pp 9–14
Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2016a) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. TCYB
Liu J, Shahroudy A, Xu D, Wang G (2016b) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: ECCV, pp 816–833
Liu M, Liu H, Chen C (2017) 3d action recognition using multiscale energy-based global ternary image. IEEE Trans Circuits Syst Video Technol 28(8):1824–1838
Lu C, Jia J, Tang CK (2014) Range-sample depth feature for action recognition. In: CVPR, pp 772–779
Oreifej O, Liu Z (2013) HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723
Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+ D: A large scale dataset for 3D human activity analysis. In: CVPR
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv preprint arXiv:1511.04119
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: CVPR, pp 1297–1304
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: ICML, pp 843–852
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp 4489–4497
Veeriah V, Zhuang N, Qi GJ (2015) Differential recurrent neural networks for action recognition. In: ICCV, pp 4041–4049
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, pp 588–595
Wan J, Guo G, Li SZ (2016a) Explore efficient local features from RGB-D data for one-shot learning gesture recognition. TPAMI 38(8):1626–1639
Wan J, Li SZ, Zhao Y, Zhou S, Guyon I, Escalera S (2016b) Chalearn looking at people RGB-D isolated and continuous datasets for gesture recognition. In: CVPRW, pp 1–9
Wang H, Wang P, Song Z, Li W (2017a) Large-scale multimodal gesture recognition using heterogeneous networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3129–3137
Wang H, Wang P, Song Z, Li W (2017b) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3138–3146
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp 1290–1297
Wang P, Li W, Ogunbona P, Gao Z, Zhang H (2014) Mining mid-level features for action recognition based on effective skeleton representation. In: DICTA, pp 1–8
Wang P, Li W, Gao Z, Tang C, Zhang J, Ogunbona PO (2015) Convnets-based action recognition from depth maps through virtual cameras and pseudocoloring. In: ACM MM, pp 1119–1122
Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona P (2016a) Action recognition from depth maps using deep convolutional neural networks. THMS 46(4):498–509
Wang P, Li W, Liu S, Gao Z, Tang C, Ogunbona P (2016b) Large-scale isolated gesture recognition using convolutional neural networks. In: Pattern recognition (ICPR), 2016 23rd international conference on, IEEE, pp 7–12
Wang P, Li Z, Hou Y, Li W (2016c) Action recognition based on joint trajectory maps using convolutional neural networks. In: ACM MM, pp 102–106
Wang P, Li W, Gao Z, Zhang Y, Tang C, Ogunbona P (2017c) Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Wang P, Li W, Gao Z, Tang C, Ogunbona PO (2018) Depth pooling based large-scale 3-d action recognition with convolutional neural networks. IEEE Trans Multimed 20(5):1051–1061
Xia L, Chen CC, Aggarwal J (2012) View invariant human action recognition using histograms of 3D joints. In: CVPRW, pp 20–27
Xiao Y, Chen J, Wang Y, Cao Z, Zhou JT, Bai X (2019) Action recognition for depth video using multi-view dynamic images. Inform Sci 480:287–304
Yang X, Tian Y (2012) Eigenjoints-based action recognition using Naive-Bayes-Nearest-Neighbor. In: CVPRW, pp 14–19
Yang X, Tian Y (2014) Super normal vector for activity recognition using depth sequences. In: CVPR, pp 804–811
Yang X, Zhang C, Tian Y (2012) Recognizing actions using depth motion maps-based histograms of oriented gradients. In: ACM MM, pp 1057–1060
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: CVPR, pp 4694–4702
Zhu G, Zhang L, Shen P, Song J (2017) Multimodal gesture recognition using 3d convolution and convolutional lstm. IEEE Access
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Grant nos. 61906173, 61822701).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gao, Z., Wang, P., Wang, H. et al. A Review of Dynamic Maps for 3D Human Motion Recognition Using ConvNets and Its Improvement. Neural Process Lett 52, 1501–1515 (2020). https://doi.org/10.1007/s11063-020-10320-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10320-w