Abstract
In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.










Similar content being viewed by others
Notes
As AlexNet and VGGNet only work for input size 224 × 224, we cannot apply the multi-scale strategy in fine-tuning these models. In contrast, ResNet model could work with different input resolution and thus a multi-scale strategy is applicable.
References
Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247
Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 7–12
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of the IEEE international conference on image processing, pp 168–172
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Proc 25(7):3010–3022
Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1737–1746
Gao Y, Wang M, Ji R, Wu X, Dai Q (2013) 3-D object retrieval with hausdorff distance learning. IEEE Trans Ind Electron 61(4):2088–2098
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In: AAAI, pp 1351–1357
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Huang Z, Wan C, Probst T, Van Gool L (2017) Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6099–6108
Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: AAAI, pp 2466–2472
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp 675–678
Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) Skeletonnet: mining deep part features for 3-d action recognition. IEEE Signal Process Lett 24 (6):731–735
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advance neural information processing systems, pp 1097–1105
Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: Proceedings of the IEEE international conference on multimedia and expo workshops, pp 1–1
Liu C, Hu Y, Li Y, Song S, Liu J Pku-mmd: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: Proceedings of the european conference on computer vision. Springer, pp 816–833
Lu G, Zhou Y, Li X, Kudo M (2016) Efficient action recognition via local position offset of 3d skeletal body joints. Multimedia Tools and Applications 75 (6):3479–3494
Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: Proceedings of the IEEE international conference on computer vision, pp 1809–1816
Nie S, Wang Z, Ji Q (2015) A generative restricted boltzmann machine based method for high-dimensional motion data modeling. Comp Vis Image Understanding 136:14–22
Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 465–470
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1717–1724
Presti LL, La Cascia M (2016) 3d skeleton-based human action classification: a survey. Pattern Recogn 53:130–147
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A, et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell 35 (12):2821–2840
Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song Y, Liu S, Tang J (2015) Describing trajectory of surface patch for human action recognition on rgb and depth videos. IEEE Signal Process Lett 22(4):426–429
Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI, pp 4263–4270
Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 4041–4049
Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Vemulapalli R, Arrate F, Chellappa R (2016) R3dg features: relative 3d geometry-based skeletal representations for human action recognition. Comp Vis Image Understanding 152:155–166
Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 2016 ACM on multimedia conference, pp 102–106
Wang D, Wang B, Zhao S, Yao H, Liu H (2017) View-based 3d object retrieval with discriminative views. Neurocomputing 252(C):58–66
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514
Wu D, Shao L (2014) Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–731
Yang S, Yuan C, Hu W, Ding X (2014) A hierarchical model based on latent dirichlet allocation for action recognition. In: Proceedings of the IEEE international conference on pattern recognition, pp 2613–2618
Yong D, Yun F, Liang W (2015) Skeleton based action recognition with convolutional neural network. In: Iapr asian conference on pattern recognition, pp 579–583
Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the international conference on learning representations, pp 1–10
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759
Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards robust action retrieval. Neurocomputing 151:533–543
Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112(C):110–118
Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimedia 19(3):632–645
Zheng Y, Yao H, Sun X, Zhao S (2015) Distinctive action sketch. In: IEEE international conference on image processing, pp 576–580
Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended lc-ksvd for action recognition. In: Proceedings of the IEEE international conference on digital lmage computing: techniques and applications, pp 1–8
Zhou L, Li W, Ogunbona P (2016) Learning a pose lexicon for semantic action recognition. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI, pp 3697–3703
Acknowledgements
This work was supported in part by Natural Science Foundation of China grants (61420106007, 61671387) and Australian Research Council grants (DE140100180).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, B., He, M., Dai, Y. et al. 3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN. Multimed Tools Appl 77, 22901–22921 (2018). https://doi.org/10.1007/s11042-018-5642-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5642-0