Skip to main content
Log in

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. As AlexNet and VGGNet only work for input size 224 × 224, we cannot apply the multi-scale strategy in fine-tuning these models. In contrast, ResNet model could work with different input resolution and thus a multi-scale strategy is applicable.

References

  1. Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247

    Article  Google Scholar 

  2. Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 7–12

  3. Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of the IEEE international conference on image processing, pp 168–172

  4. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118

  5. Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Proc 25(7):3010–3022

    Article  MathSciNet  Google Scholar 

  6. Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1737–1746

  7. Gao Y, Wang M, Ji R, Wu X, Dai Q (2013) 3-D object retrieval with hausdorff distance learning. IEEE Trans Ind Electron 61(4):2088–2098

    Article  Google Scholar 

  8. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  9. Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In: AAAI, pp 1351–1357

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  11. Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol PP(99):1–1

    Google Scholar 

  12. Huang Z, Wan C, Probst T, Van Gool L (2017) Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6099–6108

  13. Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: AAAI, pp 2466–2472

  14. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp 675–678

  15. Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) Skeletonnet: mining deep part features for 3-d action recognition. IEEE Signal Process Lett 24 (6):731–735

    Article  Google Scholar 

  16. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297

  17. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advance neural information processing systems, pp 1097–1105

  18. Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: Proceedings of the IEEE international conference on multimedia and expo workshops, pp 1–1

  19. Liu C, Hu Y, Li Y, Song S, Liu J Pku-mmd: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475

  20. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: Proceedings of the european conference on computer vision. Springer, pp 816–833

  21. Lu G, Zhou Y, Li X, Kudo M (2016) Efficient action recognition via local position offset of 3d skeletal body joints. Multimedia Tools and Applications 75 (6):3479–3494

    Article  Google Scholar 

  22. Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: Proceedings of the IEEE international conference on computer vision, pp 1809–1816

  23. Nie S, Wang Z, Ji Q (2015) A generative restricted boltzmann machine based method for high-dimensional motion data modeling. Comp Vis Image Understanding 136:14–22

    Article  Google Scholar 

  24. Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 465–470

  25. Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1717–1724

  26. Presti LL, La Cascia M (2016) 3d skeleton-based human action classification: a survey. Pattern Recogn 53:130–147

    Article  Google Scholar 

  27. Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  28. Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A, et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell 35 (12):2821–2840

    Article  Google Scholar 

  29. Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  30. Song Y, Liu S, Tang J (2015) Describing trajectory of surface patch for human action recognition on rgb and depth videos. IEEE Signal Process Lett 22(4):426–429

    Article  Google Scholar 

  31. Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI, pp 4263–4270

  32. Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 4041–4049

  33. Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479

  34. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595

  35. Vemulapalli R, Arrate F, Chellappa R (2016) R3dg features: relative 3d geometry-based skeletal representations for human action recognition. Comp Vis Image Understanding 152:155–166

    Article  Google Scholar 

  36. Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 2016 ACM on multimedia conference, pp 102–106

  37. Wang D, Wang B, Zhao S, Yao H, Liu H (2017) View-based 3d object retrieval with discriminative views. Neurocomputing 252(C):58–66

    Article  Google Scholar 

  38. Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514

    Article  Google Scholar 

  39. Wu D, Shao L (2014) Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–731

  40. Yang S, Yuan C, Hu W, Ding X (2014) A hierarchical model based on latent dirichlet allocation for action recognition. In: Proceedings of the IEEE international conference on pattern recognition, pp 2613–2618

  41. Yong D, Yun F, Liang W (2015) Skeleton based action recognition with convolutional neural network. In: Iapr asian conference on pattern recognition, pp 579–583

  42. Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the international conference on learning representations, pp 1–10

  43. Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759

  44. Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards robust action retrieval. Neurocomputing 151:533–543

    Article  Google Scholar 

  45. Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112(C):110–118

    Article  Google Scholar 

  46. Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimedia 19(3):632–645

    Article  Google Scholar 

  47. Zheng Y, Yao H, Sun X, Zhao S (2015) Distinctive action sketch. In: IEEE international conference on image processing, pp 576–580

  48. Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended lc-ksvd for action recognition. In: Proceedings of the IEEE international conference on digital lmage computing: techniques and applications, pp 1–8

  49. Zhou L, Li W, Ogunbona P (2016) Learning a pose lexicon for semantic action recognition. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6

  50. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI, pp 3697–3703

Download references

Acknowledgements

This work was supported in part by Natural Science Foundation of China grants (61420106007, 61671387) and Australian Research Council grants (DE140100180).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mingyi He.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, B., He, M., Dai, Y. et al. 3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN. Multimed Tools Appl 77, 22901–22921 (2018). https://doi.org/10.1007/s11042-018-5642-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-5642-0

Keywords

Navigation