3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Li, Bo; He, Mingyi; Dai, Yuchao; Cheng, Xuelian; Chen, Yucheng

doi:10.1007/s11042-018-5642-0

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Published: 06 February 2018

Volume 77, pages 22901–22921, (2018)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Bo Li¹,
Mingyi He ORCID: orcid.org/0000-0003-2051-6955¹,
Yuchao Dai¹,
Xuelian Cheng¹ &
…
Yucheng Chen¹

1239 Accesses
30 Citations
Explore all metrics

Abstract

In this paper, we present an image classification approach to action recognition with 3D skeleton videos. First, we propose a video domain translation-scale invariant image mapping, which transforms the 3D skeleton videos to color images, namely skeleton images. Second, a multi-scale dilated convolutional neural network (CNN) is designed for the classification of the skeleton images. Our multi-scale dilated CNN model could effectively improve the frequency adaptiveness and exploit the discriminative temporal-spatial cues for the skeleton images. Even though the skeleton images are very different from natural images, we show that the fine-tuning strategy still works well. Furthermore, we propose different kinds of data augmentation strategies to improve the generalization and robustness of our method. Experimental results on popular benchmark datasets such as NTU RGB + D, UTD-MHAD, MSRC-12 and G3D demonstrate the superiority of our approach, which outperforms the state-of-the-art methods by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Article 14 March 2020

Notes

As AlexNet and VGGNet only work for input size 224 × 224, we cannot apply the multi-scale strategy in fine-tuning these models. In contrast, ResNet model could work with different input resolution and thus a multi-scale strategy is applicable.

References

Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247
Article Google Scholar
Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 7–12
Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of the IEEE international conference on image processing, pp 168–172
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Du Y, Fu Y, Wang L (2016) Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans Image Proc 25(7):3010–3022
Article MathSciNet Google Scholar
Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1737–1746
Gao Y, Wang M, Ji R, Wu X, Dai Q (2013) 3-D object retrieval with hausdorff distance learning. IEEE Trans Ind Electron 61(4):2088–2098
Article Google Scholar
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Gowayyed MA, Torki M, Hussein ME, El-Saban M (2013) Histogram of oriented displacements (hod): describing trajectories of human joints for action recognition. In: AAAI, pp 1351–1357
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hou Y, Li Z, Wang P, Li W (2016) Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Google Scholar
Huang Z, Wan C, Probst T, Van Gool L (2017) Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6099–6108
Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In: AAAI, pp 2466–2472
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on multimedia, pp 675–678
Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) Skeletonnet: mining deep part features for 3-d action recognition. IEEE Signal Process Lett 24 (6):731–735
Article Google Scholar
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the advance neural information processing systems, pp 1097–1105
Li B, Dai Y, Cheng X, Chen H, Lin Y, He M (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: Proceedings of the IEEE international conference on multimedia and expo workshops, pp 1–1
Liu C, Hu Y, Li Y, Song S, Liu J Pku-mmd: a large scale benchmark for continuous multi-modal human action understanding. arXiv:1703.07475
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: Proceedings of the european conference on computer vision. Springer, pp 816–833
Lu G, Zhou Y, Li X, Kudo M (2016) Efficient action recognition via local position offset of 3d skeletal body joints. Multimedia Tools and Applications 75 (6):3479–3494
Article Google Scholar
Luo J, Wang W, Qi H (2013) Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: Proceedings of the IEEE international conference on computer vision, pp 1809–1816
Nie S, Wang Z, Ji Q (2015) A generative restricted boltzmann machine based method for high-dimensional motion data modeling. Comp Vis Image Understanding 136:14–22
Article Google Scholar
Ohn-Bar E, Trivedi M (2013) Joint angles similarities and hog2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 465–470
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1717–1724
Presti LL, La Cascia M (2016) 3d skeleton-based human action classification: a survey. Pattern Recogn 53:130–147
Article Google Scholar
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb + d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Shotton J, Girshick R, Fitzgibbon A, Sharp T, Cook M, Finocchio M, Moore R, Kohli P, Criminisi A, Kipman A, et al (2013) Efficient human pose estimation from single depth images. IEEE Trans Pattern Anal Mach Intell 35 (12):2821–2840
Article Google Scholar
Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Song Y, Liu S, Tang J (2015) Describing trajectory of surface patch for human action recognition on rgb and depth videos. IEEE Signal Process Lett 22(4):426–429
Article Google Scholar
Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI, pp 4263–4270
Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 4041–4049
Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Vemulapalli R, Arrate F, Chellappa R (2016) R3dg features: relative 3d geometry-based skeletal representations for human action recognition. Comp Vis Image Understanding 152:155–166
Article Google Scholar
Wang P, Li Z, Hou Y, Li W (2016) Action recognition based on joint trajectory maps using convolutional neural networks. In: Proceedings of the 2016 ACM on multimedia conference, pp 102–106
Wang D, Wang B, Zhao S, Yao H, Liu H (2017) View-based 3d object retrieval with discriminative views. Neurocomputing 252(C):58–66
Article Google Scholar
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514
Article Google Scholar
Wu D, Shao L (2014) Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 724–731
Yang S, Yuan C, Hu W, Ding X (2014) A hierarchical model based on latent dirichlet allocation for action recognition. In: Proceedings of the IEEE international conference on pattern recognition, pp 2613–2618
Yong D, Yun F, Liang W (2015) Skeleton based action recognition with convolutional neural network. In: Iapr asian conference on pattern recognition, pp 579–583
Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the international conference on learning representations, pp 1–10
Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3d kinematics descriptor for low-latency action recognition and detection. In: Proceedings of the IEEE international conference on computer vision, pp 2752–2759
Zhao S, Chen L, Yao H, Zhang Y, Sun X (2015) Strategy for dynamic 3d depth data matching towards robust action retrieval. Neurocomputing 151:533–543
Article Google Scholar
Zhao S, Yao H, Zhang Y, Wang Y, Liu S (2015) View-based 3d object retrieval via multi-modal graph learning. Signal Process 112(C):110–118
Article Google Scholar
Zhao S, Yao H, Gao Y, Ji R, Ding G (2017) Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Trans Multimedia 19(3):632–645
Article Google Scholar
Zheng Y, Yao H, Sun X, Zhao S (2015) Distinctive action sketch. In: IEEE international conference on image processing, pp 576–580
Zhou L, Li W, Zhang Y, Ogunbona P, Nguyen DT, Zhang H (2014) Discriminative key pose extraction using extended lc-ksvd for action recognition. In: Proceedings of the IEEE international conference on digital lmage computing: techniques and applications, pp 1–8
Zhou L, Li W, Ogunbona P (2016) Learning a pose lexicon for semantic action recognition. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI, pp 3697–3703

Download references

Acknowledgements

This work was supported in part by Natural Science Foundation of China grants (61420106007, 61671387) and Australian Research Council grants (DE140100180).

Author information

Authors and Affiliations

Northwestern Polytechnical University, Xi’an, 710129, China
Bo Li, Mingyi He, Yuchao Dai, Xuelian Cheng & Yucheng Chen

Authors

Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
Mingyi He
View author publications
You can also search for this author in PubMed Google Scholar
Yuchao Dai
View author publications
You can also search for this author in PubMed Google Scholar
Xuelian Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yucheng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mingyi He.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, B., He, M., Dai, Y. et al. 3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN. Multimed Tools Appl 77, 22901–22921 (2018). https://doi.org/10.1007/s11042-018-5642-0

Download citation

Received: 31 July 2017
Revised: 04 December 2017
Accepted: 10 January 2018
Published: 06 February 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s11042-018-5642-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Human Action Recognition and Prediction: A Survey

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

3D skeleton based action recognition by video-domain translation-scale invariant mapping and multi-scale dilated CNN

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Human Action Recognition and Prediction: A Survey

Human action recognition using fusion of multiview and deep features: an application to video surveillance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation