Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Jia, Jin-Gong; Zhou, Yuan-Feng; Hao, Xing-Wei; Li, Feng; Desrosiers, Christian; Zhang, Cai-Ming

doi:10.1007/s11390-020-0405-6

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Regular Paper
Published: 29 May 2020

Volume 35, pages 538–550, (2020)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Jin-Gong Jia¹,
Yuan-Feng Zhou¹,
Xing-Wei Hao¹,
Feng Li¹,
Christian Desrosiers² &
…
Cai-Ming Zhang¹

426 Accesses
19 Citations
4 Altmetric
Explore all metrics

Abstract

With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TS-TCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human activity recognition in artificial intelligence framework: a narrative review

Article 18 January 2022

Neha Gupta, Suneet K. Gupta, … Jasjit S. Suri

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

Pranjal Kumar, Siddhartha Chauhan & Lalit Kumar Awasthi

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

References

Aggarwal J K, Xia L. Human activity recognition from 3D data: A review. Pattern Recognition Letters, 2014, 48: 70-80.
Article Google Scholar
Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 2011, 115(2): 224-241.
Article Google Scholar
Han F, Reily B, Hoff W, Zhang H. Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 2017, 158: 85-105.
Article Google Scholar
Liu H, Liu B, Zhang H, Li L, Qin X, Zhang G. Crowd evacuation simulation approach based on navigation knowledge and two-layer control mechanism. Information Sciences, 2018, 436/437: 247-267.
Turaga P, Chellappa R, Subrahmanian V S. Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1473-1488.
Article Google Scholar
Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image and Vision Computing, 2017, 60: 4-21.
Article Google Scholar
Hou J H, Chau L P, Thalmann N M, He Y. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 25(1): 51-62.
Google Scholar
Sermanet P, Lynch C, Hsu J, Levine S. Time-contrastive networks: Self-supervised learning from multi-view observation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017, pp.486-487.
Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2011, 56(1): 116-124.
Article Google Scholar
Li S, Fang Z, Song W, Hao A, Qin H. Bidirectional optimization coupled lightweight networks for efficient and robust multi-person 2D pose estimation. Journal of Computer Science and Technology, 2019, 34(3): 522-536.
Article Google Scholar
Shahroudy A, Liu J, Ng T T, Gang W. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.1010-1019.
Zhu F, Shao L, Xie J, Fang Y. From handcrafted to learned representations for human action recognition: A survey. Image and Vision Computing, 2016, 55: 42-52.
Article Google Scholar
Huang Z W, Wan C, Probst T, Van G L. Deep learning on lie groups for skeleton-based action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1243-1252.
Ke Q, An S, Bennamoun M, Sohel F, Boussaid F. Skeleton-Net: Mining deep part features for 3-D action recognition. IEEE Signal Processing Letters, 2017, 24(6): 731-735.
Article Google Scholar
Weng J, Weng C, Yuan J, Liu Z. Discriminative spatiotemporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(4): 1077-1089.
Article Google Scholar
Liu J, Shahroudy A, Xu D, Kot A C, Wang G. Skeleton-based action recognition using spatiotemporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12): 3007-3021.
Article Google Scholar
Lee I, Kim D, Kang S, Lee S. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.1012-1020.
Zhang P, Xue J, Lan C, Zeng W, Gao Z, Zheng N. Adding attentiveness to the neurons in recurrent neural networks. In Proc. the 15th European Conference on Computer Vision, September 2018, pp.136-152.
Meng F, Liu H, Liang Y, Tu J, Liu M. Sample fusion network: An end-to-end data augmentation network for skeleton-based human action recognition. IEEE Transactions on Image Processing, 2019, 28(11): 5281-5295.
Article MathSciNet Google Scholar
Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7444-7452.
Shi L, Zhang Y, Cheng J, Lu H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12026-12035.
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3595-3603.
Si C, Chen W, Wang W, Wang L, Tan T. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1227-1236.
Lea C, Flynn M D, Vidal R, Reiter A, Hager G D. Temporal convolutional networks for action segmentation and detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1003-1012.
Kim T S, Reiter A. Interpretable 3D human action analysis with temporal convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1623-1631.
Liu J, Shahroudy A, Perez M, Wang G, Duan L Y, Kot A C. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. arXiv:1905.04757, 2019. https://arxiv.org/pdf/1905.04757.pdf, Jan. 2020.
Jiang W, Nie X, Xia Y, Wu Y, Zhu S C. Cross-view action modeling, learning and recognition. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.2649-2656.
Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.20-27.
Liu Z, Zhang C, Tian Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image and Vision Computing, 2016, 55: 93-100.
Article Google Scholar
Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7404-7411.
Jiang W, Liu Z, Wu Y, Yuan J. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5): 914-927.
Article Google Scholar
Zhang S, Liu X, Xiao J. On geometric features for skeleton-based action recognition using multilayer LSTM networks. In Proc. the 2017 IEEE Winter Conference on Applications of Computer Vision, March 2017, pp.148-157.
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2136-2145.
Ke Q, Bennamoun M, An S, Sohel F, Boussaïd F. A new representation of skeleton sequences for 3D action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4570-4579.
Ghorbel E, Boonaert J, Boutteau R, Lecoeuche S, Savatier X. An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based human action recognition. Computer Vision and Image Understanding, 2018, 175: 32-43.
Article Google Scholar
Yuan J, Liu Z, Wu Y. Discriminative subvolume search for efficient action detection. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.2442-2449.
Liu M, Shi Y, Zheng L, Xu K, Huang H, Manocha D. Recurrent 3D attentional networks for end-to-end active object recognition. Computational Visual Media, 2019, 5(1): 91-104.
Article Google Scholar
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.
He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imageNet classification. In Proc. the 2015 IEEE International Conference on Computer Vision, December 2015, pp.1026-1034.
Girija S S. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016. https://arxiv.org/abs/1603.04467, Jan. 2020.
Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R. Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.8-13.
Zhao R, Wang K, Su H, Ji Q. Bayesian graph convolution LSTM for skeleton based action recognition. In Proc. the 2019 IEEE Conference on International Conference on Computer Vision, October 2019, pp.6881-6891.
Yu Z, Chen W, Guo G. Fusing spatiotemporal features and joints for 3D action recognition. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.486-491.

Download references

Author information

Authors and Affiliations

School of Software, Shandong University, Jinan, 250101, China
Jin-Gong Jia, Yuan-Feng Zhou, Xing-Wei Hao, Feng Li & Cai-Ming Zhang
Department of Software and IT Engineering, University of Quebec, Montreal, H3C 3P8, Canada
Christian Desrosiers

Authors

Jin-Gong Jia
View author publications
You can also search for this author in PubMed Google Scholar
Yuan-Feng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xing-Wei Hao
View author publications
You can also search for this author in PubMed Google Scholar
Feng Li
View author publications
You can also search for this author in PubMed Google Scholar
Christian Desrosiers
View author publications
You can also search for this author in PubMed Google Scholar
Cai-Ming Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan-Feng Zhou.

Electronic supplementary material

ESM 1

(PDF 222 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, JG., Zhou, YF., Hao, XW. et al. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition. J. Comput. Sci. Technol. 35, 538–550 (2020). https://doi.org/10.1007/s11390-020-0405-6

Download citation

Received: 29 February 2020
Revised: 05 April 2020
Published: 29 May 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s11390-020-0405-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Abstract

Access this article

Similar content being viewed by others

Human activity recognition in artificial intelligence framework: a narrative review

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Abstract

Access this article

Similar content being viewed by others

Human activity recognition in artificial intelligence framework: a narrative review

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

References

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation