Skip to main content
Log in

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

With the growing popularity of somatosensory interaction devices, human action recognition is becoming attractive in many application scenarios. Skeleton-based action recognition is effective because the skeleton can represent the position and the structure of key points of the human body. In this paper, we leverage spatiotemporal vectors between skeleton sequences as input feature representation of the network, which is more sensitive to changes of the human skeleton compared with representations based on distance and angle features. In addition, we redesign residual blocks that have different strides in the depth of the network to improve the processing ability of the temporal convolutional networks (TCNs) for long time dependent actions. In this work, we propose the two-stream temporal convolutional networks (TS-TCNs) that take full advantage of the inter-frame vector feature and the intra-frame vector feature of skeleton sequences in the spatiotemporal representations. The framework can integrate different feature representations of skeleton sequences so that the two feature representations can make up for each other’s shortcomings. The fusion loss function is used to supervise the training parameters of the two branch networks. Experiments on public datasets show that our network achieves superior performance and attains an improvement of 1.2% over the recent GCN-based (BGC-LSTM) method on the NTU RGB+D dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal J K, Xia L. Human activity recognition from 3D data: A review. Pattern Recognition Letters, 2014, 48: 70-80.

    Article  Google Scholar 

  2. Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding, 2011, 115(2): 224-241.

    Article  Google Scholar 

  3. Han F, Reily B, Hoff W, Zhang H. Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 2017, 158: 85-105.

    Article  Google Scholar 

  4. Liu H, Liu B, Zhang H, Li L, Qin X, Zhang G. Crowd evacuation simulation approach based on navigation knowledge and two-layer control mechanism. Information Sciences, 2018, 436/437: 247-267.

  5. Turaga P, Chellappa R, Subrahmanian V S. Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1473-1488.

    Article  Google Scholar 

  6. Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey. Image and Vision Computing, 2017, 60: 4-21.

    Article  Google Scholar 

  7. Hou J H, Chau L P, Thalmann N M, He Y. Compressing 3-D human motions via keyframe-based geometry videos. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 25(1): 51-62.

    Google Scholar 

  8. Sermanet P, Lynch C, Hsu J, Levine S. Time-contrastive networks: Self-supervised learning from multi-view observation. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, July 2017, pp.486-487.

  9. Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R. Real-time human pose recognition in parts from single depth images. Communications of the ACM, 2011, 56(1): 116-124.

    Article  Google Scholar 

  10. Li S, Fang Z, Song W, Hao A, Qin H. Bidirectional optimization coupled lightweight networks for efficient and robust multi-person 2D pose estimation. Journal of Computer Science and Technology, 2019, 34(3): 522-536.

    Article  Google Scholar 

  11. Shahroudy A, Liu J, Ng T T, Gang W. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, June 2016, pp.1010-1019.

  12. Zhu F, Shao L, Xie J, Fang Y. From handcrafted to learned representations for human action recognition: A survey. Image and Vision Computing, 2016, 55: 42-52.

    Article  Google Scholar 

  13. Huang Z W, Wan C, Probst T, Van G L. Deep learning on lie groups for skeleton-based action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1243-1252.

  14. Ke Q, An S, Bennamoun M, Sohel F, Boussaid F. Skeleton-Net: Mining deep part features for 3-D action recognition. IEEE Signal Processing Letters, 2017, 24(6): 731-735.

    Article  Google Scholar 

  15. Weng J, Weng C, Yuan J, Liu Z. Discriminative spatiotemporal pattern discovery for 3D action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(4): 1077-1089.

    Article  Google Scholar 

  16. Liu J, Shahroudy A, Xu D, Kot A C, Wang G. Skeleton-based action recognition using spatiotemporal LSTM network with trust gates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(12): 3007-3021.

    Article  Google Scholar 

  17. Lee I, Kim D, Kang S, Lee S. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.1012-1020.

  18. Zhang P, Xue J, Lan C, Zeng W, Gao Z, Zheng N. Adding attentiveness to the neurons in recurrent neural networks. In Proc. the 15th European Conference on Computer Vision, September 2018, pp.136-152.

  19. Meng F, Liu H, Liang Y, Tu J, Liu M. Sample fusion network: An end-to-end data augmentation network for skeleton-based human action recognition. IEEE Transactions on Image Processing, 2019, 28(11): 5281-5295.

    Article  MathSciNet  Google Scholar 

  20. Yan S, Xiong Y, Lin D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7444-7452.

  21. Shi L, Zhang Y, Cheng J, Lu H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.12026-12035.

  22. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.3595-3603.

  23. Si C, Chen W, Wang W, Wang L, Tan T. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In Proc. the 2019 IEEE Conference on Computer Vision and Pattern Recognition, June 2019, pp.1227-1236.

  24. Lea C, Flynn M D, Vidal R, Reiter A, Hager G D. Temporal convolutional networks for action segmentation and detection. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1003-1012.

  25. Kim T S, Reiter A. Interpretable 3D human action analysis with temporal convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.1623-1631.

  26. Liu J, Shahroudy A, Perez M, Wang G, Duan L Y, Kot A C. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. arXiv:1905.04757, 2019. https://arxiv.org/pdf/1905.04757.pdf, Jan. 2020.

  27. Jiang W, Nie X, Xia Y, Wu Y, Zhu S C. Cross-view action modeling, learning and recognition. In Proc. the 2014 IEEE Conference on Computer Vision and Pattern Recognition, June 2014, pp.2649-2656.

  28. Xia L, Chen C C, Aggarwal J K. View invariant human action recognition using histograms of 3D joints. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.20-27.

  29. Liu Z, Zhang C, Tian Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image and Vision Computing, 2016, 55: 93-100.

    Article  Google Scholar 

  30. Wang P, Li W, Wan J, Ogunbona P, Liu X. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proc. the 32nd AAAI Conference on Artificial Intelligence, February 2018, pp.7404-7411.

  31. Jiang W, Liu Z, Wu Y, Yuan J. Learning actionlet ensemble for 3D human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(5): 914-927.

    Article  Google Scholar 

  32. Zhang S, Liu X, Xiao J. On geometric features for skeleton-based action recognition using multilayer LSTM networks. In Proc. the 2017 IEEE Winter Conference on Applications of Computer Vision, March 2017, pp.148-157.

  33. Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proc. the 2017 IEEE International Conference on Computer Vision, October 2017, pp.2136-2145.

  34. Ke Q, Bennamoun M, An S, Sohel F, Boussaïd F. A new representation of skeleton sequences for 3D action recognition. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, July 2017, pp.4570-4579.

  35. Ghorbel E, Boonaert J, Boutteau R, Lecoeuche S, Savatier X. An extension of kernel learning methods using a modified Log-Euclidean distance for fast and accurate skeleton-based human action recognition. Computer Vision and Image Understanding, 2018, 175: 32-43.

    Article  Google Scholar 

  36. Yuan J, Liu Z, Wu Y. Discriminative subvolume search for efficient action detection. In Proc. the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2009, pp.2442-2449.

  37. Liu M, Shi Y, Zheng L, Xu K, Huang H, Manocha D. Recurrent 3D attentional networks for end-to-end active object recognition. Computational Visual Media, 2019, 5(1): 91-104.

    Article  Google Scholar 

  38. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.448-456.

  39. He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imageNet classification. In Proc. the 2015 IEEE International Conference on Computer Vision, December 2015, pp.1026-1034.

  40. Girija S S. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016. https://arxiv.org/abs/1603.04467, Jan. 2020.

  41. Ofli F, Chaudhry R, Kurillo G, Vidal R, Bajcsy R. Sequence of the most informative joints (SMIJ): A new representation for human skeletal action recognition. In Proc. the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2012, pp.8-13.

  42. Zhao R, Wang K, Su H, Ji Q. Bayesian graph convolution LSTM for skeleton based action recognition. In Proc. the 2019 IEEE Conference on International Conference on Computer Vision, October 2019, pp.6881-6891.

  43. Yu Z, Chen W, Guo G. Fusing spatiotemporal features and joints for 3D action recognition. In Proc. the 2013 IEEE Conference on Computer Vision and Pattern Recognition, June 2013, pp.486-491.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuan-Feng Zhou.

Electronic supplementary material

ESM 1

(PDF 222 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jia, JG., Zhou, YF., Hao, XW. et al. Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition. J. Comput. Sci. Technol. 35, 538–550 (2020). https://doi.org/10.1007/s11390-020-0405-6

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-020-0405-6

Keywords

Navigation