Spatio-temporal attention on manifold space for 3D human action recognition

Ding, Chongyang; Liu, Kai; Cheng, Fei; Belyaev, Evgeny

doi:10.1007/s10489-020-01803-3

Spatio-temporal attention on manifold space for 3D human action recognition

Published: 21 August 2020

Volume 51, pages 560–570, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Chongyang Ding ORCID: orcid.org/0000-0002-4344-8142¹,
Kai Liu¹,
Fei Cheng¹ &
…
Evgeny Belyaev²

847 Accesses
19 Citations
Explore all metrics

Abstract

Recently, skeleton-based action recognition has become increasingly prevalent in computer vision due to its wide range of applications, and many approaches have been proposed to address this task. Among these methods, manifold space is widely used to deal with the relative geometric relationships between different body parts in human skeletons. Existing studies treat all geometric relationships as having the same degree of importance; thus, they cannot focus on significant information. In addition, the traditional attention mechanism aims mostly to solve the attention problems in Euclidean space, and is not applicable in manifold space. To investigate these issues, we propose a spatial and temporal attention mechanism on Lie groups for 3D human action recognition. We build our network architecture with a generalized attention mechanism that extends the scope of attention from traditional Euclidean space to manifold space. In addition, our model can learn to identify the significant spatial features and temporal stages with effective attention modules, which focus on discriminative transformation relationships between different rigid bodies within each frame and allocate different levels of attention to different frames. Extensive experiments are conducted on standard datasets and the experimental results demonstrate the effectiveness of the proposed network architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition

Insight on Attention Modules for Skeleton-Based Action Recognition

Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition

Notes

We refer readers to [32] for efficient implementations of the exponential and logarithm maps of SO₃, and [3] for further details about the Lie group representation of skeletal sequences.

References

Absil PA, Mahony R, Sepulchre R (2009) Optimization algorithms on matrix manifolds. Princeton University Press, Princeton
MATH Google Scholar
Anirudh R, Turaga P, Su J, Srivastava A (2015) Elastic functional coding of human actions: from vector-fields to latent variables. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 3147–3155
Anirudh R, Turaga P, Su J, Srivastava A (2017) Elastic functional coding of riemannian trajectories. IEEE Trans Pattern Anal Mach Intell 39(5):922–936
Article Google Scholar
Ba J, Mnih V, Kavukcuoglu K (2014) Multiple object recognition with visual attention. arXiv:1412.7755
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Ben Tanfous A, Drira H, Ben Amor B (2018) Coding kendall’s shape trajectories for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2840–2849
Bloom V, Makris D, Argyriou V (2012) G3d: a gaming action dataset and real time action recognition evaluation framework. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 7–12
Boumal N, Absil PA (2011) A discrete regression method on manifolds and its application to data on so (n). IFAC Proc 44(1):2284–2289
Article Google Scholar
Cai X, Zhou W, Wu L, Luo J, Li H (2015) Effective active skeleton representation for low latency human action recognition. IEEE Trans Multimed 18(2):141–154
Article Google Scholar
Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci 3(3):201
Article Google Scholar
Dağlarlı E, Dağlarlı SF, Günel GÖ, Köse H (2017) Improving human-robot interaction based on joint attention. Appl Intell 47(1):62–82
Article Google Scholar
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Fan Z, Zhao X, Lin T, Su H (2018) Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans Multimed 21(2):363–374
Article Google Scholar
Fiorini L, Mancioppi G, Semeraro F, Fujita H, Cavallo F (February 2020) Unsupervised emotional state classification through physiological parameters for social robotics applications. Knowl-Based Syst 190(29):105217
Article Google Scholar
Gao P, Yuan R, Wang F, Xiao L, Fujita H, Zhang Y (6 April 2020) Siamese attentional keypoint network for high performance visual tracking. Knowl-Based Syst 193:105448
Article Google Scholar
Gao P, Zhang Q, Wang F, Xiao L, Fujita H, Zhang Y (2020) Learning reinforced attentional representation for end-to-end visual tracking. Inf Sci 517:52–67
Article Google Scholar
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1080–1088
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5–6):602–610
Article Google Scholar
Hall B (2015) Lie groups, Lie algebras, and representations: an elementary introduction, vol 222. Springer, Berlin
Book Google Scholar
Huang Z, Van Gool L (2017) A riemannian network for spd matrix learning. In: Thirty-first AAAI conference on artificial intelligence
Huang Z, Wan C, Probst T, Van Gool L (2017) Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6099–6108
Johansson G (1973) Visual perception of biological motion and a model for its analysis. Percept Psychophys 14(2):201–211
Article Google Scholar
Kalita S, Karmakar A, Hazarika S M (2018) Efficient extraction of spatial relations for extended objects vis-à-vis human activity recognition in video. Appl Intell 48(1):204–219
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1409.1556
Krüger B, Weber A (2007) Documentation mocap database hdm05
Li D, Yao T, Duan LY, Mei T, Rui Y (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Trans Multimed 21(2):416–428
Article Google Scholar
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, Berlin, pp 816–833
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit 68:346–362
Article Google Scholar
Liu M, Liu H, Chen C (2017) Robust 3d action recognition through sampling local appearances and global distributions. IEEE Trans Multimed 20(8):1932–1947
Article Google Scholar
Majd M, Safabakhsh R (2019) A motion-aware convlstm network for action recognition. Appl Intell pp 1–7
Mallya A, Lazebnik S (2016) Learning models for actions and person-object interactions with transfer to question answering European conference on computer vision. Springer, Berlin, pp 414–428
Murray RM (2017) A mathematical introduction to robotic manipulation. CRC Press, Boca Raton
Book Google Scholar
Nie S, Ji Q (2014) Capturing global and local dynamics for human action recognition. In: 2014 22nd international conference on pattern recognition. IEEE, pp 1946–1951
Shahroudy A, Liu J, Ng TT, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Sharma S, Kiros R, Salakhutdinov R (2015) Action recognition using visual attention. arXiv:1511.04119
Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, Cook M, Moore R (2013) Real-time human pose recognition in parts from single depth images. Commun ACM 56(1):116–124
Article Google Scholar
Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI conference on artificial intelligence
Turaga P, Chellappa R (2009) Locally time-invariant models of human activities using trajectories on the grassmannian. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 2435–2441
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595
Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 499–508
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 1290–1297
Wang J, Nie X, Xia Y, Wu Y, Zhu SC (2014) Cross-view action modeling, learning and recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Wang P, Yuan C, Hu W, Li B, Zhang Y (2016) Graph based skeleton motion representation and similarity measurement for action recognition. In: European conference on computer vision. Springer, Berlin, pp 370–385
Weng J, Weng C, Yuan J (2017) Spatio-temporal naive-bayes nearest-neighbor (st-nbnn) for skeleton-based action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4171–4180
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8(3–4):229–256
MATH Google Scholar
Xia L, Chen CC, Aggarwal JK (2012) View invariant human action recognition using histograms of 3d joints. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 20–27
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI conference on artificial intelligence, pp 7444–7452
Yang Y, Deng C, Gao S, Liu W, Tao D, Gao X (2016) Discriminative multi-instance multitask learning for 3d action recognition. IEEE Trans Multimed 19(3):519–529
Article Google Scholar
Yao G, Lei T, Zhong J, Jiang P (2019) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49(6):2017–2029
Article Google Scholar
Zhang S, Gao C, Zhang J, Chen F, Sang N (2017) Discriminative part selection for human action recognition. IEEE Transa Multimed 20(4):769–780
Google Scholar
Zhang S, Yang Y, Xiao J, Liu X, Yang Y, Xie D, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Trans Multimed 20(9):2330–2343
Article Google Scholar
Zhao Z, Elgammal AM (2008) Information theoretic key frame selection for action recognition. In: BMVC, pp 1–10

Download references

Acknowledgements

The authors would like to thank the National Natural Science Foundation of China and Postgraduate Innovation Fund of Xidian University for their support.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Xidian University, Xi’an, China
Chongyang Ding, Kai Liu & Fei Cheng
Department of Information Systems, ITMO University, Saint Petersburg, Russia
Evgeny Belyaev

Authors

Chongyang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Kai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Evgeny Belyaev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, C., Liu, K., Cheng, F. et al. Spatio-temporal attention on manifold space for 3D human action recognition. Appl Intell 51, 560–570 (2021). https://doi.org/10.1007/s10489-020-01803-3

Download citation

Published: 21 August 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s10489-020-01803-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatio-temporal attention on manifold space for 3D human action recognition

Abstract

Access this article

Similar content being viewed by others

Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition

Insight on Attention Modules for Skeleton-Based Action Recognition

Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatio-temporal attention on manifold space for 3D human action recognition

Abstract

Access this article

Similar content being viewed by others

Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition

Insight on Attention Modules for Skeleton-Based Action Recognition

Focal and Global Spatial-Temporal Transformer for Skeleton-Based Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation