skip to main content
research-article

Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

Published: 22 July 2021 Publication History

Abstract

Recently, human activity recognition using skeleton data is increasing due to its ease of acquisition and finer shape details. Still, it suffers from a wide range of intra-class variation, inter-class similarity among the actions and view variation due to which extraction of discriminative spatial and temporal features is still a challenging problem. In this regard, we present a novel Residual Inception Attention Driven CNN (RIAC-Net) Network, which visualizes the dynamics of the action in a part-wise manner. The complete skeletonis partitioned into five key parts: Head to Spine, Left Leg, Right Leg, Left Hand, Right Hand. For each part, a Compact Action Skeleton Sequence (CASS) is defined. Part-wise skeleton-based motion dynamics highlights discriminative local features of the skeleton that helps to overcome the challenges of inter-class similarity and intra-class variation with improved recognition performance. The RIAC-Net architecture is inspired by the concept of inception-residual representation that unifies the Attention Driven Residues (ADR) with inception-based Spatio-Temporal Convolution Features (STCF) to learn efficient salient action features. An ablation study is also carried out to analyze the effect of ADR over simple residue-based action representation. The robustness of the proposed framework is evaluated by performing an extensive experiment on four challenging datasets: UT Kinect Action 3D, Florence 3D action, MSR Daily Action3D, and NTU RGB-D datasets, which consistently demonstrate the superiority of the proposed method over other state-of-the-art methods.

References

[1]
C. Dhiman and D. K. Vishwakarma. 2019. A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Applic. Artif. Intell. 77, (2019) 21–45.
[2]
K. Singh, S. Rajora, D. K. Vishwakarma, G. Trapathi, S. Kumar, and G. S. Walia. 2020. Crowd anomaly detection using aggregation of ensembles of fine-tuned convnets. Neurocomputing 371, (2020), 188–198.
[3]
C. Dhiman and K. D. Vishwakarma. 2020. View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Proc. 29, (2020), 3835–3844.
[4]
D. K. Vishwakarma and C. Dhiman. 2019. A unified model for human activity recognition using spatial distribution of gradients and difference of Gaussian kernel. Vis. Comput. 35, (2019), 1595–1613.
[5]
C. Chen, R. Jafari, and N. Kehtarnavaz. 2015. Action recognition from depth sequences using depth motion maps-based local binary patterns. In the Winter Conference on Applications of Computer Vision (WACV).
[6]
K. Li and Y. Fu. 2014. Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36, 8 (2014), 1644–1657.
[7]
R. Vemulapalli, F. Arrate, and R. Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In the IEEE Conference on Computer Vision and Pattern Recognition.
[8]
D. Wu and L. Shao. 2014. Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[9]
E. Ghorbel, G. Demisse, D. Aouada, and B. Ottersten. 2020. Fast adaptive reparametrization (FAR) with application to human action recognition. IEEE Sig. Proc. Lett. 27 (2020), 580–584.
[10]
A. Tamou, L. Ballihi, and D. Aboutajdine. 2017. Automatic learning of articulated skeletons based on mean of 3D joints for efficient action recognition. Int. J. Pattern Recog. Artif. Intell. 31, 4 (2017), 1–17.
[11]
J. Wang, Z. Liu, Y. Wu, and J. Yuan. 2014. Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36, 5 (2014), 914–927.
[12]
S. Y. Jin and H. J. Choi. 2012. Essential body-joint and atomic action detection for human activity recognition using longest common subsequence algorithm. In the IEEE International Conference on Computer Vision (ICCV).
[13]
M. E. Hussein, M. Torki, M. A. Gowayyed, and M. E. Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In the International Joint Conference on Artificial Intelligence.
[14]
L. Xia, C. C. Chen, and J. K. Aggarwal. 2012. View invariant human action recognition using histograms of 3D joints. In the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.
[15]
J. Lia, X. Liu, M. Zhang, and D. Wang. 2019. Spatio-temporal deformable 3D convnets with attention for action recognition. Pattern Recog. 98 (2019), 107037.
[16]
Y. Du, Y. Fu, and L. Wang. 2015. Skeleton based action recognition with convolutional neural network. In the Asian Conference on Pattern Recognition (ACPR).
[17]
Y. Du, Y. Fu, and L. Wang. 2016. Representation learning of temporal dynamics for skeleton-based action recognition. IEEE Trans. Image Proc. 25, 7 (2016), 3010–3022.
[18]
S. Zhang, X. Liu, and J. Xiao. 2017. On geometric features for skeleton-based action recognition using multilayer LSTM Networks. In the IEEE Winter Conference on Applications of Computer Vision (WACV).
[19]
J. Liu, G. Wang, L. Y. Duan, K. Abdiyeva, and A. C. Kot. 2018. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Proc. 27, 4 (2018), 1586–1599.
[20]
H. Wang and L. Wang. 2017. Learning robust representations using recurrent neural networks for skeleton based action classification and detection. In the IEEE International Conference on Multimedia Expo Workshops (ICMEW).
[21]
H. Wang and L. Wang. 2018. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Proc. 27, 9 (2018), 4382–4394.
[22]
P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng. 2017. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In the IEEE International Conference on Computer Vision (ICCV).
[23]
J. Wang, Z. Liu, Y. Wu, and J. Yuan. 2012. Mining actionlet ensemble for action recognition with depth cameras. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[24]
R. Vemulapalli, F. Arrate, and R. Chellappa. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[25]
W. Li, Z. Zhang, and Z. Liu. 2010. Action recognition based on a bag of 3D points. In the IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.
[26]
Z. Shi and T.-K. Kim. 2017. Learning and refining of privileged information-based RNNs for action. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[27]
M. Rhif, H. Wannous, and I. R. Farah. 2018. Action recognition from 3D skeleton sequences using deep networks on lie group features. In the International Conference on Pattern Recognition (ICPR).
[28]
J. Tu, M. Liu, and H. Liu. 2018. Skeleton based human action recognition using spatial temporal 3D convolutional neural network. In the IEEE International Conference on Multimedia and Expo.
[29]
M. Liu, H. Liu, and C. Chen. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recog. 68 (2017), 346–362.
[30]
I. Lillo, J. C. Niebles, and A. Soto. 2016. A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[31]
H. Chen, G. Wang, J.-H. Xue, and L. He. 2016. A novel hierarchical framework for human action recognition. Pattern Recog. 55 (2016), 148–159.
[32]
B. B. Amor, J. Su, and A. Srivastava. 2016. Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans. Pattern Anal. Mach. Intell. 38, 1 (2016), 1–13.
[33]
Y. Hou, Z. Li, P. Wang, and W. Li. 2018. Skeleton optical spectra based action recognition using convolutional neural networks. IEEE Trans. Circ. Syst. Vid. Technol. 28, 3 (2018), 807–811.
[34]
S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. 2017. An end-to-end spatiotemporal attention model for human action recognition from skeleton data. In the 31st AAAI Conference on Artificial Intelligence.
[35]
T. S. Kim and A. Reiter. 2017. Interpretable 3D human action analysis with temporal convolutional networks. arXiv preprint arXiv:1704.04516 (2017).
[36]
Y. Du, W. Wang, and L. Wang. 2015. Hierarchical recurrent neural network for skeleton based action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston 2015.
[37]
A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. 2016. NTU RGB+D: A large scale dataset for 3D human activity analysis. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38]
E. Park, X. Han, T. L. Berg, and A. C. Berg. 2016. Combining multiple sources of knowledge in deep cnns for action recognition. In the IEEE Winter Conference on Applications of Computer Vision (WACV).
[39]
W. Zhu, C. Lan, J. Xing, W. Zen, Y. Li, L. Shen, and X. Xie. 2016. Cooccurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In the 30th AAAI Conference on Artificial Intelligence (AAAI).
[40]
H. H. Pham, L. Khoudour, A. Crouzil, and P. Zegers. 2018. Exploiting deep residual networks for human action recognition from skeletal data. Comput. Vis. Image Underst. 170 (2018), 51–66.
[41]
C. Li, Z. Cui, W. Zheng, C. Xu, and J. Yang. 2018. Spatio-temporal graph convolution for skeleton based action recognition. In the 32nd AAAI Conference on Artificial Intelligence.
[42]
S. Lohit, Q. Wang, and P. Turaga. 2019. Temporal transformer networks: Joint learning of invariant and discriminative time warping. In the IEEE Conference on Computer Vision and Pattern Recognition.
[43]
K. Papadopoulos, E. Ghorbel, D. Aouada, and B. Ottersten. 2019. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv:1912.09745v1 (2019).
[44]
K. Thakkar and P. J. Narayanan. 2018. Part-based graph convolutional network for action recognition. In the British Machine Vision Conference (BMVC).
[45]
C. Si, W. Chen, W. Wang, L. Wang, and T. Tan. 2019. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition.
[46]
F. Baradel, C. Wolf, and J. Mille. 2017. Human action recognition: Pose-based attention draws focus to hands. In the International Conference on Computer Vision Workshop (ICCVW).
[47]
F. Baradel, C. Wolf, and J. Mille. 2018. Human activity recognition with pose-driven attention to RGB. In the British Machine Vision Conference (BMVC).
[48]
T. H. Thea, C. H. Hua, T. T. Ngo, and D. S. Kim. 2020. Image representation of pose-transition feature for 3D skeleton-based action recognition. Inf. Sci. 513 (2020), 112–126.
[49]
S. Cho, M. H. Maqbool, F. Liu, and H. Foroosh. 2020. Self-attention network for skeleton-based human action recognition. In the Winter Conference on Applications of Computer Vision.
[50]
S. Sharma, R. Kiros, and R. Salakhutdinov. 2015. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015).
[51]
V. Mnih, N. Heess, A. Graves, and K. Kav. 2014. Recurrent models of visual attention. In the International Conference on Neural Information Processing Systems.
[52]
J. Kuen, Z. Wang, and G. Wang. 2016. Recurrent attentional networks for saliency detection. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[53]
S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. F. Fei. 2018. Every moment counts: Dense detailed labeling of actions in complex videos. Int. J. Comput. Vis. 126, (2018), 375–389.
[54]
T. Shen, T. Zhou, G. Long, J. Jiang, S. Wang, and C. Zhang. 2018. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. In the 27th International Joint Conference on Artificial Intelligence (IJCAI).
[55]
K. Zhu, R. Wang, Q. Zhao, J. Cheng, and D. Tao. 2020. A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Trans. Multim. 22, 11 (2020), 2977–2989.
[56]
L. Shi, Y. Zhang, J. Cheng, and H. Lu. 2019. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. arXiv:1912.06971 [cs.CV] (2019).
[57]
C. Szegedy, S. Ioffe, and V. Vanhoucke. 2017. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In the 31st AAAI Conference on Artificial Intelligence (AAAI).
[58]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[59]
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).
[60]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In the 31st Conference on Neural Information Processing Systems (NIPS).
[61]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Identity mappings in deep residual networks. arXiv preprint arXiv:1603.05027 (2016).
[62]
E. Shelhamer, J. Long, and T. Darrell. 2017. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2017), 640–651.
[63]
D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014).
[64]
D. Britz, A. Goldie, M. T. Luong, and Q. Le. 2017. Massive exploration of neural machine translation architectures. arXiv:1703.03906 (2017).
[65]
C. G. Snoek, M. Worring, and A. W. Smeulders. 2005. Early versus late fusion in semantic video analysis. In the 13th ACM International Conference on Multimedia.
[66]
L. Seidenari, V. Varano, S. Berretti, A. D. Bimbo, and P. Pala. 2013. Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
[67]
R. Caruana, S. Lawrence, and L. Giles. 2000. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In the 13th International Conference on Neural Information Processing Systems.
[68]
J. Liu, A. Shahroudy, D. Xu, and G. Wang. 2016. Spatio-Temporal LSTM with Trust Gates for 3D human action recognition. In the European Conference on Computer Vision (ECCV).
[69]
D. C. Luvizon, H. Tabia, and D. Picard. 2017. Learning features combination for human action recognition from skeleton sequences. Pattern Recog. Lett. 99, 1 (2017), 13–20.
[70]
R. Slama, H. Wannous, M. Daoudi, and A. Srivastava. 2015. Accurate 3D action recognition using learning on the Grassmann manifold. Pattern Recog. 48, 2 (2015), 556–567.
[71]
I. Lee, D. Kim, S. Kang, and S. Lee. 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding LSTM networks. In the IEEE International Conference on Computer Vision (ICCV).
[72]
P. Koniusz, A. Cherian, and F. Porikli. 2016. Tensor representations via kernel linearization for action recognition from 3D skeletons. In the European Conference on Computer Vision (ECCV).
[73]
M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. D. Bimbo. 2015. 3D human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cyber. 45, 7 (2015), 1340–1352.
[74]
C. Wang, Y. Wang, and A. L. Yuille. 2016. Mining 3D key-pose-motifs for action recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[75]
J. Weng, C. Weng, and J. Yuan. 2017. SpatioTemporal naive-bayes nearest-neighbor (ST-NBNN) for skeleton-based action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[76]
K. Jin, J. Min, J. Kong, H. Huo, and X. Wang. 2017. Action recognition using vague division depth motion maps. J. Eng. 4 (2017), 77–84.
[77]
G. Li, K. Liu, W. Ding, F. Cheng, and B. Chen. 2018. Key skeleton pattern mining on 3D skeletons represented by lie group for action recognition. Math. Prob. Eng. 2018.
[78]
H. H. Pham, L. Khoudoury, A. Crouzil, P. Zegers, and S. A. Velastiny. 2018. Skeletal movement to color map: A novel representation for 3D action recognition with inception residual networks. arXiv:1807.07033v1 [cs.CV] (2018).
[79]
J. Liu, A. Shahroudy, D. Xu, and A. C. Ko. 2018. Skeleton-based action recognition using spatio-temporal LSTM network with trust. IEEE Trans. Pattern Anal. Mach. Intell. 40, 12 (2018), 3007–3021.
[80]
Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid. 2017. A new representation of skeleton sequences for 3D action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[81]
C. Li, P. Wang, S. Wang, and Y. Ho. 2017. Skeleton-based action recognition using LSTM and CNN. In the IEEE International Conference on Multimedia Expo Workshops (ICMEW).
[82]
M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian. 2019. Actional-structural graph convolutional networks for skeleton-based action recognition. In the Conference on Computer Vision and Pattern Recognition (CVPR).
[83]
A. Graves. 2011. Practical variational inference for neural networks. In the International Conference on Neural Information Processing Systems.

Cited By

View all
  • (2025)Graph Convolutional Networks for multi-modal robotic martial arts leg pose recognitionFrontiers in Neurorobotics10.3389/fnbot.2024.152098318Online publication date: 20-Jan-2025
  • (2025)Human activity recognition: A review of deep learning‐based methodsIET Computer Vision10.1049/cvi2.7000319:1Online publication date: Feb-2025
  • (2025)Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognitionScientific Reports10.1038/s41598-025-87752-815:1Online publication date: 10-Feb-2025
  • Show More Cited By

Index Terms

  1. Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 17, Issue 3
    August 2021
    443 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3476118
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 July 2021
    Accepted: 01 December 2020
    Revised: 01 December 2020
    Received: 01 July 2020
    Published in TOMM Volume 17, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Attention
    2. human action recognition
    3. inception
    4. residues
    5. skeleton
    6. spatio-temporal action representation

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)80
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Graph Convolutional Networks for multi-modal robotic martial arts leg pose recognitionFrontiers in Neurorobotics10.3389/fnbot.2024.152098318Online publication date: 20-Jan-2025
    • (2025)Human activity recognition: A review of deep learning‐based methodsIET Computer Vision10.1049/cvi2.7000319:1Online publication date: Feb-2025
    • (2025)Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognitionScientific Reports10.1038/s41598-025-87752-815:1Online publication date: 10-Feb-2025
    • (2025)Object pose tracking using multimodal knowledge from RGB images and quaternion-based rotation contextsApplied Soft Computing10.1016/j.asoc.2025.112699170(112699)Online publication date: Feb-2025
    • (2024)A Review of State-of-the-Art Methodologies and Applications in Action RecognitionElectronics10.3390/electronics1323473313:23(4733)Online publication date: 29-Nov-2024
    • (2024)MultiWave-Net: An Optimized Spatiotemporal Network for Abnormal Action Recognition Using Wavelet-Based Channel AugmentationAI10.3390/ai50100145:1(259-289)Online publication date: 24-Jan-2024
    • (2024)An integrated framework for multi-granular explanation of video summarizationFrontiers in Signal Processing10.3389/frsip.2024.14333884Online publication date: 24-Dec-2024
    • (2024)RL-CWtrans Net: multimodal swimming coaching driven via robot visionFrontiers in Neurorobotics10.3389/fnbot.2024.143918818Online publication date: 14-Aug-2024
    • (2024)How to Improve Video Analytics with Action Recognition: A SurveyACM Computing Surveys10.1145/367901157:1(1-36)Online publication date: 7-Oct-2024
    • (2024)Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363947020:5(1-22)Online publication date: 7-Feb-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media