Skeleton-based human action recognition by fusing attention based three-stream convolutional neural network and SVM

Ren, Fang; Tang, Chao; Tong, Anyang; Wang, Wenjian

doi:10.1007/s11042-023-15334-9

Skeleton-based human action recognition by fusing attention based three-stream convolutional neural network and SVM

Published: 23 May 2023

Volume 83, pages 6273–6295, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Fang Ren¹,
Chao Tang ORCID: orcid.org/0000-0002-8934-9537¹,
Anyang Tong¹ &
…
Wenjian Wang²

259 Accesses
Explore all metrics

Abstract

This work proposes a method, aiming the 3D skeleton sequence, for the human action recognition by fusing the attention-based three-stream convolutional neural network and support vector machine. The traditional action recognition methods primarily employ RGB video as input. However, RGB video has issues with respect to large data volume, low semanticity, and ease of making the model interfered by irrelevant information such as the background. The efficient and advanced human action information contained in the 3D skeleton sequence facilitates human behavior recognition. First, the information of 3D coordinates, temporal-difference information, and spatial-difference information of joints are extracted from the raw skeleton data, and the above information is input into the respective convolutional neural networks for pre-training. Then, the pre-trained network model extracts the feature containing the spatial-temporal information. Finally, the mixed feature vectors are input into the support vector machine for training and classification. Under the X-View and X-Sub benchmarks, the accuracy on the open dataset NTU RGB+D is 92.6% and 86.7% respectively, demonstrating that the method proposed for incorporating multistream feature learning, feature fusing, and hybrid model can improve the recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

CBAM: Convolutional Block Attention Module

Transfer learning for image classification using VGG19: Caltech-101 image data set

Article 17 September 2021

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Article 12 August 2023

Data Availability

The datasets used or analyzed during the current study are available from the corresponding author on reasonable request.

References

Al-Faris M, Chiverton J P, Yang Y, Ndzi D (2020) Multi-view region-adaptive multi-temporal dmm and rgb action recognition. Pattern Anal Appl 23 (4):1587–1602. https://doi.org/10.1007/s10044-020-00886-5
Article Google Scholar
Bhatti U A, Huang M, Wang H, Zhang Y, Mehmood A, Di W (2018) Recommendation system for immunization coverage and monitoring. Human Vacc Immunotherap 14(1):165–171
Article Google Scholar
Bhatti U A, Huang M, Wu D, Zhang Y, Mehmood A, Han H (2019) Recommendation system using feature extraction and pattern recognition in clinical care systems. Enterprise Inform Syst 13(3):329–351
Article Google Scholar
Bhatti U A, Ming-Quan Z, Huo Q, Ali S, Hussain A, Yan Y, Yu Z, Yuan L, Nawaz S A (2021) Advanced color edge detection using clifford algebra in satellite images. IEEE Photonics J 13(2)
Bhatti U A, Nizamani M M, Huang M (2022) Climate change threatens Pakistan’s snow leopards. Science 377(6606):585–586. https://doi.org/10.1126/science.add9065
Article Google Scholar
Bhatti U A, Yan Y, Zhou M, Ali S, Hussain A, Huo Q, Yu Z, Yuan L (2021) Time series analysis and forecasting of air pollution particulate matter (pm2.5): an sarima and factor analysis approach. IEEE Access 9:41019–41031
Article Google Scholar
Bhatti U A, Yuan L, Yu Z, Li J, Nawaz S A, Mehmood A, Zhang K (2021) New watermarking algorithm utilizing quaternion fourier transform with advanced scrambling and secure encryption. Multimed Tools Applic 80(9):13367–13387
Article Google Scholar
Bhatti U A, Yu Z, Chanussot J, Zeeshan Z, Yuan L, Luo W, Nawaz S A, Bhatti M A, Ain Q U, Mehmood A (2022) Local similarity-based spatial-spectral fusion hyperspectral image classification with deep cnn and gabor filtering. IEEE Trans Geosci Remote Sens 60:1–15
Article Google Scholar
Caetano C, Brémond F, Schwartz W R (2019) Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE, pp 16–23
Chen J, Ho C M, Soc I C (2022) Mm-vit: multi-modal video transformer for compressed video action recognition. In: 22nd IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE Winter Conference on Applications of Computer Vision, pp 786–797
Dan Y, Jingbing L, Yangxiu F, Wenfeng C, Xiliang X, Bhatti U A, Baoru H (2021) A robust zero-watermarkinging algorithm based on phts-dct for medical images in the encrypted domain. Innovation in Medicine and Healthcare. Proceedings of 9th KES-InMed 2021. Smart Innovation, Systems and Technologies, pp 101–13
Dang L M, Min K, Wang H, Piran M J, Lee C H, Moon H (2020) Sensor-based and vision-based human activity recognition: a comprehensive survey. Pattern Recogn, 108. https://doi.org/10.1016/j.patcog.2020.107561
Ding W, Ding C, Li G, Liu K (2021) Skeleton-based square grid for human action recognition with 3d convolutional neural network. IEEE Access 9:54078–54089
Article Google Scholar
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1110–1118
Duan H, Zhao Y, Chen K, Lin D, Dai B, Ieee Comp, S O C (2022) Revisiting skeleton-based action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Conference on Computer Vision and Pattern Recognition, pp 2959–2968. https://doi.org/10.1109/cvpr52688.2022.00298
Feng D, Wu Z, Zhang J, Ren T (2021) Multi-scale spatial temporal graph neural network for skeleton-based action recognition. IEEE Access 9:58256–58265
Article Google Scholar
Feng L, Zhao Y, Zhao W, Tang J (2022) A comparative review of graph convolutional networks for human skeleton-based action recognition. Artif Intell Rev, 4275–4305. https://doi.org/10.1007/s10462-021-10107-y
Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3d skeletal data: a review. Comput Vis Image Underst 158:85–105
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML). PMLR , pp 448–456
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 3288–3297
Kennedy-Metz L R, Mascagni P, Torralba A, Dias R D, Perona P, Shah J A, Padoy N, Zenati M A (2020) Computer vision in the operating room: opportunities and caveats. IEEE Trans Med Robot Bion 3(1): 2–10
Article Google Scholar
Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In: European Conference on Computer Vision (ECCV). Springer, pp 37–53
Li C, Zhong Q, Xie D, Pu S (2017) Skeleton-based action recognition with convolutional neural networks. In: IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE , pp 597–600
Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: International Joint Conference on Artificial Intelligence (IJCAI)
Li S, Li W, Cook C, Zhu C, Gao Y (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5457–5466
Li M S, Chen S H, Chen X, Zhang Y, Wang Y F, Tian Q, Soc I C (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Conference on computer vision and pattern recognition, pp 3590–3598
Li T, Li J, Liu J, Huang M, Chen Y-W, Bhatti U A (2022) Robust watermarking algorithm for medical images based on log-polar transform. Eurasip J Wireless Commun Network 2022:1. https://doi.org/10.1186/s13638-022-02106-6
Article Google Scholar
Li Y, Li J, Shao C, Bhatti U A, Ma J (2022) Robust multi-watermarking algorithm for medical images using patchwork-dct. In: 8th International Conference on Artificial Intelligence and Security (ICAIS). Lecture notes in computer science, vol 13340, pp 386–399, DOI https://doi.org/10.1007/978-3-031-06791-4_31
Liang D, Fan G, Lin G, Chen W, Zhu H (2019) Three-stream convolutional neural network with multi-task and ensemble learning for 3d action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Lin Z, Zhang W, Deng X, Ma C, Wang H (2020) Image-based pose representation for action recognition and hand gesture recognition, 532–539
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European Conference on Computer Vision (ECCV). Springer, pp 816–833
Liu A-A, Shao Z, Wong Y, Li J, Su Y-T, Kankanhalli M (2019) Lstm-based multi-label video event detection. Multimed Tools Applic 78 (1):677–695. https://doi.org/10.1007/s11042-017-5532-x
Article Google Scholar
Liu Z Y, Zhang H W, Chen Z H, Wang Z Y, Ouyang W L, Ieee (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Conference on computer vision and pattern recognition, pp 140–149. https://doi.org/10.1109/cvpr42600.2020.00022
Liu W, Li J, Shao C, Ma J, Huang M, Bhatti U A (2022) Robust zero watermarking algorithm for medical images using local binary pattern and discrete cosine transform. Advances in artificial intelligence and security: 8th international conference on artificial intelligence and security, ICAIS 2022, Proceedings. Communications in computer and information science
Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M (2022) Action transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recogn, 124
Nguyen V-T, Nguyen T-N, Le T-L, Pham D-T, Vu H (2021) Adaptive most joint selection and covariance descriptions for a robust skeleton-based human action recognition. Multimed Tools Applic 80(18):27757–27783
Pan H, Chen Y (2019) Multilevel lstm for action recognition based on skeleton sequence. In: 2019 IEEE 21st international conference on high performance computing and communications; IEEE 17th International conference on smart city; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, pp 2218–2223
Ruiz A H, Porzi L, Bulo S R, Moreno-Noguer F (2017) 3d cnns on distance matrices for human action recognition, 1087–1095. https://doi.org/10.1145/3123266.3123299
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1010–1019
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/tnnls.2022.3152990
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words, 1–15. https://doi.org/10.1109/TMM.2023.3241517
Shen X, Ding Y (2022) Human skeleton representation for 3d action recognition based on complex network coding and lstm. J Vis Commun Image Represent 82:103386. https://doi.org/10.1016/j.jvcir.2021.103386
Article Google Scholar
Shen X, Ding Y (2022) Human skeleton representation for 3d action recognition based on complex network coding and lstm. J Vis Commun Image Represent 82:103386. https://doi.org/10.1016/j.jvcir.2021.103386
Article Google Scholar
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 7904–7913. https://doi.org/10.1109/CVPR.2019.00810
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 12018–12027. https://doi.org/10.1109/CVPR.2019.01230
Shi L, Zhang Y F, Cheng J, Lu H Q (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545. https://doi.org/10.1109/tip.2020.3028207
Article Google Scholar
Si C, Jing Y, Wang W, Wang L, Tan T (2018) Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European conference on computer vision (ECCV), pp 103–118
Singla M, Ghosh D, Shukla KK (2020) A survey of robust optimization based machine learning with special reference to support vector machines. Int J Mach Learn Cybern 11(7):1359–1385
Article Google Scholar
Su B, Wu H, Sheng M, Shen C (2019) Accurate hierarchical human actions recognition from kinect skeleton data. IEEE Access 7:52532–52541
Article Google Scholar
Tang Y, Tian Y, Lu J, Li P, Zhou J (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5323–5332. https://doi.org/10.1109/CVPR.2018.00558
Tong A, Tang C, Wang W (2022) Semi-supervised action recognition from temporal augmentation using curriculum learning. IEEE Trans Circuits Syst Video Technol, 1–1. https://doi.org/10.1109/TCSVT.2022.3210271
Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp 4471–4479
Wang L, Huynh D Q, Koniusz P (2020) A comparative review of recent kinect-based action recognition algorithms. IEEE Trans Image Process 29:15–28. https://doi.org/10.1109/TIP.2019.2925285
Article MathSciNet Google Scholar
Woo S, Park J, Lee J-Y, Kweon I S (2018) Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19
Wu H, Ma X, Li Y (2022) Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Trans Circ Syst Video Technol 32 (3):1250–1261. https://doi.org/10.1109/TCSVT.2021.3077512
Article Google Scholar
Xia L, Chen C-C, Aggarwal J K (2012) View invariant human action recognition using histograms of 3d joints. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, pp 20–27
Xiliang X, Jingbing L, Dan Y, Yangxiu F, Wenfeng C, Bhatti U A, Baoru H (2021) Robust zero watermarking algorithm for encrypted medical images based on dwt-gabor. Innovation in Medicine and Healthcare. Proceedings of 9th KES-InMed 2021. Smart Innovation, Systems and Technologies. https://doi.org/10.1007/978-981-16-3013-2_7
Xu W, Wu M, Zhu J, Zhao M (2021) Multi-scale skeleton adaptive weighted gcn for skeleton-based human action recognition in iot. Appl Soft Comput, 104
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, pp 7444–7452
Yangxiu F, Jing L, Jingbing L, Dan Y, Wenfeng C, Xiliang X, Baoru H, Bhatti U A (2021) A novel robust watermarking algorithm for encrypted medical image based on Bandelet-DCT. https://doi.org/10.1007/978-981-16-3013-2_6
Yu L, Tian L, Du Q, Bhutto J A (2022) Multi-stream adaptive 3d attention graph convolution network for skeleton-based action recognition. Appl Intell
Yue R, Tian Z, Du S (2022) Action recognition based on rgb and skeleton data sets: a survey. Neurocomputing 512:287–306. https://doi.org/10.1016/j.neucom.2022.09.071
Article Google Scholar
Zeeshan Z, ul Ain Q, Bhatti U A, Memon W H, Ali S, Nawaz S A, Nizamani M M, Mehmood A, Bhatti M A, Shoukat M U (2021) Feature-based multi-criteria recommendation system using a weighted approach with ranking correlation. Intell Data Anal 25(4):1013–1029
Article Google Scholar
Zeng C, Liu J, Li J, Cheng J, Zhou J, Nawaz S A, Xiao X, Bhatti U A (2022) Multi-watermarking algorithm for medical image based on kaze-dct. J Ambient Intell Humaniz Comput, https://doi.org/10.1007/s12652-021-03539-5
Zhang S, Yang Y, Xiao J, Liu X, Yang Y, Xie D, Zhuang Y (2018) Fusing geometric features for skeleton-based action recognition using multilayer lstm networks. IEEE Trans Multimed 20(9):2330–2343. https://doi.org/10.1109/TMM.2018.2802648
Article Google Scholar
Zhang J, Lou Y, Wang J, Wu K, Lu K, Jia X (2021) Evaluating adversarial attacks on driving safety in vision-based autonomous vehicles. IEEE Internet Things J 9(5):3443–3456
Article Google Scholar
Zheng Z, An G, Wu D, Ruan Q (2019) Spatial-temporal pyramid based convolutional neural network for action recognition. Neurocomputing 358:446–455. https://doi.org/10.1016/j.neucom.2019.05.058
Article Google Scholar
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Zhuang Q, Gan S, Zhang L (2022) Human-computer interaction based health diagnostics using resnet34 for tongue image classification. Comput Methods Programs Biomed 226:107096. https://doi.org/10.1016/j.cmpb.2022.107096
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation (62076154, U21A20513), the Anhui Provincial Natural Science Foundation (2008085MF202), the University Natural Sciences Research Project of Anhui Province (KJ2020A0660), and the Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University (MMC202003), Scientific Research Projects for Graduate Students in Anhui Universities (YJS20210564), Anhui Province Student Innovation Training Project (S202111059266, S202111059016).

Author information

Authors and Affiliations

School of Artificial Intelligence and Big Data, Hefei University, Hefei, China
Fang Ren, Chao Tang & Anyang Tong
School of Computer and Information Technology, Shanxi University, Taiyuan, China
Wenjian Wang

Authors

Fang Ren
View author publications
You can also search for this author in PubMed Google Scholar
Chao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Anyang Tong
View author publications
You can also search for this author in PubMed Google Scholar
Wenjian Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chao Tang.

Ethics declarations

Conflict of Interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ren, F., Tang, C., Tong, A. et al. Skeleton-based human action recognition by fusing attention based three-stream convolutional neural network and SVM. Multimed Tools Appl 83, 6273–6295 (2024). https://doi.org/10.1007/s11042-023-15334-9

Download citation

Received: 10 November 2022
Revised: 27 March 2023
Accepted: 06 April 2023
Published: 23 May 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s11042-023-15334-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Skeleton-based human action recognition by fusing attention based three-stream convolutional neural network and SVM

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Transfer learning for image classification using VGG19: Caltech-101 image data set

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Skeleton-based human action recognition by fusing attention based three-stream convolutional neural network and SVM

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Transfer learning for image classification using VGG19: Caltech-101 image data set

Human Activity Recognition (HAR) Using Deep Learning: Review, Methodologies, Progress and Future Research Directions

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation