Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network

Le, Hoangcong; Lu, Cheng-Kai; Hsu, Chen-Chien; Huang, Shao-Kang

doi:10.1007/s10489-024-06082-w

Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network

Published: 11 January 2025

Volume 55, article number 298, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Hoangcong Le¹,
Cheng-Kai Lu ORCID: orcid.org/0000-0002-5819-0754¹,
Chen-Chien Hsu¹ &
…
Shao-Kang Huang¹

509 Accesses
Explore all metrics

Abstract

In the field of computer vision, the task of human action recognition (HAR) represents a challenge, due to the complexity of capturing nuanced human movements from video data. To address this issue, researchers have developed various algorithms. In this study, a novel two-stream architecture is developed that combines LSTM with a depthwise separable convolutional neural network (DSConV) and skeleton information, with the aim of enhancing the accuracy of HAR. The 3D coordinates of each joint in the skeleton are extracted using the Mediapipe library, and the 2D coordinates are obtained using MoveNet. The proposed method comprises two streams, called the temporal LSTM module and the joint-motion module, and was developed to overcome the limitations of prior two-stream RNN models, such as the vanishing gradient problem and the difficulty of effectively extracting temporal-spatial information. A performance evaluation on the benchmark datasets of JHMDB (73.31%), Florence-3D Action (97.67%), SBU Interaction (95.2%), and Penn Action (94.0%) showcases the effectiveness of the proposed model. A comparison with state-of-the-art methods demonstrates the superior performance of the approach on these datasets. This study contributes to advancing the field of HAR, with potential applications in surveillance and robotics.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Article 29 May 2020

Spatio-temporal neural network with handcrafted features for skeleton-based action recognition

Article Open access 24 February 2024

Deep learning-based multi-view 3D-human action recognition using skeleton and depth data

Article 18 November 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data and material presented in this study are available on request from the corresponding author.

Code availability

The code that supports for this research are available on request from the corresponding author or the first author with email address such as cklu@ntnu.edu.tw or lehoangcongk16spkt@gmail.com, respectively.

References

Al Saleem G, Bajwa UI, Raza RH (2023) Toward human activity recognition: a survey. Neural Comput Applic 35:4145–4182
Article MATH Google Scholar
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53:24142–24156
Article MATH Google Scholar
Chih-Yao, Ma et al (2019) TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Sig Process Image Commun 71:76–87
Article MATH Google Scholar
Kalfaoglu ME et al (2020) Late temporal modeling in 3D CNN architectures with BERT for action recognition. In: Bartoli A, Fusiello A (eds) Computer vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science, vol 12539. Springer, Cham
Google Scholar
Gowda SN, Rohrbach M, Sevilla-Lara L (2021) SMART frame selection for action recognition. In: 2021 the AAAI Conference on Artificial Intelligence. AAAI, pp 1451–1459
Khobdeh SB, Yamaghani MR, Sareshkeh SK (2024) Basketball action recognition based on the combination of YOLO and a deep fuzzy LSTM network. J Supercomput 80:3528–3553
Article MATH Google Scholar
Saleem G, Bajwa UI, Raza RH (2023) Toward human activity recognition: a survey. Neural Comput Applic 35:4145–4182
Article MATH Google Scholar
Liu Y, Li Y, Zhang H, Zhang X, Xu D (2024) Decoupled knowledge embedded graph convolutional network for skeleton-based human action recognition. In: IEEE transactions on circuits and systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3399126
Koniusz P, Anoop Cherian (2022) Tensor representations for action recognition. IEEE Trans Pattern Anal Mach Intell 44:648–665. https://doi.org/10.1109/TPAMI.2021.3107160
Article MATH Google Scholar
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 the IEEE conference on computer vision and pattern recognition. IEEE, pp 499–508
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9907. Springer, Cham. https://doi.org/10.1007/978-3-319-46487-9_50.
Chapter Google Scholar
Xie Z, Zheng G, Miao L, Huang W (2023) STGL-GCN: spatial–temporal mixing of global and local self-attention graph convolutional networks for human action recognition. IEEE Access 11:16526–16532
Article Google Scholar
Shah A, et al (2022) Pose and joint-aware action recognition. In: 2022 the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, pp 3850–3860. https://doi.org/10.1109/WACV51458.2022.00022
Mazzia V, et al (2022) Action Transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recogn 124:108487. https://doi.org/10.1016/j.patcog.2021.108487
Article MATH Google Scholar
Shi J, Zhang Y, Wang W, Xing B, Hu D, Chen L (2023) A novel two-stream transformer-based framework for multi-modality human action recognition. Appl Sci 13:2058. https://doi.org/10.3390/app13042058
Article MATH Google Scholar
Ahn D, et al (2023) STAR-Transformer: a spatio-temporal cross attention transformer for human action recognition. In: 2023 the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, pp 3330–3339. https://doi.org/10.1109/WACV56688.2023.00333
Cha J, Saqlain M, Kim D, Lee S, Lee S, Baek S (2022) Learning 3D skeletal representation from transformer for action recognition. IEEE Access 10:67541–67550. https://doi.org/10.1109/ACCESS.2022.3185058
Article MATH Google Scholar
Liu X, Li Y, Guo T, Xia R (2020) Relative view based holistic-separate representations for two person interaction recognition using multiple graph convolutional networks. J Vis Commun Image Represent 70:102833
Article MATH Google Scholar
Bian C, Feng W, Wan L, Wang S (2021) Structural knowledge distillation for efficient skeleton-based action recognition. IEEE Trans Image Process 30:2963–2976. https://doi.org/10.1109/TIP.2021.3056895
Article MATH Google Scholar
Liu C, et al (2024) Enhancing action recognition from low-quality skeleton data via part-level knowledge distillation. Sig Process 221:109486
Article MATH Google Scholar
Bazarevsky V, et al (2020) BlazePose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204
Google Research (posted by Ronny Votel and Na Li) (17 (2021) Next-generation pose detection with MoveNet and TensorFlow.j. https://blog.tensorflow.org/2021/05/next-generation-pose-detection-with-movenet-and-tensorflowjs.html. Accessed 17 May 2021
Yin M, et al (2023) Efficient skeleton-based action recognition via multi-stream depthwise separable convolutional neural network. Expert Syst Appl 226:120080. https://doi.org/10.1016/j.eswa.2023.120080
Article Google Scholar
Wu K, Gong X (2023) Asymmetric information-regularized learning for skeleton-based action recognition. Appl Intell 53:31065–31076. https://doi.org/10.1007/s10489-023-05173-4
Article MATH Google Scholar
Mi S, Zhang Y (2022) Pose-guided action recognition in static images using lie-group. Appl Intell 52:6760–6768. https://doi.org/10.1007/s10489-021-02760-1
Article MATH Google Scholar
Du Z, Mukaidani H (2022) Linear dynamical systems approach for human action recognition with dual-stream deep features. Appl Intell 52:452–470. https://doi.org/10.1007/s10489-021-02367-6
Article MATH Google Scholar
Jiang G, Jiang X, Fang Z et al (2021) An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell 51:7043–7057. https://doi.org/10.1007/s10489-021-02195-8
Article MATH Google Scholar
Sheng Z, et al (2023) Residual LSTM based short-term load forecasting. Appl Soft Comput 144:110461
Article MATH Google Scholar
Gumaei A, et al (2019) A hybrid deep learning model for human activity recognition using multimodal body sensing data. IEEE Access 7:99152–99160. https://doi.org/10.1109/ACCESS.2019.2927134
Article MATH Google Scholar
Qi Y, Hu J, Zhuang L et al (2023) Semantic-guided multi-scale human skeleton action recognition. Appl Intell 53:9763–9778. https://doi.org/10.1007/s10489-022-03968-5
Article Google Scholar
Zhang H, Liu X, Yu D et al (2023) Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network. Appl Intell 53:17629–17643. https://doi.org/10.1007/s10489-022-04365-8
Article Google Scholar
Zhu Q, Deng H (2023) Spatial adaptive graph convolutional network for skeleton-based action recognition. Appl Intell 53:17796–17808. https://doi.org/10.1007/s10489-022-04442-y
Article MATH Google Scholar
Cai Y et al (2020) Learning delicate local representations for multi-person pose estimation. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12348. Springer, Cham. https://doi.org/10.1007/978-3-030-58580-8_27.
Chapter MATH Google Scholar
Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2021) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43:172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Article Google Scholar
Xu Y, Zhang J, Zhang Q, Tao D (2022) ViTPose: simple vision transformer baselines for human pose estimation. Adv Neural Inf Process Syst 35:38571–38584
MATH Google Scholar
Jayagopal JK (2021) Finding headache moments from youtube videos using weak supervision. Master’s thesis, Texas A&M, University US. https://hdl.handle.net/1969.1/195104. Accessed 2021
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608
Article MATH Google Scholar
Chen S, Tang J, Zhu L (2023) A multi-stage dynamical fusion network for multimodal emotion recognition. Cogn Neurodyn 17:671–680. https://doi.org/10.1007/s11571-022-09851-w
Article MATH Google Scholar
Pedregosa et al (2021) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Maharana K, Mondal S, Nemade B (2022) A review: Data pre-processing and data augmentation techniques. Glob Transit Proc 3:91–99. https://doi.org/10.1016/j.gltp.2022.04.020
Perez M, Liu J, Kot AC (2022) Interaction relational network for mutual action recognition. IEEE Trans Multimedia 24:366–376
Article MATH Google Scholar
Zhang W, et al (2013) From actemes to action: A strongly-supervised representation for detailed action understanding. In: 2013 the IEEE international conference on computer vision, 2013. IEEE, pp 2248–2255. https://doi.org/10.1109/ICCV.2013.280
Seidenari L, Varano V, Berretti S, Del Bimbo A, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 2013. IEEE, pp 479–485. https://doi.org/10.1109/CVPRW.2013.77
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013. IEEE, pp 3192–3199. https://doi.org/10.1109/ICCV.2013.396
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012. IEEE, pp 28–35. https://doi.org/10.1109/CVPRW.2012.6239234
Li H et al (2023) Action recognition based on attention mechanism and depthwise separable residual module. SIViP 17:57–65
Article MATH Google Scholar
Sandler M, et al (2018) MobileNetV2: Inverted residuals and linear bottlenecks. In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 4510–4520. https://doi.org/10.1109/CVPR.2018.00474
Tasci E (2020) Voting combinations-based ensemble of fine-tuned convolutional neural networks for food image recognition. Multimedia Tools Appl 79:30397–30418. https://doi.org/10.1007/s11042-020-09486-1
Article Google Scholar
Yadav Y et al (2020) Analysis of facial sentiments: a deep-learning way. In: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India. IEEE, pp 541–545. https://doi.org/10.1109/ICESC48915.2020.9155622
Batchuluun G et al (2023) CAM-CAN: class activation map-based categorical adversarial network. Expert Syst Appl 222:119809
Article Google Scholar
Yang F, Wu Y, Sakti S, Nakamura S (2019) Make skeleton-based action recognition model smaller, faster and better. In: 2019 the ACM multimedia asia. ACM, pp 1–6. https://doi.org/10.1145/3338533.3366569
Askar A, et al (2022) 2D Skeleton-based action recognition using action-snippets and sequential deep learning. In: 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, pp 2372–2377. https://doi.org/10.1109/SMC53654.2022.9945402
Chen D, Wu M, Zhang T, Li C (2023) Feature fusion for dual-stream cooperative action recognition. IEEE Access 11:116732–116740
Article Google Scholar
Sahoo SP et al (2021) HAR-Depth: a novel framework for human action recognition using sequential learning and depth estimated history images. IEEE Trans Emerg Top Comput Intell 5:813–825
Article MATH Google Scholar
Mazari A, Sahbi H (2024) Deep multiple aggregation networks for action recognition. Int J Multimed Info Retr 13:9. https://doi.org/10.1007/s13735-023-00317-1
Article MATH Google Scholar
Ludl D, Gulde T, Curio C (2019) Simple yet efficient real-time pose-based action recognition. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, pp 581–588. https://doi.org/10.1109/ITSC.2019.8917128
Asghari-Esfeden S et al (2020) Dynamic motion representation for human action recognition. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA. IEEE, pp 546–555. https://doi.org/10.1109/WACV45572.2020.9093500
Tanfous AB et al (2018) Coding Kendall’s shape trajectories for 3D action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. IEEE, pp 2840–2849. https://doi.org/10.1109/CVPR.2018.00300
Li Y et al (2022) Action status based novel relative feature representations for interaction recognition. Chin J Electron 31:168–180
MATH Google Scholar
Monika et al (2023) Skeleton-based human activity recognition using bidirectional LSTM. In: Proceedings of International Conference on Intelligent Systems Design and Applications. Springer Nature, Cham, pp 150–159
Weng J, Jiang X, Yuan J (2021) NBNN-Based discriminative 3D action and gesture recognition. In: Thalmann NM, Zhang JJ, Ramanathan M, Thalmann D (eds) Intelligent scene modeling and human-computer interaction. Human–computer interaction series. Springer, Cham. https://doi.org/10.1007/978-3-030-71002-6_3
Chapter MATH Google Scholar
Zhao R et al (2019) Bayesian hierarchical dynamic model for human action recognition. In: 2019 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. IEEE, pp 7725–7734. https://doi.org/10.1109/CVPR.2019.00792
Shi L, et al (2019) Two stream adaptive graph convolutional networks for skeleton based action recognition. In: 2019 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 12026–12035. https://doi.org/10.1109/CVPR.2019.01230

Download references

Acknowledgements

This work was financially supported by the “Chinese Language and Technology Center” of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and National Science and Technology Council, Taiwan, under Grants no. NSTC 112-2221-E-003-007, NSTC 112-2221-E-003-008, and NSTC 112-2221-E-003-010.

Author information

Authors and Affiliations

Department of Electrical Engineering, National Taiwan Normal University, Taipei City, 10610, Taiwan
Hoangcong Le, Cheng-Kai Lu, Chen-Chien Hsu & Shao-Kang Huang

Authors

Hoangcong Le
View author publications
You can also search for this author inPubMed Google Scholar
Cheng-Kai Lu
View author publications
You can also search for this author inPubMed Google Scholar
Chen-Chien Hsu
View author publications
You can also search for this author inPubMed Google Scholar
Shao-Kang Huang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Cheng-Kai Lu.

Ethics declarations

Competing interests

All authors declare that they have no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Le, H., Lu, CK., Hsu, CC. et al. Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network. Appl Intell 55, 298 (2025). https://doi.org/10.1007/s10489-024-06082-w

Download citation

Accepted: 17 November 2024
Published: 11 January 2025
DOI: https://doi.org/10.1007/s10489-024-06082-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network

Abstract

Graphical abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Two-Stream Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Spatio-temporal neural network with handcrafted features for skeleton-based action recognition

Deep learning-based multi-view 3D-human action recognition using skeleton and depth data

Explore related subjects

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now