Skip to main content
Log in

Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In the field of computer vision, the task of human action recognition (HAR) represents a challenge, due to the complexity of capturing nuanced human movements from video data. To address this issue, researchers have developed various algorithms. In this study, a novel two-stream architecture is developed that combines LSTM with a depthwise separable convolutional neural network (DSConV) and skeleton information, with the aim of enhancing the accuracy of HAR. The 3D coordinates of each joint in the skeleton are extracted using the Mediapipe library, and the 2D coordinates are obtained using MoveNet. The proposed method comprises two streams, called the temporal LSTM module and the joint-motion module, and was developed to overcome the limitations of prior two-stream RNN models, such as the vanishing gradient problem and the difficulty of effectively extracting temporal-spatial information. A performance evaluation on the benchmark datasets of JHMDB (73.31%), Florence-3D Action (97.67%), SBU Interaction (95.2%), and Penn Action (94.0%) showcases the effectiveness of the proposed model. A comparison with state-of-the-art methods demonstrates the superior performance of the approach on these datasets. This study contributes to advancing the field of HAR, with potential applications in surveillance and robotics.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data and material presented in this study are available on request from the corresponding author.

Code availability

The code that supports for this research are available on request from the corresponding author or the first author with email address such as cklu@ntnu.edu.tw or lehoangcongk16spkt@gmail.com, respectively.

References

  1. Al Saleem G, Bajwa UI, Raza RH (2023) Toward human activity recognition: a survey. Neural Comput Applic 35:4145–4182

    Article  MATH  Google Scholar 

  2. Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53:24142–24156

    Article  MATH  Google Scholar 

  3. Chih-Yao, Ma et al (2019) TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Sig Process Image Commun 71:76–87

    Article  MATH  Google Scholar 

  4. Kalfaoglu ME et al (2020) Late temporal modeling in 3D CNN architectures with BERT for action recognition. In: Bartoli A, Fusiello A (eds) Computer vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science, vol 12539. Springer, Cham

    Google Scholar 

  5. Gowda SN, Rohrbach M, Sevilla-Lara L (2021) SMART frame selection for action recognition. In: 2021 the AAAI Conference on Artificial Intelligence. AAAI, pp 1451–1459

  6. Khobdeh SB, Yamaghani MR, Sareshkeh SK (2024) Basketball action recognition based on the combination of YOLO and a deep fuzzy LSTM network. J Supercomput 80:3528–3553

    Article  MATH  Google Scholar 

  7. Saleem G, Bajwa UI, Raza RH (2023) Toward human activity recognition: a survey. Neural Comput Applic 35:4145–4182

    Article  MATH  Google Scholar 

  8. Liu Y, Li Y, Zhang H, Zhang X, Xu D (2024) Decoupled knowledge embedded graph convolutional network for skeleton-based human action recognition. In: IEEE transactions on circuits and systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3399126

  9. Koniusz P, Anoop Cherian (2022) Tensor representations for action recognition. IEEE Trans Pattern Anal Mach Intell 44:648–665. https://doi.org/10.1109/TPAMI.2021.3107160

    Article  MATH  Google Scholar 

  10. Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 the IEEE conference on computer vision and pattern recognition. IEEE, pp 499–508

  11. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9907. Springer, Cham. https://doi.org/10.1007/978-3-319-46487-9_50.

    Chapter  Google Scholar 

  12. Xie Z, Zheng G, Miao L, Huang W (2023) STGL-GCN: spatial–temporal mixing of global and local self-attention graph convolutional networks for human action recognition. IEEE Access 11:16526–16532

    Article  Google Scholar 

  13. Shah A, et al (2022) Pose and joint-aware action recognition. In: 2022 the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, pp 3850–3860. https://doi.org/10.1109/WACV51458.2022.00022

  14. Mazzia V, et al (2022) Action Transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recogn 124:108487. https://doi.org/10.1016/j.patcog.2021.108487

    Article  MATH  Google Scholar 

  15. Shi J, Zhang Y, Wang W, Xing B, Hu D, Chen L (2023) A novel two-stream transformer-based framework for multi-modality human action recognition. Appl Sci 13:2058. https://doi.org/10.3390/app13042058

    Article  MATH  Google Scholar 

  16. Ahn D, et al (2023) STAR-Transformer: a spatio-temporal cross attention transformer for human action recognition. In: 2023 the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, pp 3330–3339. https://doi.org/10.1109/WACV56688.2023.00333

  17. Cha J, Saqlain M, Kim D, Lee S, Lee S, Baek S (2022) Learning 3D skeletal representation from transformer for action recognition. IEEE Access 10:67541–67550. https://doi.org/10.1109/ACCESS.2022.3185058

    Article  MATH  Google Scholar 

  18. Liu X, Li Y, Guo T, Xia R (2020) Relative view based holistic-separate representations for two person interaction recognition using multiple graph convolutional networks. J Vis Commun Image Represent 70:102833

    Article  MATH  Google Scholar 

  19. Bian C, Feng W, Wan L, Wang S (2021) Structural knowledge distillation for efficient skeleton-based action recognition. IEEE Trans Image Process 30:2963–2976. https://doi.org/10.1109/TIP.2021.3056895

    Article  MATH  Google Scholar 

  20. Liu C, et al (2024) Enhancing action recognition from low-quality skeleton data via part-level knowledge distillation. Sig Process 221:109486

    Article  MATH  Google Scholar 

  21. Bazarevsky V, et al (2020) BlazePose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204

  22. Google Research (posted by Ronny Votel and Na Li) (17 (2021) Next-generation pose detection with MoveNet and TensorFlow.j. https://blog.tensorflow.org/2021/05/next-generation-pose-detection-with-movenet-and-tensorflowjs.html. Accessed 17 May 2021

  23. Yin M, et al (2023) Efficient skeleton-based action recognition via multi-stream depthwise separable convolutional neural network. Expert Syst Appl 226:120080. https://doi.org/10.1016/j.eswa.2023.120080

    Article  Google Scholar 

  24. Wu K, Gong X (2023) Asymmetric information-regularized learning for skeleton-based action recognition. Appl Intell 53:31065–31076. https://doi.org/10.1007/s10489-023-05173-4

    Article  MATH  Google Scholar 

  25. Mi S, Zhang Y (2022) Pose-guided action recognition in static images using lie-group. Appl Intell 52:6760–6768. https://doi.org/10.1007/s10489-021-02760-1

    Article  MATH  Google Scholar 

  26. Du Z, Mukaidani H (2022) Linear dynamical systems approach for human action recognition with dual-stream deep features. Appl Intell 52:452–470. https://doi.org/10.1007/s10489-021-02367-6

    Article  MATH  Google Scholar 

  27. Jiang G, Jiang X, Fang Z et al (2021) An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell 51:7043–7057. https://doi.org/10.1007/s10489-021-02195-8

    Article  MATH  Google Scholar 

  28. Sheng Z, et al (2023) Residual LSTM based short-term load forecasting. Appl Soft Comput 144:110461

    Article  MATH  Google Scholar 

  29. Gumaei A, et al (2019) A hybrid deep learning model for human activity recognition using multimodal body sensing data. IEEE Access 7:99152–99160. https://doi.org/10.1109/ACCESS.2019.2927134

    Article  MATH  Google Scholar 

  30. Qi Y, Hu J, Zhuang L et al (2023) Semantic-guided multi-scale human skeleton action recognition. Appl Intell 53:9763–9778. https://doi.org/10.1007/s10489-022-03968-5

    Article  Google Scholar 

  31. Zhang H, Liu X, Yu D et al (2023) Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network. Appl Intell 53:17629–17643. https://doi.org/10.1007/s10489-022-04365-8

    Article  Google Scholar 

  32. Zhu Q, Deng H (2023) Spatial adaptive graph convolutional network for skeleton-based action recognition. Appl Intell 53:17796–17808. https://doi.org/10.1007/s10489-022-04442-y

    Article  MATH  Google Scholar 

  33. Cai Y et al (2020) Learning delicate local representations for multi-person pose estimation. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12348. Springer, Cham. https://doi.org/10.1007/978-3-030-58580-8_27.

    Chapter  MATH  Google Scholar 

  34. Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2021) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43:172–186. https://doi.org/10.1109/TPAMI.2019.2929257

    Article  Google Scholar 

  35. Xu Y, Zhang J, Zhang Q, Tao D (2022) ViTPose: simple vision transformer baselines for human pose estimation. Adv Neural Inf Process Syst 35:38571–38584

    MATH  Google Scholar 

  36. Jayagopal JK (2021) Finding headache moments from youtube videos using weak supervision. Master’s thesis, Texas A&M, University US. https://hdl.handle.net/1969.1/195104. Accessed 2021

  37. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608

    Article  MATH  Google Scholar 

  38. Chen S, Tang J, Zhu L (2023) A multi-stage dynamical fusion network for multimodal emotion recognition. Cogn Neurodyn 17:671–680. https://doi.org/10.1007/s11571-022-09851-w

    Article  MATH  Google Scholar 

  39. Pedregosa et al (2021) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  40. Maharana K, Mondal S, Nemade B (2022) A review: Data pre-processing and data augmentation techniques. Glob Transit Proc 3:91–99. https://doi.org/10.1016/j.gltp.2022.04.020

  41. Perez M, Liu J, Kot AC (2022) Interaction relational network for mutual action recognition. IEEE Trans Multimedia 24:366–376

    Article  MATH  Google Scholar 

  42. Zhang W, et al (2013) From actemes to action: A strongly-supervised representation for detailed action understanding. In: 2013 the IEEE international conference on computer vision, 2013. IEEE, pp 2248–2255. https://doi.org/10.1109/ICCV.2013.280

  43. Seidenari L, Varano V, Berretti S, Del Bimbo A, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 2013. IEEE, pp 479–485. https://doi.org/10.1109/CVPRW.2013.77

  44. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013. IEEE, pp 3192–3199. https://doi.org/10.1109/ICCV.2013.396

  45. Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012. IEEE, pp 28–35. https://doi.org/10.1109/CVPRW.2012.6239234

  46. Li H et al (2023) Action recognition based on attention mechanism and depthwise separable residual module. SIViP 17:57–65

    Article  MATH  Google Scholar 

  47. Sandler M, et al (2018) MobileNetV2: Inverted residuals and linear bottlenecks. In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 4510–4520. https://doi.org/10.1109/CVPR.2018.00474

  48. Tasci E (2020) Voting combinations-based ensemble of fine-tuned convolutional neural networks for food image recognition. Multimedia Tools Appl 79:30397–30418. https://doi.org/10.1007/s11042-020-09486-1

    Article  Google Scholar 

  49. Yadav Y et al (2020) Analysis of facial sentiments: a deep-learning way. In: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India. IEEE, pp 541–545. https://doi.org/10.1109/ICESC48915.2020.9155622

  50. Batchuluun G et al (2023) CAM-CAN: class activation map-based categorical adversarial network. Expert Syst Appl 222:119809

    Article  Google Scholar 

  51. Yang F, Wu Y, Sakti S, Nakamura S (2019) Make skeleton-based action recognition model smaller, faster and better. In: 2019 the ACM multimedia asia. ACM, pp 1–6. https://doi.org/10.1145/3338533.3366569

  52. Askar A, et al (2022) 2D Skeleton-based action recognition using action-snippets and sequential deep learning. In: 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, pp 2372–2377. https://doi.org/10.1109/SMC53654.2022.9945402

  53. Chen D, Wu M, Zhang T, Li C (2023) Feature fusion for dual-stream cooperative action recognition. IEEE Access 11:116732–116740

    Article  Google Scholar 

  54. Sahoo SP et al (2021) HAR-Depth: a novel framework for human action recognition using sequential learning and depth estimated history images. IEEE Trans Emerg Top Comput Intell 5:813–825

    Article  MATH  Google Scholar 

  55. Mazari A, Sahbi H (2024) Deep multiple aggregation networks for action recognition. Int J Multimed Info Retr 13:9. https://doi.org/10.1007/s13735-023-00317-1

    Article  MATH  Google Scholar 

  56. Ludl D, Gulde T, Curio C (2019) Simple yet efficient real-time pose-based action recognition. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, pp 581–588. https://doi.org/10.1109/ITSC.2019.8917128

  57. Asghari-Esfeden S et al (2020) Dynamic motion representation for human action recognition. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA. IEEE, pp 546–555. https://doi.org/10.1109/WACV45572.2020.9093500

  58. Tanfous AB et al (2018) Coding Kendall’s shape trajectories for 3D action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. IEEE, pp 2840–2849. https://doi.org/10.1109/CVPR.2018.00300

  59. Li Y et al (2022) Action status based novel relative feature representations for interaction recognition. Chin J Electron 31:168–180

    MATH  Google Scholar 

  60. Monika et al (2023) Skeleton-based human activity recognition using bidirectional LSTM. In: Proceedings of International Conference on Intelligent Systems Design and Applications. Springer Nature, Cham, pp 150–159

  61. Weng J, Jiang X, Yuan J (2021) NBNN-Based discriminative 3D action and gesture recognition. In: Thalmann NM, Zhang JJ, Ramanathan M, Thalmann D (eds) Intelligent scene modeling and human-computer interaction. Human–computer interaction series. Springer, Cham. https://doi.org/10.1007/978-3-030-71002-6_3

    Chapter  MATH  Google Scholar 

  62. Zhao R et al (2019) Bayesian hierarchical dynamic model for human action recognition. In: 2019 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. IEEE, pp 7725–7734. https://doi.org/10.1109/CVPR.2019.00792

  63. Shi L, et al (2019) Two stream adaptive graph convolutional networks for skeleton based action recognition. In: 2019 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 12026–12035. https://doi.org/10.1109/CVPR.2019.01230

Download references

Acknowledgements

This work was financially supported by the “Chinese Language and Technology Center” of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and National Science and Technology Council, Taiwan, under Grants no. NSTC 112-2221-E-003-007, NSTC 112-2221-E-003-008, and NSTC 112-2221-E-003-010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng-Kai Lu.

Ethics declarations

Competing interests

All authors declare that they have no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, H., Lu, CK., Hsu, CC. et al. Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network. Appl Intell 55, 298 (2025). https://doi.org/10.1007/s10489-024-06082-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-06082-w

Keywords