Abstract
In the field of computer vision, the task of human action recognition (HAR) represents a challenge, due to the complexity of capturing nuanced human movements from video data. To address this issue, researchers have developed various algorithms. In this study, a novel two-stream architecture is developed that combines LSTM with a depthwise separable convolutional neural network (DSConV) and skeleton information, with the aim of enhancing the accuracy of HAR. The 3D coordinates of each joint in the skeleton are extracted using the Mediapipe library, and the 2D coordinates are obtained using MoveNet. The proposed method comprises two streams, called the temporal LSTM module and the joint-motion module, and was developed to overcome the limitations of prior two-stream RNN models, such as the vanishing gradient problem and the difficulty of effectively extracting temporal-spatial information. A performance evaluation on the benchmark datasets of JHMDB (73.31%), Florence-3D Action (97.67%), SBU Interaction (95.2%), and Penn Action (94.0%) showcases the effectiveness of the proposed model. A comparison with state-of-the-art methods demonstrates the superior performance of the approach on these datasets. This study contributes to advancing the field of HAR, with potential applications in surveillance and robotics.
Graphical abstract















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data and material presented in this study are available on request from the corresponding author.
Code availability
The code that supports for this research are available on request from the corresponding author or the first author with email address such as cklu@ntnu.edu.tw or lehoangcongk16spkt@gmail.com, respectively.
References
Al Saleem G, Bajwa UI, Raza RH (2023) Toward human activity recognition: a survey. Neural Comput Applic 35:4145–4182
Wu N, Kera H, Kawamoto K (2023) Improving zero-shot action recognition using human instruction with text description. Appl Intell 53:24142–24156
Chih-Yao, Ma et al (2019) TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Sig Process Image Commun 71:76–87
Kalfaoglu ME et al (2020) Late temporal modeling in 3D CNN architectures with BERT for action recognition. In: Bartoli A, Fusiello A (eds) Computer vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science, vol 12539. Springer, Cham
Gowda SN, Rohrbach M, Sevilla-Lara L (2021) SMART frame selection for action recognition. In: 2021 the AAAI Conference on Artificial Intelligence. AAAI, pp 1451–1459
Khobdeh SB, Yamaghani MR, Sareshkeh SK (2024) Basketball action recognition based on the combination of YOLO and a deep fuzzy LSTM network. J Supercomput 80:3528–3553
Saleem G, Bajwa UI, Raza RH (2023) Toward human activity recognition: a survey. Neural Comput Applic 35:4145–4182
Liu Y, Li Y, Zhang H, Zhang X, Xu D (2024) Decoupled knowledge embedded graph convolutional network for skeleton-based human action recognition. In: IEEE transactions on circuits and systems for Video Technology. https://doi.org/10.1109/TCSVT.2024.3399126
Koniusz P, Anoop Cherian (2022) Tensor representations for action recognition. IEEE Trans Pattern Anal Mach Intell 44:648–665. https://doi.org/10.1109/TPAMI.2021.3107160
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: 2017 the IEEE conference on computer vision and pattern recognition. IEEE, pp 499–508
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9907. Springer, Cham. https://doi.org/10.1007/978-3-319-46487-9_50.
Xie Z, Zheng G, Miao L, Huang W (2023) STGL-GCN: spatial–temporal mixing of global and local self-attention graph convolutional networks for human action recognition. IEEE Access 11:16526–16532
Shah A, et al (2022) Pose and joint-aware action recognition. In: 2022 the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, pp 3850–3860. https://doi.org/10.1109/WACV51458.2022.00022
Mazzia V, et al (2022) Action Transformer: a self-attention model for short-time pose-based human action recognition. Pattern Recogn 124:108487. https://doi.org/10.1016/j.patcog.2021.108487
Shi J, Zhang Y, Wang W, Xing B, Hu D, Chen L (2023) A novel two-stream transformer-based framework for multi-modality human action recognition. Appl Sci 13:2058. https://doi.org/10.3390/app13042058
Ahn D, et al (2023) STAR-Transformer: a spatio-temporal cross attention transformer for human action recognition. In: 2023 the IEEE/CVF Winter Conference on Applications of Computer Vision. IEEE, pp 3330–3339. https://doi.org/10.1109/WACV56688.2023.00333
Cha J, Saqlain M, Kim D, Lee S, Lee S, Baek S (2022) Learning 3D skeletal representation from transformer for action recognition. IEEE Access 10:67541–67550. https://doi.org/10.1109/ACCESS.2022.3185058
Liu X, Li Y, Guo T, Xia R (2020) Relative view based holistic-separate representations for two person interaction recognition using multiple graph convolutional networks. J Vis Commun Image Represent 70:102833
Bian C, Feng W, Wan L, Wang S (2021) Structural knowledge distillation for efficient skeleton-based action recognition. IEEE Trans Image Process 30:2963–2976. https://doi.org/10.1109/TIP.2021.3056895
Liu C, et al (2024) Enhancing action recognition from low-quality skeleton data via part-level knowledge distillation. Sig Process 221:109486
Bazarevsky V, et al (2020) BlazePose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204
Google Research (posted by Ronny Votel and Na Li) (17 (2021) Next-generation pose detection with MoveNet and TensorFlow.j. https://blog.tensorflow.org/2021/05/next-generation-pose-detection-with-movenet-and-tensorflowjs.html. Accessed 17 May 2021
Yin M, et al (2023) Efficient skeleton-based action recognition via multi-stream depthwise separable convolutional neural network. Expert Syst Appl 226:120080. https://doi.org/10.1016/j.eswa.2023.120080
Wu K, Gong X (2023) Asymmetric information-regularized learning for skeleton-based action recognition. Appl Intell 53:31065–31076. https://doi.org/10.1007/s10489-023-05173-4
Mi S, Zhang Y (2022) Pose-guided action recognition in static images using lie-group. Appl Intell 52:6760–6768. https://doi.org/10.1007/s10489-021-02760-1
Du Z, Mukaidani H (2022) Linear dynamical systems approach for human action recognition with dual-stream deep features. Appl Intell 52:452–470. https://doi.org/10.1007/s10489-021-02367-6
Jiang G, Jiang X, Fang Z et al (2021) An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell 51:7043–7057. https://doi.org/10.1007/s10489-021-02195-8
Sheng Z, et al (2023) Residual LSTM based short-term load forecasting. Appl Soft Comput 144:110461
Gumaei A, et al (2019) A hybrid deep learning model for human activity recognition using multimodal body sensing data. IEEE Access 7:99152–99160. https://doi.org/10.1109/ACCESS.2019.2927134
Qi Y, Hu J, Zhuang L et al (2023) Semantic-guided multi-scale human skeleton action recognition. Appl Intell 53:9763–9778. https://doi.org/10.1007/s10489-022-03968-5
Zhang H, Liu X, Yu D et al (2023) Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network. Appl Intell 53:17629–17643. https://doi.org/10.1007/s10489-022-04365-8
Zhu Q, Deng H (2023) Spatial adaptive graph convolutional network for skeleton-based action recognition. Appl Intell 53:17796–17808. https://doi.org/10.1007/s10489-022-04442-y
Cai Y et al (2020) Learning delicate local representations for multi-person pose estimation. In: Vedaldi A, Bischof H, Brox T, Frahm JM (eds) Computer vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12348. Springer, Cham. https://doi.org/10.1007/978-3-030-58580-8_27.
Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2021) OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43:172–186. https://doi.org/10.1109/TPAMI.2019.2929257
Xu Y, Zhang J, Zhang Q, Tao D (2022) ViTPose: simple vision transformer baselines for human pose estimation. Adv Neural Inf Process Syst 35:38571–38584
Jayagopal JK (2021) Finding headache moments from youtube videos using weak supervision. Master’s thesis, Texas A&M, University US. https://hdl.handle.net/1969.1/195104. Accessed 2021
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608
Chen S, Tang J, Zhu L (2023) A multi-stage dynamical fusion network for multimodal emotion recognition. Cogn Neurodyn 17:671–680. https://doi.org/10.1007/s11571-022-09851-w
Pedregosa et al (2021) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
Maharana K, Mondal S, Nemade B (2022) A review: Data pre-processing and data augmentation techniques. Glob Transit Proc 3:91–99. https://doi.org/10.1016/j.gltp.2022.04.020
Perez M, Liu J, Kot AC (2022) Interaction relational network for mutual action recognition. IEEE Trans Multimedia 24:366–376
Zhang W, et al (2013) From actemes to action: A strongly-supervised representation for detailed action understanding. In: 2013 the IEEE international conference on computer vision, 2013. IEEE, pp 2248–2255. https://doi.org/10.1109/ICCV.2013.280
Seidenari L, Varano V, Berretti S, Del Bimbo A, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 2013. IEEE, pp 479–485. https://doi.org/10.1109/CVPRW.2013.77
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013. IEEE, pp 3192–3199. https://doi.org/10.1109/ICCV.2013.396
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 2012. IEEE, pp 28–35. https://doi.org/10.1109/CVPRW.2012.6239234
Li H et al (2023) Action recognition based on attention mechanism and depthwise separable residual module. SIViP 17:57–65
Sandler M, et al (2018) MobileNetV2: Inverted residuals and linear bottlenecks. In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 4510–4520. https://doi.org/10.1109/CVPR.2018.00474
Tasci E (2020) Voting combinations-based ensemble of fine-tuned convolutional neural networks for food image recognition. Multimedia Tools Appl 79:30397–30418. https://doi.org/10.1007/s11042-020-09486-1
Yadav Y et al (2020) Analysis of facial sentiments: a deep-learning way. In: 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India. IEEE, pp 541–545. https://doi.org/10.1109/ICESC48915.2020.9155622
Batchuluun G et al (2023) CAM-CAN: class activation map-based categorical adversarial network. Expert Syst Appl 222:119809
Yang F, Wu Y, Sakti S, Nakamura S (2019) Make skeleton-based action recognition model smaller, faster and better. In: 2019 the ACM multimedia asia. ACM, pp 1–6. https://doi.org/10.1145/3338533.3366569
Askar A, et al (2022) 2D Skeleton-based action recognition using action-snippets and sequential deep learning. In: 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, pp 2372–2377. https://doi.org/10.1109/SMC53654.2022.9945402
Chen D, Wu M, Zhang T, Li C (2023) Feature fusion for dual-stream cooperative action recognition. IEEE Access 11:116732–116740
Sahoo SP et al (2021) HAR-Depth: a novel framework for human action recognition using sequential learning and depth estimated history images. IEEE Trans Emerg Top Comput Intell 5:813–825
Mazari A, Sahbi H (2024) Deep multiple aggregation networks for action recognition. Int J Multimed Info Retr 13:9. https://doi.org/10.1007/s13735-023-00317-1
Ludl D, Gulde T, Curio C (2019) Simple yet efficient real-time pose-based action recognition. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, pp 581–588. https://doi.org/10.1109/ITSC.2019.8917128
Asghari-Esfeden S et al (2020) Dynamic motion representation for human action recognition. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA. IEEE, pp 546–555. https://doi.org/10.1109/WACV45572.2020.9093500
Tanfous AB et al (2018) Coding Kendall’s shape trajectories for 3D action recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. IEEE, pp 2840–2849. https://doi.org/10.1109/CVPR.2018.00300
Li Y et al (2022) Action status based novel relative feature representations for interaction recognition. Chin J Electron 31:168–180
Monika et al (2023) Skeleton-based human activity recognition using bidirectional LSTM. In: Proceedings of International Conference on Intelligent Systems Design and Applications. Springer Nature, Cham, pp 150–159
Weng J, Jiang X, Yuan J (2021) NBNN-Based discriminative 3D action and gesture recognition. In: Thalmann NM, Zhang JJ, Ramanathan M, Thalmann D (eds) Intelligent scene modeling and human-computer interaction. Human–computer interaction series. Springer, Cham. https://doi.org/10.1007/978-3-030-71002-6_3
Zhao R et al (2019) Bayesian hierarchical dynamic model for human action recognition. In: 2019 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA. IEEE, pp 7725–7734. https://doi.org/10.1109/CVPR.2019.00792
Shi L, et al (2019) Two stream adaptive graph convolutional networks for skeleton based action recognition. In: 2019 the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 12026–12035. https://doi.org/10.1109/CVPR.2019.01230
Acknowledgements
This work was financially supported by the “Chinese Language and Technology Center” of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and National Science and Technology Council, Taiwan, under Grants no. NSTC 112-2221-E-003-007, NSTC 112-2221-E-003-008, and NSTC 112-2221-E-003-010.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
All authors declare that they have no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Le, H., Lu, CK., Hsu, CC. et al. Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network. Appl Intell 55, 298 (2025). https://doi.org/10.1007/s10489-024-06082-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06082-w