Skip to main content
Log in

Towards a deep human activity recognition approach based on video to image transformation with skeleton data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

One of the most recent challenging tasks in computer vision is Human Activity Recognition (HAR), which aims to analyze and detect the human actions for the benefit of many fields such as video surveillance, behavior analysis and healthcare. Several works in the literature are based on the extraction and analysis of human skeletons in the aim of actions recognition. This paper introduces a new HAR approach based on the extraction of human skeletons from videos. Three features extraction techniques are proposed in this work. They used the extracted skeletons from the videos frames in order to construct a single image that summarizes the activity in that video. The first technique, called dynamic skeleton, is founded on the concept of dynamic images introduced in the literature, while the second one, called skeleton superposition, is based on the superposition of the extracted human skeletons in the same image. The third contribution is called body articulations and it uses only the body joints instead of the whole skeleton in order to recognize the ongoing activity. The obtained images from these three techniques are analyzed and classified using a classification system based on transfer learning principle by fine-tuning three well-known pre-trained CNNs (MobileNet, ResNet-50, VGG16). The designed system is validated and tested on two famous datasets for human activity recognition, which are RGBD-HuDact and KTH datasets. The obtained results are outstanding and proved that the implemented system outperforms the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. In: Proceedings of the 20th international conference on artificial neural networks: Part II, ICANN’10. https://doi.org/10.5555/1889001.1889024. Springer-Verlag, Berlin, pp 154–159

  2. Barnachon M, Bouakaz S, Boufama B, Guillou E (2012) Human actions recognition from streamed motion capture. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp 3807–3810

  3. Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247. https://doi.org/10.1016/j.patcog.2013.06.020

    Article  Google Scholar 

  4. Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40 (12):2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085

    Article  Google Scholar 

  5. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267. https://doi.org/10.1109/34.910878

    Article  Google Scholar 

  6. Campbell LW, Bobick AF (1995) Recognition of human body motion using phase space constraints. In: Proceedings of IEEE international conference on computer vision, pp 624–630. https://doi.org/10.1109/ICCV.1995.466880

  7. Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2018) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. arXiv:1812.08008

  8. Chou K, Prasad M, Wu D, Sharma N, Li D, Lin Y, Blumenstein M, Lin W, Lin C (2018) Robust feature-based automated multi-view human action recognition system. IEEE Access 6:15283–15296. https://doi.org/10.1109/ACCESS.2018.2809552

    Article  Google Scholar 

  9. Ciptadi A, Goodwin MS, Rehg JM (2014) Movement pattern histogram for action recognition and retrieval. In: European conference on computer vision (ECCV), pp 695–710. https://doi.org/10.1007/978-3-319-10605-2_45

  10. Deng J, Dong W, Socher R, Li L, Li K, Li F-F (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

  11. Diaf AA (2013) Eigenvector-based dimensionality reduction for human activity recognition and data classification. Ph.D. thesis CAN

  12. Duric Z, Gray WD, Heishman R, Li F, Rosenfeld A, Schoelles MJ, Schunn C, Wechsler H (2002) Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. Proc IEEE 90 (7):1272–1289. https://doi.org/10.1109/JPROC.2002.801449

    Article  Google Scholar 

  13. Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vision 61(1):55–79. https://doi.org/10.1023/B:VISI.0000042934.15159.49

    Article  Google Scholar 

  14. Fernando B, Gavves E, Oramas MJ, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5378–5387. https://doi.org/10.1109/CVPR.2015.7299176

  15. Gaur U, Zhu Y, Song B, Roy-Chowdhury A (2011) A string of feature graphs model for recognition of complex activities in natural videos. In: 2011 International conference on computer vision, pp 2595–2602. https://doi.org/10.1109/ICCV.2011.6126548

  16. Gnouma M, Ejbali R, Zaied M (2017) Human fall detection based on block matching and silhouette area. In: Verikas A, Radeva P, Nikolaev DP, Zhang W, Zhou J (eds) Ninth international conference on machine vision (ICMV 2016). International Society for Optics and Photonics, SPIE. https://doi.org/10.1117/12.2268988, vol 10341, pp 18–22

  17. Gnouma M, Ladjailia A, Ejbali R, Zaied M (2019) Stacked sparse autoencoder and history of binary motion image for human activity recognition. Multimedia Tools Appl 78 (2):2157–2179. https://doi.org/10.1007/s11042-018-6273-1

    Article  Google Scholar 

  18. Hankyu M, Rajeev S, Namsoon J (2012) Method and system for measuring shopper response to products based on behavior and facial expression. https://lens.org/105-447-594-886-96X

  19. Hassairi S, Ejbali R, Zaied M (2015) Supervised image classification using deep convolutional wavelets network. In: 2015 IEEE 27th International conference on tools with artificial intelligence (ICTAI), pp 265–271

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  21. Hou R, Chen C, Shah M (2017) An end-to-end 3d convolutional neural network for action detection and segmentation in videos. arXiv:1712.01111

  22. Hou R, Chen C, Sukthankar R, Shah M (2019) An efficient 3d CNN for action/object segmentation in video. arXiv:1907.08895

  23. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861

  24. Idrees H, Saleemi I, Seibert C, Shah M (2013) Multi-source multi-scale counting in extremely dense crowd images. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 2547–2554. https://doi.org/10.1109/CVPR.2013.329

  25. Ikizler N, Forsyth D (2007) Searching video for complex activities with finite state models. In: 2007 IEEE Conference on computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2007.383168

  26. Jaeyong S, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from rgbd images. In: 2012 IEEE International conference on robotics and automation, pp 842–849. https://doi.org/10.1109/ICRA.2012.6224591

  27. Jalal A, Kamal S, Kim D (2017) A depth video-based human detection and activity recognition using multi-features and embedded hidden markov models for health care monitoring systems. Int J Int Mult Artif Intell 4(4):54–62. https://doi.org/10.9781/ijimai.2017.447

    Google Scholar 

  28. Ji S, Xu W, Yang M (2013) Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231. https://doi.org/10.1109/TPAMI.2012.59

    Article  Google Scholar 

  29. Ji XF, Wu QQ, Ju ZJ, Wang YY (2014) Study of human action recognition based on improved spatio-temporal features. Int J Autom Comput 11 (5):500–509. https://doi.org/10.1007/s11633-014-0831-4

    Article  Google Scholar 

  30. Jlidi N, Snoun A, Bouchrika T, Jemai O, Zaied M (2020) PTLHAR: PoseNet and transfer learning for human activities recognition based on body articulations. In: Osten W, Nikolaev DP (eds) Twelfth international conference on machine vision (ICMV 2019). International Society for Optics and Photonics, SPIE. https://doi.org/10.1117/12.2559567, vol 11433, pp 187–194

  31. Johansson G (1973) Visual perception of biological motion and a model for its analysis. Percept Psycho 14:201–211. https://doi.org/10.3758/BF03212378

    Article  Google Scholar 

  32. Kwak S, Han B, Han JH (2011) Scenario-based video event recognition by constraint flow. pp 3345–3352. https://doi.org/10.1109/CVPR.2011.5995435

  33. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: 2008 IEEE Conference on computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587756

  34. Li Q, Cheng H, Zhou Y, Huo G (2016) Human action recognition using improved salient dense trajectories. Comput Intell Neurosci 2016:1–11. https://doi.org/10.1155/2016/6750459

    Google Scholar 

  35. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. pp 9–14. https://doi.org/10.1109/CVPRW.2010.5543273

  36. Lo Presti L, La Cascia M (2016) 3d skeleton-based human action classification. Pattern Recogn 53(C):130–147. https://doi.org/10.1016/j.patcog.2015.11.019

    Article  Google Scholar 

  37. Lokoč J, Bailer W, Schoeffmann K, Muenzer B, Aw1ad G (2018) On influential trends in interactive video retrieval: Video browser showdown 2015–2017. IEEE Trans Multimedia 20(12):3361–3376. https://doi.org/10.1109/TMM.2018.2830110

  38. Lu X, Ma C, Ni B, Yang X, Reid I, Yang MH (2018) Deep regression tracking with shrinkage loss. In: ECCV

  39. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 3618–3627

  40. Lu X, Wang W, Shen J, Tai YW, Crandall DJ, Hoi S (2020) Learning video object segmentation from unlabeled videos. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) pp 8957–8967

  41. Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. In: 2007 IEEE Conference on computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2007.383131

  42. Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 2929–2936. https://doi.org/10.1109/CVPR.2009.5206557

  43. Ni B, Gang W, Moulin P (2011) Rgbd-hudaact: A color-depth video database for human daily activity recognition. In: 2011 IEEE International conference on computer vision workshops (ICCV Workshops), pp 1147–1153. https://doi.org/10.1109/ICCVW.2011.6130379

  44. Papadopoulos K, Demisse GG, Ghorbel E, Antunes M, Aouada D, Ottersten BE (2019) Localized trajectories for 2d and 3d action recognition. arXiv:1904.05244

  45. Papandreou G, Zhu T, Chen L, Gidaris S, Tompson J, Murphy K (2018) Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. arXiv:1803.08225

  46. Rea F, Vignolo A, Sciutti A, Noceti N (2019) Human motion understanding for selecting action timing in collaborative human-robot interaction Frontiers in Robotics and AI 6. https://doi.org/10.3389/frobt.2019.00058

  47. Sadanand S, Corso JJ (2012) Action bank: A high-level representation of activity in video. In: 2012 IEEE Conference on computer vision and pattern recognition, pp 1234–1241. https://doi.org/10.1109/CVPR.2012.6247806

  48. Said S, Jemai O, Hassairi S, Ejbali R, Zaied M, Ben Amar C (2016) Deep wavelet network for image classification. In: 2016 IEEE International conference on systems, man, and cybernetics (SMC), pp 000922–000927

  49. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004. https://doi.org/10.1109/ICPR.2004.1334462, vol 3, pp 32–36

  50. Shamsipour G, Pirasteh S (2019) Artificial intelligence and convolutional neural network for recognition of human interaction by video from drone. https://doi.org/10.20944/preprints201908.0289.v1

  51. Sheikh Y, Sheikh M, Shah M (2005) Exploring the space of a human action. In: Tenth IEEE International Conference on Computer Vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.90, vol 1, pp 144–149

  52. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  53. Smola AJ, Schölkopf B (2003) A tutorial on support vector regression. Tech. rep., STATISTICS AND COMPUTING. https://doi.org/10.1023/B:STCO.0000035301.49549.88

  54. Snoun A, Teyeb I, Jemai O, Zaied M (2017) A multimodal vigilance monitoring system based on fuzzy logic architecture. In: Liu D, Xie S, Li Y, Zhao D, El-Alfy ESM (eds) Neural Information Processing. Springer International Publishing, Cham, pp 202–211

  55. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  56. Sun L, Zhao C, Yan Z, Liu P, Duckett T, Stolkin R (2019) A novel weakly-supervised approach for rgb-d-based nuclear waste object detection. IEEE Sens J 19(9):3487–3500

    Article  Google Scholar 

  57. Suriani S, Noor S, Ahmad F, Tomari R, Nurshazwani W, Wan Zakaria W, Haji Mohd MN (2018) Human activity recognition based on optimal skeleton joints using convolutional neural network. J Eng Sci Technol 7:48–57

    Google Scholar 

  58. Tang Z, Yu H, Lu C, Liu P, Jin X (2019) Single-trial classification of different movements on one arm based on erd/ers and corticomuscular coherence. IEEE Access 7:128185–128197

    Article  Google Scholar 

  59. Tang ZC, Li C, Wu JF, Liu PC, Cheng SW (2019) Classification of eeg-based single-trial motor imagery tasks using a b-csp method for bci. Front Inf Technol Electron Eng 20:1087–1098. https://doi.org/10.1631/FITEE.1800083

    Article  Google Scholar 

  60. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European Conference on Computer Vision: Part VI, ECCV’10. https://doi.org/10.5555/1888212.1888225. Springer-Verlag, Berlin, pp 140–153

  61. Teyeb I, Snoun A, Jemai O, Zaied M (2018) Fuzzy logic decision support system for hypovigilance detection based on cnn feature extractor and wn classifier. J Comput Sci 14:1546–1564

    Article  Google Scholar 

  62. Thangali A, Nash JP, Sclaroff S, Neidle C (2011) Exploiting phonological constraints for handshape inference in asl video. In: CVPR 2011, pp 521–528. https://doi.org/10.1109/CVPR.2011.5995718

  63. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: 2014 IEEE conference on computer vision and pattern recognition, pp 588–595. https://doi.org/10.1109/CVPR.2014.82

  64. Yilmaz A, Shah M (2005) Recognizing human actions in videos acquired by uncalibrated moving cameras. In: Tenth IEEE International conference on computer vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.201, vol 1, pp 150–157

  65. Yong D, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1110–1118. https://doi.org/10.1109/CVPR.2015.7298714

  66. Zhao R, Ali H, van der Smagt P (2017) Two-stream rnn/cnn for action recognition in 3d videos 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4260–4267. https://doi.org/10.1109/IROS.2017.8206288

  67. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, p. 3697–3703. AAAI Press

Download references

Acknowledgements

The authors would like to acknowledge the financial support of this work by grants from General Direction of scientific Research (DGRST), Tunisia, under the ARUB program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Snoun.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Snoun, A., Jlidi, N., Bouchrika, T. et al. Towards a deep human activity recognition approach based on video to image transformation with skeleton data. Multimed Tools Appl 80, 29675–29698 (2021). https://doi.org/10.1007/s11042-021-11188-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11188-1

Keywords

Navigation