Abstract
One of the most recent challenging tasks in computer vision is Human Activity Recognition (HAR), which aims to analyze and detect the human actions for the benefit of many fields such as video surveillance, behavior analysis and healthcare. Several works in the literature are based on the extraction and analysis of human skeletons in the aim of actions recognition. This paper introduces a new HAR approach based on the extraction of human skeletons from videos. Three features extraction techniques are proposed in this work. They used the extracted skeletons from the videos frames in order to construct a single image that summarizes the activity in that video. The first technique, called dynamic skeleton, is founded on the concept of dynamic images introduced in the literature, while the second one, called skeleton superposition, is based on the superposition of the extracted human skeletons in the same image. The third contribution is called body articulations and it uses only the body joints instead of the whole skeleton in order to recognize the ongoing activity. The obtained images from these three techniques are analyzed and classified using a classification system based on transfer learning principle by fine-tuning three well-known pre-trained CNNs (MobileNet, ResNet-50, VGG16). The designed system is validated and tested on two famous datasets for human activity recognition, which are RGBD-HuDact and KTH datasets. The obtained results are outstanding and proved that the implemented system outperforms the state-of-the-art approaches.
Similar content being viewed by others
References
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2010) Action classification in soccer videos with long short-term memory recurrent neural networks. In: Proceedings of the 20th international conference on artificial neural networks: Part II, ICANN’10. https://doi.org/10.5555/1889001.1889024. Springer-Verlag, Berlin, pp 154–159
Barnachon M, Bouakaz S, Boufama B, Guillou E (2012) Human actions recognition from streamed motion capture. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), pp 3807–3810
Barnachon M, Bouakaz S, Boufama B, Guillou E (2014) Ongoing human action recognition with motion capture. Pattern Recogn 47(1):238–247. https://doi.org/10.1016/j.patcog.2013.06.020
Bilen H, Fernando B, Gavves E, Vedaldi A (2018) Action recognition with dynamic image networks. IEEE Trans Pattern Anal Mach Intell 40 (12):2799–2813. https://doi.org/10.1109/TPAMI.2017.2769085
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267. https://doi.org/10.1109/34.910878
Campbell LW, Bobick AF (1995) Recognition of human body motion using phase space constraints. In: Proceedings of IEEE international conference on computer vision, pp 624–630. https://doi.org/10.1109/ICCV.1995.466880
Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y (2018) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. arXiv:1812.08008
Chou K, Prasad M, Wu D, Sharma N, Li D, Lin Y, Blumenstein M, Lin W, Lin C (2018) Robust feature-based automated multi-view human action recognition system. IEEE Access 6:15283–15296. https://doi.org/10.1109/ACCESS.2018.2809552
Ciptadi A, Goodwin MS, Rehg JM (2014) Movement pattern histogram for action recognition and retrieval. In: European conference on computer vision (ECCV), pp 695–710. https://doi.org/10.1007/978-3-319-10605-2_45
Deng J, Dong W, Socher R, Li L, Li K, Li F-F (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Diaf AA (2013) Eigenvector-based dimensionality reduction for human activity recognition and data classification. Ph.D. thesis CAN
Duric Z, Gray WD, Heishman R, Li F, Rosenfeld A, Schoelles MJ, Schunn C, Wechsler H (2002) Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction. Proc IEEE 90 (7):1272–1289. https://doi.org/10.1109/JPROC.2002.801449
Felzenszwalb PF, Huttenlocher DP (2005) Pictorial structures for object recognition. Int J Comput Vision 61(1):55–79. https://doi.org/10.1023/B:VISI.0000042934.15159.49
Fernando B, Gavves E, Oramas MJ, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 5378–5387. https://doi.org/10.1109/CVPR.2015.7299176
Gaur U, Zhu Y, Song B, Roy-Chowdhury A (2011) A string of feature graphs model for recognition of complex activities in natural videos. In: 2011 International conference on computer vision, pp 2595–2602. https://doi.org/10.1109/ICCV.2011.6126548
Gnouma M, Ejbali R, Zaied M (2017) Human fall detection based on block matching and silhouette area. In: Verikas A, Radeva P, Nikolaev DP, Zhang W, Zhou J (eds) Ninth international conference on machine vision (ICMV 2016). International Society for Optics and Photonics, SPIE. https://doi.org/10.1117/12.2268988, vol 10341, pp 18–22
Gnouma M, Ladjailia A, Ejbali R, Zaied M (2019) Stacked sparse autoencoder and history of binary motion image for human activity recognition. Multimedia Tools Appl 78 (2):2157–2179. https://doi.org/10.1007/s11042-018-6273-1
Hankyu M, Rajeev S, Namsoon J (2012) Method and system for measuring shopper response to products based on behavior and facial expression. https://lens.org/105-447-594-886-96X
Hassairi S, Ejbali R, Zaied M (2015) Supervised image classification using deep convolutional wavelets network. In: 2015 IEEE 27th International conference on tools with artificial intelligence (ICTAI), pp 265–271
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Hou R, Chen C, Shah M (2017) An end-to-end 3d convolutional neural network for action detection and segmentation in videos. arXiv:1712.01111
Hou R, Chen C, Sukthankar R, Shah M (2019) An efficient 3d CNN for action/object segmentation in video. arXiv:1907.08895
Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861
Idrees H, Saleemi I, Seibert C, Shah M (2013) Multi-source multi-scale counting in extremely dense crowd images. In: 2013 IEEE Conference on computer vision and pattern recognition, pp 2547–2554. https://doi.org/10.1109/CVPR.2013.329
Ikizler N, Forsyth D (2007) Searching video for complex activities with finite state models. In: 2007 IEEE Conference on computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2007.383168
Jaeyong S, Ponce C, Selman B, Saxena A (2012) Unstructured human activity detection from rgbd images. In: 2012 IEEE International conference on robotics and automation, pp 842–849. https://doi.org/10.1109/ICRA.2012.6224591
Jalal A, Kamal S, Kim D (2017) A depth video-based human detection and activity recognition using multi-features and embedded hidden markov models for health care monitoring systems. Int J Int Mult Artif Intell 4(4):54–62. https://doi.org/10.9781/ijimai.2017.447
Ji S, Xu W, Yang M (2013) Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231. https://doi.org/10.1109/TPAMI.2012.59
Ji XF, Wu QQ, Ju ZJ, Wang YY (2014) Study of human action recognition based on improved spatio-temporal features. Int J Autom Comput 11 (5):500–509. https://doi.org/10.1007/s11633-014-0831-4
Jlidi N, Snoun A, Bouchrika T, Jemai O, Zaied M (2020) PTLHAR: PoseNet and transfer learning for human activities recognition based on body articulations. In: Osten W, Nikolaev DP (eds) Twelfth international conference on machine vision (ICMV 2019). International Society for Optics and Photonics, SPIE. https://doi.org/10.1117/12.2559567, vol 11433, pp 187–194
Johansson G (1973) Visual perception of biological motion and a model for its analysis. Percept Psycho 14:201–211. https://doi.org/10.3758/BF03212378
Kwak S, Han B, Han JH (2011) Scenario-based video event recognition by constraint flow. pp 3345–3352. https://doi.org/10.1109/CVPR.2011.5995435
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: 2008 IEEE Conference on computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587756
Li Q, Cheng H, Zhou Y, Huo G (2016) Human action recognition using improved salient dense trajectories. Comput Intell Neurosci 2016:1–11. https://doi.org/10.1155/2016/6750459
Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3d points. pp 9–14. https://doi.org/10.1109/CVPRW.2010.5543273
Lo Presti L, La Cascia M (2016) 3d skeleton-based human action classification. Pattern Recogn 53(C):130–147. https://doi.org/10.1016/j.patcog.2015.11.019
Lokoč J, Bailer W, Schoeffmann K, Muenzer B, Aw1ad G (2018) On influential trends in interactive video retrieval: Video browser showdown 2015–2017. IEEE Trans Multimedia 20(12):3361–3376. https://doi.org/10.1109/TMM.2018.2830110
Lu X, Ma C, Ni B, Yang X, Reid I, Yang MH (2018) Deep regression tracking with shrinkage loss. In: ECCV
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 3618–3627
Lu X, Wang W, Shen J, Tai YW, Crandall DJ, Hoi S (2020) Learning video object segmentation from unlabeled videos. In: 2020 IEEE/CVF Conference on computer vision and pattern recognition (CVPR) pp 8957–8967
Lv F, Nevatia R (2007) Single view human action recognition using key pose matching and viterbi path searching. In: 2007 IEEE Conference on computer vision and pattern recognition, pp 1–8. https://doi.org/10.1109/CVPR.2007.383131
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 2929–2936. https://doi.org/10.1109/CVPR.2009.5206557
Ni B, Gang W, Moulin P (2011) Rgbd-hudaact: A color-depth video database for human daily activity recognition. In: 2011 IEEE International conference on computer vision workshops (ICCV Workshops), pp 1147–1153. https://doi.org/10.1109/ICCVW.2011.6130379
Papadopoulos K, Demisse GG, Ghorbel E, Antunes M, Aouada D, Ottersten BE (2019) Localized trajectories for 2d and 3d action recognition. arXiv:1904.05244
Papandreou G, Zhu T, Chen L, Gidaris S, Tompson J, Murphy K (2018) Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. arXiv:1803.08225
Rea F, Vignolo A, Sciutti A, Noceti N (2019) Human motion understanding for selecting action timing in collaborative human-robot interaction Frontiers in Robotics and AI 6. https://doi.org/10.3389/frobt.2019.00058
Sadanand S, Corso JJ (2012) Action bank: A high-level representation of activity in video. In: 2012 IEEE Conference on computer vision and pattern recognition, pp 1234–1241. https://doi.org/10.1109/CVPR.2012.6247806
Said S, Jemai O, Hassairi S, Ejbali R, Zaied M, Ben Amar C (2016) Deep wavelet network for image classification. In: 2016 IEEE International conference on systems, man, and cybernetics (SMC), pp 000922–000927
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004. https://doi.org/10.1109/ICPR.2004.1334462, vol 3, pp 32–36
Shamsipour G, Pirasteh S (2019) Artificial intelligence and convolutional neural network for recognition of human interaction by video from drone. https://doi.org/10.20944/preprints201908.0289.v1
Sheikh Y, Sheikh M, Shah M (2005) Exploring the space of a human action. In: Tenth IEEE International Conference on Computer Vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.90, vol 1, pp 144–149
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Smola AJ, Schölkopf B (2003) A tutorial on support vector regression. Tech. rep., STATISTICS AND COMPUTING. https://doi.org/10.1023/B:STCO.0000035301.49549.88
Snoun A, Teyeb I, Jemai O, Zaied M (2017) A multimodal vigilance monitoring system based on fuzzy logic architecture. In: Liu D, Xie S, Li Y, Zhao D, El-Alfy ESM (eds) Neural Information Processing. Springer International Publishing, Cham, pp 202–211
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sun L, Zhao C, Yan Z, Liu P, Duckett T, Stolkin R (2019) A novel weakly-supervised approach for rgb-d-based nuclear waste object detection. IEEE Sens J 19(9):3487–3500
Suriani S, Noor S, Ahmad F, Tomari R, Nurshazwani W, Wan Zakaria W, Haji Mohd MN (2018) Human activity recognition based on optimal skeleton joints using convolutional neural network. J Eng Sci Technol 7:48–57
Tang Z, Yu H, Lu C, Liu P, Jin X (2019) Single-trial classification of different movements on one arm based on erd/ers and corticomuscular coherence. IEEE Access 7:128185–128197
Tang ZC, Li C, Wu JF, Liu PC, Cheng SW (2019) Classification of eeg-based single-trial motor imagery tasks using a b-csp method for bci. Front Inf Technol Electron Eng 20:1087–1098. https://doi.org/10.1631/FITEE.1800083
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European Conference on Computer Vision: Part VI, ECCV’10. https://doi.org/10.5555/1888212.1888225. Springer-Verlag, Berlin, pp 140–153
Teyeb I, Snoun A, Jemai O, Zaied M (2018) Fuzzy logic decision support system for hypovigilance detection based on cnn feature extractor and wn classifier. J Comput Sci 14:1546–1564
Thangali A, Nash JP, Sclaroff S, Neidle C (2011) Exploiting phonological constraints for handshape inference in asl video. In: CVPR 2011, pp 521–528. https://doi.org/10.1109/CVPR.2011.5995718
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: 2014 IEEE conference on computer vision and pattern recognition, pp 588–595. https://doi.org/10.1109/CVPR.2014.82
Yilmaz A, Shah M (2005) Recognizing human actions in videos acquired by uncalibrated moving cameras. In: Tenth IEEE International conference on computer vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.201, vol 1, pp 150–157
Yong D, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp 1110–1118. https://doi.org/10.1109/CVPR.2015.7298714
Zhao R, Ali H, van der Smagt P (2017) Two-stream rnn/cnn for action recognition in 3d videos 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4260–4267. https://doi.org/10.1109/IROS.2017.8206288
Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, p. 3697–3703. AAAI Press
Acknowledgements
The authors would like to acknowledge the financial support of this work by grants from General Direction of scientific Research (DGRST), Tunisia, under the ARUB program.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Snoun, A., Jlidi, N., Bouchrika, T. et al. Towards a deep human activity recognition approach based on video to image transformation with skeleton data. Multimed Tools Appl 80, 29675–29698 (2021). https://doi.org/10.1007/s11042-021-11188-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11188-1