Skip to main content
Log in

A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Deep learning base solutions for computer vision made life easier for humans. Video data contain a lot of hidden information and patterns, that can be used for Human Action Recognition (HAR). HAR can apply to many areas, such as behavior analysis, intelligent video surveillance, and robotic vision. Occlusion, viewpoint variation, and illumination are some issues that make the HAR task more difficult. Some action classes have similar actions or some overlapping parts in them. This, among many other problems, is the main reason that contributes the most to misclassification. Traditional hand-engineering and machine learning-based solutions lack the ability to handle overlapping actions. In this paper, we propose a deep learning-based spatiotemporal HAR framework for overlapping human actions in long videos. Transfer learning techniques are used for deep feature extraction. Fine-tuned pre-trained CNN models learn the spatial relationship at the frame level. An optimized Deep Autoencoder was used to squeeze high-dimensional deep features. An RNN with LSTM was used to learn the long-term temporal relationships. An iterative module added at the end to fine-tune the trained model on new videos that learns and adopt changes. Our proposed framework achieved state-of-the-art performance in spatiotemporal HAR for overlapping human actions in long visual data streams for non-stationary surveillance environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. T. YouTube-Team (2020) 60 hours per minute and 4 billion views a day on YouTube. Youtube official Blog. https://blog.youtube/news-and-events/holy-nyans-60-hours-per-minute-and-4/ (Accessed 12 Nov 2020)

  2. Wojcicki S (2020) YouTube at 15: my personal journey and the road ahead. Youtube official Blog. https://blog.youtube/news-and-events/youtube-at-15-my-personal-journey (Accessed 12 Nov 2020)

  3. Cisco (2020) Cisco annual internet report (2018–2023) white paper. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (Accessed 12 Nov 2020)

  4. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732

  5. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint http://arxiv.org/abs/arXiv:1409.1556

  6. Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2625–2634

  7. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4489–4497

  8. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941

  9. Wang L et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, Springer, pp 20–36

  10. Girdhar R, Ramanan D, Gupta A, Sivic J, Russell B (2017) Actionvlad: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 971–980

  11. Zhu Y, Lan Z, Newsam S, Hauptmann A (2018) Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, Springer, pp 363–378

  12. Diba A, et al. (2017) Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv preprint http://arxiv.org/abs/arXiv:1711.08200

  13. Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in Neural Information Processing Systems, pp 34–45

  14. Zheng Z, An G, Wu D, Ruan Q (2020) Global and local knowledge-aware attention network for action recognition. In: IEEE Transactions on Neural Networks and Learning Systems

  15. Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: purely attention based local feature integration for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7834–7843

  16. Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 244–253

  17. Yu J et al (2020) A discriminative deep model with feature fusion and temporal attention for human action recognition. IEEE Access 8:43243–43255

    Article  Google Scholar 

  18. Liang C, Liu D, Qi L, Guan L (2020) Multi-modal human action recognition with sub-action exploiting and class-privacy preserved collaborative representation learning. IEEE Access 8:39920–39933

    Article  Google Scholar 

  19. Nazir S, Qian Y, Yousaf M, Velastin Carroza SA, Izquierdo E, Vazquez E (2019) Human action recognition using multi-kernel learning for temporal residual network

  20. Zhang D, Dai X, Wang Y-F (2020) METAL: minimum effort temporal activity localization in untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3882–3892

  21. Yang X, Yang X, Liu M-Y, Xiao F, Davis LS, Kautz J (2019) Step: spatio-temporal progressive learning for video action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 264–272

  22. Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  23. Lin Y-P, Jung T-P (2017) Improving EEG-based emotion classification using conditional transfer learning. Front Hum Neurosci 11:334

    Article  Google Scholar 

  24. Zhuang F et al (2019) A comprehensive survey on transfer learning. arXiv preprint http://arxiv.org/abs/arXiv:1911.02685

  25. Muhammad K, Hussain T, Baik SW (2020) Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recogn Lett 130:370–375

    Article  Google Scholar 

  26. Muhammad K, Ahmad J, Mehmood I, Rho S, Baik SW (2018) Convolutional neural networks based fire detection in surveillance videos. IEEE Access 6:18174–18183

    Article  Google Scholar 

  27. Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp 988–997

  28. Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimedia Comput Commun Appl (TOMM) 14(2):1–20

    Google Scholar 

  29. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255

  30. O’Mahony N, et al. (2019) Deep learning vs. traditional computer vision. In: Science and Information Conference, Springer, pp 128–144

  31. Georgiou T, Liu Y, Chen W, Lew M (2019) A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int J Multimedia Inf Retr 9:1–36

    Google Scholar 

  32. Salau AO, Jain S (2019) Feature extraction: a survey of the types, techniques, applications. In: 2019 International Conference on Signal Processing and Communication (ICSC), IEEE, pp 158–164

  33. Bashar A (2019) Survey on evolving deep learning neural network architectures. J Artif Intell 1(02):73–82

    Google Scholar 

  34. Karthikayani K, Arunachalam A (2020) A survey on deep learning feature extraction techniques. In: AIP Conference Proceedings, vol 2282, no 1, AIP Publishing LLC, p 020035

  35. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708

  36. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2818–2826

  37. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European Conference on Computer Vision, Springer, pp 630–645

  38. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1251–1258

  39. Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint http://arxiv.org/abs/arXiv:1212.0402

  40. Caetano CA, De Melo VHC, dos Santos JA, Schwartz WR (2017) Activity recognition based on a magnitude-orientation stream network. In: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), IEEE, pp 47–54

  41. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, Springer, pp 428–441

  42. Shi F, Laganiere R, Petriu E (2015) Gradient boundary histograms for action recognition. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp 1107–1114

  43. Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125

    Article  Google Scholar 

  44. Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 596–603

  45. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International Conference on Machine Learning, pp 843–852

  46. Liu A-A, Su Y-T, Nie W-Z, Kankanhalli M (2016) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114

    Article  Google Scholar 

  47. Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1991–1999

  48. Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4597–4605

  49. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  50. Ullah A, Muhammad K, Haq IU, Baik SW (2019) Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Futur Gener Comput Syst 96:386–397

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2019R1F1A1060668) and also the Chung-Ang University Research Grants in 2021.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Muazzam Maqsood or Seungmin Rho.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of our paper published in the Proceedings of the 2020 International Conference on Artificial Intelligence (ICAI), Las Vegas, USA, 27–30 July 2020.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bilal, M., Maqsood, M., Yasmin, S. et al. A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes. J Supercomput 78, 2873–2908 (2022). https://doi.org/10.1007/s11227-021-03957-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03957-4

Keywords

Navigation