A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes

Bilal, Muhammad; Maqsood, Muazzam; Yasmin, Sadaf; Hasan, Najam Ul; Rho, Seungmin

doi:10.1007/s11227-021-03957-4

A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes

Published: 13 July 2021

Volume 78, pages 2873–2908, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Muhammad Bilal¹,
Muazzam Maqsood¹,
Sadaf Yasmin¹,
Najam Ul Hasan² &
…
Seungmin Rho³

1753 Accesses
Explore all metrics

Abstract

Deep learning base solutions for computer vision made life easier for humans. Video data contain a lot of hidden information and patterns, that can be used for Human Action Recognition (HAR). HAR can apply to many areas, such as behavior analysis, intelligent video surveillance, and robotic vision. Occlusion, viewpoint variation, and illumination are some issues that make the HAR task more difficult. Some action classes have similar actions or some overlapping parts in them. This, among many other problems, is the main reason that contributes the most to misclassification. Traditional hand-engineering and machine learning-based solutions lack the ability to handle overlapping actions. In this paper, we propose a deep learning-based spatiotemporal HAR framework for overlapping human actions in long videos. Transfer learning techniques are used for deep feature extraction. Fine-tuned pre-trained CNN models learn the spatial relationship at the frame level. An optimized Deep Autoencoder was used to squeeze high-dimensional deep features. An RNN with LSTM was used to learn the long-term temporal relationships. An iterative module added at the end to fine-tune the trained model on new videos that learns and adopt changes. Our proposed framework achieved state-of-the-art performance in spatiotemporal HAR for overlapping human actions in long visual data streams for non-stationary surveillance environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Spatiotemporal neural networks for action recognition based on joint loss

Article 30 November 2019

Improved two-stream model for human action recognition

Article Open access 17 June 2020

References

T. YouTube-Team (2020) 60 hours per minute and 4 billion views a day on YouTube. Youtube official Blog. https://blog.youtube/news-and-events/holy-nyans-60-hours-per-minute-and-4/ (Accessed 12 Nov 2020)
Wojcicki S (2020) YouTube at 15: my personal journey and the road ahead. Youtube official Blog. https://blog.youtube/news-and-events/youtube-at-15-my-personal-journey (Accessed 12 Nov 2020)
Cisco (2020) Cisco annual internet report (2018–2023) white paper. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (Accessed 12 Nov 2020)
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1725–1732
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint http://arxiv.org/abs/arXiv:1409.1556
Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2625–2634
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4489–4497
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941
Wang L et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European Conference on Computer Vision, Springer, pp 20–36
Girdhar R, Ramanan D, Gupta A, Sivic J, Russell B (2017) Actionvlad: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 971–980
Zhu Y, Lan Z, Newsam S, Hauptmann A (2018) Hidden two-stream convolutional networks for action recognition. In: Asian Conference on Computer Vision, Springer, pp 363–378
Diba A, et al. (2017) Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv preprint http://arxiv.org/abs/arXiv:1711.08200
Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in Neural Information Processing Systems, pp 34–45
Zheng Z, An G, Wu D, Ruan Q (2020) Global and local knowledge-aware attention network for action recognition. In: IEEE Transactions on Neural Networks and Learning Systems
Long X, Gan C, De Melo G, Wu J, Liu X, Wen S (2018) Attention clusters: purely attention based local feature integration for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7834–7843
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 244–253
Yu J et al (2020) A discriminative deep model with feature fusion and temporal attention for human action recognition. IEEE Access 8:43243–43255
Article Google Scholar
Liang C, Liu D, Qi L, Guan L (2020) Multi-modal human action recognition with sub-action exploiting and class-privacy preserved collaborative representation learning. IEEE Access 8:39920–39933
Article Google Scholar
Nazir S, Qian Y, Yousaf M, Velastin Carroza SA, Izquierdo E, Vazquez E (2019) Human action recognition using multi-kernel learning for temporal residual network
Zhang D, Dai X, Wang Y-F (2020) METAL: minimum effort temporal activity localization in untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3882–3892
Yang X, Yang X, Liu M-Y, Xiao F, Davis LS, Kautz J (2019) Step: spatio-temporal progressive learning for video action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 264–272
Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Lin Y-P, Jung T-P (2017) Improving EEG-based emotion classification using conditional transfer learning. Front Hum Neurosci 11:334
Article Google Scholar
Zhuang F et al (2019) A comprehensive survey on transfer learning. arXiv preprint http://arxiv.org/abs/arXiv:1911.02685
Muhammad K, Hussain T, Baik SW (2020) Efficient CNN based summarization of surveillance videos for resource-constrained devices. Pattern Recogn Lett 130:370–375
Article Google Scholar
Muhammad K, Ahmad J, Mehmood I, Rho S, Baik SW (2018) Convolutional neural networks based fire detection in surveillance videos. IEEE Access 6:18174–18183
Article Google Scholar
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on Multimedia, pp 988–997
Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans Multimedia Comput Commun Appl (TOMM) 14(2):1–20
Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp 248–255
O’Mahony N, et al. (2019) Deep learning vs. traditional computer vision. In: Science and Information Conference, Springer, pp 128–144
Georgiou T, Liu Y, Chen W, Lew M (2019) A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int J Multimedia Inf Retr 9:1–36
Google Scholar
Salau AO, Jain S (2019) Feature extraction: a survey of the types, techniques, applications. In: 2019 International Conference on Signal Processing and Communication (ICSC), IEEE, pp 158–164
Bashar A (2019) Survey on evolving deep learning neural network architectures. J Artif Intell 1(02):73–82
Google Scholar
Karthikayani K, Arunachalam A (2020) A survey on deep learning feature extraction techniques. In: AIP Conference Proceedings, vol 2282, no 1, AIP Publishing LLC, p 020035
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2818–2826
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European Conference on Computer Vision, Springer, pp 630–645
Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1251–1258
Soomro K, Zamir AR, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint http://arxiv.org/abs/arXiv:1212.0402
Caetano CA, De Melo VHC, dos Santos JA, Schwartz WR (2017) Activity recognition based on a magnitude-orientation stream network. In: 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), IEEE, pp 47–54
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, Springer, pp 428–441
Shi F, Laganiere R, Petriu E (2015) Gradient boundary histograms for action recognition. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp 1107–1114
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 596–603
Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In: International Conference on Machine Learning, pp 843–852
Liu A-A, Su Y-T, Nie W-Z, Kankanhalli M (2016) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Article Google Scholar
Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1991–1999
Sun L, Jia K, Yeung D-Y, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4597–4605
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Ullah A, Muhammad K, Haq IU, Baik SW (2019) Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Futur Gener Comput Syst 96:386–397
Article Google Scholar

Download references

Acknowledgements

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2019R1F1A1060668) and also the Chung-Ang University Research Grants in 2021.

Author information

Authors and Affiliations

Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock, Pakistan
Muhammad Bilal, Muazzam Maqsood & Sadaf Yasmin
Department of Electrical and Computer Engineering College of Engineering, Dhofar University, Salalah, Oman
Najam Ul Hasan
Department of Industrial Security, Chung-Ang University, Seoul, 06974, South Korea
Seungmin Rho

Authors

Muhammad Bilal
View author publications
You can also search for this author inPubMed Google Scholar
Muazzam Maqsood
View author publications
You can also search for this author inPubMed Google Scholar
Sadaf Yasmin
View author publications
You can also search for this author inPubMed Google Scholar
Najam Ul Hasan
View author publications
You can also search for this author inPubMed Google Scholar
Seungmin Rho
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Muazzam Maqsood or Seungmin Rho.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This paper is an extended version of our paper published in the Proceedings of the 2020 International Conference on Artificial Intelligence (ICAI), Las Vegas, USA, 27–30 July 2020.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bilal, M., Maqsood, M., Yasmin, S. et al. A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes. J Supercomput 78, 2873–2908 (2022). https://doi.org/10.1007/s11227-021-03957-4

Download citation

Accepted: 16 June 2021
Published: 13 July 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11227-021-03957-4

Keywords

Part of a collection:

SI - Artificial Intelligence based Deep Video Data Analytics

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A transfer learning-based efficient spatiotemporal human action recognition framework for long and overlapping action classes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Spatiotemporal neural networks for action recognition based on joint loss

Improved two-stream model for human action recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now