Transfer learning with fine tuning for human action recognition from still images

Chakraborty, Saikat; Mondal, Riktim; Singh, Pawan Kumar; Sarkar, Ram; Bhattacharjee, Debotosh

doi:10.1007/s11042-021-10753-y

Transfer learning with fine tuning for human action recognition from still images

Published: 08 March 2021

Volume 80, pages 20547–20578, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Saikat Chakraborty¹,
Riktim Mondal²,
Pawan Kumar Singh ORCID: orcid.org/0000-0002-9598-7981³,
Ram Sarkar² &
…
Debotosh Bhattacharjee^2,4

1420 Accesses
Explore all metrics

Abstract

Still image-based human action recognition (HAR) is one of the most challenging research problems in the field of computer vision. Some of the significant reasons to support this claim are the availability of few datasets as well as fewer images per action class and the existence of many confusing classes in the datasets and comparing with video-based data. There is the unavailability of temporal information. In this work, we train some of the most reputed Convolutional Neural Network (CNN) based architectures using transfer learning after fine-tuned those suitably to develop a model for still image-based HAR. Since the number of images per action classes is found to be significantly less in number, we have also applied some well-known data augmentation techniques to increase the amount of data, which is always a need for deep learning-based models. Two benchmark datasets used for validating our model are Stanford 40 and PPMI, which are better known for their confusing action classes and the presence of occluded images and random poses of subjects. Results obtained by our model on these datasets outperform some of the benchmark results reported in the literature by a considerable margin. Class imbalance is deliberately introduced in the said datasets to better explore the robustness of the proposed model. The source code of the present work is available at: https://github.com/saikat021/Transfer-Learning-Based-HAR

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Action Recognition in Still Images

Is Transformer Good for Vision-Based Human Action Recognition with Limited Data Source

Action recognition based on multi-stage jointly training convolutional network

Article 31 August 2018

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Banerjee A, Singh PK, Sarkar R (2020) Fuzzy integral based CNN classifier fusion for 3D skeleton action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 1. https://doi.org/10.1109/TCSVT.2020.3019293
Bhattacharya S, Shaw V, Singh PK, Sarkar DB (2020). SV-NET: a deep learning approach to video based human activity recognition. Proceedings of the eleventh international Conference on Soft Computing and Pattern Recognition, SoCPaR 2019, Hyderabad, India, December 13–15, 2019.
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. 2015 IEEE International Conference on Image Processing (ICIP), 168–172. IEEE
Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425. https://doi.org/10.1007/s11042-015-3177-1
Article Google Scholar
Clawson K, Jing M, Scotney B, Wang H, Liu J (2014) Human action recognition in video via fused optical flow and moment features – towards a hierarchical approach to complex scenario recognition BT - MultiMedia Modeling (C. Gurrin, F. Hopfgartner, W. Hurst, H. Johansen, H. Lee, & N. O'Connor, Eds.). Cham: Springer International Publishing.
Cruciani F, Vafeiadis A, Nugent C, Cleland I, McCullagh P, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2020) Feature learning for human activity recognition using convolutional neural networks. CCF Transactions on Pervasive Computing and Interaction 2(1):18–32. https://doi.org/10.1007/s42486-020-00026-2
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition 2009:248–255. https://doi.org/10.1109/CVPR.2009.5206848
Article Google Scholar
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R∗CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter, pp 1080–1088. https://doi.org/10.1109/ICCV.2015.129
Chapter Google Scholar
Guha R, Khan AH, Singh PK, Sarkar R, Bhattacharjee D (2020) CGA: a new feature selection model for visual human action recognition. Neural Comput & Applic. https://doi.org/10.1007/s00521-020-05297-5
Gunawan IP, Ghanbari M (2008) Efficient reduced-reference video quality meter. IEEE Trans Broadcast 54(3):669–679
Article Google Scholar
Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-Decem, 770–778. https://doi.org/10.1109/CVPR.2016.90
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-Janua, 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Igbinedion I (2016) Pose guided visual attention for action recognition
Google Scholar
Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. Proceedings - International Conference on Pattern Recognition. https://doi.org/10.1109/icpr.2008.4761663
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning, ICML 2015, 1, pp 448–456
Google Scholar
Jalal A, Kamal S, Kim D (2014) A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 14(7):11735–11759
Article Google Scholar
Jalal A, Kamal S, Kim D (2015) Depth silhouettes context: a new robust feature for human tracking and activity recognition based on embedded HMMs. 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), 294–299. IEEE.
Jalal A, Kamal S, Kim D (2017) A depth video-based human detection and activity recognition using multi-features and embedded hidden Markov models for health care monitoring systems. Int J Interact Multimed Artificial Intell 4(4)
Jang Y, Lee H, Hwang SJ, Shin J (2019) Learning what and where to transfer. CoRR, abs/1905.0. Retrieved from http://arxiv.org/abs/1905.05901
Khan FS, Van De Weijer J, Anwer RM, Felsberg M, Gatta C (2014) Semantic pyramids for gender and action recognition. IEEE Trans Image Process 23(8):3633–3645. https://doi.org/10.1109/TIP.2014.2331759
Article MathSciNet MATH Google Scholar
Khan FS, van de Weijer J, Anwer RM, Bagdanov AD, Felsberg M, Laaksonen J (2018) Scale coding bag of deep features for human attribute and action recognition. Mach Vis Appl 29(1):55–71. https://doi.org/10.1007/s00138-017-0871-1
Article Google Scholar
Kumar P, Saini R, Yadava M, Roy PP, Dogra DP, Balasubramanian R (2017) Virtual trainer with real-time feedback using kinect sensor. TENSYMP 2017 - IEEE International Symposium on Technologies for Smart Cities. https://doi.org/10.1109/TENCONSpring.2017.8070063
Lavinia Y, Vo HH, Verma A (2017) Fusion based deep CNN for improved large-scale image action recognition. Proceedings - 2016 IEEE International Symposium on Multimedia, ISM 2016. https://doi.org/10.1109/ISM.2016.84
Lee Y-S, Cho S-B (2011) Activity recognition using hierarchical hidden Markov models on a smartphone with 3D accelerometer BT - hybrid artificial intelligent systems (E. Corchado, M. Kurzyński, & M. Woźniak, Eds.). Berlin, Heidelberg: Springer Berlin Heidelberg.
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. Twenty-Fourth International Joint Conference on Artificial Intelligence.
Google Scholar
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115
Article Google Scholar
Mondal R, Mukherjee D, Singh PK, Bhateja V, Sarkar R (2020) A new framework for smartphone sensor based human activity recognition using graph neural network. IEEE Sensors Journal, 1. https://doi.org/10.1109/JSEN.2020.3015726
Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D (2020) EnsemConvNet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl 79(41):31663–31690. https://doi.org/10.1007/s11042-020-09537-7
Article Google Scholar
Munoz-Organero M (2019) Outlier detection in wearable sensor data for human activity recognition (HAR) based on DRNNs. IEEE Access 7:74422–74436. https://doi.org/10.1109/ACCESS.2019.2921096
Article Google Scholar
Nunez JC, Cabido R, Pantrigo JJ, Montemayor AS, Velez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94
Article Google Scholar
Sadhukhan S, Mallick S, Singh PK, Sarkar R, Bhattacharjee D (2020) A comparative study of different feature descriptors for video-based human action recognition BT - intelligent computing: image processing based applications (J. K. Mandal & S. Banerjee, Eds.). https://doi.org/10.1007/978-981-15-4288-6_3
Safaei M, Foroosh H (2019) Still image action recognition by predicting spatial-temporal pixel evolution. Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019. https://doi.org/10.1109/WACV.2019.00019
Safaei M, Balouchian P, Foroosh H (2017) UCF-STAR : a large scale still image dataset for understanding human actions 101.
Saini R, Kumar P, Roy P, Dogra D (2018) A novel framework of continuous human-activity recognition using Kinect. Neurocomputing 311:99–111. https://doi.org/10.1016/j.neucom.2018.05.042
Article Google Scholar
Sharma G, Jurie F, Schmid C (2012) Discriminative spatial saliency for image classification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 3506–3513. https://doi.org/10.1109/CVPR.2012.6248093
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. 1–14. Retrieved from http://arxiv.org/abs/1409.1556
Sreela SR, Idicula SM (2018) Action recognition in still images using residual neural network features. Procedia Computer Science 143:563–569. https://doi.org/10.1016/j.procs.2018.10.432
Article Google Scholar
Sulong G, Mohammedali A (2015) Recognition of human activities from still image using novel classifier. J Theor Appl Inf Technol 71(1):115–121
Google Scholar
Transfer Learning in Keras with Computer Vision Models. (n.d.). Retrieved September 7, 2019, from https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
Yao B, Fei-Fei L (2010) Grouplet: A structured image representation for recognizing human and object interactions. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 9–16. https://doi.org/10.1109/CVPR.2010.5540234
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts. 2011 International Conference on Computer Vision, 1331–1338. https://doi.org/10.1109/ICCV.2011.6126386
Yu Z, Li C, Wu J, Cai J, Do MN, Lu J (2016) Action recognition in still images with minimum annotation efforts. IEEE Trans Image Process 25(11):5479–5490. https://doi.org/10.1109/TIP.2016.2605305
Article MathSciNet MATH Google Scholar
Yu X, Zhang Z, Wu L, Pang W, Chen H, Yu Z, Li B (2020) Deep ensemble learning for human action recognition in still images. Complexity 2020:1–23. https://doi.org/10.1155/2020/9428612
Article Google Scholar
Zeng G, He Y, Yu Z, Yang X, Yang R, Zhang L (2016) InceptionNet/GoogLeNet - going deeper with convolutions. Cvpr 91(8):2322–2330. https://doi.org/10.1002/jctb.4820
Article Google Scholar
Zhang J, Han Y, Jiang J (2016a) Tucker decomposition-based tensor learning for human action recognition. Multimed Syst 22(3):343–353. https://doi.org/10.1007/s00530-015-0464-7
Article Google Scholar
Zhang L, Zhen X, Han J (2016b) Towards optimal vlad for human action recognition from still images. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2016-May, pp 1841–1845. https://doi.org/10.1109/ICASSP.2016.7471995
Chapter Google Scholar
Zhao Z, Ma H, Chen X (2016) Generalized symmetric pair model for action classification in still images. Pattern Recognit 64:64–360. https://doi.org/10.1016/j.patcog.2016.10.001
Article Google Scholar
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Proceedings of the European Conference on Computer Vision (ECCV), 803–818.
Zia Uddin M, Kim T-S, Kim JT (2011) Video-based indoor human gait recognition using depth imaging and hidden Markov model: a smart system for smart home. Indoor and Built Environment 20(1):120–128
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics & Telecommunication Engineering, Jadavpur University, Kolkata, 700032, India
Saikat Chakraborty
Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
Riktim Mondal, Ram Sarkar & Debotosh Bhattacharjee
Department of Information Technology, Jadavpur University, Kolkata, 700106, India
Pawan Kumar Singh
Center for Basic and Applied Science, Faculty of Informatics and Management, University of Hradec Kralove, Rokitanskeho 62, 500 03, Hradec Kralove, Czech Republic
Debotosh Bhattacharjee

Authors

Saikat Chakraborty
View author publications
You can also search for this author inPubMed Google Scholar
Riktim Mondal
View author publications
You can also search for this author inPubMed Google Scholar
Pawan Kumar Singh
View author publications
You can also search for this author inPubMed Google Scholar
Ram Sarkar
View author publications
You can also search for this author inPubMed Google Scholar
Debotosh Bhattacharjee
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Pawan Kumar Singh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chakraborty, S., Mondal, R., Singh, P.K. et al. Transfer learning with fine tuning for human action recognition from still images. Multimed Tools Appl 80, 20547–20578 (2021). https://doi.org/10.1007/s11042-021-10753-y

Download citation

Received: 01 November 2019
Revised: 15 February 2021
Accepted: 24 February 2021
Published: 08 March 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11042-021-10753-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transfer learning with fine tuning for human action recognition from still images

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Action Recognition in Still Images

Is Transformer Good for Vision-Based Human Action Recognition with Limited Data Source

Action recognition based on multi-stage jointly training convolutional network

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now