Abstract
Still image-based human action recognition (HAR) is one of the most challenging research problems in the field of computer vision. Some of the significant reasons to support this claim are the availability of few datasets as well as fewer images per action class and the existence of many confusing classes in the datasets and comparing with video-based data. There is the unavailability of temporal information. In this work, we train some of the most reputed Convolutional Neural Network (CNN) based architectures using transfer learning after fine-tuned those suitably to develop a model for still image-based HAR. Since the number of images per action classes is found to be significantly less in number, we have also applied some well-known data augmentation techniques to increase the amount of data, which is always a need for deep learning-based models. Two benchmark datasets used for validating our model are Stanford 40 and PPMI, which are better known for their confusing action classes and the presence of occluded images and random poses of subjects. Results obtained by our model on these datasets outperform some of the benchmark results reported in the literature by a considerable margin. Class imbalance is deliberately introduced in the said datasets to better explore the robustness of the proposed model. The source code of the present work is available at: https://github.com/saikat021/Transfer-Learning-Based-HAR



















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Banerjee A, Singh PK, Sarkar R (2020) Fuzzy integral based CNN classifier fusion for 3D skeleton action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 1. https://doi.org/10.1109/TCSVT.2020.3019293
Bhattacharya S, Shaw V, Singh PK, Sarkar DB (2020). SV-NET: a deep learning approach to video based human activity recognition. Proceedings of the eleventh international Conference on Soft Computing and Pattern Recognition, SoCPaR 2019, Hyderabad, India, December 13–15, 2019.
Chen C, Jafari R, Kehtarnavaz N (2015) UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. 2015 IEEE International Conference on Image Processing (ICIP), 168–172. IEEE
Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425. https://doi.org/10.1007/s11042-015-3177-1
Clawson K, Jing M, Scotney B, Wang H, Liu J (2014) Human action recognition in video via fused optical flow and moment features – towards a hierarchical approach to complex scenario recognition BT - MultiMedia Modeling (C. Gurrin, F. Hopfgartner, W. Hurst, H. Johansen, H. Lee, & N. O'Connor, Eds.). Cham: Springer International Publishing.
Cruciani F, Vafeiadis A, Nugent C, Cleland I, McCullagh P, Votis K, Giakoumis D, Tzovaras D, Chen L, Hamzaoui R (2020) Feature learning for human activity recognition using convolutional neural networks. CCF Transactions on Pervasive Computing and Interaction 2(1):18–32. https://doi.org/10.1007/s42486-020-00026-2
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition 2009:248–255. https://doi.org/10.1109/CVPR.2009.5206848
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R∗CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2015 Inter, pp 1080–1088. https://doi.org/10.1109/ICCV.2015.129
Guha R, Khan AH, Singh PK, Sarkar R, Bhattacharjee D (2020) CGA: a new feature selection model for visual human action recognition. Neural Comput & Applic. https://doi.org/10.1007/s00521-020-05297-5
Gunawan IP, Ghanbari M (2008) Efficient reduced-reference video quality meter. IEEE Trans Broadcast 54(3):669–679
Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-Decem, 770–778. https://doi.org/10.1109/CVPR.2016.90
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-Janua, 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Igbinedion I (2016) Pose guided visual attention for action recognition
Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. Proceedings - International Conference on Pattern Recognition. https://doi.org/10.1109/icpr.2008.4761663
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 32nd International Conference on Machine Learning, ICML 2015, 1, pp 448–456
Jalal A, Kamal S, Kim D (2014) A depth video sensor-based life-logging human activity recognition system for elderly care in smart indoor environments. Sensors 14(7):11735–11759
Jalal A, Kamal S, Kim D (2015) Depth silhouettes context: a new robust feature for human tracking and activity recognition based on embedded HMMs. 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), 294–299. IEEE.
Jalal A, Kamal S, Kim D (2017) A depth video-based human detection and activity recognition using multi-features and embedded hidden Markov models for health care monitoring systems. Int J Interact Multimed Artificial Intell 4(4)
Jang Y, Lee H, Hwang SJ, Shin J (2019) Learning what and where to transfer. CoRR, abs/1905.0. Retrieved from http://arxiv.org/abs/1905.05901
Khan FS, Van De Weijer J, Anwer RM, Felsberg M, Gatta C (2014) Semantic pyramids for gender and action recognition. IEEE Trans Image Process 23(8):3633–3645. https://doi.org/10.1109/TIP.2014.2331759
Khan FS, van de Weijer J, Anwer RM, Bagdanov AD, Felsberg M, Laaksonen J (2018) Scale coding bag of deep features for human attribute and action recognition. Mach Vis Appl 29(1):55–71. https://doi.org/10.1007/s00138-017-0871-1
Kumar P, Saini R, Yadava M, Roy PP, Dogra DP, Balasubramanian R (2017) Virtual trainer with real-time feedback using kinect sensor. TENSYMP 2017 - IEEE International Symposium on Technologies for Smart Cities. https://doi.org/10.1109/TENCONSpring.2017.8070063
Lavinia Y, Vo HH, Verma A (2017) Fusion based deep CNN for improved large-scale image action recognition. Proceedings - 2016 IEEE International Symposium on Multimedia, ISM 2016. https://doi.org/10.1109/ISM.2016.84
Lee Y-S, Cho S-B (2011) Activity recognition using hierarchical hidden Markov models on a smartphone with 3D accelerometer BT - hybrid artificial intelligent systems (E. Corchado, M. Kurzyński, & M. Woźniak, Eds.). Berlin, Heidelberg: Springer Berlin Heidelberg.
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. Twenty-Fourth International Joint Conference on Artificial Intelligence.
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115
Mondal R, Mukherjee D, Singh PK, Bhateja V, Sarkar R (2020) A new framework for smartphone sensor based human activity recognition using graph neural network. IEEE Sensors Journal, 1. https://doi.org/10.1109/JSEN.2020.3015726
Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D (2020) EnsemConvNet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl 79(41):31663–31690. https://doi.org/10.1007/s11042-020-09537-7
Munoz-Organero M (2019) Outlier detection in wearable sensor data for human activity recognition (HAR) based on DRNNs. IEEE Access 7:74422–74436. https://doi.org/10.1109/ACCESS.2019.2921096
Nunez JC, Cabido R, Pantrigo JJ, Montemayor AS, Velez JF (2018) Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recogn 76:80–94
Sadhukhan S, Mallick S, Singh PK, Sarkar R, Bhattacharjee D (2020) A comparative study of different feature descriptors for video-based human action recognition BT - intelligent computing: image processing based applications (J. K. Mandal & S. Banerjee, Eds.). https://doi.org/10.1007/978-981-15-4288-6_3
Safaei M, Foroosh H (2019) Still image action recognition by predicting spatial-temporal pixel evolution. Proceedings - 2019 IEEE Winter Conference on Applications of Computer Vision, WACV 2019. https://doi.org/10.1109/WACV.2019.00019
Safaei M, Balouchian P, Foroosh H (2017) UCF-STAR : a large scale still image dataset for understanding human actions 101.
Saini R, Kumar P, Roy P, Dogra D (2018) A novel framework of continuous human-activity recognition using Kinect. Neurocomputing 311:99–111. https://doi.org/10.1016/j.neucom.2018.05.042
Sharma G, Jurie F, Schmid C (2012) Discriminative spatial saliency for image classification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 3506–3513. https://doi.org/10.1109/CVPR.2012.6248093
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. 1–14. Retrieved from http://arxiv.org/abs/1409.1556
Sreela SR, Idicula SM (2018) Action recognition in still images using residual neural network features. Procedia Computer Science 143:563–569. https://doi.org/10.1016/j.procs.2018.10.432
Sulong G, Mohammedali A (2015) Recognition of human activities from still image using novel classifier. J Theor Appl Inf Technol 71(1):115–121
Transfer Learning in Keras with Computer Vision Models. (n.d.). Retrieved September 7, 2019, from https://machinelearningmastery.com/how-to-use-transfer-learning-when-developing-convolutional-neural-network-models/
Yao B, Fei-Fei L (2010) Grouplet: A structured image representation for recognizing human and object interactions. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 9–16. https://doi.org/10.1109/CVPR.2010.5540234
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts. 2011 International Conference on Computer Vision, 1331–1338. https://doi.org/10.1109/ICCV.2011.6126386
Yu Z, Li C, Wu J, Cai J, Do MN, Lu J (2016) Action recognition in still images with minimum annotation efforts. IEEE Trans Image Process 25(11):5479–5490. https://doi.org/10.1109/TIP.2016.2605305
Yu X, Zhang Z, Wu L, Pang W, Chen H, Yu Z, Li B (2020) Deep ensemble learning for human action recognition in still images. Complexity 2020:1–23. https://doi.org/10.1155/2020/9428612
Zeng G, He Y, Yu Z, Yang X, Yang R, Zhang L (2016) InceptionNet/GoogLeNet - going deeper with convolutions. Cvpr 91(8):2322–2330. https://doi.org/10.1002/jctb.4820
Zhang J, Han Y, Jiang J (2016a) Tucker decomposition-based tensor learning for human action recognition. Multimed Syst 22(3):343–353. https://doi.org/10.1007/s00530-015-0464-7
Zhang L, Zhen X, Han J (2016b) Towards optimal vlad for human action recognition from still images. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2016-May, pp 1841–1845. https://doi.org/10.1109/ICASSP.2016.7471995
Zhao Z, Ma H, Chen X (2016) Generalized symmetric pair model for action classification in still images. Pattern Recognit 64:64–360. https://doi.org/10.1016/j.patcog.2016.10.001
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Proceedings of the European Conference on Computer Vision (ECCV), 803–818.
Zia Uddin M, Kim T-S, Kim JT (2011) Video-based indoor human gait recognition using depth imaging and hidden Markov model: a smart system for smart home. Indoor and Built Environment 20(1):120–128
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chakraborty, S., Mondal, R., Singh, P.K. et al. Transfer learning with fine tuning for human action recognition from still images. Multimed Tools Appl 80, 20547–20578 (2021). https://doi.org/10.1007/s11042-021-10753-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10753-y