Abstract
Recognizing human actions is an active research area, where pose of the performer is an important cue for recognition. However, applying the 3D landmark points of the performer in recognizing action, is relatively less explored area of research due to the challenge involved in the process of extracting 3D landmark points from single view of the performers. With the recent advancements in the area of 3D landmark point detection, exploiting the landmark points in recognizing human action, is a good idea. We propose a technique for Human Action Recognition by learning the 3D landmark points of human pose, obtained from single image. We apply an autoencoder architecture followed by a regression layer to estimate the pose parameters like shape, gesture and camera position, which are later mapped to the 3D landmark points by Skinned Multi Person Linear Model (SMPL model). The proposed method is a novel attempt to apply a CNN based 3D pose reconstruction model (autoencoder) for recognizing action. Further, instead of using the autoencoder as a classifier to classify to 3D poses, we replace the decoder part by a regressor to obtain the landmark points, which are then fed into a classifier. The 3D landmark points of the human performer(s) at each frame, are fed into a neural network classifier as features for recognizing action.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Fan, Z., Ling, S., Jin, X., Yi, F.: From handcrafted to learned representations for human action recognition: a survey. Image Vis. Comput. 55, 42–52 (2016)
Maryam, Z., Robert, B.: Semantic human activity recognition: a literature review. Pattern Recogn. 48(8), 2329–2345 (2015)
Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: CVPR, pp. 1–8. IEEE (2009)
Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing human action at a distance in video by key poses. IEEE Trans. CSVT 21(9), 1228–1241 (2011)
Wang H., Schmid C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558. IEEE (2013)
Mukherjee, S.: Human action recognition using dominant pose duplet. In: Nalpantidis, L., Krüger, V., Eklundh, J.-O., Gasteratos, A. (eds.) ICVS 2015. LNCS, vol. 9163, pp. 488–497. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-20904-3_44
Laptev I., Marszałek M., Schmid C., Rozenfeld B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8. IEEE (2008)
Das Dawn, D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 32(3), 289–306 (2015). https://doi.org/10.1007/s00371-015-1066-2
Vinodh, B., Sunitha, G.T., Mukherjee, S.: Event recognition in egocentric videos using a novel trajectory based feature. In: ICVGIP, pp. 76:1–76:8. ACM (2016)
Nazir, S., Yousaf, M.H., Nebel, J.-C., Velastin, S.A.: A bag of expression framework for improved human action recognition. Pattern Recogn. Lett. 103, 39–45 (2018)
Herath, S., Harandi, M.T., Porikli, F.M.: Going deeper into action recognition: a survey. Image Vis. Comput. (2017). https://doi.org/10.1016/j.imavis.2017.01.010
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1–9. IEEE (2016)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. In: ICML, pp. 1–8 (2010)
Hara, K., Kataoka, H., Satoh, Y.: Can spatio-temporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: CVPR, pp. 6546–6555. IEEE (2018)
Li, C., Zhong, Q., Xie, D., Pu, S.: Collaborative spatiotemporal feature learning for video action recognition. In: CVPR, pp. 7872–7881. IEEE (2019)
Wu, C.-Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Compressed video action recognition. In: CVPR, pp. 6026–6035. IEEE (2018)
Shou, Z., et al.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In: CVPR, pp. 1–10. IEEE (2019)
Singh, K.K., Mukherjee, S.: Recognizing human activities in videos using improved dense trajectories over LSTM. In: Rameshan, R., Arora, C., Dutta Roy, S. (eds.) NCVPRIPG 2017. CCIS, vol. 841, pp. 78–88. Springer, Singapore (2018). https://doi.org/10.1007/978-981-13-0020-2_8
Li, C., Wang, P., Wang, S., Hou, Y., Li, W.: Skeleton-based action recognition using LSTM and CNN. In: ICME Workshops, pp. 585–590. IEEE (2017)
Li, C., et al.: Deep manifold structure transfer for action recognition. IEEE Trans. Image Process. 28, 4646–4658 (2019)
Uddin, M.A., Lee, Y.-K.: Feature fusion of deep spatial features and handcrafted spatiotemporal features for human action recognition. Sensors 19(7), 1599 (2019). https://doi.org/10.3390/s19071599
Loper, M., Mahmood, N., Romero, J., Gerard, P.-M., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 248:1–248:16 (2015)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 1–8. IEEE (2005)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR, pp. 1–10. IEEE (2018)
Nagalakshmi, C., Mukherjee S.: Classification of yoga asana from single image by learning 3D view of human pose. In: ICVGIP Workshops. Springer (2018). https://doi.org/10.1007/978-3-030-57907-4_1
Soomro K., Zamir A.R., Shah M.: UCF101: a dataset of 101 human action classes from videos in the wild. Report no. CRCV-TR-12-01 (November 2012)
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: CVPR, pp. 1112–1121 (2020)
Materzynska, J., Xiao, T., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: compositional action recognition with spatial-temporal interaction networks. In: CVPR, pp. 1049–1059 (2020)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Mukherjee, S., Nagalakshmi, C. (2021). Human Action Recognition from 3D Landmark Points of the Performer. In: Singh, S.K., Roy, P., Raman, B., Nagabhushan, P. (eds) Computer Vision and Image Processing. CVIP 2020. Communications in Computer and Information Science, vol 1377. Springer, Singapore. https://doi.org/10.1007/978-981-16-1092-9_4
Download citation
DOI: https://doi.org/10.1007/978-981-16-1092-9_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1091-2
Online ISBN: 978-981-16-1092-9
eBook Packages: Computer ScienceComputer Science (R0)