Abstract
Static action recognition in images is challenging, because the image lacks the motion information to characterize the relations between the human and objects. Existing works detect the human with related objects or transfer the motion from videos to images. However the interaction is implicitly depicted. In this paper, we try to solve this problem from a different aspect of view, i.e., to explicitly learn the interactive information from the pose of the human and objects. Humans have different poses in different actions, and the objects in different actions can have different spatial interactions with certain parts of the human. This interaction in poses can be represented by Lie-group naturally. The Lie-group method computes the orientation and distance between the key points or joints, which reveal the relation between humans and objects in different actions. In the experiment, the proposed method shows competitive classification results on several still action image datasets, which advocates the way to recognize still actions by using poses.
Similar content being viewed by others
References
Maji S, Bourdev L, Malik J (2011) Action recognition from a distributed representation of pose and appearance. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 3177–3184
Hoai M (2014) Regularized max pooling for image categorization. In: Proceedings of British Machine Vision Conference
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 1717–1724
Gupta S, Malik J (2015) Visual semantic role labeling. arXiv:1505.0447
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R*CNN. In: Proceedings of IEEE Int’l Conf. on Computer Vision, pp 1080–1088
Sharma G, Jurie F, Schmid C (2015) Expanded parts model for semantic description of humans in still images. arXiv:1509.04186
Gkioxari G, Girshick R, Malik J (2015) Actions and attributes from wholes and parts. In: Proceedings of IEEE Int’l Conf. on Computer Vision, pp 2470–2478
Prest A, Schmid C, Ferrari V (2012) Weakly supervised learning of interactions between humans and objects. IEEE Trans Pattern Anal Mach Intell 34(3):601–614
Liu L, Tan R T, You S (2018) Loss guided activation for action recognition in still images. In: Asian Conference on Computer Vision, pp 152–167
Khan F S, van de Weijer J, Anwer R M, Bagdanov A D, Felsberg M, Laaksonen J (2018) Scale coding bag of deep features for human attribute and action recognition. arXiv:1612.04884v2
Yang W, Wang Y, Mori G (2010) Recognizing human actions from still images with latent poses. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 2030–2037
Wang J, Wang G (2016) Hierarchical spatial sum-product networks for action recognition in still images. IEEE Trans Circ Syst Video Technol 28(1):90–100
Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object intaractions. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition
Gao R, Xiong B, Grauman K (2018) Im2flow: Motion hallucination from static images for action recognition. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 5937–5947
Delaitre V, Sivic J, Laptev I (2011) Learning person-object interactions for action recognition in still images. In: Proceedings of Advances in Neural Information Processing Systems
Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 1159–1168
Procesi C (2007) Lie groups: An approach through invariants and representations. Springer
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, pp 1137–1149
Thurau C, Hlavac V (2008) Pose primitive based human action recognition in videos or still images. Proc IEEE Int’l Conf Comput Vis Pattern Recogn:1–8
Zhou Y, Ni B, Hong R, Wang M, Tian Q (2015) Interaction part mining: A mid-level approach for fine-grained action recognition. Proc IEEE Int’l Conf Comput Vis Pattern Recogn:3323–3331
Girshick R B, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. Proc IEEE Int’l Conf Comput Vis Pattern Recogn:580–587
Yan S, Smith J S, Lu W, Zhang B (2018) Multibranch attention networks for action recognition in still images. IEEE Trans Cogn Dev Syst 10(4):1116–1125
Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T, Kloft M, Shen D, Yin J, Gao W (2020) Multiple kernel k-means with incomplete kernels. IEEE Trans Pattern Anal Mach Intell 42 (5):1191–1204
Yu X, Ye X, Gao Q (2020) Infrared handprint image restoration algorithm based on apoptotic mechanism. IEEE Access 8:47334–47343
Zhang L, Song L, Du B, Zhang Y (2021) Nonlocal low-rank tensor completion for visual data. IEEE Trans Cybern 51(2):673–685
He Z, Huang H, Wu Y, Yang X, Zhang W (2021) Consistent scale normalization for object perception. Appl Intell 51:4490–4502
Li Y, Cao G, Yu Q, Li X (2018) Active contours driven by non-local gaussian distribution fitting energy for image segmentation. Appl Intell 48(12):4855–4870
Yang W, Gao Y, Cao L, Yang M, Shi Y (2014) mpadal: a joint local-and-global multi-view feature selection method for activity recognition. Appl Intell 41(3):776–790
Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. Proc IEEE Int’l Conf Comput Vis Pattern Recogn:1653–1660
Tompson J, Jain A, Lecun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. Proc Adv Neural Inf Process Syst:1799–1807
Pfister T, Charles J, Zisserman A (2015) Flowing convnets for human pose estimation in videos. Int Conf Comput Vis:1913–1921
Wei S, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. Comput Vis Pattern Recogn:4724–4732
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. Proc Eur Conf Comput Vision:483– 499
Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. Proc IEEE Int’l Conf Comput Vis Pattern Recogn:4733–4742
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. Comput Vis Pattern Recogn:5693–5703
Mohamed W, Ben Hamza A (2016) Deformable 3d shape retrieval using a spectral geometric descriptor. Appl Intell 45(2):213–229
Chéron G, Laptev I, Schmid C (2015) P-CNN: Pose-based CNN features for action recognition. In: Proceedings of IEEE Int’l Conf. on Computer Vision, pp 3218–3226
Ma M, Marturi N, Li Y, Leonardis A, Stolkin R (2018) Region-sequence based six-stream cnn features for general and fine-grained human action recognition in videos. Pattern Recogn 76:506–521
Nie B X, Xiong C, Zhu S (2015) Joint action recognition and pose estimation from video. Proc IEEE Int’l Conf Comput Vis Pattern Recogn:1293–1301
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. Proc IEEE Int’l Conf Comput Vis Pattern Recog:7024–7033
Du W, Wang Y, Qiao Y (2017) Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. Proc IEEE Int’l Conf Comput Vis:3745–3754
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 7103–7112
Moreno-Noguer F (2018) 3d human pose estimation from a single image via distance matrix regression. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 1561–1570
Simo-Serra E, Quattoni A, Torras C, Moreno-Noguer F (2013) A joint model for 2d and 3d pose estimation from a single image. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 3634–3641
Ramakrishna V, Kanade T, Sheikh Y (2012) Reconstructing 3d human pose from 2d image landmarks. In: Proceedings of European Conf. Computer Vision
Martinez J, Hossain R, Romero J, Little J J (2017) A simple yet effective baseline for 3d human pose estimation. In: Proceedings of IEEE Int’l Conf. on Computer Vision, pp 2659–2668
Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36 (7):1325–1339
Rad M, Lepetit V (2017) Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: Proceedings of IEEE Int’l Conf. on Computer Vision, pp 3848–3856
Grabner A, Roth P M, Lepetit V (2018) 3d pose estimation and 3d model retrieval for objects in the wild. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 3022–3031
Tekin B, Sinha S N, Fua P (2018) Real-time seamless single shot 6d object pose prediction. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 292–301
Redmon J, Farhadi A (2017) Yolo9000: Better, faster, stronger. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 6517–6525
Lepetit V, Moreno-Noguer F, Fua. P (2009) Epnp: An accurate o(n) solution to the pnp problem. Int J Comput Vis 81(2):155–166
Xu C, Govindarajan L N, Zhang Y, Cheng L (2017) Lie-x: Depth image based articulated object pose estimation, tracking, and action recognition on lie groups. Int J Comput Vis 123(3):454–478
Wang F, Jiang M, Qian C, Yang S, Li C (2017) Residual attention network for image classification. In: Proceedings of IEEE Int’l Conf. on Computer Vision and Pattern Recognition, pp 6450–6458
Everingham M, Gool L V, Williams C, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.htm%l
Yao B, Jiang X, Khosla A, Lin AL, Guibas LJ, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts. In: Proceedings of IEEE Int’l Conf. on Computer Vision, pp 1331–1338
Zhang Y, Cheng L, Wu J, Cai J, Do M N, Lu J (2016) Action recognition in still images with minimum annotation efforts. IEEE Trans Image Process 25(11):5479–5490
Safaei M, Foroosh H (2018) A zero-shot architecture for action recognition in still images. In: Proceedings of Int’l Conf. on Image Processing, pp 460–464
Safaei M, Foroosh H (2019) Still image action recognition by predicting spatial-temporal pixel evolution. In: IEEE Winter Conference on Applications of Computer Vision, pp 111– 120
Li L-J, Su H, Lim Y, Cosgriff R, Goodwin D, Fei-Fei L (2011) Object bank: A high-level image representation for scene classification and semantic feature sparsification. In: Proceedings of Advances in Neural Information Processing Systems
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Proceedings of IEEE Int’l Conf. on Com- puter Vision and Pattern Recognition, pp 3360–3367
Acknowledgments
This work is supported by National Key R&D Program of China (2018AAA0100100), National Natural Science Foundation of China (61702095), and Natural Science Foundation of Jiangsu Province (BK20190341).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mi, S., Zhang, Y. Pose-guided action recognition in static images using lie-group. Appl Intell 52, 6760–6768 (2022). https://doi.org/10.1007/s10489-021-02760-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02760-1