Patch excitation network for boxless action recognition in still images

Liang, Shuang; Wang, Jiewen; Zhuang, Zikun

doi:10.1007/s00371-023-03071-x

Patch excitation network for boxless action recognition in still images

Original article
Published: 25 September 2023

Volume 40, pages 4099–4113, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

170 Accesses
Explore all metrics

Abstract

Action recognition in still images is considered to be a challenging task, mainly due to the lack of temporal information. To address this issue, researchers usually need to locate the human in the image first and then combined it with some information that is closely related to the action, such as human pose and surrounding objects. Such methods are very effective, but they rely heavily on annotations, especially the human bounding box. To get rid of this limitation, we propose a novel patch excitation network in this paper, which requires only images as input in both training and testing phases. The images will be evenly divided into patches, and “excitation" will be performed in two levels. First, the action excitation module will process the whole image so that the action-related regions get a higher response. After that, the activated feature will be assigned to patches of different sizes. Finally, the patches of different sizes will be fed into the patch interaction module for mutual enhancement. Throughout the process, no step deliberately discovers a specific action-related information like the human body or pose, but rather looks more broadly for all clues related to the action in the image. This idea is different from some previous boxless methods. Experiments show that the proposed solution is able to obtain state-of-the-art results in the boxless methods on widely used datasets. In particular, on the Stanford 40 dataset, the proposed solution’s performance is comparable to those of the methods using additional annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention Focused Spatial Pyramid Pooling for Boxless Action Recognition in Still Images

Boxless Action Recognition in Still Images via Recurrent Visual Attention

Loss Guided Activation for Action Recognition in Still Images

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and material

All data generated or analyzed during this study are included in this published article, and the dataset used or analyzed during the current study is open online.

References

Minaee, S., Liang, X., Yan, S.: Modern augmented reality: applications, trends, and future directions. arXiv preprint arXiv:2202.09450 (2022)
Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Dtr-har: deep temporal residual representation for human activity recognition. Vis. Comput. 38, 993–1013 (2022)
Article Google Scholar
Xie, C., Zhuang, Z., Zhao, S., Liang, S.: Temporal dropout for weakly supervised action localization. ACM Trans. Multimed. Comput. Commun. Appl. 19(3), 1–24 (2023)
Article Google Scholar
Fang, H.-S., Cao, J., Tai, Y.-W., Lu, C.: Pairwise body-part attention for recognizing human-object interactions. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–67 (2018)
Ma, W., Liang, S.: Human-object relation network for action recognition in still images. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020). IEEE
Thurau, C., Hlavác, V.: Pose primitive based human action recognition in videos or still images. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). IEEE
Zhao, Z., Ma, H., You, S.: Single image action recognition using semantic body part actions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3391–3399 (2017)
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2011)
Article Google Scholar
Yan, S., Smith, J.S., Lu, W., Zhang, B.: Multibranch attention networks for action recognition in still images. IEEE Trans. Cognit. Dev. Syst. 10(4), 1116–1125 (2017)
Article Google Scholar
Zheng, X., Gong, T., Lu, X., Li, X.: Human action recognition by multiple spatial clues network. Neurocomputing 483, 10–21 (2022)
Article Google Scholar
Zheng, Y., Zheng, X., Lu, X., Wu, S.: Spatial attention based visual semantic learning for action recognition in still images. Neurocomputing 413, 383–396 (2020)
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Article Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 International Conference on Computer Vision, pp. 1331–1338 (2011). IEEE
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)
Wang, J., Liang, S.: Pose-enhanced relation feature for action recognition in still images. In: MultiMedia Modeling, pp. 154–165. Springer, Cham (2022)
Zhang, Y., Cheng, L., Wu, J., Cai, J., Do, M.N., Lu, J.: Action recognition in still images with minimum annotation efforts. IEEE Trans. Image Process. 25(11), 5479–5490 (2016)
Article MathSciNet Google Scholar
Fan, C., Hu, C., Liu, B.: Linearized kernel dictionary learning with group sparse priors for action recognition. Vis. Comput. 35(12), 1797–1807 (2019)
Google Scholar
Feng, W., Zhang, X., Huang, X., Luo, Z.: Boxless action recognition in still images via recurrent visual attention. In: Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24, pp. 663–673 (2017). Springer
Banerjee, A., Roy, S., Kundu, R., Singh, P.K., Bhateja, V., Sarkar, R.: An ensemble approach for still image-based human action recognition. Neural Comput. Appl. 34(21), 19269–19282 (2022)
Article Google Scholar
Liu, L., Tan, R.T., You, S.: Loss guided activation for action recognition in still images. In: Asian Conference on Computer Vision, pp. 152–167 (2018). Springer
Gao, R., Xiong, B., Grauman, K.: Im2flow: Motion hallucination from static images for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5937–5947 (2018)
Huang, S., Zhao, X., Niu, L., Zhang, L.: Static image action recognition with hallucinated fine-grained motion information. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021). IEEE
Niu, L., Huang, S., Zhao, X., Kang, L., Zhang, Y., Zhang, L.: Hallucinating uncertain motion and future for static image action recognition. Comput. Vis. Image Understand. 215, 103337 (2022)
Article Google Scholar
Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Deep image-to-video adaptation and fusion networks for action recognition. IEEE Trans. Image Process. 29, 3168–3182 (2019)
Article Google Scholar
Liu, Y., Wang, K., Li, G., Lin, L.: Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Trans. Image Process. 30, 5573–5588 (2021)
Article Google Scholar
Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity 2018, 1–20 (2018)
Google Scholar
Liu, Y., Lu, Z., Li, J., Yang, T.: Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2416–2430 (2018)
Article Google Scholar
Yang, H., Zhang, Y.: A context-and level-aware feature pyramid network for object detection with attention mechanism. Vis. Comput. (2023). https://doi.org/10.1007/s00371-022-02758-x
Article Google Scholar
Cheng, Z., Qu, A., He, X.: Contour-aware semantic segmentation network with spatial attention mechanism for medical image. Vis. Comput. 38, 749–762 (2022)
Article Google Scholar
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213–229 (2020). Springer
Tamura, M., Ohashi, H., Yoshinaga, T.: Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10410–10419 (2021)
Sudhakaran, S., Escalera, S., Lanz, O.: Lsta: Long short-term attention for egocentric action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9954–9963 (2019)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Xi, P., Guan, H., Shu, C., Borgeat, L., Goubran, R.: An integrated approach for medical abnormality detection using deep patch convolutional neural networks. Vis. Comput. 36(9), 1869–1882 (2020)
Article Google Scholar
Cao, G., Li, J., Chen, X., He, Z.: Patch-based self-adaptive matting for high-resolution image and video. Vis. Comput. 35, 133–147 (2019)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62076183, 61936014 and 61976159, in part by the Natural Science Foundation of Shanghai under Grant 20ZR1473500, in part by the Shanghai Science and Technology Innovation Action Project of under Grant 20511100700 and 22511105300, in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100, and in part by the Fundamental Research Funds for the Central Universities. The authors would also like to thank the anonymous reviewers for their careful work and valuable suggestions.

Author information

Authors and Affiliations

School of Software Engineering, Tongji University, Shanghai, 201804, China
Shuang Liang, Jiewen Wang & Zikun Zhuang

Authors

Shuang Liang
View author publications
You can also search for this author in PubMed Google Scholar
Jiewen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zikun Zhuang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuang Liang.

Ethics declarations

Conflict of interest

There is no conflict of interest or competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liang, S., Wang, J. & Zhuang, Z. Patch excitation network for boxless action recognition in still images. Vis Comput 40, 4099–4113 (2024). https://doi.org/10.1007/s00371-023-03071-x

Download citation

Accepted: 12 August 2023
Published: 25 September 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s00371-023-03071-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Patch excitation network for boxless action recognition in still images

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Attention Focused Spatial Pyramid Pooling for Boxless Action Recognition in Still Images

Boxless Action Recognition in Still Images via Recurrent Visual Attention

Loss Guided Activation for Action Recognition in Still Images

Explore related subjects

Availability of data and material

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now