Skip to main content

Advertisement

Log in

Patch excitation network for boxless action recognition in still images

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Action recognition in still images is considered to be a challenging task, mainly due to the lack of temporal information. To address this issue, researchers usually need to locate the human in the image first and then combined it with some information that is closely related to the action, such as human pose and surrounding objects. Such methods are very effective, but they rely heavily on annotations, especially the human bounding box. To get rid of this limitation, we propose a novel patch excitation network in this paper, which requires only images as input in both training and testing phases. The images will be evenly divided into patches, and “excitation" will be performed in two levels. First, the action excitation module will process the whole image so that the action-related regions get a higher response. After that, the activated feature will be assigned to patches of different sizes. Finally, the patches of different sizes will be fed into the patch interaction module for mutual enhancement. Throughout the process, no step deliberately discovers a specific action-related information like the human body or pose, but rather looks more broadly for all clues related to the action in the image. This idea is different from some previous boxless methods. Experiments show that the proposed solution is able to obtain state-of-the-art results in the boxless methods on widely used datasets. In particular, on the Stanford 40 dataset, the proposed solution’s performance is comparable to those of the methods using additional annotations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Availability of data and material

All data generated or analyzed during this study are included in this published article, and the dataset used or analyzed during the current study is open online.

References

  1. Minaee, S., Liang, X., Yan, S.: Modern augmented reality: applications, trends, and future directions. arXiv preprint arXiv:2202.09450 (2022)

  2. Basly, H., Ouarda, W., Sayadi, F.E., Ouni, B., Alimi, A.M.: Dtr-har: deep temporal residual representation for human activity recognition. Vis. Comput. 38, 993–1013 (2022)

    Article  Google Scholar 

  3. Xie, C., Zhuang, Z., Zhao, S., Liang, S.: Temporal dropout for weakly supervised action localization. ACM Trans. Multimed. Comput. Commun. Appl. 19(3), 1–24 (2023)

    Article  Google Scholar 

  4. Fang, H.-S., Cao, J., Tai, Y.-W., Lu, C.: Pairwise body-part attention for recognizing human-object interactions. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 51–67 (2018)

  5. Ma, W., Liang, S.: Human-object relation network for action recognition in still images. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020). IEEE

  6. Thurau, C., Hlavác, V.: Pose primitive based human action recognition in videos or still images. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). IEEE

  7. Zhao, Z., Ma, H., You, S.: Single image action recognition using semantic body part actions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3391–3399 (2017)

  8. Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2011)

    Article  Google Scholar 

  9. Yan, S., Smith, J.S., Lu, W., Zhang, B.: Multibranch attention networks for action recognition in still images. IEEE Trans. Cognit. Dev. Syst. 10(4), 1116–1125 (2017)

    Article  Google Scholar 

  10. Zheng, X., Gong, T., Lu, X., Li, X.: Human action recognition by multiple spatial clues network. Neurocomputing 483, 10–21 (2022)

    Article  Google Scholar 

  11. Zheng, Y., Zheng, X., Lu, X., Wu, S.: Spatial attention based visual semantic learning for action recognition in still images. Neurocomputing 413, 383–396 (2020)

    Article  Google Scholar 

  12. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  13. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)

    Article  Google Scholar 

  14. Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 International Conference on Computer Vision, pp. 1331–1338 (2011). IEEE

  15. Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)

  16. Wang, J., Liang, S.: Pose-enhanced relation feature for action recognition in still images. In: MultiMedia Modeling, pp. 154–165. Springer, Cham (2022)

  17. Zhang, Y., Cheng, L., Wu, J., Cai, J., Do, M.N., Lu, J.: Action recognition in still images with minimum annotation efforts. IEEE Trans. Image Process. 25(11), 5479–5490 (2016)

    Article  MathSciNet  Google Scholar 

  18. Fan, C., Hu, C., Liu, B.: Linearized kernel dictionary learning with group sparse priors for action recognition. Vis. Comput. 35(12), 1797–1807 (2019)

    Google Scholar 

  19. Feng, W., Zhang, X., Huang, X., Luo, Z.: Boxless action recognition in still images via recurrent visual attention. In: Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part II 24, pp. 663–673 (2017). Springer

  20. Banerjee, A., Roy, S., Kundu, R., Singh, P.K., Bhateja, V., Sarkar, R.: An ensemble approach for still image-based human action recognition. Neural Comput. Appl. 34(21), 19269–19282 (2022)

    Article  Google Scholar 

  21. Liu, L., Tan, R.T., You, S.: Loss guided activation for action recognition in still images. In: Asian Conference on Computer Vision, pp. 152–167 (2018). Springer

  22. Gao, R., Xiong, B., Grauman, K.: Im2flow: Motion hallucination from static images for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5937–5947 (2018)

  23. Huang, S., Zhao, X., Niu, L., Zhang, L.: Static image action recognition with hallucinated fine-grained motion information. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2021). IEEE

  24. Niu, L., Huang, S., Zhao, X., Kang, L., Zhang, Y., Zhang, L.: Hallucinating uncertain motion and future for static image action recognition. Comput. Vis. Image Understand. 215, 103337 (2022)

    Article  Google Scholar 

  25. Liu, Y., Lu, Z., Li, J., Yang, T., Yao, C.: Deep image-to-video adaptation and fusion networks for action recognition. IEEE Trans. Image Process. 29, 3168–3182 (2019)

    Article  Google Scholar 

  26. Liu, Y., Wang, K., Li, G., Lin, L.: Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Trans. Image Process. 30, 5573–5588 (2021)

    Article  Google Scholar 

  27. Liu, Y., Lu, Z., Li, J., Yao, C., Deng, Y.: Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity 2018, 1–20 (2018)

    Google Scholar 

  28. Liu, Y., Lu, Z., Li, J., Yang, T.: Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2416–2430 (2018)

    Article  Google Scholar 

  29. Yang, H., Zhang, Y.: A context-and level-aware feature pyramid network for object detection with attention mechanism. Vis. Comput. (2023). https://doi.org/10.1007/s00371-022-02758-x

    Article  Google Scholar 

  30. Cheng, Z., Qu, A., He, X.: Contour-aware semantic segmentation network with spatial attention mechanism for medical image. Vis. Comput. 38, 749–762 (2022)

    Article  Google Scholar 

  31. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2017)

  32. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  34. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pp. 213–229 (2020). Springer

  35. Tamura, M., Ohashi, H., Yoshinaga, T.: Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10410–10419 (2021)

  36. Sudhakaran, S., Escalera, S., Lanz, O.: Lsta: Long short-term attention for egocentric action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9954–9963 (2019)

  37. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  38. Xi, P., Guan, H., Shu, C., Borgeat, L., Goubran, R.: An integrated approach for medical abnormality detection using deep patch convolutional neural networks. Vis. Comput. 36(9), 1869–1882 (2020)

    Article  Google Scholar 

  39. Cao, G., Li, J., Chen, X., He, Z.: Patch-based self-adaptive matting for high-resolution image and video. Vis. Comput. 35, 133–147 (2019)

    Article  Google Scholar 

  40. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  41. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597 (2018)

  42. Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114 (2019). PMLR

  43. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 62076183, 61936014 and 61976159, in part by the Natural Science Foundation of Shanghai under Grant 20ZR1473500, in part by the Shanghai Science and Technology Innovation Action Project of under Grant 20511100700 and 22511105300, in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100, and in part by the Fundamental Research Funds for the Central Universities. The authors would also like to thank the anonymous reviewers for their careful work and valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuang Liang.

Ethics declarations

Conflict of interest

There is no conflict of interest or competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, S., Wang, J. & Zhuang, Z. Patch excitation network for boxless action recognition in still images. Vis Comput 40, 4099–4113 (2024). https://doi.org/10.1007/s00371-023-03071-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03071-x

Keywords