skip to main content
10.1145/3581807.3581810acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccprConference Proceedingsconference-collections
research-article

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition

Authors Info & Claims
Published:22 May 2023Publication History

ABSTRACT

Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.

References

  1. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 20-36. https://doi.org/10.1007/978-3-319-46484-8_2Google ScholarGoogle ScholarCross RefCross Ref
  2. Karen Simonyan, Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems 27, 568-576.Google ScholarGoogle Scholar
  3. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini, Venugopalan, Kate Saenko, Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2625-2634. https://doi: 10.1109/CVPR.2015.7298878.Google ScholarGoogle ScholarCross RefCross Ref
  4. Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6546–6555Google ScholarGoogle ScholarCross RefCross Ref
  5. Jie Hu, Li Shen, Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141Google ScholarGoogle ScholarCross RefCross Ref
  6. Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, 3-19Google ScholarGoogle Scholar
  7. H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre. 2011. HMDB: A large video database for human motion recognition. In the Proceedings of the International Conference on Computer Vision, 2556-2563Google ScholarGoogle Scholar
  8. Khurram Soomro, Amir Roshan Zamir, Mubarak Shah. 2012. UCF101: A Dataset of 101 human actions classes from videos in the wild. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. arXiv:1212.0402, https://arxiv.org/abs/1212.0402.Google ScholarGoogle Scholar
  9. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In the Proceedings of the International Conference on Computer Vision, 4489–4497Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zhaofan Qiu, Ting Yao, Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In the Proceedings of the International Conference on Computer Vision, 5534–5542Google ScholarGoogle ScholarCross RefCross Ref
  11. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri. 2018. A Closer look at spatiotemporal convolutions for action recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6459Google ScholarGoogle ScholarCross RefCross Ref
  12. Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 449–458Google ScholarGoogle ScholarCross RefCross Ref
  13. Yixiang Zhang, Hongbo Zhang, Jixiang Du, Qing Lei, Lijie Yang, Bineng Zhong. 2021. RGB+2D skeleton: local hand-crafted and 3D convolution feature coding for action recognition. Signal, Image and Video Processing 15, 1379–1386. https://doi.org/10.1007/s11760-021-01868-8Google ScholarGoogle ScholarCross RefCross Ref
  14. Volodymyr Mnih, Nicolas Heess, Alex Graves, koray kavukcuoglu. 2014. Recurrent models of visual attention. In the Proceedings of Advances in Neural Information Processing Systems 27, 2204–2212Google ScholarGoogle Scholar
  15. Max Jaderberg, Karen Simonyan, Andrew Zisserman, koray kavukcuoglu. 2015. Spatial transformer networks. In the Proceedings of Advances in Neural Information Processing Systems 28, 2017–2025Google ScholarGoogle Scholar
  16. Shikhar Sharma, Ryan Kiros, Ruslan Salakhutdinov. 2015. Action Recognition using Visual Attention. arXiv preprint arXiv:1511.04119Google ScholarGoogle Scholar
  17. Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. 2018. Non-local neural networks. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803Google ScholarGoogle ScholarCross RefCross Ref
  18. Cemil Zalluhoglu, Nazli Ikizler-Cinbis. 2021. Comparison of 2D and 3D attention mechanisms for human (collective) activity recognition. Signal, Image and Video Processing 16, 865-872. https://doi.org/10.1007/s11760-021-02028-8Google ScholarGoogle ScholarCross RefCross Ref
  19. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2016. Deep residual learning for image recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778Google ScholarGoogle ScholarCross RefCross Ref
  20. Joao Carreira, Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4733Google ScholarGoogle ScholarCross RefCross Ref
  21. Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. 2015. Towards good practices for very deep two-stream convnets. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-6Google ScholarGoogle Scholar
  22. Heng Wang, Cordelia Schmid. 2013. Action recognition with improved trajectories. In the Proceedings of the International Conference on Computer Vision, 3551–3558Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra. 2020. Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 336–359Google ScholarGoogle Scholar
  24. Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In the Proceedings of the IEEE International Conference on Computer Vision, 3628–3636Google ScholarGoogle ScholarCross RefCross Ref
  25. Z. Y, Z.H. Sun, J.C. Feng, K. Jia. 2020. Channel separable convolutional neural network for action recognition. Journal of Signal Processing 36, 9(September 2020), 1497-1502Google ScholarGoogle Scholar
  26. Seyma Yucer, Yusuf Sinan Akgul. 2018. 3D human action recognition with Siamese-LSTM based deep metric learning. arXiv preprint arXiv:1807.02131, https://doi.org/10.18178/joig.6.1.21-26Google ScholarGoogle Scholar

Index Terms

  1. An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition
      November 2022
      683 pages
      ISBN:9781450397056
      DOI:10.1145/3581807

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 May 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)39
      • Downloads (Last 6 weeks)4

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format