research-article

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition

Authors:
Zhonghua Sun

Faculty of Information Technology, Beijing University of Technology, China and Beijing Laboratory of Advanced Information Networks, Beijing University of Technology, China

Faculty of Information Technology, Beijing University of Technology, China and Beijing Laboratory of Advanced Information Networks, Beijing University of Technology, China

0000-0002-6515-8859
View Profile

,
Meng Dai

Faculty of Information Technology, Beijing University of Technology, China

Faculty of Information Technology, Beijing University of Technology, China

0000-0003-3942-1841
View Profile

,
Ziwen Yi

Faculty of Information Technology, Beijing University of Technology, China

Faculty of Information Technology, Beijing University of Technology, China

0000-0003-2249-8340
View Profile

,
Tianyi Wang

Faculty of Information Technology, Beijing University of Technology, China

Faculty of Information Technology, Beijing University of Technology, China

0000-0003-4746-9124
View Profile

,
Jinchao Feng

Faculty of Information Technology, Beijing University of Technology, China and Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information Technology, China

Faculty of Information Technology, Beijing University of Technology, China and Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information Technology, China

0000-0001-5603-8874
View Profile

,
Kebin Jia

Faculty of Information Technology, Beijing University of Technology, China and Beijing Laboratory of Advanced Information Networks, Beijing University of Technology, China

Faculty of Information Technology, Beijing University of Technology, China and Beijing Laboratory of Advanced Information Networks, Beijing University of Technology, China

0000-0001-7620-2221
View Profile

ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern RecognitionNovember 2022Pages 14–19https://doi.org/10.1145/3581807.3581810

Published:22 May 2023Publication History

ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition

Pages 14–19

ABSTRACT

Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.

References

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 20-36. https://doi.org/10.1007/978-3-319-46484-8_2Google ScholarCross Ref
Karen Simonyan, Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems 27, 568-576.Google Scholar
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini, Venugopalan, Kate Saenko, Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2625-2634. https://doi: 10.1109/CVPR.2015.7298878.Google ScholarCross Ref
Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6546–6555Google ScholarCross Ref
Jie Hu, Li Shen, Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141Google ScholarCross Ref
Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, 3-19Google Scholar
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre. 2011. HMDB: A large video database for human motion recognition. In the Proceedings of the International Conference on Computer Vision, 2556-2563Google Scholar
Khurram Soomro, Amir Roshan Zamir, Mubarak Shah. 2012. UCF101: A Dataset of 101 human actions classes from videos in the wild. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. arXiv:1212.0402, https://arxiv.org/abs/1212.0402.Google Scholar
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In the Proceedings of the International Conference on Computer Vision, 4489–4497Google ScholarDigital Library
Zhaofan Qiu, Ting Yao, Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In the Proceedings of the International Conference on Computer Vision, 5534–5542Google ScholarCross Ref
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri. 2018. A Closer look at spatiotemporal convolutions for action recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6459Google ScholarCross Ref
Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 449–458Google ScholarCross Ref
Yixiang Zhang, Hongbo Zhang, Jixiang Du, Qing Lei, Lijie Yang, Bineng Zhong. 2021. RGB+2D skeleton: local hand-crafted and 3D convolution feature coding for action recognition. Signal, Image and Video Processing 15, 1379–1386. https://doi.org/10.1007/s11760-021-01868-8Google ScholarCross Ref
Volodymyr Mnih, Nicolas Heess, Alex Graves, koray kavukcuoglu. 2014. Recurrent models of visual attention. In the Proceedings of Advances in Neural Information Processing Systems 27, 2204–2212Google Scholar
Max Jaderberg, Karen Simonyan, Andrew Zisserman, koray kavukcuoglu. 2015. Spatial transformer networks. In the Proceedings of Advances in Neural Information Processing Systems 28, 2017–2025Google Scholar
Shikhar Sharma, Ryan Kiros, Ruslan Salakhutdinov. 2015. Action Recognition using Visual Attention. arXiv preprint arXiv:1511.04119Google Scholar
Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. 2018. Non-local neural networks. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803Google ScholarCross Ref
Cemil Zalluhoglu, Nazli Ikizler-Cinbis. 2021. Comparison of 2D and 3D attention mechanisms for human (collective) activity recognition. Signal, Image and Video Processing 16, 865-872. https://doi.org/10.1007/s11760-021-02028-8Google ScholarCross Ref
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2016. Deep residual learning for image recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778Google ScholarCross Ref
Joao Carreira, Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4733Google ScholarCross Ref
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. 2015. Towards good practices for very deep two-stream convnets. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-6Google Scholar
Heng Wang, Cordelia Schmid. 2013. Action recognition with improved trajectories. In the Proceedings of the International Conference on Computer Vision, 3551–3558Google ScholarDigital Library
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra. 2020. Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 336–359Google Scholar
Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In the Proceedings of the IEEE International Conference on Computer Vision, 3628–3636Google ScholarCross Ref
Z. Y, Z.H. Sun, J.C. Feng, K. Jia. 2020. Channel separable convolutional neural network for action recognition. Journal of Signal Processing 36, 9(September 2020), 1497-1502Google Scholar
Seyma Yucer, Yusuf Sinan Akgul. 2018. 3D human action recognition with Siamese-LSTM based deep metric learning. arXiv preprint arXiv:1807.02131, https://doi.org/10.18178/joig.6.1.21-26Google Scholar

Index Terms

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

An efficient attention module for 3d convolutional neural networks in action recognition
Abstract
Due to illumination changes, varying postures, and occlusion, accurately recognizing actions in videos is still a challenging task. A three-dimensional convolutional neural network (3D CNN), which can simultaneously extract spatio-temporal ...
Read More
Spatio-temporal deformable 3D ConvNets with attention for action recognition
Highlights
- We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).
Abstract
The irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric ...
Read More
Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition
Abstract
Deep learning has recently gained remarkable results in action recognition. In this paper, an architecture is proposed for action recognition, including ResNet feature extractor, Conv-Attention-LSTM, BiLSTM, and fully connected layers. Furthermore,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition
November 2022
683 pages
ISBN:9781450397056
DOI:10.1145/3581807

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 May 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
3D CNN
Action recognition
Attention module
Spatio-temporal features
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 39
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition

ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition

ABSTRACT

References

Cited By

Index Terms

Recommendations

An efficient attention module for 3d convolutional neural networks in action recognition

Spatio-temporal deformable 3D ConvNets with attention for action recognition

Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition

ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition

ABSTRACT

References

Cited By

Index Terms

Recommendations

An efficient attention module for 3d convolutional neural networks in action recognition

Spatio-temporal deformable 3D ConvNets with attention for action recognition

Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media