ABSTRACT
Effective feature learning is one of the prime components for human action recognition algorithm. Three-dimensional convolutional neural network (3D CNN) can directly extract spatio-temporal features, however it is insufficient to capture the most discriminative part of the action video. The redundant spatial regions within and between temporal frames would weak the descriptive ability of the 3D CNN model. To address this problem, we propose a lightweight spatio-temporal attention module (ST-AM), composed of spatial attention module (SAM) and temporal attention module (TAM). SAM and TAM can effectively encode the semantic spatial areas and suppress the redundant temporal frames to reduce misclassification. The proposed SAM and TAM have complementary effects and can be easily embedded into the existing 3D CNN action recognition model. Experiment on UCF-101 and HMDB-51 datasets shows that the ST-AM embedded model achieves impressive performance on action recognition task.
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, Cham, 20-36. https://doi.org/10.1007/978-3-319-46484-8_2Google ScholarCross Ref
- Karen Simonyan, Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of Advances in Neural Information Processing Systems 27, 568-576.Google Scholar
- Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini, Venugopalan, Kate Saenko, Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2625-2634. https://doi: 10.1109/CVPR.2015.7298878.Google ScholarCross Ref
- Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6546–6555Google ScholarCross Ref
- Jie Hu, Li Shen, Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141Google ScholarCross Ref
- Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, 3-19Google Scholar
- H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre. 2011. HMDB: A large video database for human motion recognition. In the Proceedings of the International Conference on Computer Vision, 2556-2563Google Scholar
- Khurram Soomro, Amir Roshan Zamir, Mubarak Shah. 2012. UCF101: A Dataset of 101 human actions classes from videos in the wild. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. arXiv:1212.0402, https://arxiv.org/abs/1212.0402.Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In the Proceedings of the International Conference on Computer Vision, 4489–4497Google ScholarDigital Library
- Zhaofan Qiu, Ting Yao, Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In the Proceedings of the International Conference on Computer Vision, 5534–5542Google ScholarCross Ref
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri. 2018. A Closer look at spatiotemporal convolutions for action recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6459Google ScholarCross Ref
- Yizhou Zhou, Xiaoyan Sun, Zheng-Jun Zha, Wenjun Zeng. 2018. MiCT: Mixed 3D/2D convolutional tube for human action recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 449–458Google ScholarCross Ref
- Yixiang Zhang, Hongbo Zhang, Jixiang Du, Qing Lei, Lijie Yang, Bineng Zhong. 2021. RGB+2D skeleton: local hand-crafted and 3D convolution feature coding for action recognition. Signal, Image and Video Processing 15, 1379–1386. https://doi.org/10.1007/s11760-021-01868-8Google ScholarCross Ref
- Volodymyr Mnih, Nicolas Heess, Alex Graves, koray kavukcuoglu. 2014. Recurrent models of visual attention. In the Proceedings of Advances in Neural Information Processing Systems 27, 2204–2212Google Scholar
- Max Jaderberg, Karen Simonyan, Andrew Zisserman, koray kavukcuoglu. 2015. Spatial transformer networks. In the Proceedings of Advances in Neural Information Processing Systems 28, 2017–2025Google Scholar
- Shikhar Sharma, Ryan Kiros, Ruslan Salakhutdinov. 2015. Action Recognition using Visual Attention. arXiv preprint arXiv:1511.04119Google Scholar
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He. 2018. Non-local neural networks. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7794–7803Google ScholarCross Ref
- Cemil Zalluhoglu, Nazli Ikizler-Cinbis. 2021. Comparison of 2D and 3D attention mechanisms for human (collective) activity recognition. Signal, Image and Video Processing 16, 865-872. https://doi.org/10.1007/s11760-021-02028-8Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. 2016. Deep residual learning for image recognition. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778Google ScholarCross Ref
- Joao Carreira, Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4724–4733Google ScholarCross Ref
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. 2015. Towards good practices for very deep two-stream convnets. In the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-6Google Scholar
- Heng Wang, Cordelia Schmid. 2013. Action recognition with improved trajectories. In the Proceedings of the International Conference on Computer Vision, 3551–3558Google ScholarDigital Library
- Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra. 2020. Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 336–359Google Scholar
- Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In the Proceedings of the IEEE International Conference on Computer Vision, 3628–3636Google ScholarCross Ref
- Z. Y, Z.H. Sun, J.C. Feng, K. Jia. 2020. Channel separable convolutional neural network for action recognition. Journal of Signal Processing 36, 9(September 2020), 1497-1502Google Scholar
- Seyma Yucer, Yusuf Sinan Akgul. 2018. 3D human action recognition with Siamese-LSTM based deep metric learning. arXiv preprint arXiv:1807.02131, https://doi.org/10.18178/joig.6.1.21-26Google Scholar
Index Terms
- An Efficient Lightweight Spatio-temporal Attention Module for Action Recognition
Recommendations
An efficient attention module for 3d convolutional neural networks in action recognition
AbstractDue to illumination changes, varying postures, and occlusion, accurately recognizing actions in videos is still a challenging task. A three-dimensional convolutional neural network (3D CNN), which can simultaneously extract spatio-temporal ...
Spatio-temporal deformable 3D ConvNets with attention for action recognition
Highlights- We are the first to propose a spatio-temporal deformable 3D convolutions with an attention mechanism (STDA for short).
AbstractThe irregularity of human actions poses great challenges in video action recognition. Recently, 3D ConvNet methods have shown promising performance at modelling the motion and appearance information. However, the fixed geometric ...
Sparse Deep LSTMs with Convolutional Attention for Human Action Recognition
AbstractDeep learning has recently gained remarkable results in action recognition. In this paper, an architecture is proposed for action recognition, including ResNet feature extractor, Conv-Attention-LSTM, BiLSTM, and fully connected layers. Furthermore,...
Comments