Abstract
The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance.
- [1] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4724–4733.
DOI: DOI: Google ScholarCross Ref - [2] . 2019. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5152–5161.
DOI: DOI: Google ScholarCross Ref - [3] . 2017. Temporal 3D convnets: New architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200.Google Scholar
- [4] . 2018. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV). 284–299.Google ScholarCross Ref
- [5] . 2017. Deep temporal linear encoding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2329–2338.Google ScholarCross Ref
- [6] . 2020. X3D: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 203–213.Google ScholarCross Ref
- [7] . 2019. Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.Google ScholarCross Ref
- [8] . 2019. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3029–3038.
DOI: DOI: Google ScholarCross Ref - [9] . 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW). 3154–3160.
DOI: DOI: Google ScholarCross Ref - [10] . 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546–6555.
DOI: DOI: Google ScholarCross Ref - [11] . 2020. Can deep learning recognize subtle human activities? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14244–14253.Google ScholarCross Ref
- [12] . 2019. STM: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2000–2009.Google ScholarCross Ref
- [13] . 2019. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 49, 9 (2019), 1806–1819.
DOI: DOI: Google ScholarCross Ref - [14] . 2011. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision. IEEE, 2556–2563.Google ScholarDigital Library
- [15] . 2021. Representing videos as discriminative sub-graphs for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3310–3319.Google ScholarCross Ref
- [16] . 2020. TEA: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 909–918.Google ScholarCross Ref
- [17] . 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision. 7083–7093.Google ScholarCross Ref
- [18] . 2021. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14892–14901.Google ScholarCross Ref
- [19] . 2020. TEINet: Towards an efficient architecture for video recognition. In Proceedings of the Association for the Advancement of Artificial Intelligence Conference. 11669–11676.Google ScholarCross Ref
- [20] . 2018. End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1558–1567.Google ScholarCross Ref
- [21] . 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 119–129.
DOI: DOI: Google ScholarCross Ref - [22] . 2021. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6964–6974.Google ScholarCross Ref
- [23] . 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 5833–5842.
DOI: DOI: Google ScholarCross Ref - [24] . 2020. A survey on 3D skeleton-based action recognition using learning method.
arxiv:2002.05907 [cs.CV].Google Scholar - [25] . 2016. Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vis. 119, 3 (2016), 346–373.Google ScholarDigital Library
- [26] . 2014. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 27 (2014), 568–576.Google Scholar
- [27] . 2020. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia (MM’20). Association for Computing Machinery, New York, NY, 1625–1633.
DOI: DOI: Google ScholarDigital Library - [28] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.Google Scholar
- [29] . 2018. Optical flow guided feature: A fast and robust motion representation for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1390–1399.Google ScholarCross Ref
- [30] . 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9.Google ScholarCross Ref
- [31] . 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.Google ScholarDigital Library
- [32] . 2019. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 5552–5561.Google ScholarCross Ref
- [33] . 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.Google ScholarCross Ref
- [34] . 2018. Beyond two-stream: Skeleton-based three-stream networks for action recognition in videos. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR). IEEE, 1567–1573.Google ScholarCross Ref
- [35] . 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 591–600.Google ScholarCross Ref
- [36] . 2019. A comprehensive survey of vision-based human action recognition methods. Sensors 19, 5 (2019).
DOI: DOI: Google ScholarCross Ref - [37] . 2020. V4D: 4D convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442.Google Scholar
- [38] . 2020. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14321–14330.
DOI: DOI: Google ScholarCross Ref - [39] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google ScholarCross Ref
- [40] . 2020. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9829–9838.Google ScholarCross Ref
- [41] . 2020. A3D: Adaptive 3D networks for video action recognition. arXiv preprint arXiv:2011.12384.Google Scholar
- [42] . 2016. Depth2Action: Exploring embedded depth for large-scale action recognition. In Computer Vision – ECCV 2016 Workshops, and (Eds.). Springer International Publishing, Cham, 668–684.Google ScholarCross Ref
Index Terms
- Shuffle-invariant Network for Action Recognition in Videos
Recommendations
Action recognition in still images by learning spatial interest regions from videos
This paper addresses the problem of human action recognition in still images.This paper proposes a novel approach to learn interest regions from videos.This paper builds a Bayesian framework using learned interest regions and image local features for ...
A local descriptor based on Laplacian pyramid coding for action recognition
We present a new descriptor for local representation of human actions. In contrast to state-of-the-art descriptors, which use spatio-temporal features to describe cuboids detected from video sequences, we propose to employ a 2D descriptor based on the ...
Weighted feature trajectories and concatenated bag-of-features for action recognition
Key-point trajectory based approaches to recognizing human actions in realistic videos have recently shown promising results. However, their coverage of the entire actor is not sufficient for describing human actions, and the trajectories often ...
Comments