ABSTRACT
This paper proposes a novel semantic action recognition method by fusing textual descriptions. To obtain detailed semantic information, this paper applies an action captioning method to generate textual descriptions of actions. This paper builds a fusion network for action recognition with the generated descriptions. The fusion network could capture the spatial and temporal relations of skeleton sequences and the semantics of descriptions. The fusion network includes a spatial module to obtain the spatial relations of joints based on the graph convolutional network, a textual module to extract textual features, and a temporal module to extract the temporal relations of fused features based on a convolutional network. The proposed method in this paper makes full use of detailed semantics for classification. Experiments were conducted on WorkoutUOW-18 and MSRC-12 datasets and the proposed method achieved better performance than using visual data only, and other semantic action recognition methods. The improvement of the proposed method demonstrates the effectiveness of fusing textual descriptions for action recognition.
- Seyma Yucer and Yusuf Sinan Akgul.2018. 3D Human Action Recognition with Siamese-LSTM Based Deep Metric Learning. Journal of Image and Graphics 6, 1 (2018), 21-26.Google ScholarCross Ref
- Naresh Kumar and Nagarajan Sukavanam. 2018. Motion trajectory for human action recognition using fourier temporal features of skeleton joints. Journal of Image and Graphics 6, 2 (2018), 174–180.Google ScholarCross Ref
- Muhammad Hassan, Tasweer Ahmad, Nudrat Liaqat, Ali Farooq, Syed Asghar Ali, and Syed Rizwan Hassan. 2014. A review on human actions recognition using vision based techniques. Journal of Image and Graphics 2, 1 (2014), 28–32.Google ScholarCross Ref
- Tasweer Ahmad, Junaid Rafique, Hassam Muazzam, and Tahir Rizvi. 2015. Using discrete cosine transform based features for human action recognition. Journal of Image and Graphics 3, 2 (2015), 96–101.Google ScholarCross Ref
- Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, and Xinwang Liu. 2018. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32Google ScholarCross Ref
- Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE international conference on computer vision. 5832–5841.Google ScholarCross Ref
- Jonathan Munro and Dima Damen. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 122–132.Google ScholarCross Ref
- Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. 2020. Speech2action: Cross-modal supervision for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10317–10326.Google ScholarCross Ref
- Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910–1921.Google ScholarCross Ref
- Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337–3344.Google ScholarDigital Library
- Lijuan Zhou, Wanqing Li, Philip Ogunbona, and Zhengyou Zhang. 2017. Semantic action recognition by learning a pose lexicon. Pattern Recognition 72 (2017), 548–562.Google ScholarDigital Library
- Lijuan Zhou, Wanqing Li, Philip Ogunbona, and Zhengyou Zhang. 2019. Jointly learning visual poses and pose lexicon for semantic action recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 457–467.Google ScholarCross Ref
- Meera Hahn, Andrew Silva, and James M Rehg. 2019. Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484 (2019).Google Scholar
- Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. 2020. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4613–4623.Google ScholarCross Ref
- Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638–13647.Google ScholarCross Ref
- Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1112–1121.Google ScholarCross Ref
- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.Google ScholarCross Ref
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Mohamed E Hussein, Marwan Torki, Mohammad A Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Twenty-third international joint conference on artificial intelligence.Google Scholar
- Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool. 2017. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6099–6108.Google ScholarCross Ref
- Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.Google ScholarCross Ref
- Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13359–13368.Google ScholarCross Ref
- Lijuan Zhou, Weicong Zhang, and Xiaojie Qian. 2021. Human Action Captioning based on a GRU+ LSTM+ Attention Model. In 2021 The 9th International Conference on Information Technology: IoT and Smart City. 168–173.Google ScholarDigital Library
- Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proceedings of the SIGCHI conference on human factors in computing systems. 1737–1746.Google ScholarDigital Library
Index Terms
- Fusing Skeletons and Texts Based on GCN-CNN for Action Recognition
Recommendations
Action Recognition Based on Dense Action Captioning
ICDIP '23: Proceedings of the 15th International Conference on Digital Image ProcessingConsidering that visually similar actions may be easily distinguished in textual description, it provides an opportunity to introduce textual description to assist action recognition. This paper proposes a novel action recognition method based on dense ...
Fusing Multiple Features for Depth-Based Action Recognition
Special Section on Visual Understanding with RGB-D SensorsHuman action recognition is a very active research topic in computer vision and pattern recognition. Recently, it has shown a great potential for human action recognition using the three-dimensional (3D) depth data captured by the emerging RGB-D ...
Local Eyebrow Feature Attention Network for Masked Face Recognition
During the COVID-19 coronavirus epidemic, wearing masks has become increasingly popular. Traditional occlusion face recognition algorithms are almost ineffective for such heavy mask occlusion. Therefore, it is urgent to improve the recognition performance ...
Comments