research-article

Fusing Skeletons and Texts Based on GCN-CNN for Action Recognition

Authors:
Lijuan Zhou

School of Computer and Artificial Intelligence, Zhengzhou University, China

School of Computer and Artificial Intelligence, Zhengzhou University, China

0000-0002-6418-6284
View Profile

,
Zhihuan Liu

School of Computer and Artificial Intelligence, Zhengzhou University, China

School of Computer and Artificial Intelligence, Zhengzhou University, China

0009-0007-4720-2768
View Profile

,
Weicong Zhang

School of Computer and Artificial Intelligence, Zhengzhou University, China

School of Computer and Artificial Intelligence, Zhengzhou University, China

0009-0007-7992-5514
View Profile

,
Di Fan

Wuhan Branch of Chinasoft International Technology Service Co., Ltd.China, China

Wuhan Branch of Chinasoft International Technology Service Co., Ltd.China, China

0009-0009-9886-7641
View Profile

,
Xiaojie Qian

School of Computer and Artificial Intelligence, Zhengzhou University, China

School of Computer and Artificial Intelligence, Zhengzhou University, China

0009-0004-8545-7815
View Profile

ICDIP '23: Proceedings of the 15th International Conference on Digital Image ProcessingMay 2023Article No.: 9Pages 1–6https://doi.org/10.1145/3604078.3604087

Published:26 October 2023Publication History

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing

Pages 1–6

ABSTRACT

This paper proposes a novel semantic action recognition method by fusing textual descriptions. To obtain detailed semantic information, this paper applies an action captioning method to generate textual descriptions of actions. This paper builds a fusion network for action recognition with the generated descriptions. The fusion network could capture the spatial and temporal relations of skeleton sequences and the semantics of descriptions. The fusion network includes a spatial module to obtain the spatial relations of joints based on the graph convolutional network, a textual module to extract textual features, and a temporal module to extract the temporal relations of fused features based on a convolutional network. The proposed method in this paper makes full use of detailed semantics for classification. Experiments were conducted on WorkoutUOW-18 and MSRC-12 datasets and the proposed method achieved better performance than using visual data only, and other semantic action recognition methods. The improvement of the proposed method demonstrates the effectiveness of fusing textual descriptions for action recognition.

References

Seyma Yucer and Yusuf Sinan Akgul.2018. 3D Human Action Recognition with Siamese-LSTM Based Deep Metric Learning. Journal of Image and Graphics 6, 1 (2018), 21-26.Google ScholarCross Ref
Naresh Kumar and Nagarajan Sukavanam. 2018. Motion trajectory for human action recognition using fourier temporal features of skeleton joints. Journal of Image and Graphics 6, 2 (2018), 174–180.Google ScholarCross Ref
Muhammad Hassan, Tasweer Ahmad, Nudrat Liaqat, Ali Farooq, Syed Asghar Ali, and Syed Rizwan Hassan. 2014. A review on human actions recognition using vision based techniques. Journal of Image and Graphics 2, 1 (2014), 28–32.Google ScholarCross Ref
Tasweer Ahmad, Junaid Rafique, Hassam Muazzam, and Tahir Rizvi. 2015. Using discrete cosine transform based features for human action recognition. Journal of Image and Graphics 3, 2 (2015), 96–101.Google ScholarCross Ref
Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, and Xinwang Liu. 2018. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32Google ScholarCross Ref
Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE international conference on computer vision. 5832–5841.Google ScholarCross Ref
Jonathan Munro and Dima Damen. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 122–132.Google ScholarCross Ref
Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. 2020. Speech2action: Cross-modal supervision for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10317–10326.Google ScholarCross Ref
Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910–1921.Google ScholarCross Ref
Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337–3344.Google ScholarDigital Library
Lijuan Zhou, Wanqing Li, Philip Ogunbona, and Zhengyou Zhang. 2017. Semantic action recognition by learning a pose lexicon. Pattern Recognition 72 (2017), 548–562.Google ScholarDigital Library
Lijuan Zhou, Wanqing Li, Philip Ogunbona, and Zhengyou Zhang. 2019. Jointly learning visual poses and pose lexicon for semantic action recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 457–467.Google ScholarCross Ref
Meera Hahn, Andrew Silva, and James M Rehg. 2019. Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484 (2019).Google Scholar
Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. 2020. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4613–4623.Google ScholarCross Ref
Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638–13647.Google ScholarCross Ref
Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1112–1121.Google ScholarCross Ref
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.Google ScholarCross Ref
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Mohamed E Hussein, Marwan Torki, Mohammad A Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Twenty-third international joint conference on artificial intelligence.Google Scholar
Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool. 2017. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6099–6108.Google ScholarCross Ref
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.Google ScholarCross Ref
Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13359–13368.Google ScholarCross Ref
Lijuan Zhou, Weicong Zhang, and Xiaojie Qian. 2021. Human Action Captioning based on a GRU+ LSTM+ Attention Model. In 2021 The 9th International Conference on Information Technology: IoT and Smart City. 168–173.Google ScholarDigital Library
Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proceedings of the SIGCHI conference on human factors in computing systems. 1737–1746.Google ScholarDigital Library

Index Terms

Fusing Skeletons and Texts Based on GCN-CNN for Action Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Action Recognition Based on Dense Action Captioning
ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing

Considering that visually similar actions may be easily distinguished in textual description, it provides an opportunity to introduce textual description to assist action recognition. This paper proposes a novel action recognition method based on dense ...
Read More
Fusing Multiple Features for Depth-Based Action Recognition
Special Section on Visual Understanding with RGB-D Sensors

Human action recognition is a very active research topic in computer vision and pattern recognition. Recently, it has shown a great potential for human action recognition using the three-dimensional (3D) depth data captured by the emerging RGB-D ...
Read More
Local Eyebrow Feature Attention Network for Masked Face Recognition
During the COVID-19 coronavirus epidemic, wearing masks has become increasingly popular. Traditional occlusion face recognition algorithms are almost ineffective for such heavy mask occlusion. Therefore, it is urgent to improve the recognition performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing
May 2023
711 pages
ISBN:9798400708237
DOI:10.1145/3604078

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Graph convolutional network
Semantic action recognition
Textual description
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 21
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Fusing Skeletons and Texts Based on GCN-CNN for Action Recognition

ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Action Recognition Based on Dense Action Captioning

Fusing Multiple Features for Depth-Based Action Recognition

Local Eyebrow Feature Attention Network for Masked Face Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media