skip to main content
10.1145/3604078.3604087acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicdipConference Proceedingsconference-collections
research-article

Fusing Skeletons and Texts Based on GCN-CNN for Action Recognition

Published:26 October 2023Publication History

ABSTRACT

This paper proposes a novel semantic action recognition method by fusing textual descriptions. To obtain detailed semantic information, this paper applies an action captioning method to generate textual descriptions of actions. This paper builds a fusion network for action recognition with the generated descriptions. The fusion network could capture the spatial and temporal relations of skeleton sequences and the semantics of descriptions. The fusion network includes a spatial module to obtain the spatial relations of joints based on the graph convolutional network, a textual module to extract textual features, and a temporal module to extract the temporal relations of fused features based on a convolutional network. The proposed method in this paper makes full use of detailed semantics for classification. Experiments were conducted on WorkoutUOW-18 and MSRC-12 datasets and the proposed method achieved better performance than using visual data only, and other semantic action recognition methods. The improvement of the proposed method demonstrates the effectiveness of fusing textual descriptions for action recognition.

References

  1. Seyma Yucer and Yusuf Sinan Akgul.2018. 3D Human Action Recognition with Siamese-LSTM Based Deep Metric Learning. Journal of Image and Graphics 6, 1 (2018), 21-26.Google ScholarGoogle ScholarCross RefCross Ref
  2. Naresh Kumar and Nagarajan Sukavanam. 2018. Motion trajectory for human action recognition using fourier temporal features of skeleton joints. Journal of Image and Graphics 6, 2 (2018), 174–180.Google ScholarGoogle ScholarCross RefCross Ref
  3. Muhammad Hassan, Tasweer Ahmad, Nudrat Liaqat, Ali Farooq, Syed Asghar Ali, and Syed Rizwan Hassan. 2014. A review on human actions recognition using vision based techniques. Journal of Image and Graphics 2, 1 (2014), 28–32.Google ScholarGoogle ScholarCross RefCross Ref
  4. Tasweer Ahmad, Junaid Rafique, Hassam Muazzam, and Tahir Rizvi. 2015. Using discrete cosine transform based features for human action recognition. Journal of Image and Graphics 3, 2 (2015), 96–101.Google ScholarGoogle ScholarCross RefCross Ref
  5. Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, and Xinwang Liu. 2018. Cooperative training of deep aggregation networks for RGB-D action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32Google ScholarGoogle ScholarCross RefCross Ref
  6. Hossein Rahmani and Mohammed Bennamoun. 2017. Learning action recognition model from depth and skeleton videos. In Proceedings of the IEEE international conference on computer vision. 5832–5841.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jonathan Munro and Dima Damen. 2020. Multi-modal domain adaptation for fine-grained action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 122–132.Google ScholarGoogle ScholarCross RefCross Ref
  8. Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, and Andrew Zisserman. 2020. Speech2action: Cross-modal supervision for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10317–10326.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jiawei Chen and Chiu Man Ho. 2022. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1910–1921.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In CVPR 2011. IEEE, 3337–3344.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Lijuan Zhou, Wanqing Li, Philip Ogunbona, and Zhengyou Zhang. 2017. Semantic action recognition by learning a pose lexicon. Pattern Recognition 72 (2017), 548–562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lijuan Zhou, Wanqing Li, Philip Ogunbona, and Zhengyou Zhang. 2019. Jointly learning visual poses and pose lexicon for semantic action recognition. IEEE Transactions on Circuits and Systems for Video Technology 30, 2 (2019), 457–467.Google ScholarGoogle ScholarCross RefCross Ref
  13. Meera Hahn, Andrew Silva, and James M Rehg. 2019. Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484 (2019).Google ScholarGoogle Scholar
  14. Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. 2020. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4613–4623.Google ScholarGoogle ScholarCross RefCross Ref
  15. Shizhe Chen and Dong Huang. 2021. Elaborative rehearsal for zero-shot action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13638–13647.Google ScholarGoogle ScholarCross RefCross Ref
  16. Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. 2020. Semantics-guided neural networks for efficient skeleton-based human action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1112–1121.Google ScholarGoogle ScholarCross RefCross Ref
  17. Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794–7803.Google ScholarGoogle ScholarCross RefCross Ref
  18. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  19. Mohamed E Hussein, Marwan Torki, Mohammad A Gowayyed, and Motaz El-Saban. 2013. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Twenty-third international joint conference on artificial intelligence.Google ScholarGoogle Scholar
  20. Zhiwu Huang, Chengde Wan, Thomas Probst, and Luc Van Gool. 2017. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6099–6108.Google ScholarGoogle ScholarCross RefCross Ref
  21. Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.Google ScholarGoogle ScholarCross RefCross Ref
  22. Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13359–13368.Google ScholarGoogle ScholarCross RefCross Ref
  23. Lijuan Zhou, Weicong Zhang, and Xiaojie Qian. 2021. Human Action Captioning based on a GRU+ LSTM+ Attention Model. In 2021 The 9th International Conference on Information Technology: IoT and Smart City. 168–173.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Simon Fothergill, Helena Mentis, Pushmeet Kohli, and Sebastian Nowozin. 2012. Instructing people for training gestural interactive systems. In Proceedings of the SIGCHI conference on human factors in computing systems. 1737–1746.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fusing Skeletons and Texts Based on GCN-CNN for Action Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICDIP '23: Proceedings of the 15th International Conference on Digital Image Processing
      May 2023
      711 pages
      ISBN:9798400708237
      DOI:10.1145/3604078

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)3

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format