skip to main content
10.1145/3503161.3547822acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition

Published:10 October 2022Publication History

ABSTRACT

Group activity recognition (GAR) is a challenging task for discerning the behavior of a group of actors. This paper aims at learning discriminative representation for GAR in a self-supervised manner based on human skeletons. As modeling relations between actors lie at the center of GAR, we propose a valid self-supervised learning pretext task with a matching framework, where a representation model is driven to identify subgroups in a synthetic group based on actors' skeleton sequences. For backbone networks, while spatial-temporal graph convolution networks have dominated the skeleton-based action recognition, they under-explore the group relevant interactions among actors. To address this issue, we come up with a novel plug-in Actor-Association Graph Convolution Module (AAGCM) based on inductive graph convolution, which can be integrated into many common backbones. It can not only model the interactions at different levels but also adapt to variable group sizes. The effectiveness of our approaches is demonstrated by extensive experiments on three benchmark datasets: Volleyball, Collective Activity, and Mutual NTU.

Skip Supplemental Material Section

Supplemental Material

References

  1. Mohamed Rabie Amer, Peng Lei, and Sinisa Todorovic. 2014. Hirf: Hierarchical random field for collective activity recognition in videos. In ECCV.Google ScholarGoogle Scholar
  2. Mohamed R Amer, Dan Xie, Mingtian Zhao, Sinisa Todorovic, and Song-Chun Zhu. 2012. Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV.Google ScholarGoogle Scholar
  3. Timur Bagautdinov, Alexandre Alahi, Francc ois Fleuret, Pascal Fua, and Silvio Savarese. 2017. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR.Google ScholarGoogle Scholar
  4. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML.Google ScholarGoogle Scholar
  5. Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV.Google ScholarGoogle Scholar
  6. Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. 2020. Skeleton-based action recognition with shift graph convolutional network. In CVPR.Google ScholarGoogle Scholar
  7. Wongun Choi and Silvio Savarese. 2013. Understanding collective activitiesof people from videos. PAMI, Vol. 36, 6 (2013), 1242--1257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Wongun Choi, Khuram Shahid, and Silvio Savarese. 2009. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In ICCV workshop.Google ScholarGoogle Scholar
  9. Wongun Choi, Khuram Shahid, and Silvio Savarese. 2011. Learning context for collective activity recognition. In CVPR.Google ScholarGoogle Scholar
  10. Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV.Google ScholarGoogle Scholar
  11. Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. 2020. Actor-transformers for group activity recognition. In CVPR.Google ScholarGoogle Scholar
  12. Tianyu Guo, Hong Liu, Zhan Chen, Mengyuan Liu, Tao Wang, and Runwei Ding. 2022. Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition. In AAAI.Google ScholarGoogle Scholar
  13. William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS.Google ScholarGoogle Scholar
  14. Guyue Hu, Bo Cui, Yuan He, and Shan Yu. 2020. Progressive relation learning for group activity recognition. In CVPR.Google ScholarGoogle Scholar
  15. Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In CVPR.Google ScholarGoogle Scholar
  16. Dinesh Jayaraman and Kristen Grauman. 2015. Learning image representations tied to ego-motion. In ICCV.Google ScholarGoogle Scholar
  17. Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In AAAI.Google ScholarGoogle Scholar
  18. Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting self-supervised visual representation learning. CVPR.Google ScholarGoogle Scholar
  19. Nikos Komodakis and Spyros Gidaris. 2018. Unsupervised representation learning by predicting image rotations. In ICLR.Google ScholarGoogle Scholar
  20. Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. In NIPS.Google ScholarGoogle Scholar
  21. Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2021. OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association. arXiv preprint arXiv:2103.02440 (2021).Google ScholarGoogle Scholar
  22. Tian Lan, Leonid Sigal, and Greg Mori. 2012. Social roles in hierarchical models for human activity recognition. In CVPR.Google ScholarGoogle Scholar
  23. Tian Lan, Yang Wang, Weilong Yang, Stephen N Robinovitch, and Greg Mori. 2011. Discriminative latent models for recognizing contextual group activities. PAMI, Vol. 34, 8 (2011), 1549--1562.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Linguo Li, Minsi Wang, Bingbing Ni, Hang Wang, Jiancheng Yang, and Wenjun Zhang. 2021b. 3d human action representation learning via cross-view consistency pursuit. In CVPR.Google ScholarGoogle Scholar
  25. Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. 2021a. GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer. In ICCV.Google ScholarGoogle Scholar
  26. Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. 2020. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In ACM MM.Google ScholarGoogle Scholar
  27. Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. 2019. Ntu rgb d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 10 (2019), 2684--2701.Google ScholarGoogle Scholar
  28. Aravindh Mahendran, James Thewlis, and Andrea Vedaldi. 2018. Cross pixel optical-flow similarity for self-supervised learning. In ACCV.Google ScholarGoogle Scholar
  29. Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV.Google ScholarGoogle Scholar
  30. Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. 2017. Representation learning by learning to count. In ICCV.Google ScholarGoogle Scholar
  31. Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In CVPR.Google ScholarGoogle Scholar
  32. Senthil Purushwalkam Shiva Prakash and Abhinav Gupta. 2020. Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases. In NIPS.Google ScholarGoogle Scholar
  33. Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019a. Skeleton-based action recognition with directed graph neural networks. In CVPR.Google ScholarGoogle Scholar
  34. Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019b. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR.Google ScholarGoogle Scholar
  35. Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song Chun Zhu. 2015. Joint inference of groups, events and human roles in aerial videos. In CVPR.Google ScholarGoogle Scholar
  36. Xiangbo Shu, Jinhui Tang, Guojun Qi, Wei Liu, and Jian Yang. 2019. Hierarchical long short-term concurrent memory for human interaction recognition. PAMI (2019), 1110 -- 1118.Google ScholarGoogle Scholar
  37. Xiangbo Shu, Liyan Zhang, Yunlian Sun, and Jinhui Tang. 2020. Host-parasite: Graph LSTM-in-LSTM for group activity recognition. PAMI (2020), 663 -- 674.Google ScholarGoogle Scholar
  38. Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In ICML.Google ScholarGoogle Scholar
  39. Kun Su, Xiulong Liu, and Eli Shlizerman. 2020. Predict & cluster: Unsupervised skeleton based action recognition. In CVPR.Google ScholarGoogle Scholar
  40. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).Google ScholarGoogle Scholar
  41. Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google ScholarGoogle Scholar
  42. Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In NIPS.Google ScholarGoogle Scholar
  43. Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In ECCV.Google ScholarGoogle Scholar
  44. Minsi Wang, Bingbing Ni, and Xiaokang Yang. 2017. Recurrent modeling of interaction context for collective activity recognition. In CVPR.Google ScholarGoogle Scholar
  45. Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. 2018. Learning and using the arrow of time. In CVPR.Google ScholarGoogle Scholar
  46. Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In CVPR.Google ScholarGoogle Scholar
  47. Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical Graph-based Cross Inference Network for Group Activity Recognition. PAMI (2020), 1--1.Google ScholarGoogle Scholar
  48. Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.Google ScholarGoogle Scholar
  49. Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In ECCV.Google ScholarGoogle Scholar
  50. Richard Zhang, Phillip Isola, and Alexei A Efros. 2017. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.Google ScholarGoogle Scholar

Index Terms

  1. Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '22: Proceedings of the 30th ACM International Conference on Multimedia
      October 2022
      7537 pages
      ISBN:9781450392037
      DOI:10.1145/3503161

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 10 October 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader