ABSTRACT
Group activity recognition (GAR) is a challenging task for discerning the behavior of a group of actors. This paper aims at learning discriminative representation for GAR in a self-supervised manner based on human skeletons. As modeling relations between actors lie at the center of GAR, we propose a valid self-supervised learning pretext task with a matching framework, where a representation model is driven to identify subgroups in a synthetic group based on actors' skeleton sequences. For backbone networks, while spatial-temporal graph convolution networks have dominated the skeleton-based action recognition, they under-explore the group relevant interactions among actors. To address this issue, we come up with a novel plug-in Actor-Association Graph Convolution Module (AAGCM) based on inductive graph convolution, which can be integrated into many common backbones. It can not only model the interactions at different levels but also adapt to variable group sizes. The effectiveness of our approaches is demonstrated by extensive experiments on three benchmark datasets: Volleyball, Collective Activity, and Mutual NTU.
Supplemental Material
Available for Download
- Mohamed Rabie Amer, Peng Lei, and Sinisa Todorovic. 2014. Hirf: Hierarchical random field for collective activity recognition in videos. In ECCV.Google Scholar
- Mohamed R Amer, Dan Xie, Mingtian Zhao, Sinisa Todorovic, and Song-Chun Zhu. 2012. Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV.Google Scholar
- Timur Bagautdinov, Alexandre Alahi, Francc ois Fleuret, Pascal Fua, and Silvio Savarese. 2017. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR.Google Scholar
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In ICML.Google Scholar
- Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. 2021. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV.Google Scholar
- Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. 2020. Skeleton-based action recognition with shift graph convolutional network. In CVPR.Google Scholar
- Wongun Choi and Silvio Savarese. 2013. Understanding collective activitiesof people from videos. PAMI, Vol. 36, 6 (2013), 1242--1257.Google ScholarDigital Library
- Wongun Choi, Khuram Shahid, and Silvio Savarese. 2009. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In ICCV workshop.Google Scholar
- Wongun Choi, Khuram Shahid, and Silvio Savarese. 2011. Learning context for collective activity recognition. In CVPR.Google Scholar
- Carl Doersch, Abhinav Gupta, and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV.Google Scholar
- Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. 2020. Actor-transformers for group activity recognition. In CVPR.Google Scholar
- Tianyu Guo, Hong Liu, Zhan Chen, Mengyuan Liu, Tao Wang, and Runwei Ding. 2022. Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition. In AAAI.Google Scholar
- William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NIPS.Google Scholar
- Guyue Hu, Bo Cui, Yuan He, and Shan Yu. 2020. Progressive relation learning for group activity recognition. In CVPR.Google Scholar
- Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In CVPR.Google Scholar
- Dinesh Jayaraman and Kristen Grauman. 2015. Learning image representations tied to ego-motion. In ICCV.Google Scholar
- Dahun Kim, Donghyeon Cho, and In So Kweon. 2019. Self-supervised video representation learning with space-time cubic puzzles. In AAAI.Google Scholar
- Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. 2019. Revisiting self-supervised visual representation learning. CVPR.Google Scholar
- Nikos Komodakis and Spyros Gidaris. 2018. Unsupervised representation learning by predicting image rotations. In ICLR.Google Scholar
- Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. In NIPS.Google Scholar
- Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2021. OpenPifPaf: Composite Fields for Semantic Keypoint Detection and Spatio-Temporal Association. arXiv preprint arXiv:2103.02440 (2021).Google Scholar
- Tian Lan, Leonid Sigal, and Greg Mori. 2012. Social roles in hierarchical models for human activity recognition. In CVPR.Google Scholar
- Tian Lan, Yang Wang, Weilong Yang, Stephen N Robinovitch, and Greg Mori. 2011. Discriminative latent models for recognizing contextual group activities. PAMI, Vol. 34, 8 (2011), 1549--1562.Google ScholarDigital Library
- Linguo Li, Minsi Wang, Bingbing Ni, Hang Wang, Jiancheng Yang, and Wenjun Zhang. 2021b. 3d human action representation learning via cross-view consistency pursuit. In CVPR.Google Scholar
- Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. 2021a. GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer. In ICCV.Google Scholar
- Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. 2020. MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition. In ACM MM.Google Scholar
- Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. 2019. Ntu rgb d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, Vol. 42, 10 (2019), 2684--2701.Google Scholar
- Aravindh Mahendran, James Thewlis, and Andrea Vedaldi. 2018. Cross pixel optical-flow similarity for self-supervised learning. In ACCV.Google Scholar
- Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV.Google Scholar
- Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. 2017. Representation learning by learning to count. In ICCV.Google Scholar
- Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In CVPR.Google Scholar
- Senthil Purushwalkam Shiva Prakash and Abhinav Gupta. 2020. Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases. In NIPS.Google Scholar
- Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019a. Skeleton-based action recognition with directed graph neural networks. In CVPR.Google Scholar
- Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019b. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR.Google Scholar
- Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song Chun Zhu. 2015. Joint inference of groups, events and human roles in aerial videos. In CVPR.Google Scholar
- Xiangbo Shu, Jinhui Tang, Guojun Qi, Wei Liu, and Jian Yang. 2019. Hierarchical long short-term concurrent memory for human interaction recognition. PAMI (2019), 1110 -- 1118.Google Scholar
- Xiangbo Shu, Liyan Zhang, Yunlian Sun, and Jinhui Tang. 2020. Host-parasite: Graph LSTM-in-LSTM for group activity recognition. PAMI (2020), 663 -- 674.Google Scholar
- Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In ICML.Google Scholar
- Kun Su, Xiulong Liu, and Eli Shlizerman. 2020. Predict & cluster: Unsupervised skeleton based action recognition. In CVPR.Google Scholar
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).Google Scholar
- Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).Google Scholar
- Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. In NIPS.Google Scholar
- Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In ECCV.Google Scholar
- Minsi Wang, Bingbing Ni, and Xiaokang Yang. 2017. Recurrent modeling of interaction context for collective activity recognition. In CVPR.Google Scholar
- Donglai Wei, Joseph J Lim, Andrew Zisserman, and William T Freeman. 2018. Learning and using the arrow of time. In CVPR.Google Scholar
- Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In CVPR.Google Scholar
- Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical Graph-based Cross Inference Network for Group Activity Recognition. PAMI (2020), 1--1.Google Scholar
- Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI.Google Scholar
- Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In ECCV.Google Scholar
- Richard Zhang, Phillip Isola, and Alexei A Efros. 2017. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In CVPR.Google Scholar
Index Terms
- Self-Supervised Representation Learning for Skeleton-Based Group Activity Recognition
Recommendations
Skeleton based solid representation with topology preservation
Special issue on SPM 05The medial axis (MA) of an object is homotopy equivalent to the solid model. This makes the medial axis a natural candidate for a skeleton representation of a general solid object. In addition, the medial axis transform is useful for many applications ...
Group Activity Representation Learning with Self-supervised Predictive Coding
Pattern Recognition and Computer VisionAbstractThis paper aims to learn the group activity representation in an unsupervised fashion without manual annotated activity labels. To achieve this, we exploit self-supervised learning based on group predictions and propose a Transformer-based ...
Self-Supervised 3D Action Representation Learning With Skeleton Cloud Colorization
3D Skeleton-based human action recognition has attracted increasing attention in recent years. Most of the existing work focuses on supervised learning which requires a large number of labeled action sequences that are often expensive and time-consuming ...
Comments