skip to main content
10.1145/3503161.3547825acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition

Published: 10 October 2022 Publication History

Abstract

Learning spatial and temporal relations among people plays an important role in recognizing group activity. Recently, transformer-based methods have become popular solutions due to the proposal of self-attention mechanism. However, the person-level features are fed directly into the self-attention module without any refinement. Moreover, group activity in a clip often involves unbalanced spatio-temporal interactions, where only a few persons with special actions are critical to identifying different activities. It is difficult to learn the spatio-temporal interactions due to the lack of elaborately modeling the action dependencies among all people. In this paper, a novel Action-guided Spatio-Temporal transFormer (ASTFormer) is proposed to capture the interaction relations for group activity recognition by learning action-centric aggregation and modeling spatio-temporal action dependencies. Specifically, ASTFormer starts with assigning all persons in each frame to the latent actions, while an action-centric aggregation strategy is performed by weighting the sum of residuals for each latent action under the supervision of global action information. Then, a dual-branch transformer is proposed to refine the inter- and intra-frame action-level features, where two encoders with the self-attention mechanism are employed to select important tokens. Next, a semantic action graph is explicitly devised to model the dynamic action-wise dependencies. Finally, our model is capable of boosting group activity recognition by fusing these important cues, while only requiring video-level action labels. Extensive experiments on two popular benchmarks (Volleyball and Collective Activity) demonstrate the superior performance of our method in comparison with the state-of-the-art methods using only raw RGB frames as input.

Supplementary Material

MP4 File (mm22-fp0350.mp4)
Presentation video

References

[1]
Mohamed R Amer and Sinisa Todorovic. 2015. Sum product networks for activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 4 (2015), 800--813.
[2]
Mohamed R Amer, Dan Xie, Mingtian Zhao, Sinisa Todorovic, and Song-Chun Zhu. 2012. Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In Proceedings of the European conference on computer vision. 187--200.
[3]
Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2018. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 6 (2018), 1437--1451.
[4]
Timur Bagautdinov, Alexandre Alahi, Francc ois Fleuret, Pascal Fua, and Silvio Savarese. 2017. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4315--4324.
[5]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In Proceedings of the International Conference on Machine Learning, Vol. 139. 813--824.
[6]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. 213--229.
[7]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[8]
Junwen Chen, Wentao Bao, and Yu Kong. 2020. Group activity prediction with sequential relational anticipation model. In Proceedings of the European Conference on Computer Vision. 581--597.
[9]
Wongun Choi, Khuram Shahid, and Silvio Savarese. 2009. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1282--1289.
[10]
Zhiwei Deng, Arash Vahdat, Hexiang Hu, and Greg Mori. 2016. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4772--4781.
[11]
Zhiwei Deng, Mengyao Zhai, Lei Chen, Yuhao Liu, Srikanth Muralidharan, Mehrsan Javan Roshtkhari, and Greg Mori. 2015. Deep structured models for group activity recognition. In Proceedings of the British Machine Vision Conference. 1--12.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.
[13]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations. 1--12.
[14]
Harshala Gammulle, Simon Denman, Sridha Sridharan, and Clinton Fookes. 2018. Multi-level sequence GAN for group activity recognition. In Proceedings of the Asian Conference on Computer Vision. 331--346.
[15]
Kirill Gavrilyuk, Ryan Sanford, Mehrsan Javan, and Cees GM Snoek. 2020. Actor-transformers for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 839--848.
[16]
Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 971--980.
[17]
Hossein Hajimirsadeghi and Greg Mori. 2015. Learning ensembles of potential functions for structured prediction with latent variables. In Proceedings of the IEEE International Conference on Computer Vision. 4059--4067.
[18]
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao. 2022a. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1--20.
[19]
Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun Chang, and Yu Qiao. 2022b. Dual-AI: Dual-path actor interaction learning for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2990--2999.
[20]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision. 2961--2969.
[21]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[22]
Guyue Hu, Bo Cui, Yuan He, and Shan Yu. 2020. Progressive relation learning for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 980--989.
[23]
Mostafa S Ibrahim and Greg Mori. 2018. Hierarchical relational networks for group activity recognition and retrieval. In Proceedings of the European Conference on Computer Vision. 721--736.
[24]
Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1971--1980.
[25]
Tian Lan, Leonid Sigal, and Greg Mori. 2012. Social roles in hierarchical models for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1354--1361.
[26]
Diangang Li, Jianquan Liu, Shoji Nishimura, Yuka Hayashi, Jun Suzuki, and Yihong Gong. 2020. Multi-person action recognition in microwave sensors. In Proceedings of the ACM International Conference on Multimedia. 411--420.
[27]
Shuaicheng Li, Qianggang Cao, Lingbo Liu, Kunlin Yang, Shinan Liu, Jun Hou, and Shuai Yi. 2021. GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In Proceedings of the International Conference on Computer Vision. 13668--13677.
[28]
Xin Li and Mooi Choo Chuah. 2017. SBGAR: Semantics based group activity recognition. In Proceedings of the IEEE International Conference on Computer Vision. 2876--2885.
[29]
Lilang Lin, Sijie Song, Wenhan Yang, and Jiaying Liu. 2020. MS$^2$L: Multi-task self-supervised learning for skeleton based action recognition. In Proceedings of the ACM International Conference on Multimedia. 2490--2498.
[30]
Weiyao Lin, Hang Chu, Jianxin Wu, Bin Sheng, and Zhenzhong Chen. 2013. A heat-map-based algorithm for recognizing group activities in videos. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 23, 11 (2013), 1980--1992.
[31]
Lihua Lu, Yao Lu, Ruizhe Yu, Huijun Di, Lin Zhang, and Shunzhou Wang. 2019. GAIM: Graph attention interaction model for collective activity recognition. IEEE Transactions on Multimedia, Vol. 22, 2 (2019), 524--539.
[32]
Preksha Pareek and Ankit Thakkar. 2021. A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artificial Intelligence Review, Vol. 54, 3 (2021), 2259--2322.
[33]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Proceedings of the Advances in Neural Information Processing Systems, Vol. 32, 8026--8037.
[34]
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2020. stagNet: An attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 2 (2020), 549--565.
[35]
Tianmin Shu, Sinisa Todorovic, and Song-Chun Zhu. 2017. CERN: Confidence-energy recurrent network for group activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5523--5531.
[36]
Tianmin Shu, Dan Xie, Brandon Rothrock, Sinisa Todorovic, and Song Chun Zhu. 2015. Joint inference of groups, events and human roles in aerial videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4576--4584.
[37]
Xiangbo Shu, Liyan Zhang, Yunlian Sun, and Jinhui Tang. 2020. Host--Parasite: Graph LSTM-in-LSTM for group activity recognition. IEEE Transactions on Neural Networks and Learning Systems, Vol. 32, 2 (2020), 663--674.
[38]
Sibo Song, Ngai-Man Cheung, Vijay Chandrasekhar, and Bappaditya Mandal. 2018. Deep adaptive temporal pooling for activity recognition. In Proceedings of the ACM international conference on Multimedia. 1829--1837.
[39]
Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. 2020. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the ACM International Conference on Multimedia. 1625--1633.
[40]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5693--5703.
[41]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[42]
Jinhui Tang, Xiangbo Shu, Rui Yan, and Liyan Zhang. 2019b. Coherence constrained graph LSTM for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1--12.
[43]
Yansong Tang, Jiwen Lu, Zian Wang, Ming Yang, and Jie Zhou. 2019a. Learning semantics-preserving attention and contextual interaction for group activity recognition. IEEE Transactions on Image Processing, Vol. 28, 10 (2019), 4997--5012.
[44]
Yansong Tang, Zian Wang, Peiyang Li, Jiwen Lu, Ming Yang, and Jie Zhou. 2018. Mining semantics-preserving attention for group activity recognition. In Proceedings of the ACM International Conference on Multimedia. 1283--1291.
[45]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 86 (2008), 2579--2605.
[46]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 5998--6008.
[47]
Minsi Wang, Bingbing Ni, and Xiaokang Yang. 2017. Recurrent modeling of interaction context for collective activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3048--3056.
[48]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[49]
Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 9964--9974.
[50]
Li-Fang Wu, Qi Wang, Meng Jian, Yu Qiao, and Bo-Xuan Zhao. 2021. A comprehensive review of group activity recognition in videos. International Journal of Automation and Computing (2021), 1--17.
[51]
Rui Yan, Xiangbo Shu, Chengcheng Yuan, Qi Tian, and Jinhui Tang. 2021. Position-aware participation-contributed temporal dynamic model for group activity recognition. IEEE Transactions on Neural Networks and Learning Systems (2021), 1--15.
[52]
Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1--14.
[53]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems. 5754--5764.
[54]
Fanfan Ye, Shiliang Pu, Qiaoyong Zhong, Chao Li, Di Xie, and Huiming Tang. 2020. Dynamic GCN: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the ACM International Conference on Multimedia. 55--63.
[55]
Hangjie Yuan and Dong Ni. 2021. Learning visual context for group activity recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3261--3269.
[56]
Hangjie Yuan, Dong Ni, and Mang Wang. 2021. Spatio-temporal dynamic inference network for group activity recognition. In Proceedings of the International Conference on Computer Vision. 7476--7485.
[57]
Can Zhang, Yuexian Zou, Guang Chen, and Lei Gan. 2019. PAN: Persistent appearance network with an efficient motion cue for fast action recognition. In Proceedings of the ACM International Conference on Multimedia. 500--509.
[58]
Jipeng Zhang, Jie Shao, Rui Cao, Lianli Gao, Xing Xu, and Heng Tao Shen. 2022. Action-centric relation transformer network for video question answering. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 1 (2022), 63--74.
[59]
Jiawei Zhao, Ke Yan, Yifan Zhao, Xiaowei Guo, Feiyue Huang, and Jia Li. 2021. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE International Conference on Computer Vision. 163--172.
[60]
Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, and Hans Peter Graf. 2021. COMPOSER: Compositional learning of group activity in videos. arXiv preprint arXiv:2112.05892 (2021).io

Cited By

View all
  • (2025)Human activity recognition: A review of deep learning‐based methodsIET Computer Vision10.1049/cvi2.7000319:1Online publication date: Feb-2025
  • (2024)Knowledge Augmented Relation Inference for Group Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342585634:11(11644-11656)Online publication date: Nov-2024
  • (2024)Spatial Formation-Guided Network for Group Activity RecognitionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447784(4250-4254)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep learning
  2. graph neural networks
  3. group activity recognition
  4. transformer

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)69
  • Downloads (Last 6 weeks)4
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Human activity recognition: A review of deep learning‐based methodsIET Computer Vision10.1049/cvi2.7000319:1Online publication date: Feb-2025
  • (2024)Knowledge Augmented Relation Inference for Group Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342585634:11(11644-11656)Online publication date: Nov-2024
  • (2024)Spatial Formation-Guided Network for Group Activity RecognitionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447784(4250-4254)Online publication date: 14-Apr-2024
  • (2024)Masked Autoencoders for Spatial–Temporal Relationship in Video-Based Group Activity RecognitionIEEE Access10.1109/ACCESS.2024.345702412(132084-132095)Online publication date: 2024
  • (2024)MLP-AIR: An effective MLP-based module for actor interaction relation learning in group activity recognitionKnowledge-Based Systems10.1016/j.knosys.2024.112453304(112453)Online publication date: Nov-2024
  • (2024)MA-VLAD: a fine-grained local feature aggregation scheme for action recognitionMultimedia Systems10.1007/s00530-024-01341-930:3Online publication date: 3-May-2024
  • (2024)Rethinking group activity recognition under the open set conditionThe Visual Computer10.1007/s00371-024-03424-041:2(1351-1366)Online publication date: 13-May-2024
  • (2024)Towards More Practical Group Activity Detection: A New Benchmark and ModelComputer Vision – ECCV 202410.1007/978-3-031-72970-6_14(240-258)Online publication date: 23-Nov-2024
  • (2023)FlexIcon: Flexible Icon Colorization via Guided Images and PalettesProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612182(8662-8673)Online publication date: 26-Oct-2023
  • (2023)Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324990633:9(5076-5088)Online publication date: 1-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media