Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition

Published: 10 October 2022 Publication History


Learning spatial and temporal relations among people plays an important role in recognizing group activity. Recently, transformer-based methods have become popular solutions due to the proposal of self-attention mechanism. However, the person-level features are fed directly into the self-attention module without any refinement. Moreover, group activity in a clip often involves unbalanced spatio-temporal interactions, where only a few persons with special actions are critical to identifying different activities. It is difficult to learn the spatio-temporal interactions due to the lack of elaborately modeling the action dependencies among all people. In this paper, a novel Action-guided Spatio-Temporal transFormer (ASTFormer) is proposed to capture the interaction relations for group activity recognition by learning action-centric aggregation and modeling spatio-temporal action dependencies. Specifically, ASTFormer starts with assigning all persons in each frame to the latent actions, while an action-centric aggregation strategy is performed by weighting the sum of residuals for each latent action under the supervision of global action information. Then, a dual-branch transformer is proposed to refine the inter- and intra-frame action-level features, where two encoders with the self-attention mechanism are employed to select important tokens. Next, a semantic action graph is explicitly devised to model the dynamic action-wise dependencies. Finally, our model is capable of boosting group activity recognition by fusing these important cues, while only requiring video-level action labels. Extensive experiments on two popular benchmarks (Volleyball and Collective Activity) demonstrate the superior performance of our method in comparison with the state-of-the-art methods using only raw RGB frames as input.

Cited By

View all
  • (2025)Human activity recognition: A review of deep learning‐based methodsIET Computer Vision10.1049/cvi2.7000319:1Online publication date: Feb-2025
  • (2024)Knowledge Augmented Relation Inference for Group Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342585634:11(11644-11656)Online publication date: Nov-2024
  • (2024)Spatial Formation-Guided Network for Group Activity RecognitionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447784(4250-4254)Online publication date: 14-Apr-2024
  • Show More Cited By



Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
Publication History

Published: 10 October 2022


Author Tags

  1. deep learning
  2. graph neural networks
  3. group activity recognition
  4. transformer


Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%


Cited By

View all
  • (2025)Human activity recognition: A review of deep learning‐based methodsIET Computer Vision10.1049/cvi2.7000319:1Online publication date: Feb-2025
  • (2024)Knowledge Augmented Relation Inference for Group Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342585634:11(11644-11656)Online publication date: Nov-2024
  • (2024)Spatial Formation-Guided Network for Group Activity RecognitionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447784(4250-4254)Online publication date: 14-Apr-2024
  • (2024)Masked Autoencoders for Spatial–Temporal Relationship in Video-Based Group Activity RecognitionIEEE Access10.1109/ACCESS.2024.345702412(132084-132095)Online publication date: 2024
  • (2024)MLP-AIR: An effective MLP-based module for actor interaction relation learning in group activity recognitionKnowledge-Based Systems10.1016/j.knosys.2024.112453304(112453)Online publication date: Nov-2024
  • (2024)MA-VLAD: a fine-grained local feature aggregation scheme for action recognitionMultimedia Systems10.1007/s00530-024-01341-930:3Online publication date: 3-May-2024
  • (2024)Rethinking group activity recognition under the open set conditionThe Visual Computer10.1007/s00371-024-03424-041:2(1351-1366)Online publication date: 13-May-2024
  • (2024)Towards More Practical Group Activity Detection: A New Benchmark and ModelComputer Vision – ECCV 202410.1007/978-3-031-72970-6_14(240-258)Online publication date: 23-Nov-2024
  • (2023)FlexIcon: Flexible Icon Colorization via Guided Images and PalettesProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612182(8662-8673)Online publication date: 26-Oct-2023
  • (2023)Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.324990633:9(5076-5088)Online publication date: 1-Sep-2023
  • Show More Cited By

