research-article

Learning Action-guided Spatio-temporal Transformer for Group Activity Recognition

Authors:

Wei Li,

Tianzhao Yang,

Xiao Wu,

Xian-Jun Du,

Jian-Jun QiaoAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 2051 - 2060

https://doi.org/10.1145/3503161.3547825

Published: 10 October 2022 Publication History

Get Access

Abstract

Learning spatial and temporal relations among people plays an important role in recognizing group activity. Recently, transformer-based methods have become popular solutions due to the proposal of self-attention mechanism. However, the person-level features are fed directly into the self-attention module without any refinement. Moreover, group activity in a clip often involves unbalanced spatio-temporal interactions, where only a few persons with special actions are critical to identifying different activities. It is difficult to learn the spatio-temporal interactions due to the lack of elaborately modeling the action dependencies among all people. In this paper, a novel Action-guided Spatio-Temporal transFormer (ASTFormer) is proposed to capture the interaction relations for group activity recognition by learning action-centric aggregation and modeling spatio-temporal action dependencies. Specifically, ASTFormer starts with assigning all persons in each frame to the latent actions, while an action-centric aggregation strategy is performed by weighting the sum of residuals for each latent action under the supervision of global action information. Then, a dual-branch transformer is proposed to refine the inter- and intra-frame action-level features, where two encoders with the self-attention mechanism are employed to select important tokens. Next, a semantic action graph is explicitly devised to model the dynamic action-wise dependencies. Finally, our model is capable of boosting group activity recognition by fusing these important cues, while only requiring video-level action labels. Extensive experiments on two popular benchmarks (Volleyball and Collective Activity) demonstrate the superior performance of our method in comparison with the state-of-the-art methods using only raw RGB frames as input.

Supplementary Material

MP4 File (mm22-fp0350.mp4)

Presentation video

Download
721.27 MB

References

[1]

Mohamed R Amer and Sinisa Todorovic. 2015. Sum product networks for activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 4 (2015), 800--813.

Abstract

Supplementary Material

References

Cited By

Recommendations

Cooperative Hierarchical Framework for Group Activity Recognition: From Group Detection to Multi-activity Recognition

Part-Aware Spatial-Temporal Graph Convolutional Network for Group Activity Recognition

Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations