Hierarchical Long-Short Transformer for Group Activity Recognition

Zhuang, Yan; He, Zhaofeng; Kong, Longteng; Lei, Ming

doi:10.1007/978-3-031-18913-5_18

Yan Zhuang¹⁵,
Zhaofeng He¹⁵,
Longteng Kong¹⁵ &
…
Ming Lei¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1964 Accesses

Abstract

Group activity recognition is a challenging task in computer vision, which needs to comprehensively model the diverse spatio-temporal relations among individuals and generate group representation. In this paper, we propose a novel group activity recognition approach, named Hierarchical Long-Short Transformer (HLSTrans). Based on Transformer, it both considers long- and short-range relationship among individuals via Long-Short Transformer Blocks. Moreover, we build a hierarchical structure in HLSTrans by stacking such blocks to obtain abundant individual relations in multiple scales. By long- and short-range relation modeling in hierarchical mode, HLSTrans is able to enhance the representation of individuals and groups, leading to better recognition performance. We evaluate the proposed HLSTrans on Volleyball and VolleyTactic datasets, and the experimental results demonstrate state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spatiotemporal information complementary modeling and group relationship reasoning for group activity recognition

Article 16 June 2024

Hunting Group Clues with Transformers for Social Group Activity Recognition

A Comprehensive Review of Group Activity Recognition in Videos

Article Open access 11 January 2021

References

Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale (2020). https://arxiv.org/abs/2010.11929
Vaswani, A., et al.: Attention is all you need (2017). https://arxiv.org/abs/1706.03762
Yu, C., Ma, X., Ren, J., Zhao, H., Z., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European Conference on Computer Vision, pp. 507–523 (2020)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? (2021) https://arxiv.org/abs/2102.05095
Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: European Conference on Computer Vision, pp. 721–736 (2018)
Google Scholar
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: IEEE International Conference on Computer Vision, pp. 3286–3295 (2019)
Google Scholar
Wu, L. Wang, L.W.J.G., Wu, G.: Learning actor relation graphs for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)
Google Scholar
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.M.: Actor-transformers for group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 836–845 (2020)
Google Scholar
Chen, K., Chen, G., Xu, D., Zhang, L., Huang, Y., Knoll, A.: NAST: non-autoregressive spatial-temporal transformer for time series forecasting (2021). https://arxiv.org/abs/2102.05624
Li, X., Chuah, M.C.: SBGAR: semantics based group activity recognition. In: IEEE International Conference on Computer Vision, pp. 2876–2885 (2017)
Google Scholar
Liu, Z., et al.: SWIN transformer: hierarchical vision transformer using shifted windows. In: IEEE International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Kong, L., Pei, D., He, R., Huang, D., Wang, Y.: Spatio-temporal player relation modeling for tactic recognition in sports videos. IEEE Trans. Circ. Syst. Video Technol. 32(9), 6086–6099 (2022)
Article Google Scholar
Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H.: Joint learning of social groups, individuals action and sub-group activities in videos (2020). https://arxiv.org/abs/2007.02632
Xu, M., et al.: Spatial-temporal transformer networks for traffic flow forecasting (2020). https://arxiv.org/abs/2001.02908
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Peng, C., Jiang, W., Quanzeng, Y., Haibin, L., Zicheng, L.: TransMOT: spatial-temporal graph transformer for multiple object tracking. In: IEEE Computer Vision and Pattern Recognition (2021)
Google Scholar
S. M. Azar, M. G. Atigh, A.N., Alahi, A.: Convolutional relational machine for group activity. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901 (2019)
Google Scholar
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: AAAI Conference on Artificial Intelligence, pp. 922–929 (2019)
Google Scholar
Shu, T., Todorovic, S., Zhu, S.C.: CERN: confidence-energy recurrent network for group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 5523–5531 (2017)
Google Scholar
Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 4315–4324 (2017)
Google Scholar
Wang, W., et al.: CrossFormer: a versatile vision transformer based on cross-scale attention (2021). https://arxiv.org/abs/2108.00154
Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: IEEE International Conference on Computer Vision, pp. 1282–1289 (2009)
Google Scholar
Shu, X., Tang, J., Qi, G.-J., Liu, W., Yang, J.: Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 1110–1118 (2019)
Article Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 4772–4781 (2016)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (62176025, U21B200389), the Fundamental Research Funds for the Central Universities (2021rc38), and the National Natural Science Foundation of China (62106015).

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, 100876, China
Yan Zhuang, Zhaofeng He, Longteng Kong & Ming Lei

Authors

Yan Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Zhaofeng He
View author publications
You can also search for this author in PubMed Google Scholar
Longteng Kong
View author publications
You can also search for this author in PubMed Google Scholar
Ming Lei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaofeng He .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhuang, Y., He, Z., Kong, L., Lei, M. (2022). Hierarchical Long-Short Transformer for Group Activity Recognition. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_18
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Long-Short Transformer for Group Activity Recognition