Abstract
Group activity recognition is a challenging task in computer vision, which needs to comprehensively model the diverse spatio-temporal relations among individuals and generate group representation. In this paper, we propose a novel group activity recognition approach, named Hierarchical Long-Short Transformer (HLSTrans). Based on Transformer, it both considers long- and short-range relationship among individuals via Long-Short Transformer Blocks. Moreover, we build a hierarchical structure in HLSTrans by stacking such blocks to obtain abundant individual relations in multiple scales. By long- and short-range relation modeling in hierarchical mode, HLSTrans is able to enhance the representation of individuals and groups, leading to better recognition performance. We evaluate the proposed HLSTrans on Volleyball and VolleyTactic datasets, and the experimental results demonstrate state-of-the-art performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale (2020). https://arxiv.org/abs/2010.11929
Vaswani, A., et al.: Attention is all you need (2017). https://arxiv.org/abs/1706.03762
Yu, C., Ma, X., Ren, J., Zhao, H., Z., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European Conference on Computer Vision, pp. 507ā523 (2020)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? (2021) https://arxiv.org/abs/2102.05095
Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: European Conference on Computer Vision, pp. 721ā736 (2018)
Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: IEEE International Conference on Computer Vision, pp. 3286ā3295 (2019)
Wu, L. Wang, L.W.J.G., Wu, G.: Learning actor relation graphs for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964ā9974 (2019)
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.M.: Actor-transformers for group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 836ā845 (2020)
Chen, K., Chen, G., Xu, D., Zhang, L., Huang, Y., Knoll, A.: NAST: non-autoregressive spatial-temporal transformer for time series forecasting (2021). https://arxiv.org/abs/2102.05624
Li, X., Chuah, M.C.: SBGAR: semantics based group activity recognition. In: IEEE International Conference on Computer Vision, pp. 2876ā2885 (2017)
Liu, Z., et al.: SWIN transformer: hierarchical vision transformer using shifted windows. In: IEEE International Conference on Computer Vision, pp. 10012ā10022 (2021)
Kong, L., Pei, D., He, R., Huang, D., Wang, Y.: Spatio-temporal player relation modeling for tactic recognition in sports videos. IEEE Trans. Circ. Syst. Video Technol. 32(9), 6086ā6099 (2022)
Ehsanpour, M., Abedin, A., Saleh, F., Shi, J., Reid, I., Rezatofighi, H.: Joint learning of social groups, individuals action and sub-group activities in videos (2020). https://arxiv.org/abs/2007.02632
Xu, M., et al.: Spatial-temporal transformer networks for traffic flow forecasting (2020). https://arxiv.org/abs/2001.02908
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 1971ā1980 (2016)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213ā229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Peng, C., Jiang, W., Quanzeng, Y., Haibin, L., Zicheng, L.: TransMOT: spatial-temporal graph transformer for multiple object tracking. In: IEEE Computer Vision and Pattern Recognition (2021)
S. M. Azar, M. G. Atigh, A.N., Alahi, A.: Convolutional relational machine for group activity. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892ā7901 (2019)
Guo, S., Lin, Y., Feng, N., Song, C., Wan, H.: Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: AAAI Conference on Artificial Intelligence, pp. 922ā929 (2019)
Shu, T., Todorovic, S., Zhu, S.C.: CERN: confidence-energy recurrent network for group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 5523ā5531 (2017)
Bagautdinov, T., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 4315ā4324 (2017)
Wang, W., et al.: CrossFormer: a versatile vision transformer based on cross-scale attention (2021). https://arxiv.org/abs/2108.00154
Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: IEEE International Conference on Computer Vision, pp. 1282ā1289 (2009)
Shu, X., Tang, J., Qi, G.-J., Liu, W., Yang, J.: Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 1110ā1118 (2019)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Computer Vision and Pattern Recognition, pp. 7794ā7803 (2018)
Deng, Z., Vahdat, A., Hu, H., Mori, G.: Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In: IEEE Computer Vision and Pattern Recognition, pp. 4772ā4781 (2016)
Acknowledgement
This work was supported by the National Natural Science Foundation of China (62176025, U21B200389), the Fundamental Research Funds for the Central Universities (2021rc38), and the National Natural Science Foundation of China (62106015).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhuang, Y., He, Z., Kong, L., Lei, M. (2022). Hierarchical Long-Short Transformer for Group Activity Recognition. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)