Abstract
This paper aims to learn the group activity representation in an unsupervised fashion without manual annotated activity labels. To achieve this, we exploit self-supervised learning based on group predictions and propose a Transformer-based Predictive Coding approach (TransPC), which mines meaningful spatio-temporal features of group activities mere-ly with data itself. Firstly, in TransPC, a Spatial Graph Transformer Encoder (SGT-Encoder) is designed to capture diverse spatial states lied in individual actions and group interactions. Then, a Temporal Causal Transformer Decoder (TCT-Decoder) is used to anticipate future group states with attending to the observed state dynamics. Furthermore, due to the complex group states, we both consider the distinguishability and consistency of predicted states and introduce a jointly learning mechanism to optimize the models, enabling TransPC to learn better group activity representation. Finally, extensive experiments are carried out to evaluate the learnt representation on downstream tasks on Volleyball and Collective Activity datasets, which demonstrate the state-of-the-art performance over existing self-supervised learning approaches with fewer training labels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alec, R., Jeff, W., Rewon, C., David, L., Dario, A., Ilya, S.: Language models are unsupervised multitask learners (2019)
Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901 (2019)
Benaim, S., et al.: SpeedNet: Learning the speediness in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9919–9928 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)
Choi, W., Shahid, K., Savarese, S.: What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: IEEE International Conference on Computer Vision Workshops, pp. 1282–1289 (2009)
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5729–5738 (2017)
Gan, C., Wang, N., Yang, Y., Yeung, D., Hauptmann, A.G.: DevNet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2577 (2015)
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.M.: Actor-transformers for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 836–845 (2020)
Girdhar, R., Grauman, K.: Anticipative video transformer. https://arxiv.org/abs/2106.02036 (2021)
Goodfellow, I., et al.: Generative adversarial nets. pp. 2672–2680 (2014)
Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: IEEE International Conference on Computer Vision Workshops, pp. 1483–1492 (2019)
Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) European Conference on Computer Vision, pp. 312–329 (2020)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)
Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)
Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L.V.: Hierarchical attention and context modeling for group activity recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 1328–1332 (2018)
Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequences. In: IEEE International Conference on Computer Vision, pp. 667–676 (2017)
Li, S., et al.: GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In: International Conference on Computer Vision, pp. 13648–13657 (2021)
Lin, Y., Guo, X., Lu, Y.: Self-supervised video representation learning with meta-contrastive network. In: International Conference on Computer Vision, pp. 8219–8229. IEEE (2021)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. https://arxiv.org/abs/1807.03748 (2018)
Peng, C., Jiang, W., Quanzeng, Y., Haibin, L., Zicheng, L.: TransMot: Spatial-temporal graph transformer for multiple object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L.V.: stagNet: An attentive semantic RNN for group activity recognition. In: European Conference on Computer Vision, pp. 104–120 (2018)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, J., Jiao, J., Liu, Y.: Self-supervised video representation learning by pace prediction. In: European Conference on Computer Vision, pp. 504–521 (2020)
Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1430–1439 (2018)
Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7408–7416 (2017)
Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6547–6556 (2020)
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6547–6556 (2020)
Yu, C., Ma, X., Ren, J., Zhao, H., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European Conference on Computer Vision, pp. 507–523 (2020)
Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: International Conference on Computer Vision, pp. 7456–7465. IEEE (2021)
Acknowledgement
This work was supported by the National Natural Science Foundation of China (62176025, U21B200389), the Fundamental Research Funds for the Central Universities (2021rc38), and the National Natural Science Foundation of China (62106015).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kong, L., He, Z., Zhang, M., Xue, Y. (2022). Group Activity Representation Learning with Self-supervised Predictive Coding. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)