Group Activity Representation Learning with Self-supervised Predictive Coding

Kong, Longteng; He, Zhaofeng; Zhang, Man; Xue, Yunzhi

doi:10.1007/978-3-031-18913-5_16

Longteng Kong¹⁵,
Zhaofeng He¹⁵,
Man Zhang¹⁵ &
…
Yunzhi Xue¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1630 Accesses

Abstract

This paper aims to learn the group activity representation in an unsupervised fashion without manual annotated activity labels. To achieve this, we exploit self-supervised learning based on group predictions and propose a Transformer-based Predictive Coding approach (TransPC), which mines meaningful spatio-temporal features of group activities mere-ly with data itself. Firstly, in TransPC, a Spatial Graph Transformer Encoder (SGT-Encoder) is designed to capture diverse spatial states lied in individual actions and group interactions. Then, a Temporal Causal Transformer Decoder (TCT-Decoder) is used to anticipate future group states with attending to the observed state dynamics. Furthermore, due to the complex group states, we both consider the distinguishability and consistency of predicted states and introduce a jointly learning mechanism to optimize the models, enabling TransPC to learn better group activity representation. Finally, extensive experiments are carried out to evaluate the learnt representation on downstream tasks on Volleyball and Collective Activity datasets, which demonstrate the state-of-the-art performance over existing self-supervised learning approaches with fewer training labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Alec, R., Jeff, W., Rewon, C., David, L., Dario, A., Ilya, S.: Language models are unsupervised multitask learners (2019)
Google Scholar
Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901 (2019)
Google Scholar
Benaim, S., et al.: SpeedNet: Learning the speediness in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9919–9928 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)
Google Scholar
Choi, W., Shahid, K., Savarese, S.: What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: IEEE International Conference on Computer Vision Workshops, pp. 1282–1289 (2009)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5729–5738 (2017)
Google Scholar
Gan, C., Wang, N., Yang, Y., Yeung, D., Hauptmann, A.G.: DevNet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2577 (2015)
Google Scholar
Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.M.: Actor-transformers for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 836–845 (2020)
Google Scholar
Girdhar, R., Grauman, K.: Anticipative video transformer. https://arxiv.org/abs/2106.02036 (2021)
Goodfellow, I., et al.: Generative adversarial nets. pp. 2672–2680 (2014)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: IEEE International Conference on Computer Vision Workshops, pp. 1483–1492 (2019)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) European Conference on Computer Vision, pp. 312–329 (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)
Google Scholar
Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)
Google Scholar
Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L.V.: Hierarchical attention and context modeling for group activity recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 1328–1332 (2018)
Google Scholar
Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequences. In: IEEE International Conference on Computer Vision, pp. 667–676 (2017)
Google Scholar
Li, S., et al.: GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In: International Conference on Computer Vision, pp. 13648–13657 (2021)
Google Scholar
Lin, Y., Guo, X., Lu, Y.: Self-supervised video representation learning with meta-contrastive network. In: International Conference on Computer Vision, pp. 8219–8229. IEEE (2021)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. https://arxiv.org/abs/1807.03748 (2018)
Peng, C., Jiang, W., Quanzeng, Y., Haibin, L., Zicheng, L.: TransMot: Spatial-temporal graph transformer for multiple object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L.V.: stagNet: An attentive semantic RNN for group activity recognition. In: European Conference on Computer Vision, pp. 104–120 (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, J., Jiao, J., Liu, Y.: Self-supervised video representation learning by pace prediction. In: European Conference on Computer Vision, pp. 504–521 (2020)
Google Scholar
Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1430–1439 (2018)
Google Scholar
Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7408–7416 (2017)
Google Scholar
Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)
Google Scholar
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)
Google Scholar
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6547–6556 (2020)
Google Scholar
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6547–6556 (2020)
Google Scholar
Yu, C., Ma, X., Ren, J., Zhao, H., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European Conference on Computer Vision, pp. 507–523 (2020)
Google Scholar
Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: International Conference on Computer Vision, pp. 7456–7465. IEEE (2021)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (62176025, U21B200389), the Fundamental Research Funds for the Central Universities (2021rc38), and the National Natural Science Foundation of China (62106015).

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, 100876, China
Longteng Kong, Zhaofeng He & Man Zhang
Automobile Software Innovation Center, Chongqing, 408000, China
Yunzhi Xue

Authors

Longteng Kong
View author publications
You can also search for this author in PubMed Google Scholar
Zhaofeng He
View author publications
You can also search for this author in PubMed Google Scholar
Man Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yunzhi Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaofeng He .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kong, L., He, Z., Zhang, M., Xue, Y. (2022). Group Activity Representation Learning with Self-supervised Predictive Coding. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_16
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Group Activity Representation Learning with Self-supervised Predictive Coding