Skip to main content

Group Activity Representation Learning with Self-supervised Predictive Coding

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

  • 1630 Accesses

Abstract

This paper aims to learn the group activity representation in an unsupervised fashion without manual annotated activity labels. To achieve this, we exploit self-supervised learning based on group predictions and propose a Transformer-based Predictive Coding approach (TransPC), which mines meaningful spatio-temporal features of group activities mere-ly with data itself. Firstly, in TransPC, a Spatial Graph Transformer Encoder (SGT-Encoder) is designed to capture diverse spatial states lied in individual actions and group interactions. Then, a Temporal Causal Transformer Decoder (TCT-Decoder) is used to anticipate future group states with attending to the observed state dynamics. Furthermore, due to the complex group states, we both consider the distinguishability and consistency of predicted states and introduce a jointly learning mechanism to optimize the models, enabling TransPC to learn better group activity representation. Finally, extensive experiments are carried out to evaluate the learnt representation on downstream tasks on Volleyball and Collective Activity datasets, which demonstrate the state-of-the-art performance over existing self-supervised learning approaches with fewer training labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alec, R., Jeff, W., Rewon, C., David, L., Dario, A., Ilya, S.: Language models are unsupervised multitask learners (2019)

    Google Scholar 

  2. Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901 (2019)

    Google Scholar 

  3. Benaim, S., et al.: SpeedNet: Learning the speediness in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9919–9928 (2020)

    Google Scholar 

  4. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)

    Google Scholar 

  5. Choi, W., Shahid, K., Savarese, S.: What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: IEEE International Conference on Computer Vision Workshops, pp. 1282–1289 (2009)

    Google Scholar 

  6. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5729–5738 (2017)

    Google Scholar 

  7. Gan, C., Wang, N., Yang, Y., Yeung, D., Hauptmann, A.G.: DevNet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2577 (2015)

    Google Scholar 

  8. Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.M.: Actor-transformers for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 836–845 (2020)

    Google Scholar 

  9. Girdhar, R., Grauman, K.: Anticipative video transformer. https://arxiv.org/abs/2106.02036 (2021)

  10. Goodfellow, I., et al.: Generative adversarial nets. pp. 2672–2680 (2014)

    Google Scholar 

  11. Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: IEEE International Conference on Computer Vision Workshops, pp. 1483–1492 (2019)

    Google Scholar 

  12. Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J. (eds.) European Conference on Computer Vision, pp. 312–329 (2020)

    Google Scholar 

  13. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020)

    Google Scholar 

  14. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  15. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)

    Google Scholar 

  16. Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)

  17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: International Conference on Learning Representations (2015)

    Google Scholar 

  18. Kong, L., Qin, J., Huang, D., Wang, Y., Gool, L.V.: Hierarchical attention and context modeling for group activity recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 1328–1332 (2018)

    Google Scholar 

  19. Lee, H., Huang, J., Singh, M., Yang, M.: Unsupervised representation learning by sorting sequences. In: IEEE International Conference on Computer Vision, pp. 667–676 (2017)

    Google Scholar 

  20. Li, S., et al.: GroupFormer: Group activity recognition with clustered spatial-temporal transformer. In: International Conference on Computer Vision, pp. 13648–13657 (2021)

    Google Scholar 

  21. Lin, Y., Guo, X., Lu, Y.: Self-supervised video representation learning with meta-contrastive network. In: International Conference on Computer Vision, pp. 8219–8229. IEEE (2021)

    Google Scholar 

  22. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. https://arxiv.org/abs/1807.03748 (2018)

  23. Peng, C., Jiang, W., Quanzeng, Y., Haibin, L., Zicheng, L.: TransMot: Spatial-temporal graph transformer for multiple object tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  24. Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L.V.: stagNet: An attentive semantic RNN for group activity recognition. In: European Conference on Computer Vision, pp. 104–120 (2018)

    Google Scholar 

  25. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  27. Wang, J., Jiao, J., Liu, Y.: Self-supervised video representation learning by pace prediction. In: European Conference on Computer Vision, pp. 504–521 (2020)

    Google Scholar 

  28. Wang, L., Li, W., Li, W., Gool, L.V.: Appearance-and-relation networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1430–1439 (2018)

    Google Scholar 

  29. Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7408–7416 (2017)

    Google Scholar 

  30. Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)

    Google Scholar 

  31. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10334–10343 (2019)

    Google Scholar 

  32. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6547–6556 (2020)

    Google Scholar 

  33. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6547–6556 (2020)

    Google Scholar 

  34. Yu, C., Ma, X., Ren, J., Zhao, H., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European Conference on Computer Vision, pp. 507–523 (2020)

    Google Scholar 

  35. Yuan, H., Ni, D., Wang, M.: Spatio-temporal dynamic inference network for group activity recognition. In: International Conference on Computer Vision, pp. 7456–7465. IEEE (2021)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (62176025, U21B200389), the Fundamental Research Funds for the Central Universities (2021rc38), and the National Natural Science Foundation of China (62106015).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhaofeng He .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kong, L., He, Z., Zhang, M., Xue, Y. (2022). Group Activity Representation Learning with Self-supervised Predictive Coding. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18913-5_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18912-8

  • Online ISBN: 978-3-031-18913-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics