Skip to main content

Actor-Centered Representations for Action Localization in Streaming Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13698))

Included in the following conference series:

  • 2913 Accesses

Abstract

Event perception tasks such as recognizing and localizing actions in streaming videos are essential for scaling to real-world application contexts. We tackle the problem of learning actor-centered representations through the notion of continual hierarchical predictive learning to localize actions in streaming videos without the need for training labels and outlines for the objects in the video. We propose a framework driven by the notion of hierarchical predictive learning to construct actor-centered features by attention-based contextualization. The key idea is that predictable features or objects do not attract attention and hence do not contribute to the action of interest. Experiments on three benchmark datasets show that the approach can learn robust representations for localizing actions using only one epoch of training, i.e., a single pass through the streaming video. We show that the proposed approach outperforms unsupervised and weakly supervised baselines while offering competitive performance to fully supervised approaches. Additionally, we extend the model to multi-actor settings to recognize group activities while localizing the multiple, plausible actors. We also show that it generalizes to out-of-domain data with limited performance degradation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aakur, S., de Souza, F.D., Sarkar, S.: Going deeper with semantics: exploiting semantic contextualization for interpretation of human activity in videos. In: IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2019)

    Google Scholar 

  2. Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  3. Aakur, S.N., Sarkar, S.: Action localization through continual predictive learning. arXiv preprint arXiv:2003.12185 (2020)

  4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  5. Choi, W., Shahid, K., Savarese, S.: What are they doing?: collective activity classification using spatio-temporal relationship among people. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, pp. 1282–1289. IEEE (2009)

    Google Scholar 

  6. Escorcia, V., Dao, C.D., Jain, M., Ghanem, B., Snoek, C.: Guess where? actor-supervision for spatiotemporal action localization. Comput. Vis. Image Underst. 192, 102886 (2020)

    Article  Google Scholar 

  7. Gan, C., Gong, B., Liu, K., Su, H., Guibas, L.J.: Geometry guided convolutional neural networks for self-supervised video representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5589–5597 (2018)

    Google Scholar 

  8. Gavrilyuk, K., Sanford, R., Javan, M., Snoek, C.G.: Actor-transformers for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 839–848 (2020)

    Google Scholar 

  9. Gkioxari, G., Malik, J.: Finding action tubes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 759–768 (2015)

    Google Scholar 

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  11. Horstmann, G., Herwig, A.: Surprise attracts the eyes and binds the gaze. Psychon. Bull. Rev. 22(3), 743–749 (2015)

    Article  Google Scholar 

  12. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (t-CNN) for action detection in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5822–5831 (2017)

    Google Scholar 

  13. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1971–1980 (2016)

    Google Scholar 

  14. Jain, M., Van Gemert, J., Jégou, H., Bouthemy, P., Snoek, C.G.: Tubelets: unsupervised action proposals from spatiotemporal super-voxels. Int. J. Comput. Vision 124(3), 287–311 (2017)

    Article  MathSciNet  Google Scholar 

  15. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)

    Google Scholar 

  16. Ji, X., Henriques, J.F., Vedaldi, A.: Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874 (2019)

    Google Scholar 

  17. Jiang, Y.G., et al.: Thumos challenge: action recognition with a large number of classes (2014)

    Google Scholar 

  18. Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 780–787 (2014)

    Google Scholar 

  19. Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: 2011 International Conference on Computer Vision, pp. 2003–2010. IEEE (2011)

    Google Scholar 

  20. Li, S., et al.: Groupformer: group activity recognition with clustered spatial-temporal transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13668–13677 (2021)

    Google Scholar 

  21. Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: Videolstm convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)

    Article  Google Scholar 

  22. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)

  23. Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2

    Chapter  Google Scholar 

  24. Liu, Y., Tu, Z., Lin, L., Xie, X., Qin, Q.: Real-time spatio-temporal action localization via learning motion representation. In: Proceedings of the Asian Conference on Computer Vision (2020)

    Google Scholar 

  25. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)

  26. Pan, J., Chen, S., Shou, M.Z., Liu, Y., Shao, J., Li, H.: Actor-context-actor relation network for spatio-temporal action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 464–474 (2021)

    Google Scholar 

  27. Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Hierarchical self-attention network for action localization in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 61–70 (2019)

    Google Scholar 

  28. Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Van Gool, L.: stagNet: an attentive semantic RNN for group activity recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 104–120. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_7

    Chapter  Google Scholar 

  29. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271 (2017)

    Google Scholar 

  30. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)

    Google Scholar 

  31. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  32. Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: Neural Information Processing Systems: Time Series Workshop (2015)

    Google Scholar 

  33. Shu, T., Todorovic, S., Zhu, S.C.: CERN: confidence-energy recurrent network for group activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5523–5531 (2017)

    Google Scholar 

  34. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  35. Soomro, K., Idrees, H., Shah, M.: Action localization in videos through context walk. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3280–3288 (2015)

    Google Scholar 

  36. Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2648–2657 (2016)

    Google Scholar 

  37. Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 696–705 (2017)

    Google Scholar 

  38. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  39. Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2642–2649 (2013)

    Google Scholar 

  40. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  41. Tran, D., Yuan, J.: Max-margin structured output regression for spatio-temporal action localization. In: Advances in neural information processing systems, pp. 350–358 (2012)

    Google Scholar 

  42. Wang, J., Jiao, J., Bao, L., He, S., Liu, Y., Liu, W.: Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4006–4015 (2019)

    Google Scholar 

  43. Wang, L., Qiao, Yu., Tang, X.: Video action detection with relational dynamic-poselets. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 565–580. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_37

    Chapter  Google Scholar 

  44. Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3048–3056 (2017)

    Google Scholar 

  45. Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE international conference on computer vision, pp. 3164–3172 (2015)

    Google Scholar 

  46. Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9964–9974 (2019)

    Google Scholar 

  47. Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: International Conference on Machine Learning (ICML), pp. 478–487 (2016)

    Google Scholar 

  48. Zhang, D., He, L., Tu, Z., Zhang, S., Han, F., Yang, B.: Learning motion representation for real-time spatio-temporal action localization. Pattern Recogn. 103, 107312 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported in part by the US National Science Foundation grants CNS 1513126, IIS 1956050, IIS 2143150, and IIS 1955230.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sathyanarayanan Aakur .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 379 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Aakur, S., Sarkar, S. (2022). Actor-Centered Representations for Action Localization in Streaming Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19839-7_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19838-0

  • Online ISBN: 978-3-031-19839-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics