Skip to main content
Log in

A progressive hierarchical analysis model for collective activity recognition

  • S.I: Machine Learning based semantic representation and analytics for multimedia application
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

We propose a progressive hierarchical analysis model to perceive the collective activities. Compared with previous activity recognition works, it not only recognizes the collective activities, but also perceives the location and the action category of each individual. At first, we perform the person temporal consistency detection procedure for each individual of the collective activities. A person detection network and conditional random field are used to receive the bounding box sequences of the activity participators. Then, we recognize the individual actions using the learned spatial features and the motion features based on LSTM. At last, the combination of the recognized person-level action category vector, the scene context features and the interaction Context features are used to recognize the collective activities. We evaluate the proposed approach on benchmark collective activity datasets. Extensive experiments demonstrate the effects of the progressive hierarchical analysis model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Ullah A, Muhammad K, Ding W, Palade V, Haq IU, Baik SW (2021) Efficient activity recognition using lightweight cnn and ds-gru network for surveillance applications. Appl Soft Comput 103(12):1–13

    Google Scholar 

  2. Antic B, Ommer B (2014) Learning latent constituents for recognition of group activities in video. In: Proceedings of the European conference on computer vision, pp 33–47

  3. Bagautdinov T, Alahi A, Fua FFP, Savarese S (2017) Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–10

  4. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: IEEE international conference on computer vision. pp 1395–1402

  5. Borja-Borja LF, Azorin-Lopez J, Saval-Calvo M, Fuster-Guillo A (2020) Deep learning architecture for group activity recognition using description of local motions. In: Proceedings of the international joint conference on neural networks, pp 1–8

  6. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics datase. In: IEEE conference on computer vision and pattern recognition, pp 1–10

  7. Chen B, Ting J.A, Marlin B, de Freitas N (2010) Deep learning of invariant spatio-temporal features from video. In: Workshop of neural information processing systems

  8. Choi W, Savarese S (2014) Understanding collective activities of people from videos. IEEE Trans Pattern Anal Mach Intell 36(6):1242–1257

    Article  Google Scholar 

  9. Choi W, Shahid K, Savarese S (2009) What are they doing? : collective activity classification using spatio-temporal relationship among people. In: IEEE international conference on computer vision workshops, pp 1282–1289

  10. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition pp 886–893

  11. Dawn DD, Shaikh SH (2015) A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. The Visual Computer. https://doi.org/10.1007/s00371-015-1066-2

  12. Deng Z, Vahdat A, Hu H, Mori G (2016) Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. In: IEEE conference on computer vision and pattern recognition, pp 4772–4781

  13. Dixon S, Hansen R, Deneke W (2019) Probabilistic grammar induction for long term human activity parsing. In: Proceedings of the international conference on computational science and computational intelligence, pp 306–311

  14. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalany S, Saenkoz K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–13

  15. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  16. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1933–1941

  17. Gavrilyuk K, Sanford R, Javan M, Snoek CGM (2020) Actor-transformers for group activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 836–845

  18. Hajimirsadeghi H, Mori G (2017) Multi-instance classification by maxmargin training of cardinality-based markov networks. IEEE Trans Pattern Anal Mach Intell 39(9):1839–1852

    Article  Google Scholar 

  19. Hu G, Cui B, He Y, Yu S (2020) Progressive relation learning for group activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 977–986

  20. Ibrahim M, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: IEEE international conference on on computer vision and pattern recognition, pp 1–10

  21. Ibrahim MS, Mori G (2018) Hierarchical relational networks for group activity recognition and retrieval. In: Proceedings of the european conference on computer vision, pp 721–736

  22. Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: IEEE conference on computer vision and pattern recognition, pp 1–10

  23. Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) Hierarchical deep temporal models for group activity recognition. pp 1–7. arXiv preprint, arXiv:1607.02643

  24. Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. In: IEEE international conference on machine learning, pp 3212–3220

  25. Jia Y (2013) Caffe: An open source convolutional architecture or fast feature embedding. http://caffe.berkeleyvision.org/

  26. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition

  27. Kläser A, Marszalek M, Schmid C (2008) A spatio-temporal descriptor based on 3D gradients. In: British machine vision conference

  28. Krizhevsky A, Sutskever I, Hinton GE (2018) Imagenet classification with deep convolutional neural networks. Communications of the ACM pp. 84–90 (2017) bibitem2018SRN Kıvrak, H., Köse, H.: Social robot navigation in human-robot interactive environments: Social force model approach. In: Proceedings of the signal processing and communications applications conference, pp 1–4

  29. Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562

    Article  Google Scholar 

  30. Laptev I (2005) On space-time interest points. IEEE Int J Comput Vis 64:107–123

    Article  Google Scholar 

  31. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition. pp 1–8

  32. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition. pp 3361–3368

  33. Li X, Chuah MC (2017) Sbgar: semantics based group activity recognition. In: Proceedings of the IEEE international conference on computer vision, pp 2895–2904

  34. Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG (2018) Videolstm convolves, attends and flows for action recognition. Comput Vis Image Understand 166:41–50

    Article  Google Scholar 

  35. Liang X, Lin L, Cao L (2013) Learning latent spatio-temporal compositional model for human action recognition. ACM Multimedia, Chengdu

    Book  Google Scholar 

  36. Azar SM, Atigh MG, Nickabadi A, Alahi A (2020) Convolutional relational machine for group activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7884–7893

  37. Amer MR, Lei P, Todorovic S (2014) Hirf: Hierarchical random field for collective activity recognition in videos. In: European conference on computer vision. vol 2, pp 571–585

  38. Ni B, Yang X, Gao S (2016) Progressively parsing interactional objects for fine grained action detection. In: IEEE international conference on on computer vision and pattern recognition, pp 1–10

  39. Pei L, Ye M, Xu P, Li T (2014) Fast multi-class action recognition by querying inverted index tables. Multimedia tools and applications. https://doi.org/10.1007/s11042-014-2207-8

  40. Pei L, Ye M, Xu P, Zhao X, Guo G (2014) One example based action detection in hough space. Multimed Tools Appl 72(2):1751–1772

    Article  Google Scholar 

  41. Qi M, Wang Y, Qin J, Li A, Luo J, Gool LV (2020) stagnet: an attentive semantic rnn for group activity and individual action recognition. IEEE Trans Circuits Syst Video Technol 30(2):549–565

    Article  Google Scholar 

  42. Shu T, Todorovic S, Zhu SC (2017) Cern: Confidence-energy recurrent network for group activity recognition. In: IEEE conference on computer vision and pattern recognition, pp 1–10

  43. Shu T, Xie D, Rothrock B, Todorovic S, Zhu SC (2015) Joint inference of groups, events and human roles in aerial videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4576–4584

  44. Shu X, Zhang L, Sun Y, Tang J (2020) Host-parasite: graph lstm-in-lstm for group activity recognition. IEEE Trans Neural Netw Learn Syst 99:1–12

    Google Scholar 

  45. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: In advances in neural information processing systems, pp 568–576

  46. Singh S, Arora C, Jawahar CV (2016) First person action recognition using deep learned descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2620–2628

  47. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition. pp 1–10

  48. Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: European conference on computer vision. pp 140–153

  49. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision, pp 4489–4497

  50. Wang H, Ullah M.M, Kläser A, Laptev L, Schmid C (2010) Evaluation of local spatio-temporal features for action recognition. In: British machine vision conference

  51. Wang K, Wang X, Lin L, Wang M, Zuo W (2014) 3D human activity recognition with reconfigurable convolutional neural networks. ACM multimedia

  52. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314

  53. Wang M, Ni B, Yang X (2017) Recurrent modeling of interaction context for collective activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3048–3056

  54. Xue C, Liu P, Liu W (2019) Studies on a video surveillance system designed for deep learning. In: Proceedings of the IEEE conference on imaging systems and techniques, pp 1–5

  55. Zhang S, Benenson R, Schiele B (2015) Filtered feature channels for pedestrian detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1751–1760

  56. Zhou I, Li K, He X, Li M (2016) A generative model for recognizing mixed group activities in still images. In: Proceedings of the twenty-fifth international joint conference on artificial intelligence. July, pp 3654–3660

  57. Zou WY, Zhu S, Ng AY, Yu K (2012) Deep learning of invariant features via simulated fixations in video. In: IEEE conference on neural information processing systems. pp 3212–3220

Download references

Acknowledgements

This work was supported by the Research Programs of Henan Science and Technology Department (192102210097, 192102210126, 212102210160, 182102210210), the National Natural Science Foundation of China (61806073) and the Open Project Foundation of Information Technology Research Base of Civil Aviation Administration of China (NO. CAAC-ITRB-201607).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuezhuan Zhao.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pei, L., Zhao, X., Li, T. et al. A progressive hierarchical analysis model for collective activity recognition. Neural Comput & Applic 34, 12415–12425 (2022). https://doi.org/10.1007/s00521-021-06585-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06585-4

Keywords

Navigation