Skip to main content
Log in

Deep multiple aggregation networks for action recognition

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Most of the current action recognition algorithms are based on deep networks which stack multiple convolutional, pooling and fully connected layers. While convolutional and fully connected operations have been widely studied in the literature, the design of pooling operations that handle action recognition, with different sources of temporal granularity in action categories, has comparatively received less attention, and existing solutions rely mainly on max or averaging operations. The latter are clearly powerless to fully exhibit the actual temporal granularity of action categories and thereby constitute a bottleneck in classification performances. In this paper, we introduce a novel hierarchical pooling design that captures different levels of temporal granularity in action recognition. Our design principle is coarse-to-fine and achieved using a tree-structured network; as we traverse this network top-down, pooling operations are getting less invariant but timely more resolute and well localized. Learning the combination of operations in this network—which best fits a given ground-truth—is obtained by solving a constrained minimization problem whose solution corresponds to the distribution of weights that capture the contribution of each level (and thereby temporal granularity) in the global hierarchical pooling process. Besides being principled and well grounded, the proposed hierarchical pooling is also video-length and resolution agnostic. Extensive experiments conducted on the challenging UCF-101, HMDB-51 and JHMDB-21 databases corroborate all these statements.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

Our manuscript has no associated data, and all data used for experiments are presently openly accessible in the web.

Notes

  1. Already available/pretrained on ImageNet to capture the appearance.

  2. Whose complexity scales quadratically w.r.t. the size of training data.

  3. These subclasses of actions are not explicitly defined in a supervised manner but implicitly by allowing enough flexibility in the multiple instances of temporal pyramids in order to capture different (unknown) subclasses of action dynamics.

  4. In order to make training cycles efficient, we only use the skeleton frames.

  5. Training of each lightweight GCN architecture lasts less than an hour on a GeForce GTX 1070 GPU (with 8 GB memory).

  6. As reported in [70], the number of parameters in HAN-2S and HAN architectures are 940 k and 530 k, respectively.

References

  1. Bagautdinov T, Alahi A, Fleuret F, Fua P, Savarese S (2017) Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  2. Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  3. Pantic M, Pentland A, Nijholt A, Huang TS (2007) Human computing and machine understanding of human behavior: a survey. In: Human computing and machine understanding of human behavior

  4. Jiu M, Sahbi H (2016) Laplacian deep kernel learning for image annotation. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1551–1555

  5. Mabrouk AB, Zagrouba E (2018) Abnormal behavior recognition for intelligent video surveillance systems: a review. Expert Syst Appl 91:480–491

    Article  Google Scholar 

  6. Han Y, Zhanga P, Zhuob T, Huang W, Zhanga Y (2018) Going deeper with two-stream ConvNets for action recognition in video surveillance. Pattern Recognit Lett 107:83–90

    Article  Google Scholar 

  7. Wang B, Ma L, Zhang W, Liu W (2018) Reconstruction network for video captioning. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  8. Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  9. Aafaq N, Akhtar N, Liu W, Gilani SZ, Mian A (2019) Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  10. Lu M, Li Z-N, Wang Y, Pan G (2019) Deep attention network for egocentric action recognition. IEEE Trans Image Process 28(8):3703

    Article  MathSciNet  Google Scholar 

  11. Mahmud T, Billah M, Hasan M, Roy-Chowdhury AK (2019) Captioning near-future activity sequences. arXiv:1908.00943

  12. Laptev I, Perez P (2007) Retrieving actions in movies. In: International conference on computer vision (ICCV)

  13. Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302

    Article  Google Scholar 

  14. Jaimes A, Omura K, Nagamine T, Hirata K (2004) Memory cues for meeting video retrieval. In: CARPE proceedings of the 1st ACM workshop on continuous archival and retrieval of personal experiences, pp 74–85

  15. Duchenne O, Laptev I, Sivic J, Bach F, Ponce J (2009) Automatic annotation of human actions in video. In: International conference on computational vision (ICCV)

  16. Meng H, Pears N, Bailey C (2007) A human action recognition system for embedded computer vision application. In: IEEE conference on computer vision and pattern recognition (CVPR)

  17. Theodoridis T, Agapitos A, Hu H, Lucas SM (2008) Ubiquitous robotics in physical human action recognition: a comparison between dynamic ANNs and GP. In: IEEE international conference on robotics and automation

  18. Demiris Y (2007) Prediction of intent in robotics and multi-agent systems. Cognit Process 8:151. https://doi.org/10.1007/s10339-007-0168-9

    Article  Google Scholar 

  19. Nan M, Ghiţă AS, Gavril A, Trascau M, Sorici A, Cramariuc B, Florea AM (2019) Human action recognition for social robots. In: International conference on control systems and computer science

  20. Pirsiavash H, Ramanan D (2012) Detecting activities of daily living in first-person camera views. In: IEEE conference on computer vision and pattern recognition (CVPR)

  21. Chen L, Duan L, Xu D (2013) Event recognition in videos by learning from heterogeneous web sources. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  22. Xu D, Chang S-F (2007) Visual event recognition in news video using kernel methods with multi-level temporal alignment. In: IEEE international conference on computer vision and pattern recognition(CVPR)

  23. Wang H, Yuan C, Hu W, Sun C (2012) Supervised class-specific dictionary learning for sparse modeling in action recognition. Pattern Recognit 45(11):3902–3911

    Article  Google Scholar 

  24. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: IEEE international conference on pattern recognition (ICPR)

  25. Wang L, Sahbi H (2014) Bags-of-daglets for action recognition. In: 2014 IEEE international conference on image processing (ICIP). IEEE, pp 1550–1554

  26. Lu W, Little JJ (2006) Simultaneous tracking and action recognition using the PCA-HOG descriptor. In: European conference on Computer vision (ECCV)

  27. Horn BKP, Schunck BG (1981) Determining optical flow. Artif Intell 17(1–3):185–203

    Article  Google Scholar 

  28. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: European conference on computer vision (ECCV)

  29. Csurka G, Perronnin F (2010) Fisher vectors: beyond bag-of-visual-words image representations. In: International conference on computer vision, imaging and computer graphics

  30. Wang L, Sahbi H (2013) Directed acyclic graph kernels for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3168–3175

  31. Wang L, Sahbi H (2015) Nonlinear cross-view sample enrichment for action recognition. In: Computer vision-ECCV 2014 workshops: Zurich, Switzerland, September 6–7 and 12, 2014, Proceedings, Part III 13. Springer, pp 47–62

  32. Kaiming H, Xiangyu Z, Shaoqing R, Jian S (2016) Deep residual learning for image recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  33. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  34. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)

  35. Gy Hinton, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath T, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process Mag 29:82–97

    Article  Google Scholar 

  36. He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: IEEE international conference on computer vision (ICCV)

  37. Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: IEEE conference on computer vision and pattern recognition (CVPR)

  38. Mazari A, Sahbi H (2020) Coarse-to-fine aggregation for cross-granularity action recognition. In: 2020 IEEE international conference on image processing (ICIP). IEEE, pp 1541–1545

  39. Sahbi H, Zhan H (2021) FFNB: forgetting-free neural blocks for deep continual learning. In: The British machine vision conference (BMVC)

  40. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Neural information processing systems (NeurIPS)

  41. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR)

  42. Carreira J, Zisserman A, Vadis Q (2017) Action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition (CVPR)

  43. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) PoTion: Pose MoTion representation for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  44. Feichtenhofer C, Pinz A, Wildes R-P (2016) Spatiotemporal residual networks for video action recognition. In: Neural information processing systems (NeurIPS)

  45. Feichtenhofer C, Pinz A, Wildes R-P (2017) Spatiotemporal multiplier networks for video action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  46. Mazari A, Sahbi H (2019) Deep temporal pyramid design for action recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)

  47. Martin P-E, Benois-Pineau J, Péteri R, Morlier J (2020) Fine grained sport action recognition with twin spatio-temporal convolutional neural networks: application to table tennis. Multimed Tools Appl 79(2020):20429–20447

    Article  Google Scholar 

  48. Ullah A et al (2017) Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6:1155–1166

    Article  Google Scholar 

  49. He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904

    Article  Google Scholar 

  50. Murray N, Perronnin F (2014) Generalized max pooling. In: IEEE conference on computer vision and pattern recognition (CVPR)

  51. Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: IEEE conference on computer vision and pattern recognition (CVPR)

  52. Obeso AM, Benois-Pineau J, Vázquez MSG, Acosta AÁR (2019) Forward–backward visual saliency propagation in deep NNS vs internal attentional mechanisms. In: 2019 Ninth international conference on image processing theory, tools and applications (IPTA). IEEE, pp 1–6

  53. Obeso AM, Benois-Pineau J, Vázquez MSG, Acosta AAR (2018) Introduction of explicit visual saliency in training of deep CNNs: application to architectural styles classification. In: 2018 International conference on content-based multimedia indexing (CBMI). IEEE, pp 1–5

  54. Li J, Liu X, Zhang W, Zhang M, Song J, Sebe N (2020) Spatio-temporal attention networks for action recognition and detection. IEEE Trans Multimed 22(11):2990–3001

    Article  Google Scholar 

  55. Piergiovanni AJ, Ryoo MS (2018) Fine-grained activity recognition in baseball videos. In: IEEE conference on computer vision and pattern recognition (CVPR), workshop on computer vision in sports

  56. Soomro ARZK, Shah M (2012) Ucf101: a dataset of 101 human action classes from videos in the wild. In: CRCV-TR-12-01

  57. Shahroudy A, Liu J, Ng T, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: IEEE conference on computer vision and pattern recognition (CVPR)

  58. Pramono RRA, Chen Y-T, Fang W-H (2019) Hierarchical self-attention network for action localization in videos. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 61–70

  59. Wang Y, Long M, Wang J, Yu PS (2017) Spatio temporal pyramid network for video action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)

  60. Zhu J, Zou W, Zhu Z (2018) End-to-end video level representation learning for action recognition. In: International conference on learning representation (ICLR)

  61. Zheng Z, An G, Wu D, Ruan Q (2019) Spatial-temporal pyramid based Convolutional Neural Network for action recognition. Neurocomputing 358:446–455

    Article  Google Scholar 

  62. Zhang D, Dai X, Wang YF (2018) Dynamic temporal pyramid network: a closer look at multi-scale modeling for activity detection. In: Asian conference on computer vision (ACCV)

  63. Yang K, Li R, Qiao P, Wang Q, Li D, Dou Y (2018) Temporal pyramid relation network for video-based gesture recognition. In: IEEE international conference on image processing (ICIP)

  64. Jin S, Cao Z, Song X (2022) IA-FPN: interactive aggregation feature pyramid network for action detection. In: 2022 4th International conference on intelligent control, measurement and signal processing (ICMSP). IEEE, pp 1063–1068

  65. Cai J, Hu J, Li S, Lin J, Wang J (2020) Combination of temporal-channels correlation information and bilinear feature for action recognition. IET Comput Vis 14(8):634–641

    Article  Google Scholar 

  66. Li H, Huang J, Zhou M, Shi Q, Fei Q (2022) Self-attention pooling-based long-term temporal network for action recognition. IEEE Trans Cognit Dev Syst 15(1):65

    Article  Google Scholar 

  67. Kusumoseniarto RH (2020) Two-stream 3D convolution attentional network for action recognition. In: 2020 Joint 9th international conference on informatics, electronics & vision (ICIEV) and 2020 4th international conference on imaging, vision & pattern recognition (icIVPR). IEEE, pp 1–6

  68. Ha M-H, Chen OT-C (2021) Deep neural networks using residual fast–slow refined highway and global atomic spatial attention for action recognition and detection. IEEE Access 9:164887–164902

    Article  Google Scholar 

  69. Pramono RRA, Fang W-H, Chen Y-T (2021) Relational reasoning for group activity recognition via self-attention augmented conditional random field. IEEE Trans Image Process 30:8184–8199

    Article  Google Scholar 

  70. Liu J, Wang Y, Xiang S, Pan C (2021) Han: an efficient hierarchical self-attention network for skeleton-based gesture recognition. arXiv preprint arXiv:2106.13391

  71. Yihuang J (2017) Pretrained 2D two streams network for action recognition on UCF-101 based on temporal segment network. https://github.com/jeffreyyihuang/two-stream-action-recognition

  72. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision (ECCV)

  73. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

    Book  Google Scholar 

  74. Gönen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  Google Scholar 

  75. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the international conference on computer vision (ICCV)

  76. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision (ICCV)

  77. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541

  78. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  79. Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439

  80. Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 695–712

  81. Wu C-Y, Zaheer M, Hu H, Manmatha R, Smola AJ, Krahenbuhl P (2018) Compressed video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6026–6035

  82. Lin J, Gan C, Han S (2019) TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE international conference on computer vision, pp 7083–7093

  83. Li J, Wei P, Zheng N (2021) Nesting spatiotemporal attention networks for action recognition. Neurocomputing 459:338–348

    Article  Google Scholar 

  84. Ohn-Barand E, Trivedi MM (2014) Hand gesture recognition in real time for automotive interfaces: a multimodal vision-based approach and evaluations. IEEE TITS 15(6):2368–2377

    Google Scholar 

  85. Oreifej O, Liu Z (2013) HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR, pp 716–723

  86. Rahmani H, Mian A (2016) 3D Action recognition from novel viewpoints. In: CVPR, pp 1506–1515

  87. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks In: AAAI, vol 2, p 6

  88. Zanfir M, Leordeanu M, Sminchisescu C (2013) The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection. In: ICCV, pp 2752–2759

  89. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3D skeletons as points in a Lie group. In: IEEE CVPR, pp 588–595

  90. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE CVPR, pp 1110–1118

  91. Zhang X, Wang Y, Gou M, Sznaier M, Camps O (2016) Efficient temporal sequence comparison and classification using gram matrix embeddings on a Riemannian manifold. In: CVPR, pp 4498–4507

  92. Garcia-Hernando G, Kim T-K (2017) Transition forests: learning discriminative temporal transitions for action recognition. In: CVPR, pp 407–415

  93. Hu J, Zheng W, Lai J, Zhang J (2015) Jointly learning heterogeneous features for RGB-D activity recognition. In: CVPR

  94. Huang Z, Gool LV (2017) A Riemannian network for SPD matrix learning. In: AAAI, pp 2036–2042

  95. Huang Z, Wu J, Gool LV (2018) Building deep networks on Grassmann manifolds. In: AAAI, pp 3279–3286

  96. Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First person hand action Benchmark with RGB-D videos and 3D hand pose annotations. In: CVPR

  97. Sahbi H (2011) Learning connectivity with graph convolutional networks. In: 2020 25th International conference on pattern recognition (ICPR). IEEE, pp 9996–10003

  98. Sahbi H (2021) Lightweight connectivity in graph convolutional networks for skeleton-based recognition. In: IEEE international conference on image processing (ICIP), pp 2329–2333

  99. Sahbi H (2022) Topologically-consistent magnitude pruning for very lightweight graph convolutional networks. In: IEEE international conference on image processing (ICIP), pp 3495–3499

  100. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hichem Sahbi.

Ethics declarations

Conflict of interest

No grant was received for conducting this study. The authors have no competing interests to declare that are relevant to the content of this article. The authors declare that they have no conflicts of interest.

Human and/or animal rights

This article does not contain any studies involving human participants performed by any of the authors. This article does not contain any studies involving animals performed by any of the authors

Informed consent

For this type of study, informed consent is not required.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mazari, A., Sahbi, H. Deep multiple aggregation networks for action recognition. Int J Multimed Info Retr 13, 9 (2024). https://doi.org/10.1007/s13735-023-00317-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00317-1

Keywords

Navigation