GSoANet: Group Second-Order Aggregation Network for Video Action Recognition

Wang, Zhenwei; Dong, Wei; Zhang, Bingbing; Zhang, Jianxin; Liu, Xiangdong; Liu, Bin; Zhang, Qiang

doi:10.1007/s11063-023-11270-9

GSoANet: Group Second-Order Aggregation Network for Video Action Recognition

Published: 18 April 2023

Volume 55, pages 7493–7509, (2023)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Zhenwei Wang^1,2,3,
Wei Dong^1,2,3,
Bingbing Zhang^1,4,
Jianxin Zhang^1,2,3,
Xiangdong Liu¹,
Bin Liu⁵ &
…
Qiang Zhang⁶

199 Accesses
1 Altmetric
Explore all metrics

Abstract

In video action recognition, the existing methods mostly utilize global average pooling at the end of the network to aggregate spatio-temporal features of the video to generate global video representations, which are insufficient in modeling complex spatio-temporal feature distributions and capturing spatio-temporal dynamic information. To address the issue, we propose a novel group second-order aggregation network (GSoANet), the core of which is to integrate the group second-order aggregation module (GSoAM) at the end of the network to aggregate video spatio-temporal features. GSoAM first adopts the grouping strategy to decompose input features into a group of relatively low-dimensional vectors, and then aggregates video spatio-temporal features in the low-dimensional space. Then the subspaces represented by codewords are introduced, where in each subspace, differences between spatio-temporal features and codewords are aggregated with soft assignment refecting their proximity. Finally, the nonlinear geometric structure of the fused subspaces is modeled by using the iterative matrix square root normalized covariance. In addition, GSoANet also introduces a high-performance convolutional network ConvNeXt as a backbone to improve network accuracy at a lower computational cost. Extensive experimental results on four challenging video datasets demonstrate the effectiveness of the proposed method in aggregating spatio-temporal features as well as its competitive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Two-stream spatiotemporal feature fusion for human action recognition

Article 09 August 2020

Learning multi-temporal-scale deep information for action recognition

Article 01 December 2018

Sparse Dense Transformer Network for Video Action Recognition

References

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, vol. 27
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp 4489–4497
Lin J, Gan C, Han S (2019) TSM: Temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: ICCV, pp 3163–3172
Wang L, Xiong Y, Wang Z, Qiao Y (2015) Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159
Christoph R, Pinz FA (2016) Spatiotemporal residual networks for video action recognition. arXiv preprint arXiv:1611.02155
Liu T, Zhao R, Xiao J, Lam K-M (2020) Progressive motion representation distillation with two-branch networks for egocentric activity recognition. IEEE Signal Process Lett 27:1320–1324
Article Google Scholar
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imagenet? In: CVPR, pp 6546–6555
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: CVPR, pp 6450–6459
Du X, Li Y, Cui Y, Qian R, Li J, Bello I (2021) Revisiting 3D ResNets for video recognition. arXiv preprint arXiv:2109.01696
Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) TEA: Temporal excitation and aggregation for action recognition. In: CVPR, pp 909–918
Sharir G, Noy A, Zelnik-Manor L (2021) An image is worth 16x16 words, what is a video worth? arXiv preprint arXiv:2103.13915
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, vol. 139, pp 813–824
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. arXiv preprint arXiv:2201.04288
Wu C-Y, Li Y, Mangalam K, Fan H, Xiong B, Malik J, Feichtenhofer C (2022) Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In: CVPR, pp 13587–13597
Wang H, Kläser A, Schmid C, Liu C-L (2011) Action recognition by dense trajectories. In: CVPR, pp. 3169–3176
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, pp 3551–3558
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: CVPR, pp 3304–3311
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vision 105(3):222–245
Article MathSciNet MATH Google Scholar
Canas G, Poggio T, Rosasco L (2012) Learning manifolds with k-means and k-flats. Adv Neural Inform Process Syst 25
Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: CVPR, pp 5297–5307
Miech A, Laptev I, Sivic J (2017) Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905
Girdhar R, Ramanan D, Gupta A, Sivic J, Russell B (2017) Actionvlad: Learning spatio-temporal aggregation for action classification. In: CVPR, pp 971–980
Sun Q, Wang Q, Zhang J, Li P (2018) Hyperlayer bilinear pooling with application to fine-grained categorization and image retrieval. Neurocomputing 282:174–183
Article Google Scholar
Li P, Xie J, Wang Q, Zuo W (2017) Is second-order information helpful for large-scale visual recognition? In: ICCV, pp 2070–2078
Li P, Xie J, Wang Q, Gao Z (2018) Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: CVPR, pp 947–955
Wang Q, Li P, Hu Q, Zhu P, Zuo W (2019) Deep global generalized gaussian networks. In: CVPR, pp 5080–5088
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A ConvNet for the 2020s. In: CVPR, pp 11976–11986
Lin R, Xiao J, Fan J (2018) NeXtVLAD: An efficient neural network to aggregate frame-level features for large-scale video classification. In: ECCV
Wang L, Li W, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: CVPR, pp 1430–1439
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. In: ECCV, pp 803–818
Li X, Wang Y, Zhou Z, Qiao Y (2020) Smallbignet: Integrating core and contextual views for video classification. In: CVPR, pp 1092–1101
Wang L, Tong Z, Ji B, Wu G (2021) TDN: Temporal difference networks for efficient action recognition. In: CVPR, pp 1895–1904
Huang Z, Zhang S, Pan L, Qing Z, Tang M, Liu Z, Ang Jr MH (2021) TAda! temporally-adaptive convolutions for video understanding. arXiv preprint arXiv:2110.06178
Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: 2007 IEEE conference on computer vision and pattern recognition, pp 1–8
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: CVPR, pp 1–8
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: ICPR, vol. 3, pp 32–36
Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video VLAD: Training the aggregation locally and temporally. IEEE Trans Image Process 27(10):4933–4944
Article MathSciNet Google Scholar
Lin T-Y, RoyChowdhury A, Maji S (2015) Bilinear CNN models for fine-grained visual recognition. In: ICCV, pp 1449–1457
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: CVPR, pp 317–326
Zhang B, Wang Q, Lu X, Wang F, Li P (2020) Locality-constrained affine subspace coding for image classification and retrieval. Pattern Recogn 100:107167
Article Google Scholar
Sun Q, Zhang Z, Li P (2021) Second-order encoding networks for semantic segmentation. Neurocomputing 445:50–60
Article Google Scholar
Diba A, Sharma V, Van Gool L (2017) Deep temporal linear encoding networks. In: CVPR, pp 2329–2338
Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: NIPS, vol. 30
Zhu X, Xu C, Hui L, Lu C, Tao D (2019) Approximated bilinear modules for temporal modeling. In: ICCV, pp 3494–3503
Li Y, Song S, Li Y, Liu J (2019) Temporal bilinear networks for video action recognition. In: AAAI, vol. 33, pp 8674–8681
Gao Z, Wang Q, Zhang B, Hu Q, Li P (2021) Temporal-attentive covariance pooling networks for video recognition. In: NIPS, vol. 34, pp 13587–13598
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp 6299–6308
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: ECCV, pp 305–321
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: A large video database for human motion recognition. In: ICCV, pp 2556–2563
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) MARS: Motion-augmented RGB stream for action recognition. In: CVPR, pp 7882–7891
Zhang S, Guo S, Huang W, Scott MR, Wang L (2020) V4D: 4D convolutional neural networks for video-level representation learning. arXiv preprint arXiv:2002.07442
Chi L, Yuan Z, Mu Y, Wang C (2020) Non-local neural networks with grouped bilinear attentional transforms. In: CVPR, pp 11804–11813
Pang B, Peng G, Li Y, Lu C (2021) PGT: A progressive method for training models on long videos. In: CVPR, pp 11379–11389
Li X, Liu C, Shuai B, Zhu Y, Chen H, Tighe J (2022) NUTA: Non-uniform temporal aggregation for action recognition. In: WACV, pp 3683–3692
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: ICCV, pp 6202–6211
Yang C, Xu Y, Shi J, Dai B, Zhou B (2020) Temporal pyramid network for action recognition. In: CVPR, pp 591–600
Jiang Y, Gong X, Wu J, Shi H, Yan Z, Wang Z (2022) Auto-X3D: Ultra-efficient video understanding via finer-grained neural architecture search. In: WACV, pp 2554–2563
Sun R, Zhang T, Wan Y, Zhang F, Wei J (2023) Wlit: Windows and linear transformer for video action recognition. Sensors 23(3):1616
Article Google Scholar
Wang H, Tran D, Torresani L, Feiszli M (2020) Video modeling with correlation networks. In: CVPR, pp 352–361
Zhou Y, Sun X, Zha Z-J, Zeng W (2018) MiCT: Mixed 3D/2D convolutional tube for human action recognition. In: CVPR, pp 449–458
Liu Z, Hu H (2019) Spatiotemporal relation networks for video action recognition. IEEE Access 7:14969–14976
Article Google Scholar
Yang G, Yang Y, Lu Z, Yang J, Liu D, Zhou C, Fan Z (2022) STA-TSN: Spatial-temporal attention temporal segment network for action recognition in video. PLoS ONE 17(3):0265115
Article Google Scholar
Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Li J, Huang F, Lu T (2020) TEINet: Towards an efficient architecture for video recognition. In: AAAI, vol. 34, pp 11669–11676
Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) VidTr: Video transformer without convolutions. In: ICCV, pp 13577–13587
Chen B, Meng F, Tang H, Tong G (2023) Two-level attention module based on spurious-3d residual networks for human action recognition. Sensors 23(3):1707
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China under Grant 61972062 and 61902220, the NSFC-Liaoning Province United Foundation under Grant U1908214, and the Young and Middle-aged Talents Program of the National Civil Affairs Commission.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116600, China
Zhenwei Wang, Wei Dong, Bingbing Zhang, Jianxin Zhang & Xiangdong Liu
Institute of Machine Intelligence and Bio-computing, Dalian Minzu University, Dalian, 116600, China
Zhenwei Wang, Wei Dong & Jianxin Zhang
SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116600, China
Zhenwei Wang, Wei Dong & Jianxin Zhang
School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116024, China
Bingbing Zhang
DUT-RUISE, Dalian University of Technology, Dalian, 116620, China
Bin Liu
School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
Qiang Zhang

Authors

Zhenwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Dong
View author publications
You can also search for this author in PubMed Google Scholar
Bingbing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiangdong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bingbing Zhang or Jianxin Zhang.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Z., Dong, W., Zhang, B. et al. GSoANet: Group Second-Order Aggregation Network for Video Action Recognition. Neural Process Lett 55, 7493–7509 (2023). https://doi.org/10.1007/s11063-023-11270-9

Download citation

Accepted: 24 March 2023
Published: 18 April 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11063-023-11270-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GSoANet: Group Second-Order Aggregation Network for Video Action Recognition

Abstract

Access this article

Similar content being viewed by others

Two-stream spatiotemporal feature fusion for human action recognition

Learning multi-temporal-scale deep information for action recognition

Sparse Dense Transformer Network for Video Action Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GSoANet: Group Second-Order Aggregation Network for Video Action Recognition

Abstract

Access this article

Similar content being viewed by others

Two-stream spatiotemporal feature fusion for human action recognition

Learning multi-temporal-scale deep information for action recognition

Sparse Dense Transformer Network for Video Action Recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation