Improving accuracy of temporal action detection by deep hybrid convolutional network

Gan, Ming-Gang; Zhang, Yan

doi:10.1007/s11042-022-13962-1

Improving accuracy of temporal action detection by deep hybrid convolutional network

Published: 17 October 2022

Volume 82, pages 16127–16149, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

267 Accesses
1 Altmetric
Explore all metrics

Abstract

Temporal action detection, a fundamental yet challenging task in understanding human actions, is usually divided into two stages: temporal action proposal generation and proposal classification. Classifying action proposals is always considered an action recognition task and receives little attention. However, compared with action classification, classifying action proposals has more large intra-class variations and subtle inter-class differences, making it more difficult to classify accurately. In this paper, we propose a novel end-to-end framework called Deep Hybrid Convolutional Network (DHCNet) to classify action proposals and achieve high-performance temporal action detection. DHCNet improves temporal action detection performance from three aspects. First, DHCNet utilizes Subnet I to effectively model the temporal structure of proposals and generate discriminative proposal features. Second, the Subnet II of DHCNet exploits Graph Convolution (GConv) to acquire information from other proposals and obtains much semantic information to enhance the proposal feature. Third, DHCNet adopts a coarse-to-fine cascaded classification, where the influence of large intra-class variations and subtle inter-class differences are reduced significantly at different granularities. Besides, we design an iterative boundary regression method based on closed-loop feedback to refine the temporal boundaries of proposals. Extensive experiments demonstrate the effectiveness of our approach. Furthermore, DHCNet achieves the state-of-the-art performance on the THUMOS’14 dataset(59.9% on mAP@0.5).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A coarse-to-fine temporal action detection method combining light and heavy networks

Article Open access 10 June 2022

Bottom-up improved multistage temporal convolutional network for action segmentation

Article 02 March 2022

Bi-direction Feature Pyramid Temporal Action Detection Network

References

Aghaahmadi M, Dehshibi MM, Bastanfard A, Fazlali M (2013) Clustering Persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools & Applications 65:521–541
Article Google Scholar
Alwassel H, Giancola S, Ghanem B (2021) Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3173–3183
Bagchi A, Mahmood J, Fernandes D, Sarvadevabhatla RK (2021) Hear Me Out: Fusional approaches for audio augmented temporal action localization. arXiv:2106.14118
Bartlett PL, Wegkamp MH (2008) Classication with a reject option using a hinge loss. J Mach Learn Res 9(8)
Bastanfard A, Aghaahmadi M, Fazel M et al (2009) Persian viseme classifcation for developing visual speech training application. In: Pacifcrim conference on multimedia, PCM
Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2009) A comprehensive audio-visual corpus for teaching sound persian phoneme articulation. In: Systems Man and Cybernetics SMC 2009 IEEE International Conference on
Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2010) The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation. Advances in Multimedia Modeling, 16th International Multimedia Modeling Conference, MMM 2010, Chongqing, China, January 6-8, Proceedings
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4733
Chao Y, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: IEEE/CVF conference on computer vision and pattern recognition 1130–1139
Chen H, Chen J, Hu R, Chen C, Wang Z (2017) Action recognition with temporal scale-invariant deep learning framework. China Commun 14(2):163–172
Article Google Scholar
Chen P, Gan C, Shen G, Huang W, Zeng R, Tan M (2019) Relation attention for temporal action localization. IEEE Trans Multimedia 1–1
Chen Z, Thabet AK, Ghanem B (2021) Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13658–13667
Cheng G, Wan Y, Saudagar AN, Namuduri K, Buckles BP (2015) Advances in human action recognition: A survey. Computer Science
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893
Dalal N, Triggs B, Schmid C (2020) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428-441
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Article Google Scholar
Esfandiari N, Bastanfard A (2020) Improving accuracy of pedestrian detection using convolutional neural networks. In: 2020 6th Iranian conference on signal processing and intelligent systems (ICSPIS). IEEE, pp 1–6
Feichtenhofer C, Fan H, Malik J, He K (2019) SlowFast networks for video recognition. In: 2019 IEEE/CVF international conference on computer vision (ICCV) 6201–6210
Gao Z, Le W, Zhang Q, Niu Z, Zheng N, Hua G (2019) Video imprint segmentation for temporal action detection in untrimmed videos. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8328–8335
Gao L, Li T, Song J, et al. (2020) Play and rewind: Context-aware video temporal action proposals. Pattern Recogn 107:107477
Article Google Scholar
Gao J, Sh Z, Wang G, Li J, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(07):10810–10817
Google Scholar
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv:arXiv-1705
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970
Huang G, Bors AG (2021) Video classification with finecoarse networks. arXiv:2103.15584
Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: Action recognition with a large number of classes
Juan H, Liao X, Wang W, Qin Z (2021) Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Google Scholar
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classication with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Laurens VDM, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9:2579–2605
MATH Google Scholar
Li X, Lin T, Liu X, Gan C, Zuo W, Li C, Long X, He D, Li F, Wen S (2019) Deep concept-wise temporal convolutional networks for action localization. arXiv:1908.09442
Li J, Liu X, Zong Z et al (2020) Graph attention based proposal 3D ConvNets for action detection. AAAI 4626–4633
Liao X, Li K, Zhu X, Liu K (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE Journal of Selected Topics in Signal Processing PP(99):1–1
Google Scholar
Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Rongrong, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. Proceedings AAAI Conf Artif Intell 34(07):11499–11506
Google Scholar
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation, pp 3889–3898
Lin C, Xu C, Luo D et al (2021) Learning salient boundary feature for anchor-free temporal action localization. In: 2021 IEEE conference on computer vision and pattern recognition (CVPR)
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision-ECCV 2018. Springer International Publishing, pp 3–21
Liu X, Hu Y, Bai S, Ding F, Bai X, Torr PH (2021) Multi-shot temporal event localization: a benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12596–12606
Liu X, Wang Q, Hu Y et al (2021) End-to-end Temporal Action Detection with Transformer. arXiv:2106.10271
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 344–353
Modhej N, Bastanfard A, Teshnehlab M, Raiesdana S (2020) Pattern separation network based on the hippocampus activity for handwritten recognition. IEEE Access 8:212803–212817
Article Google Scholar
Nawhal M, Mori G (2021) Activity graph transformer for temporal action localization. arXiv:2101.08540
Neubeck A, Van Gool L (2006) Efficient non-maximum suppression. In: 18th international conference on pattern recognition (ICPR’06), vol 3. IEEE, pp 850–855
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep cnns for action recognition. In: 2016 IEEE Winter conference on applications of computer vision (WACV), pp 1–8
Peng L, Liao X, Chen M (2020) Resampling parameter estimation via dual-fltering based convolutional neural network. Multimedia Systems, 1–8
Piergiovanni A, Ryoo MS (2018) Fine-grained activity recognition in baseball videos. IEEE/CVF Conference on Computer Vision & Pattern Recognition Workshops
Ren Y, Xu X, Shen F et al (2021) Multi-scale dynamic network for temporal action detection. In: Proceedings of the 2021 international conference on multimedia retrieval
Simonyan K, Zisserman A (2014) Two-Stream convolutional networks for action recognition in videos. Advances in neural information processing systems 1
Smith LN (2015) Cyclical learning rates for training neural networks. arXiv:1506.01186
Soomro K, Shah M (2017) Unsupervised action discovery and localization in videos. In: 2017 IEEE international conference on computer vision (ICCV) 696–705
Su R, Xu D, Sheng L, Ouyang W (2021) PCG-TAL: Progressive cross-granularity cooperation for temporal action localization. IEEE Trans Image Process 30:2103–2113
Article Google Scholar
Su B, Zhou JH, Ding XQ, Wu Y (2017) Unsupervised Hierarchical Dynamic Parsing and Encoding for Action Recognition. Ieee Transactions on Image Processing 26:5784–5799
Article MathSciNet MATH Google Scholar
Tang Y, Niu C, Dong M, Ren S, Jimin (2019) Afo-tad: Anchor-free one-stage detector for temporal action detection. arXiv:08250 Liang
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6450–6459
Vaudaux-Ruth G, Chan-Hon-Tong A, Achard C (2021) SALAD: Self-assessment learning for action detection. In: 2021 IEEE Winter conference on applications of computer vision (WACV) 1268–1277
Wang C, Cai H, Zou Y, Xiong Y (2021) RGB stream is enough for temporal action detection. arXiv:2107.04362
Wang Z, Liu Q (2020) Progressive boundary renement network for temporal action detection. AAAI20
Wang H, Schmid C, Ieee (2013) Action recognition with improved trajectories. IEEE International Conference on Computer Vision, New York, pp 3551–3558
Google Scholar
Wang H, Wu X, Jia Y (2014) Video annotation via image groups from the web. IEEETransactions on Multimedia 16(5):1282–1291
Article Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Berlin, pp 20–36
Wang B, Yang L, Zhao Y (2021) POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization. IEEE Signal Process Lett 28:503–507
Article Google Scholar
Wu J, Sun P, Chen S et al (2021) Towards high-quality temporal action detection with sparse proposals. arXiv:2109.08847
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classication. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Xiong Y, Wang L, Wang Z et al (2016) CUHK & ETHZ & SIAT Submission to ActivityNet Challenge
Xu M, Pérez-Rúa J-M, Zhu X et al (2021) Low-fidelity end-to-end video encoder pre-training for temporal action localization. arXiv:2103.15233
Xu M, Zhao C, Rojas DS, Thabet A, Bernard (2019) G-tad: Sub-graph localization for temporal action detection. arXiv:11462 Ghanem
Ylmaz AA, Guzel MS, Bostanci E, Askerzade I (2020) A novel action recognition framework based on deep-learning and genetic algorithms. IEEE Access 8(1):100631–100644
Article Google Scholar
Yu G, Goussies NA, Yuan J, Liu Z (2011) Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans Multimed 13(3):507–517
Article Google Scholar
Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE international conference on computer vision, pp 7094–7103
Zhang W, Wang B, Ma S, et al. (2021) I2Net: Mining intra-video and inter-video attention for temporal action localization. Neurocomputing 444:16–29
Article Google Scholar
Zhao H, Torralba A, Torresani L, Yan Z (2019) Hacs: Human action clips and segments dataset for recognition and temporal localization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 8667–8677
Zhao P, Xie L, Ju C, Zhang Y, Preprint Q (2020) Constraining temporal relationship for action localization. arXiv:07358 Tian
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2933–2942

Download references

Author information

Authors and Affiliations

State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, 100081, Beijing, China
Ming-Gang Gan & Yan Zhang

Authors

Ming-Gang Gan
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Zhang.

Ethics declarations

This work is supposed by the National Key R&D Program of China under Grant 2020YFB1708500.

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gan, MG., Zhang, Y. Improving accuracy of temporal action detection by deep hybrid convolutional network. Multimed Tools Appl 82, 16127–16149 (2023). https://doi.org/10.1007/s11042-022-13962-1

Download citation

Received: 13 August 2021
Revised: 10 December 2021
Accepted: 13 September 2022
Published: 17 October 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11042-022-13962-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving accuracy of temporal action detection by deep hybrid convolutional network

Abstract

Access this article

Similar content being viewed by others

A coarse-to-fine temporal action detection method combining light and heavy networks

Bottom-up improved multistage temporal convolutional network for action segmentation

Bi-direction Feature Pyramid Temporal Action Detection Network

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving accuracy of temporal action detection by deep hybrid convolutional network

Abstract

Access this article

Similar content being viewed by others

A coarse-to-fine temporal action detection method combining light and heavy networks

Bottom-up improved multistage temporal convolutional network for action segmentation

Bi-direction Feature Pyramid Temporal Action Detection Network

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation