Abstract
Temporal action detection, a fundamental yet challenging task in understanding human actions, is usually divided into two stages: temporal action proposal generation and proposal classification. Classifying action proposals is always considered an action recognition task and receives little attention. However, compared with action classification, classifying action proposals has more large intra-class variations and subtle inter-class differences, making it more difficult to classify accurately. In this paper, we propose a novel end-to-end framework called Deep Hybrid Convolutional Network (DHCNet) to classify action proposals and achieve high-performance temporal action detection. DHCNet improves temporal action detection performance from three aspects. First, DHCNet utilizes Subnet I to effectively model the temporal structure of proposals and generate discriminative proposal features. Second, the Subnet II of DHCNet exploits Graph Convolution (GConv) to acquire information from other proposals and obtains much semantic information to enhance the proposal feature. Third, DHCNet adopts a coarse-to-fine cascaded classification, where the influence of large intra-class variations and subtle inter-class differences are reduced significantly at different granularities. Besides, we design an iterative boundary regression method based on closed-loop feedback to refine the temporal boundaries of proposals. Extensive experiments demonstrate the effectiveness of our approach. Furthermore, DHCNet achieves the state-of-the-art performance on the THUMOS’14 dataset(59.9% on mAP@0.5).
Similar content being viewed by others
References
Aghaahmadi M, Dehshibi MM, Bastanfard A, Fazlali M (2013) Clustering Persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools & Applications 65:521–541
Alwassel H, Giancola S, Ghanem B (2021) Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3173–3183
Bagchi A, Mahmood J, Fernandes D, Sarvadevabhatla RK (2021) Hear Me Out: Fusional approaches for audio augmented temporal action localization. arXiv:2106.14118
Bartlett PL, Wegkamp MH (2008) Classication with a reject option using a hinge loss. J Mach Learn Res 9(8)
Bastanfard A, Aghaahmadi M, Fazel M et al (2009) Persian viseme classifcation for developing visual speech training application. In: Pacifcrim conference on multimedia, PCM
Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2009) A comprehensive audio-visual corpus for teaching sound persian phoneme articulation. In: Systems Man and Cybernetics SMC 2009 IEEE International Conference on
Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2010) The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation. Advances in Multimedia Modeling, 16th International Multimedia Modeling Conference, MMM 2010, Chongqing, China, January 6-8, Proceedings
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4733
Chao Y, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: IEEE/CVF conference on computer vision and pattern recognition 1130–1139
Chen H, Chen J, Hu R, Chen C, Wang Z (2017) Action recognition with temporal scale-invariant deep learning framework. China Commun 14(2):163–172
Chen P, Gan C, Shen G, Huang W, Zeng R, Tan M (2019) Relation attention for temporal action localization. IEEE Trans Multimedia 1–1
Chen Z, Thabet AK, Ghanem B (2021) Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13658–13667
Cheng G, Wan Y, Saudagar AN, Namuduri K, Buckles BP (2015) Advances in human action recognition: A survey. Computer Science
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893
Dalal N, Triggs B, Schmid C (2020) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428-441
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Esfandiari N, Bastanfard A (2020) Improving accuracy of pedestrian detection using convolutional neural networks. In: 2020 6th Iranian conference on signal processing and intelligent systems (ICSPIS). IEEE, pp 1–6
Feichtenhofer C, Fan H, Malik J, He K (2019) SlowFast networks for video recognition. In: 2019 IEEE/CVF international conference on computer vision (ICCV) 6201–6210
Gao Z, Le W, Zhang Q, Niu Z, Zheng N, Hua G (2019) Video imprint segmentation for temporal action detection in untrimmed videos. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8328–8335
Gao L, Li T, Song J, et al. (2020) Play and rewind: Context-aware video temporal action proposals. Pattern Recogn 107:107477
Gao J, Sh Z, Wang G, Li J, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(07):10810–10817
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv:arXiv-1705
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970
Huang G, Bors AG (2021) Video classification with finecoarse networks. arXiv:2103.15584
Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: Action recognition with a large number of classes
Juan H, Liao X, Wang W, Qin Z (2021) Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classication with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition, pp 1725–1732
Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Laurens VDM, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9:2579–2605
Li X, Lin T, Liu X, Gan C, Zuo W, Li C, Long X, He D, Li F, Wen S (2019) Deep concept-wise temporal convolutional networks for action localization. arXiv:1908.09442
Li J, Liu X, Zong Z et al (2020) Graph attention based proposal 3D ConvNets for action detection. AAAI 4626–4633
Liao X, Li K, Zhu X, Liu K (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE Journal of Selected Topics in Signal Processing PP(99):1–1
Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Rongrong, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. Proceedings AAAI Conf Artif Intell 34(07):11499–11506
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation, pp 3889–3898
Lin C, Xu C, Luo D et al (2021) Learning salient boundary feature for anchor-free temporal action localization. In: 2021 IEEE conference on computer vision and pattern recognition (CVPR)
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision-ECCV 2018. Springer International Publishing, pp 3–21
Liu X, Hu Y, Bai S, Ding F, Bai X, Torr PH (2021) Multi-shot temporal event localization: a benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12596–12606
Liu X, Wang Q, Hu Y et al (2021) End-to-end Temporal Action Detection with Transformer. arXiv:2106.10271
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 344–353
Modhej N, Bastanfard A, Teshnehlab M, Raiesdana S (2020) Pattern separation network based on the hippocampus activity for handwritten recognition. IEEE Access 8:212803–212817
Nawhal M, Mori G (2021) Activity graph transformer for temporal action localization. arXiv:2101.08540
Neubeck A, Van Gool L (2006) Efficient non-maximum suppression. In: 18th international conference on pattern recognition (ICPR’06), vol 3. IEEE, pp 850–855
Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep cnns for action recognition. In: 2016 IEEE Winter conference on applications of computer vision (WACV), pp 1–8
Peng L, Liao X, Chen M (2020) Resampling parameter estimation via dual-fltering based convolutional neural network. Multimedia Systems, 1–8
Piergiovanni A, Ryoo MS (2018) Fine-grained activity recognition in baseball videos. IEEE/CVF Conference on Computer Vision & Pattern Recognition Workshops
Ren Y, Xu X, Shen F et al (2021) Multi-scale dynamic network for temporal action detection. In: Proceedings of the 2021 international conference on multimedia retrieval
Simonyan K, Zisserman A (2014) Two-Stream convolutional networks for action recognition in videos. Advances in neural information processing systems 1
Smith LN (2015) Cyclical learning rates for training neural networks. arXiv:1506.01186
Soomro K, Shah M (2017) Unsupervised action discovery and localization in videos. In: 2017 IEEE international conference on computer vision (ICCV) 696–705
Su R, Xu D, Sheng L, Ouyang W (2021) PCG-TAL: Progressive cross-granularity cooperation for temporal action localization. IEEE Trans Image Process 30:2103–2113
Su B, Zhou JH, Ding XQ, Wu Y (2017) Unsupervised Hierarchical Dynamic Parsing and Encoding for Action Recognition. Ieee Transactions on Image Processing 26:5784–5799
Tang Y, Niu C, Dong M, Ren S, Jimin (2019) Afo-tad: Anchor-free one-stage detector for temporal action detection. arXiv:08250 Liang
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6450–6459
Vaudaux-Ruth G, Chan-Hon-Tong A, Achard C (2021) SALAD: Self-assessment learning for action detection. In: 2021 IEEE Winter conference on applications of computer vision (WACV) 1268–1277
Wang C, Cai H, Zou Y, Xiong Y (2021) RGB stream is enough for temporal action detection. arXiv:2107.04362
Wang Z, Liu Q (2020) Progressive boundary renement network for temporal action detection. AAAI20
Wang H, Schmid C, Ieee (2013) Action recognition with improved trajectories. IEEE International Conference on Computer Vision, New York, pp 3551–3558
Wang H, Wu X, Jia Y (2014) Video annotation via image groups from the web. IEEETransactions on Multimedia 16(5):1282–1291
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Berlin, pp 20–36
Wang B, Yang L, Zhao Y (2021) POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization. IEEE Signal Process Lett 28:503–507
Wu J, Sun P, Chen S et al (2021) Towards high-quality temporal action detection with sparse proposals. arXiv:2109.08847
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classication. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Xiong Y, Wang L, Wang Z et al (2016) CUHK & ETHZ & SIAT Submission to ActivityNet Challenge
Xu M, Pérez-Rúa J-M, Zhu X et al (2021) Low-fidelity end-to-end video encoder pre-training for temporal action localization. arXiv:2103.15233
Xu M, Zhao C, Rojas DS, Thabet A, Bernard (2019) G-tad: Sub-graph localization for temporal action detection. arXiv:11462 Ghanem
Ylmaz AA, Guzel MS, Bostanci E, Askerzade I (2020) A novel action recognition framework based on deep-learning and genetic algorithms. IEEE Access 8(1):100631–100644
Yu G, Goussies NA, Yuan J, Liu Z (2011) Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans Multimed 13(3):507–517
Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE international conference on computer vision, pp 7094–7103
Zhang W, Wang B, Ma S, et al. (2021) I2Net: Mining intra-video and inter-video attention for temporal action localization. Neurocomputing 444:16–29
Zhao H, Torralba A, Torresani L, Yan Z (2019) Hacs: Human action clips and segments dataset for recognition and temporal localization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 8667–8677
Zhao P, Xie L, Ju C, Zhang Y, Preprint Q (2020) Constraining temporal relationship for action localization. arXiv:07358 Tian
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2933–2942
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
This work is supposed by the National Key R&D Program of China under Grant 2020YFB1708500.
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gan, MG., Zhang, Y. Improving accuracy of temporal action detection by deep hybrid convolutional network. Multimed Tools Appl 82, 16127–16149 (2023). https://doi.org/10.1007/s11042-022-13962-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13962-1