Skip to main content
Log in

Improving accuracy of temporal action detection by deep hybrid convolutional network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Temporal action detection, a fundamental yet challenging task in understanding human actions, is usually divided into two stages: temporal action proposal generation and proposal classification. Classifying action proposals is always considered an action recognition task and receives little attention. However, compared with action classification, classifying action proposals has more large intra-class variations and subtle inter-class differences, making it more difficult to classify accurately. In this paper, we propose a novel end-to-end framework called Deep Hybrid Convolutional Network (DHCNet) to classify action proposals and achieve high-performance temporal action detection. DHCNet improves temporal action detection performance from three aspects. First, DHCNet utilizes Subnet I to effectively model the temporal structure of proposals and generate discriminative proposal features. Second, the Subnet II of DHCNet exploits Graph Convolution (GConv) to acquire information from other proposals and obtains much semantic information to enhance the proposal feature. Third, DHCNet adopts a coarse-to-fine cascaded classification, where the influence of large intra-class variations and subtle inter-class differences are reduced significantly at different granularities. Besides, we design an iterative boundary regression method based on closed-loop feedback to refine the temporal boundaries of proposals. Extensive experiments demonstrate the effectiveness of our approach. Furthermore, DHCNet achieves the state-of-the-art performance on the THUMOS’14 dataset(59.9% on mAP@0.5).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Aghaahmadi M, Dehshibi MM, Bastanfard A, Fazlali M (2013) Clustering Persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools & Applications 65:521–541

    Article  Google Scholar 

  2. Alwassel H, Giancola S, Ghanem B (2021) Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 3173–3183

  3. Bagchi A, Mahmood J, Fernandes D, Sarvadevabhatla RK (2021) Hear Me Out: Fusional approaches for audio augmented temporal action localization. arXiv:2106.14118

  4. Bartlett PL, Wegkamp MH (2008) Classication with a reject option using a hinge loss. J Mach Learn Res 9(8)

  5. Bastanfard A, Aghaahmadi M, Fazel M et al (2009) Persian viseme classifcation for developing visual speech training application. In: Pacifcrim conference on multimedia, PCM

  6. Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2009) A comprehensive audio-visual corpus for teaching sound persian phoneme articulation. In: Systems Man and Cybernetics SMC 2009 IEEE International Conference on

  7. Bastanfard A, Fazel M, Kelishami AA, Aghaahmadi M (2010) The Persian Linguistic Based Audio-Visual Data Corpus, AVA II, Considering Coarticulation. Advances in Multimedia Modeling, 16th International Multimedia Modeling Conference, MMM 2010, Chongqing, China, January 6-8, Proceedings

  8. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4733

  9. Chao Y, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: IEEE/CVF conference on computer vision and pattern recognition 1130–1139

  10. Chen H, Chen J, Hu R, Chen C, Wang Z (2017) Action recognition with temporal scale-invariant deep learning framework. China Commun 14(2):163–172

    Article  Google Scholar 

  11. Chen P, Gan C, Shen G, Huang W, Zeng R, Tan M (2019) Relation attention for temporal action localization. IEEE Trans Multimedia 1–1

  12. Chen Z, Thabet AK, Ghanem B (2021) Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13658–13667

  13. Cheng G, Wan Y, Saudagar AN, Namuduri K, Buckles BP (2015) Advances in human action recognition: A survey. Computer Science

  14. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893

  15. Dalal N, Triggs B, Schmid C (2020) Human detection using oriented histograms of flow and appearance. In: European conference on computer vision. Springer, pp 428-441

  16. Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Article  Google Scholar 

  17. Esfandiari N, Bastanfard A (2020) Improving accuracy of pedestrian detection using convolutional neural networks. In: 2020 6th Iranian conference on signal processing and intelligent systems (ICSPIS). IEEE, pp 1–6

  18. Feichtenhofer C, Fan H, Malik J, He K (2019) SlowFast networks for video recognition. In: 2019 IEEE/CVF international conference on computer vision (ICCV) 6201–6210

  19. Gao Z, Le W, Zhang Q, Niu Z, Zheng N, Hua G (2019) Video imprint segmentation for temporal action detection in untrimmed videos. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8328–8335

  20. Gao L, Li T, Song J, et al. (2020) Play and rewind: Context-aware video temporal action proposals. Pattern Recogn 107:107477

    Article  Google Scholar 

  21. Gao J, Sh Z, Wang G, Li J, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence 34(07):10810–10817

    Google Scholar 

  22. Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv:arXiv-1705

  23. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 1440–1448

  24. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970

  25. Huang G, Bors AG (2021) Video classification with finecoarse networks. arXiv:2103.15584

  26. Jiang Y-G, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: Action recognition with a large number of classes

  27. Juan H, Liao X, Wang W, Qin Z (2021) Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans Circuits Syst Video Technol PP(99):1–1

    Google Scholar 

  28. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classication with convolutional neural networks. In: 2014 IEEE conference on computer vision and pattern recognition, pp 1725–1732

  29. Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950

  30. Laurens VDM, Hinton G (2008) Visualizing Data using t-SNE. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  31. Li X, Lin T, Liu X, Gan C, Zuo W, Li C, Long X, He D, Li F, Wen S (2019) Deep concept-wise temporal convolutional networks for action localization. arXiv:1908.09442

  32. Li J, Liu X, Zong Z et al (2020) Graph attention based proposal 3D ConvNets for action detection. AAAI 4626–4633

  33. Liao X, Li K, Zhu X, Liu K (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE Journal of Selected Topics in Signal Processing PP(99):1–1

    Google Scholar 

  34. Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Rongrong, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. Proceedings AAAI Conf Artif Intell 34(07):11499–11506

    Google Scholar 

  35. Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation, pp 3889–3898

  36. Lin C, Xu C, Luo D et al (2021) Learning salient boundary feature for anchor-free temporal action localization. In: 2021 IEEE conference on computer vision and pattern recognition (CVPR)

  37. Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision-ECCV 2018. Springer International Publishing, pp 3–21

  38. Liu X, Hu Y, Bai S, Ding F, Bai X, Torr PH (2021) Multi-shot temporal event localization: a benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12596–12606

  39. Liu X, Wang Q, Hu Y et al (2021) End-to-end Temporal Action Detection with Transformer. arXiv:2106.10271

  40. Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 344–353

  41. Modhej N, Bastanfard A, Teshnehlab M, Raiesdana S (2020) Pattern separation network based on the hippocampus activity for handwritten recognition. IEEE Access 8:212803–212817

    Article  Google Scholar 

  42. Nawhal M, Mori G (2021) Activity graph transformer for temporal action localization. arXiv:2101.08540

  43. Neubeck A, Van Gool L (2006) Efficient non-maximum suppression. In: 18th international conference on pattern recognition (ICPR’06), vol 3. IEEE, pp 850–855

  44. Park E, Han X, Berg TL, Berg AC (2016) Combining multiple sources of knowledge in deep cnns for action recognition. In: 2016 IEEE Winter conference on applications of computer vision (WACV), pp 1–8

  45. Peng L, Liao X, Chen M (2020) Resampling parameter estimation via dual-fltering based convolutional neural network. Multimedia Systems, 1–8

  46. Piergiovanni A, Ryoo MS (2018) Fine-grained activity recognition in baseball videos. IEEE/CVF Conference on Computer Vision & Pattern Recognition Workshops

  47. Ren Y, Xu X, Shen F et al (2021) Multi-scale dynamic network for temporal action detection. In: Proceedings of the 2021 international conference on multimedia retrieval

  48. Simonyan K, Zisserman A (2014) Two-Stream convolutional networks for action recognition in videos. Advances in neural information processing systems 1

  49. Smith LN (2015) Cyclical learning rates for training neural networks. arXiv:1506.01186

  50. Soomro K, Shah M (2017) Unsupervised action discovery and localization in videos. In: 2017 IEEE international conference on computer vision (ICCV) 696–705

  51. Su R, Xu D, Sheng L, Ouyang W (2021) PCG-TAL: Progressive cross-granularity cooperation for temporal action localization. IEEE Trans Image Process 30:2103–2113

    Article  Google Scholar 

  52. Su B, Zhou JH, Ding XQ, Wu Y (2017) Unsupervised Hierarchical Dynamic Parsing and Encoding for Action Recognition. Ieee Transactions on Image Processing 26:5784–5799

    Article  MathSciNet  MATH  Google Scholar 

  53. Tang Y, Niu C, Dong M, Ren S, Jimin (2019) Afo-tad: Anchor-free one-stage detector for temporal action detection. arXiv:08250 Liang

  54. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6450–6459

  55. Vaudaux-Ruth G, Chan-Hon-Tong A, Achard C (2021) SALAD: Self-assessment learning for action detection. In: 2021 IEEE Winter conference on applications of computer vision (WACV) 1268–1277

  56. Wang C, Cai H, Zou Y, Xiong Y (2021) RGB stream is enough for temporal action detection. arXiv:2107.04362

  57. Wang Z, Liu Q (2020) Progressive boundary renement network for temporal action detection. AAAI20

  58. Wang H, Schmid C, Ieee (2013) Action recognition with improved trajectories. IEEE International Conference on Computer Vision, New York, pp 3551–3558

    Google Scholar 

  59. Wang H, Wu X, Jia Y (2014) Video annotation via image groups from the web. IEEETransactions on Multimedia 16(5):1282–1291

    Article  Google Scholar 

  60. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer International Publishing, Berlin, pp 20–36

  61. Wang B, Yang L, Zhao Y (2021) POLO: Learning Explicit Cross-Modality Fusion for Temporal Action Localization. IEEE Signal Process Lett 28:503–507

    Article  Google Scholar 

  62. Wu J, Sun P, Chen S et al (2021) Towards high-quality temporal action detection with sparse proposals. arXiv:2109.08847

  63. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classication. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  64. Xiong Y, Wang L, Wang Z et al (2016) CUHK & ETHZ & SIAT Submission to ActivityNet Challenge

  65. Xu M, Pérez-Rúa J-M, Zhu X et al (2021) Low-fidelity end-to-end video encoder pre-training for temporal action localization. arXiv:2103.15233

  66. Xu M, Zhao C, Rojas DS, Thabet A, Bernard (2019) G-tad: Sub-graph localization for temporal action detection. arXiv:11462 Ghanem

  67. Ylmaz AA, Guzel MS, Bostanci E, Askerzade I (2020) A novel action recognition framework based on deep-learning and genetic algorithms. IEEE Access 8(1):100631–100644

    Article  Google Scholar 

  68. Yu G, Goussies NA, Yuan J, Liu Z (2011) Fast action detection via discriminative random forest voting and top-k subvolume search. IEEE Trans Multimed 13(3):507–517

    Article  Google Scholar 

  69. Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE international conference on computer vision, pp 7094–7103

  70. Zhang W, Wang B, Ma S, et al. (2021) I2Net: Mining intra-video and inter-video attention for temporal action localization. Neurocomputing 444:16–29

    Article  Google Scholar 

  71. Zhao H, Torralba A, Torresani L, Yan Z (2019) Hacs: Human action clips and segments dataset for recognition and temporal localization. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 8667–8677

  72. Zhao P, Xie L, Ju C, Zhang Y, Preprint Q (2020) Constraining temporal relationship for action localization. arXiv:07358 Tian

  73. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2933–2942

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Zhang.

Ethics declarations

This work is supposed by the National Key R&D Program of China under Grant 2020YFB1708500.

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gan, MG., Zhang, Y. Improving accuracy of temporal action detection by deep hybrid convolutional network. Multimed Tools Appl 82, 16127–16149 (2023). https://doi.org/10.1007/s11042-022-13962-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13962-1

Keywords

Navigation