Skip to main content

Advertisement

Log in

MBGNet:Multi-branch boundary generation network with temporal context aggregation for temporal action detection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Temporal action detection is an important and fundamental video understanding task that aims to locate the temporal regions where human actions or events may occur and to identify the classes of actions in untrimmed videos. The main challenge of temporal action detection is that videos are usually of different durations and untrimmed. Although existing methods have achieved better results in recent years, there are still some challenges, such as a lack of full utilisation of video context features, insufficient accuracy of generated action boundaries and failure to consider the relationship between proposals. To address the above issues, this paper proposes a Multi-branch Boundary Generation Network (MBGNet) with temporal context aggregation. It improves the performance of temporal action proposal generation by exploiting rich temporal context features and complementary boundary generators.First, we propose a multi-path temporal context feature aggregation (MTCA) module to exploit “local and global” contextual temporal features for the generation of temporal action proposals. Second, in order to generate accurate action boundaries, we design a multi-branch temporal boundary detector (MBG) to optimise the prediction results by exploiting the complementary relationship between the two boundary detectors.In addition, to accurately predict the confidence of densely distributed proposals, we design a proposal relation-aware module (PRAM) that exploits global correlation for proposal relationship modelling. Experiments on the popular datasets ActivityNet1.3, THUMOS14, and HACS demonstrate the effectiveness of the method proposed in this paper on the task of temporal action proposal generation, which can generate action proposals with high precision and recall. Moreover, combining with existing action classifiers can also achieve better performance in temporal action detection.These results demonstrate the effectiveness of the method in this paper in improving the accuracy of temporal action proposal generation and detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability and access

We evaluate our proposed method on public datasets ActivityNet and THUMOS.The ActivityNet dataset is available at http://activity-net.org/.The THUMOS dataset is available at https://www.crcv.ucf.edu/THUMOS14/download.html

References

  1. Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J (2023) Human action recognition from various data modalities: A review. IEEE Trans Pattern Anal Mach Intell 45(3):3200–3225

    Google Scholar 

  2. Hussain T, Muhammad K, Ding W, Lloret J, Baik SW, Albuquerque VHC (2021) A comprehensive survey of multi-view video summarization. Pattern Recogn 109:107567

    Article  Google Scholar 

  3. Dong J, Li X, Xu C, Yang X, Yang G, Wang X, Wang M (2022) Dual encoding for video retrieval by text. IEEE Trans Pattern Anal Mach Intell 44(8):4065–4080

    Google Scholar 

  4. Yang L, Peng H, Zhang D, Fu J (2020) Han J () Revisiting anchor mechanisms for temporal action localization. IEEE Trans Image Process 29:8535–8548

    Article  Google Scholar 

  5. Gao J, Chen K, Nevatia R (2018) Ctap: Complementary temporal action proposal generation, In: Proceedings of the European Conference on Computer Vision (ECCV) pp, 68–83

  6. Gao J, Shi Z, Wang G, Li J, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. Proceedings of the AAAI Conference on Artificial Intelligence 34:10810–10817

    Article  Google Scholar 

  7. Chen W, Chai Y, Qi M, Sun H, Pu Q, Kong J, Zheng C (2022) Bottomup improved multistage temporal convolutional network for action segmentation. Appl Intell 52(12):14053–14069

    Article  Google Scholar 

  8. Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV) pp, 3–19

  9. Bai Y, Wang Y, Tong Y, Yang Y, Liu Q, Liu J (2020) Boundary content graph neural network for temporal action proposal generation, In: European Conference on Computer Vision pp, 121–137. Springer

  10. Su H, Gan W, Wu W, Qiao Y, Yan J (2021) Bsn++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. Proceedings of the AAAI Conference on Artificial Intelligence 35:2602–2610

    Article  Google Scholar 

  11. Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 10156–10165

  12. Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. Proceedings of the AAAI Conference on Artificial Intelligence 34:11499–11506

    Article  Google Scholar 

  13. Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 3889–3898

  14. Vo-Ho VK, Le N, Kamazaki K, Sugimoto A, Tran MT (2021) Agentenvironment network for temporal action proposal generation. In: ICASSP 2021- 2021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) pp, 2160–2164

  15. Yao G, Lei T, Zhong J, Jiang P (2019) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49:2017–2029

    Article  Google Scholar 

  16. Du Z, Mukaidani H (2022) Linear dynamical systems approach for human action recognition with dual-stream deep features. Appl Intell 52(1):452–470

    Article  Google Scholar 

  17. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks, In: Proceedings of the IEEE International Conference on Computer Vision pp, 4489–4497

  18. Jiang G, Jiang X, Fang Z, Chen S (2021) An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell 51(10):7043–7057

    Article  Google Scholar 

  19. Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 7083–7093

  20. Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 2000–2009

  21. Li Y, Ji B, Shi X, Zhang J, Kang B., Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 906–915

  22. Wu Z, Xiong C, Ma CY, Socher R, Davis LS (2019) Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 1278–1287

  23. Gao Z, Guo L, Ren T, Liu AA, Cheng ZY, Chen S (2020) Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Transactions on Neural Networks and Learning Systems 33(3):1147–1161

    Article  Google Scholar 

  24. Gurunlu B, Ozturk S (2022) Efficient approach for block-based copy-move forgery detection. Smart Trends in Computing and Communications: Proceedings of SmartCom 2021:167–174

    Article  Google Scholar 

  25. Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp, 1049–1058

  26. Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: Temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision pp, 3628–3638

  27. Huang J, Li N, Zhang T, Li G, Huang T, Gao W (2018) Sap: Self-adaptive proposal model for temporal action detection based on reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence 32:6951–6958

    Article  Google Scholar 

  28. Eun H, Lee S, Moon J, Park J, Jung C, Kim C (2020) Srg: Snippet relatednessbased temporal action proposal generator. IEEE Trans Circuits Syst Video Technol 30(11):4232–4244

    Article  Google Scholar 

  29. Tan J, Tang J, Wang L, Wu G (2021) Relaxed transformer decoders for direct action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 13526–13535

  30. Liu Y, Chen J, Chen X, Deng B, Huang J, Hua XS (2022) Centerness-aware network for temporal action proposal. IEEE Trans Circuits Syst Video Technol 32(1):5–16

    Article  Google Scholar 

  31. Yang H, Wu W, Wang L, Jin S, Xia B, Yao H, Huang H (2022) Temporal action proposal generation with background constraint. Proceedings of the AAAI Conference on Artificial Intelligence 36:3054–3062

    Article  Google Scholar 

  32. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp, 7794–7803

  33. Chen P, Gan C, Shen G, Huang W, Zeng R, Tan M (2020) Relation attention for temporal action localization. IEEE Trans Multimedia 22(10):2723–2733

    Article  Google Scholar 

  34. Gao L, Li T, Song J, Zhao Z, Shen HT (2020) Play and rewind: Context-aware video temporal action proposals. Pattern Recogn 107:107477

    Article  Google Scholar 

  35. Zhao Y, Zhang H, Gao Z, Guan W, Nie J, Liu A, Wang M, Chen S (2022) A temporal-aware relation and attention network for temporal action localization. IEEE Trans Image Process 31:4746–4760

  36. Bodla N, Singh B, Chellappa R, Davis L.S (2017) Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision pp, 5561–5569

  37. Liu S, Zhao X, Su H, Hu Z (2020) Tsi: Temporal scale invariant network for action proposal generation. Proceedings of the Asian Conference on Computer Vision 12626:530–546

    Google Scholar 

  38. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition pp, 961–970

  39. Idrees H, Zamir AR, Jiang Y.-G, Gorban A, Laptev I, Sukthankar R, Shah M () The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155:1–23

  40. Zhao H, Torralba A, Torresani L, Yan Z (2017) Hacs: Human action clips and segments dataset for recognition and temporal localization, In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 8668–8678 (2019)

  41. Alwassel H, Giancola S, Ghanem B (2021) Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 3173–3183

  42. Qing Z, Su H, Gan W, Wang D, Wu W, Wang X, Qiao Y, Yan J, Gao, C, Sang N (2021) Temporal context aggregation network for temporal action proposal refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 485–494

  43. Chen G, Zheng YD, Wang L, Lu T (2022) Dcan: improving temporal action detection via dual context aggregation. Proceedings of the AAAI Conference on Artificial Intelligence 36:248–257

    Article  Google Scholar 

  44. Shang J, Wei P, Li H, Zheng N (2023) Multi-scale interaction transformer for temporal action proposal generation. Image Vis Comput 129:104589

    Article  Google Scholar 

  45. Gan MG, Zhang Y (2023) Temporal attention-pyramid pooling for temporal action detection. IEEE Trans Multimedia 25:3799–3810

  46. Su T, Wang H, Wang L (2023) Multi-level content-aware boundary detection for temporal action proposal generation. IEEE Trans Image Process S32:6090–6101

    Article  Google Scholar 

  47. Vo K, Truong S, Yamazaki K, Raj B, Tran MT, Le N (2023) Aoe-net: Entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. Int J Comput Vision 131(1):302–323

    Article  Google Scholar 

  48. Liu Y, Ma L, Zhang Y, Liu W, Chang SF (2019) Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 3604–3613

  49. Xia K, Wang L, Zhou S, Hua G, Tang W (2022) Dual relation network for temporal action localization. Pattern Recogn 129:108725

    Article  Google Scholar 

  50. Liu Q, Wang Z, Rong S (2023) Improve temporal action proposals using hierarchical context. Pattern Recogn 140:109560

    Article  Google Scholar 

  51. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision pp, 2914–2923

  52. Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 10153–10162

  53. Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput 109:104144

    Article  Google Scholar 

  54. Qin X, Zhao H, Lin G, Zeng H, Xu S, Li X (2022) Pcmnet: Position-sensitive context modeling network for temporal action localization. Neurocomputing 510:48–58

    Article  Google Scholar 

  55. Xia K, Wang L, Shen Y, Zhou S, Hua G, Tang W (2023) Exploring action centers for temporal action localization. IEEE Trans Multimedia 25:9425–9436

    Article  Google Scholar 

  56. Xing K, Li T, Wang X (2023) Proposalvlad with proposal-intra exploring for temporal action proposal generation. ACM Transactions on Multimedia Computing. Communications and Applications 19(3):1–18

  57. Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 7094–7103

  58. Liu Q, Wang Z (2020) Progressive boundary refinement network for temporal action detection. Proceedings of the AAAI Conference on Artificial Intelligence 34:11612–11619

    Article  Google Scholar 

  59. Vo K, Yamazaki K, Truong S, Tran M-T, Sugimoto A, Le N (2021) Abn: Agentaware boundary networks for temporal action proposal generation. IEEE Access 9:126431–126445

    Article  Google Scholar 

  60. Xu M, Perez Rua JM, Zhu X, Ghanem B, Martinez B (2021) Low-fidelity video encoder optimization for temporal action localization. Adv Neural Inf Process Syst 34:9923–9935

    Google Scholar 

  61. Liu X, Wang Q, Hu Y, Tang X, Zhang S, Bai S, Bai X (2022) End-to-end temporal action detection with transformer. IEEE Trans Image Process 31:5427–5441

    Article  Google Scholar 

  62. Maaten L, Hinton GE (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    Google Scholar 

Download references

Acknowledgements

The research work was supported by the Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing.

Author information

Authors and Affiliations

Authors

Contributions

Xiao Ying Pan (First Author):Supervision,Conceptulization,Writing-Review & Editing,Formal Analysis,Methodology,Project Administration;Ni Juan Zhang:Conceptulization,Methodology, Data Curation, Writing, Validation, software, Formal Analysis,Visualization;He Wei Xie :Data Curation;Validation; Shou Kun Li:Data Curation;Tong Feng:Visualization;

Corresponding author

Correspondence to Xiaoying Pan.

Ethics declarations

Competing Interests

No potential conflict of interest was reported by the authors.

Ethical and informed consent for data used

I guarantee that the data used is from the official source and authorized by the official. Otherwise, I will bear the responsibility.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Notations

Appendix: Notations

Below is a table that summarizes and briefly describes the important symbols in this paper:

Table 14 Description of the symbols used in this paper

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, X., Zhang, N., Xie, H. et al. MBGNet:Multi-branch boundary generation network with temporal context aggregation for temporal action detection. Appl Intell 54, 9045–9066 (2024). https://doi.org/10.1007/s10489-024-05664-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05664-y

Keywords