Abstract
Temporal action detection is an important and fundamental video understanding task that aims to locate the temporal regions where human actions or events may occur and to identify the classes of actions in untrimmed videos. The main challenge of temporal action detection is that videos are usually of different durations and untrimmed. Although existing methods have achieved better results in recent years, there are still some challenges, such as a lack of full utilisation of video context features, insufficient accuracy of generated action boundaries and failure to consider the relationship between proposals. To address the above issues, this paper proposes a Multi-branch Boundary Generation Network (MBGNet) with temporal context aggregation. It improves the performance of temporal action proposal generation by exploiting rich temporal context features and complementary boundary generators.First, we propose a multi-path temporal context feature aggregation (MTCA) module to exploit “local and global” contextual temporal features for the generation of temporal action proposals. Second, in order to generate accurate action boundaries, we design a multi-branch temporal boundary detector (MBG) to optimise the prediction results by exploiting the complementary relationship between the two boundary detectors.In addition, to accurately predict the confidence of densely distributed proposals, we design a proposal relation-aware module (PRAM) that exploits global correlation for proposal relationship modelling. Experiments on the popular datasets ActivityNet1.3, THUMOS14, and HACS demonstrate the effectiveness of the method proposed in this paper on the task of temporal action proposal generation, which can generate action proposals with high precision and recall. Moreover, combining with existing action classifiers can also achieve better performance in temporal action detection.These results demonstrate the effectiveness of the method in this paper in improving the accuracy of temporal action proposal generation and detection.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability and access
We evaluate our proposed method on public datasets ActivityNet and THUMOS.The ActivityNet dataset is available at http://activity-net.org/.The THUMOS dataset is available at https://www.crcv.ucf.edu/THUMOS14/download.html
References
Sun Z, Ke Q, Rahmani H, Bennamoun M, Wang G, Liu J (2023) Human action recognition from various data modalities: A review. IEEE Trans Pattern Anal Mach Intell 45(3):3200–3225
Hussain T, Muhammad K, Ding W, Lloret J, Baik SW, Albuquerque VHC (2021) A comprehensive survey of multi-view video summarization. Pattern Recogn 109:107567
Dong J, Li X, Xu C, Yang X, Yang G, Wang X, Wang M (2022) Dual encoding for video retrieval by text. IEEE Trans Pattern Anal Mach Intell 44(8):4065–4080
Yang L, Peng H, Zhang D, Fu J (2020) Han J () Revisiting anchor mechanisms for temporal action localization. IEEE Trans Image Process 29:8535–8548
Gao J, Chen K, Nevatia R (2018) Ctap: Complementary temporal action proposal generation, In: Proceedings of the European Conference on Computer Vision (ECCV) pp, 68–83
Gao J, Shi Z, Wang G, Li J, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. Proceedings of the AAAI Conference on Artificial Intelligence 34:10810–10817
Chen W, Chai Y, Qi M, Sun H, Pu Q, Kong J, Zheng C (2022) Bottomup improved multistage temporal convolutional network for action segmentation. Appl Intell 52(12):14053–14069
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV) pp, 3–19
Bai Y, Wang Y, Tong Y, Yang Y, Liu Q, Liu J (2020) Boundary content graph neural network for temporal action proposal generation, In: European Conference on Computer Vision pp, 121–137. Springer
Su H, Gan W, Wu W, Qiao Y, Yan J (2021) Bsn++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. Proceedings of the AAAI Conference on Artificial Intelligence 35:2602–2610
Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 10156–10165
Lin C, Li J, Wang Y, Tai Y, Luo D, Cui Z, Wang C, Li J, Huang F, Ji R (2020) Fast learning of temporal action proposal via dense boundary generator. Proceedings of the AAAI Conference on Artificial Intelligence 34:11499–11506
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 3889–3898
Vo-Ho VK, Le N, Kamazaki K, Sugimoto A, Tran MT (2021) Agentenvironment network for temporal action proposal generation. In: ICASSP 2021- 2021 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) pp, 2160–2164
Yao G, Lei T, Zhong J, Jiang P (2019) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49:2017–2029
Du Z, Mukaidani H (2022) Linear dynamical systems approach for human action recognition with dual-stream deep features. Appl Intell 52(1):452–470
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks, In: Proceedings of the IEEE International Conference on Computer Vision pp, 4489–4497
Jiang G, Jiang X, Fang Z, Chen S (2021) An efficient attention module for 3d convolutional neural networks in action recognition. Appl Intell 51(10):7043–7057
Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 7083–7093
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 2000–2009
Li Y, Ji B, Shi X, Zhang J, Kang B., Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 906–915
Wu Z, Xiong C, Ma CY, Socher R, Davis LS (2019) Adaframe: Adaptive frame selection for fast video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 1278–1287
Gao Z, Guo L, Ren T, Liu AA, Cheng ZY, Chen S (2020) Pairwise two-stream convnets for cross-domain action recognition with small data. IEEE Transactions on Neural Networks and Learning Systems 33(3):1147–1161
Gurunlu B, Ozturk S (2022) Efficient approach for block-based copy-move forgery detection. Smart Trends in Computing and Communications: Proceedings of SmartCom 2021:167–174
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp, 1049–1058
Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: Temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision pp, 3628–3638
Huang J, Li N, Zhang T, Li G, Huang T, Gao W (2018) Sap: Self-adaptive proposal model for temporal action detection based on reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence 32:6951–6958
Eun H, Lee S, Moon J, Park J, Jung C, Kim C (2020) Srg: Snippet relatednessbased temporal action proposal generator. IEEE Trans Circuits Syst Video Technol 30(11):4232–4244
Tan J, Tang J, Wang L, Wu G (2021) Relaxed transformer decoders for direct action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 13526–13535
Liu Y, Chen J, Chen X, Deng B, Huang J, Hua XS (2022) Centerness-aware network for temporal action proposal. IEEE Trans Circuits Syst Video Technol 32(1):5–16
Yang H, Wu W, Wang L, Jin S, Xia B, Yao H, Huang H (2022) Temporal action proposal generation with background constraint. Proceedings of the AAAI Conference on Artificial Intelligence 36:3054–3062
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp, 7794–7803
Chen P, Gan C, Shen G, Huang W, Zeng R, Tan M (2020) Relation attention for temporal action localization. IEEE Trans Multimedia 22(10):2723–2733
Gao L, Li T, Song J, Zhao Z, Shen HT (2020) Play and rewind: Context-aware video temporal action proposals. Pattern Recogn 107:107477
Zhao Y, Zhang H, Gao Z, Guan W, Nie J, Liu A, Wang M, Chen S (2022) A temporal-aware relation and attention network for temporal action localization. IEEE Trans Image Process 31:4746–4760
Bodla N, Singh B, Chellappa R, Davis L.S (2017) Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision pp, 5561–5569
Liu S, Zhao X, Su H, Hu Z (2020) Tsi: Temporal scale invariant network for action proposal generation. Proceedings of the Asian Conference on Computer Vision 12626:530–546
Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition pp, 961–970
Idrees H, Zamir AR, Jiang Y.-G, Gorban A, Laptev I, Sukthankar R, Shah M () The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding 155:1–23
Zhao H, Torralba A, Torresani L, Yan Z (2017) Hacs: Human action clips and segments dataset for recognition and temporal localization, In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 8668–8678 (2019)
Alwassel H, Giancola S, Ghanem B (2021) Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 3173–3183
Qing Z, Su H, Gan W, Wang D, Wu W, Wang X, Qiao Y, Yan J, Gao, C, Sang N (2021) Temporal context aggregation network for temporal action proposal refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 485–494
Chen G, Zheng YD, Wang L, Lu T (2022) Dcan: improving temporal action detection via dual context aggregation. Proceedings of the AAAI Conference on Artificial Intelligence 36:248–257
Shang J, Wei P, Li H, Zheng N (2023) Multi-scale interaction transformer for temporal action proposal generation. Image Vis Comput 129:104589
Gan MG, Zhang Y (2023) Temporal attention-pyramid pooling for temporal action detection. IEEE Trans Multimedia 25:3799–3810
Su T, Wang H, Wang L (2023) Multi-level content-aware boundary detection for temporal action proposal generation. IEEE Trans Image Process S32:6090–6101
Vo K, Truong S, Yamazaki K, Raj B, Tran MT, Le N (2023) Aoe-net: Entities interactions modeling with adaptive attention mechanism for temporal action proposals generation. Int J Comput Vision 131(1):302–323
Liu Y, Ma L, Zhang Y, Liu W, Chang SF (2019) Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 3604–3613
Xia K, Wang L, Zhou S, Hua G, Tang W (2022) Dual relation network for temporal action localization. Pattern Recogn 129:108725
Liu Q, Wang Z, Rong S (2023) Improve temporal action proposals using hierarchical context. Pattern Recogn 140:109560
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision pp, 2914–2923
Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp, 10153–10162
Chen Y, Guo B, Shen Y, Wang W, Lu W, Suo X (2021) Boundary graph convolutional network for temporal action detection. Image Vis Comput 109:104144
Qin X, Zhao H, Lin G, Zeng H, Xu S, Li X (2022) Pcmnet: Position-sensitive context modeling network for temporal action localization. Neurocomputing 510:48–58
Xia K, Wang L, Shen Y, Zhou S, Hua G, Tang W (2023) Exploring action centers for temporal action localization. IEEE Trans Multimedia 25:9425–9436
Xing K, Li T, Wang X (2023) Proposalvlad with proposal-intra exploring for temporal action proposal generation. ACM Transactions on Multimedia Computing. Communications and Applications 19(3):1–18
Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision pp, 7094–7103
Liu Q, Wang Z (2020) Progressive boundary refinement network for temporal action detection. Proceedings of the AAAI Conference on Artificial Intelligence 34:11612–11619
Vo K, Yamazaki K, Truong S, Tran M-T, Sugimoto A, Le N (2021) Abn: Agentaware boundary networks for temporal action proposal generation. IEEE Access 9:126431–126445
Xu M, Perez Rua JM, Zhu X, Ghanem B, Martinez B (2021) Low-fidelity video encoder optimization for temporal action localization. Adv Neural Inf Process Syst 34:9923–9935
Liu X, Wang Q, Hu Y, Tang X, Zhang S, Bai S, Bai X (2022) End-to-end temporal action detection with transformer. IEEE Trans Image Process 31:5427–5441
Maaten L, Hinton GE (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605
Acknowledgements
The research work was supported by the Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing.
Author information
Authors and Affiliations
Contributions
Xiao Ying Pan (First Author):Supervision,Conceptulization,Writing-Review & Editing,Formal Analysis,Methodology,Project Administration;Ni Juan Zhang:Conceptulization,Methodology, Data Curation, Writing, Validation, software, Formal Analysis,Visualization;He Wei Xie :Data Curation;Validation; Shou Kun Li:Data Curation;Tong Feng:Visualization;
Corresponding author
Ethics declarations
Competing Interests
No potential conflict of interest was reported by the authors.
Ethical and informed consent for data used
I guarantee that the data used is from the official source and authorized by the official. Otherwise, I will bear the responsibility.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Notations
Appendix: Notations
Below is a table that summarizes and briefly describes the important symbols in this paper:
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Pan, X., Zhang, N., Xie, H. et al. MBGNet:Multi-branch boundary generation network with temporal context aggregation for temporal action detection. Appl Intell 54, 9045–9066 (2024). https://doi.org/10.1007/s10489-024-05664-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05664-y