Abstract
Weakly supervised temporal action localization classifies and localizes actions in uncropped videos by using only video-level labels. Many current methods employ feature extractors initially intended for post-cropped video action classification. The accuracy of localization decreases when feature extractors of this type are used, because they may introduce redundant information into the action localization task. To overcome the aforementioned constraints, we propose a WSTAL technique based on the two-stream context aggregation network (TSCANet), which consists of two main modules: a multistage temporal feature aggregation module (MSTFA) and a feature alignment module (FA). The MSTFA enables TSCANet to rapidly expand the receptive field and acquire temporal dependencies between long-distance segments by stacking dilated convolutional layers. Therefore, MSTFA allows the model to better aggregate temporal information in optical flow features to reduce redundant information in the original features. To avoid inconsistencies between the enhanced optical flow and RGB flow features, this study designed an FA to calibrate RGB features using optimized optical flow features through a mutual learning approach. On THUMOS14 and ActivityNet datasets, many comparative tests are carried out, and an improved localization performance is attained. In particular, localization at low t-IoU thresholds outperforms many of the existing WSTAL methods.




Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Shao J, Wang X, Quan R, Zheng J, Yang J, Yang Y (2023) Action Sensitivity Learning for Temporal Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13457–13469
Wang L, Huang B, Zhao Z, Tong Z, He Y, Wang Y, Wang Y, Qiao Y (2023) Videomae v2: Scaling Video Masked Autoencoders with Dual Masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560
Lee P, Byun H (2021) Learning Action Completeness from Points for Weakly-Supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13648–13657
Rizve MN, Mittal G, Yu Y, Hall M, Sajeev S, Shah M, Chen M (2023) Pivotal: Prior-driven Supervision for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22992–23002
Liu D, Jiang T, Wang Y (2019) Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1298–1307
Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308
Xia L, Ma W (2021) Human action recognition using high-order feature of optical flows. J Supercomput 77(12):14230–14251
Moniruzzaman M, Yin Z, He Z, Qin R, Leu MC (2020) Action Completeness Modeling with Background Aware Networks for Weakly-supervised Temporal Action Localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2166–2174
Zhai Y, Wang L, Tang W, Zhang Q, Yuan J, Hua G (2020) Two-stream Consensus Network for Weakly-supervised Temporal Action Localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pp. 37–54. Springer
Gao J, Chen M, Xu C (2022) Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19999–20009
Xia L, Wen X (2024) Multi-stream network with key frame sampling for human action recognition. J Supercomput 80:11958–11988
Zhao Y, Man KL, Smith J, Guan S-U (2022) A novel two-stream structure for video anomaly detection in smart city management. J Supercomput 78(3):3940–3954
Wang Y, Li Y, Wang H (2023) Two-stream Networks for Weakly-supervised Temporal Action Localization with Semantic-aware Mechanisms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18878–18887
Zhang X, Hamann B, Wang D, Wang H, Wang Y, Yin Y, Gao H (2024) Fmgdn: Flexible Multi-grained Dilation Network Empowered Multimedia Image Inpainting for Electronic Consumer. IEEE Transactions on Consumer Electronics
Xia L, Li Z (2021) A new method of abnormal behavior detection using lstm network with temporal attention mechanism. J Supercomput 77(4):3223–3241
Zhang H, Zhou F, Wang D, Zhang X, Yu D, Guan L (2024) LGAFormer: transformer with local and global attention for action detection. J Supercomput 80:17952–17979
Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary Sensitive Network for Temporal Action Proposal Generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19
Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-Matching Network for Temporal Action Proposal Generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898
Qing Z, Su H, Gan W, Wang D, Wu W, Wang X, Qiao Y, Yan J, Gao C, Sang N (2021) Temporal Context Aggregation Network for Temporal Action Proposal Refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 485–494
Wang X, Qing Z, Huang Z, Feng Y, Zhang S, Jiang J, Tang M, Shao Y, Sang N (2021) Weakly-supervised Temporal Action Localization Through Local-global Background Modeling. arXiv preprint arXiv:2106.11811
Wang X, Qing Z, Huang Z, Feng Y, Zhang S, Jiang J, Tang M, Gao C, Sang N (2021) Proposal Relation Network for Temporal Action Detection. arXiv preprint arXiv:2106.11812
Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: Sub-graph Localization for Temporal Action Detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165
Zhang C.-L, Wu J, Li Y (2022) Actionformer: Localizing moments of actions with transformers. In: European Conference on Computer Vision, pp. 492–510. Springer
Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1591–1599
Guo X, Zhang X, Li L, Xia Z (2023) Micro-expression spotting with multi-scale local transformer in long videos. Pattern Recognit Lett 168:146–152
Zhang R, Cao Z, Yang S, Si L, Sun H, Xu L, Sun F (2024) Cognition-driven Structural Prior for Instance-dependent Label Transition Matrix Estimation. IEEE Transactions on Neural Networks and Learning Systems
Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmednets for Weakly Supervised Action Recognition and Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334
Lee M, Cho S, Lee D, Park C, Lee J, Lee S (2024) Guided Slot Attention for Unsupervised Video Object Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3807–3816
Ding B, Zhang R, Xu L, Liu G, Yang S, Liu Y, Zhang Q (2023) U2 d2 net: Unsupervised unified image dehazing and denoising network for single hazy image enhancement. IEEE Trans Multimed 26:202–217
Zhang R, Tan J, Cao Z, Xu L, Liu Y, Si L, Sun F (2024) Part-aware correlation networks for few-shot learning. IEEE Trans Multimed 26:9527–9538
Shou Z, Gao H, Zhang L, Miyazawa K, Chang S.-F Autoloc: Weakly-supervised Temporal Action Localization in Untrimmed Videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171 (2018)
Liu Z, Wang L, Zhang Q, Gao Z, Niu Z, Zheng N, Hua G (2019) Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3899–3908
Diederik PK (2014) Adam: A method for stochastic optimization. (No Title)
Luo W, Zhang T, Yang W, Liu J, Mei T, Wu F, Zhang Y (2021) Action Unit Memory Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9969–9979
Nguyen P, Liu T, Prasad G, Han B (2018) Weakly Supervised Action Localization by Sparse Temporal Pooling Network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761
Nguyen PX, Ramanan D, Fowlkes CC (2019) Weakly-supervised Action Localization with Background Modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5502–5511
Islam A, Long C, Radke R (2021) A Hybrid Attention Mechanism for Weakly-supervised Temporal Action Localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1637–1645
Tong Z, Song Y, Wang J, Wang L (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv neural inf process syst 35:10078–10093
Wang Y, Li K, Li X, Yu J, He Y, Chen G, Pei B, Zheng R, Xu J, Wang Z, et al (2024) Internvideo2: Scaling video foundation models for multimodal video understanding. CoRR
Zhang R, Xu L, Yu Z, Shi Y, Mu C, Xu M (2021) Deep-irtarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation. IEEE Trans Multimed 24:1735–1749
Zhang X, Zhu J, Wang D, Wang Y, Liang T, Wang H, Yin Y (2024) A gradual self distillation network with adaptive channel attention for facial expression recognition. Appl Soft Comput 161:111762
Zhou J, Wu Y (2023) Temporal Feature Enhancement Dilated Convolution Network for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6028–6037
Chao Y-W, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the Faster R-cnn Architecture for Temporal Action Localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139
Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian Temporal Awareness Networks for Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353
Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph Convolutional Networks for Temporal Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103
Lee S, Jung J, Oh C, Yun S (2024) Enhancing Temporal Action Localization: Advanced s6 Modeling with Recurrent Mechanism. arXiv preprint arXiv:2407.13078
Chen G, Huang Y, Xu J, Pei B, Chen Z, Li Z, Wang J, Li K, Lu T, Wang L (2024) Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding. arXiv preprint arXiv:2403.09626
Paul S, Roy S, Roy-Chowdhury AK (2018) W-TALC: Weakly-Supervised Temporal Activity Localization and Classification, pp. 588–607
Narayan S, Cholakkal H, Khan FS, Shao L (2019) 3c-net: Category Count and Center Loss for Weakly-supervised Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8679–8687
Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised Action Localization by Generative Attention Modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019
Liu Z, Wang L, Zhang Q, Tang W, Yuan J, Zheng N, Hua G (2021) Acsnet: Action-context Separation Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2233–2241
Li J, Yang T, Ji W, Wang J, Cheng L (2022) Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19914–19924
He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: Action-aware Segment Modeling for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13925–13935
Yang Z, Qin J, Huang D (2022) Acgnet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 3090–3098
Huang L, Wang L, Li H Weakly Supervised Temporal Action Localization Via Representative Snippet Knowledge Propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3272–3281 (2022)
Moniruzzaman M, Yin Z (2023) Feature weakening, contextualization, and discrimination for weakly supervised temporal action localization. IEEE Trans Multimed 26:270–283
Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-tad: Sub-graph Localization for Temporal Action Detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zhang C, Cao M, Yang D, Chen J, Zou Y (2021) Cola: Weakly-supervised Temporal Action Localization with Snippet Contrastive Learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Narayan S, Cholakkal H, Hayat M, Khan FS, Yang M-H, Shao L (2021) D2-net: Weakly-supervised Action Localization Via Discriminative Embeddings and Denoised Activations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Chen M, Gao J, Yang S, Xu C (2022) Dual-evidential learning for weakly-Supervised Temporal Action Localization. In: European Conference on Computer Vision, pp. 192–208. Springer
Zhai Y, Wang L, Tang W, Zhang Q. Yuan J, Hua G (2020) Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization, pp 37–54
Zhao T, Han J, Yang L, Zhang D (2022) Equivalent classification mapping for weakly supervised temporal action localization. IEEE Trans Pattern Anal Mach Intelli 45(3):3019–031
Liu Z, Wang L, Zhang Q, Tang W, Yuan J, Zheng N, Hua G (2021) Acsnet: Action-context Separation Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2233–2241
Wang Y, Li Y, Wang H (2023) Two-stream Networks for Weakly-supervised Temporal Action Localization with Semantic-aware Mechanisms. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18878–18887
Hu Y, Fu J, Chen M, Gao J, Dong J, Fan B, Liu H (2024) Learning proposal-aware re-ranking for weakly-supervised temporal action localization. IEEE Trans Circuits Syst Video Technol 34(1):207–220
Funding
This research was supported by the National Natural science Foundation of China under Grant No.62202131 and No.62372145.
Author information
Authors and Affiliations
Contributions
Haiping Zhang, Haixiang Lin, Fuxing Zhou, Dongyang Xu, Dongjing Wang and Xujian Fang contributed to conceptualization, methodology, writing-original draft and visualization; Haiping Zhang, Haixiang Lin and Fuxing Zhou contributed to data curation and software, formal analysis and investigation; Haiping Zhang, Dongjin Wang, Dongjin Yu and Liming Guan contributed to funding acquisition and resources; and Haiping Zhang and Dongjing Wang contributed to project administration and supervision, validation and writing - review & editing.
Corresponding authors
Ethics declarations
Competing of Interest
All the authors do not have any possible Conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Lin, H., Wang, D. et al. TSCANet: a two-stream context aggregation network for weakly-supervised temporal action localization. J Supercomput 81, 311 (2025). https://doi.org/10.1007/s11227-024-06810-6
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06810-6