Skip to main content
Log in

TSCANet: a two-stream context aggregation network for weakly-supervised temporal action localization

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Weakly supervised temporal action localization classifies and localizes actions in uncropped videos by using only video-level labels. Many current methods employ feature extractors initially intended for post-cropped video action classification. The accuracy of localization decreases when feature extractors of this type are used, because they may introduce redundant information into the action localization task. To overcome the aforementioned constraints, we propose a WSTAL technique based on the two-stream context aggregation network (TSCANet), which consists of two main modules: a multistage temporal feature aggregation module (MSTFA) and a feature alignment module (FA). The MSTFA enables TSCANet to rapidly expand the receptive field and acquire temporal dependencies between long-distance segments by stacking dilated convolutional layers. Therefore, MSTFA allows the model to better aggregate temporal information in optical flow features to reduce redundant information in the original features. To avoid inconsistencies between the enhanced optical flow and RGB flow features, this study designed an FA to calibrate RGB features using optimized optical flow features through a mutual learning approach. On THUMOS14 and ActivityNet datasets, many comparative tests are carried out, and an improved localization performance is attained. In particular, localization at low t-IoU thresholds outperforms many of the existing WSTAL methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Shao J, Wang X, Quan R, Zheng J, Yang J, Yang Y (2023) Action Sensitivity Learning for Temporal Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13457–13469

  2. Wang L, Huang B, Zhao Z, Tong Z, He Y, Wang Y, Wang Y, Qiao Y (2023) Videomae v2: Scaling Video Masked Autoencoders with Dual Masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560

  3. Lee P, Byun H (2021) Learning Action Completeness from Points for Weakly-Supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13648–13657

  4. Rizve MN, Mittal G, Yu Y, Hall M, Sajeev S, Shah M, Chen M (2023) Pivotal: Prior-driven Supervision for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22992–23002

  5. Liu D, Jiang T, Wang Y (2019) Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1298–1307

  6. Carreira J, Zisserman A (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308

  7. Xia L, Ma W (2021) Human action recognition using high-order feature of optical flows. J Supercomput 77(12):14230–14251

    Article  Google Scholar 

  8. Moniruzzaman M, Yin Z, He Z, Qin R, Leu MC (2020) Action Completeness Modeling with Background Aware Networks for Weakly-supervised Temporal Action Localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2166–2174

  9. Zhai Y, Wang L, Tang W, Zhang Q, Yuan J, Hua G (2020) Two-stream Consensus Network for Weakly-supervised Temporal Action Localization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, pp. 37–54. Springer

  10. Gao J, Chen M, Xu C (2022) Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19999–20009

  11. Xia L, Wen X (2024) Multi-stream network with key frame sampling for human action recognition. J Supercomput 80:11958–11988

    Article  Google Scholar 

  12. Zhao Y, Man KL, Smith J, Guan S-U (2022) A novel two-stream structure for video anomaly detection in smart city management. J Supercomput 78(3):3940–3954

    Article  Google Scholar 

  13. Wang Y, Li Y, Wang H (2023) Two-stream Networks for Weakly-supervised Temporal Action Localization with Semantic-aware Mechanisms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18878–18887

  14. Zhang X, Hamann B, Wang D, Wang H, Wang Y, Yin Y, Gao H (2024) Fmgdn: Flexible Multi-grained Dilation Network Empowered Multimedia Image Inpainting for Electronic Consumer. IEEE Transactions on Consumer Electronics

  15. Xia L, Li Z (2021) A new method of abnormal behavior detection using lstm network with temporal attention mechanism. J Supercomput 77(4):3223–3241

    Article  Google Scholar 

  16. Zhang H, Zhou F, Wang D, Zhang X, Yu D, Guan L (2024) LGAFormer: transformer with local and global attention for action detection. J Supercomput 80:17952–17979

    Article  Google Scholar 

  17. Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: Boundary Sensitive Network for Temporal Action Proposal Generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19

  18. Lin T, Liu X, Li X, Ding E, Wen S (2019) Bmn: Boundary-Matching Network for Temporal Action Proposal Generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898

  19. Qing Z, Su H, Gan W, Wang D, Wu W, Wang X, Qiao Y, Yan J, Gao C, Sang N (2021) Temporal Context Aggregation Network for Temporal Action Proposal Refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 485–494

  20. Wang X, Qing Z, Huang Z, Feng Y, Zhang S, Jiang J, Tang M, Shao Y, Sang N (2021) Weakly-supervised Temporal Action Localization Through Local-global Background Modeling. arXiv preprint arXiv:2106.11811

  21. Wang X, Qing Z, Huang Z, Feng Y, Zhang S, Jiang J, Tang M, Gao C, Sang N (2021) Proposal Relation Network for Temporal Action Detection. arXiv preprint arXiv:2106.11812

  22. Xu M, Zhao C, Rojas D.S, Thabet A, Ghanem B (2020) G-tad: Sub-graph Localization for Temporal Action Detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165

  23. Zhang C.-L, Wu J, Li Y (2022) Actionformer: Localizing moments of actions with transformers. In: European Conference on Computer Vision, pp. 492–510. Springer

  24. Hong F-T, Feng J-C, Xu D, Shan Y, Zheng W-S (2021) Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1591–1599

  25. Guo X, Zhang X, Li L, Xia Z (2023) Micro-expression spotting with multi-scale local transformer in long videos. Pattern Recognit Lett 168:146–152

    Article  Google Scholar 

  26. Zhang R, Cao Z, Yang S, Si L, Sun H, Xu L, Sun F (2024) Cognition-driven Structural Prior for Instance-dependent Label Transition Matrix Estimation. IEEE Transactions on Neural Networks and Learning Systems

  27. Wang L, Xiong Y, Lin D, Van Gool L (2017) Untrimmednets for Weakly Supervised Action Recognition and Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334

  28. Lee M, Cho S, Lee D, Park C, Lee J, Lee S (2024) Guided Slot Attention for Unsupervised Video Object Segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3807–3816

  29. Ding B, Zhang R, Xu L, Liu G, Yang S, Liu Y, Zhang Q (2023) U2 d2 net: Unsupervised unified image dehazing and denoising network for single hazy image enhancement. IEEE Trans Multimed 26:202–217

    Article  Google Scholar 

  30. Zhang R, Tan J, Cao Z, Xu L, Liu Y, Si L, Sun F (2024) Part-aware correlation networks for few-shot learning. IEEE Trans Multimed 26:9527–9538

    Article  Google Scholar 

  31. Shou Z, Gao H, Zhang L, Miyazawa K, Chang S.-F Autoloc: Weakly-supervised Temporal Action Localization in Untrimmed Videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171 (2018)

  32. Liu Z, Wang L, Zhang Q, Gao Z, Niu Z, Zheng N, Hua G (2019) Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3899–3908

  33. Diederik PK (2014) Adam: A method for stochastic optimization. (No Title)

  34. Luo W, Zhang T, Yang W, Liu J, Mei T, Wu F, Zhang Y (2021) Action Unit Memory Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9969–9979

  35. Nguyen P, Liu T, Prasad G, Han B (2018) Weakly Supervised Action Localization by Sparse Temporal Pooling Network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761

  36. Nguyen PX, Ramanan D, Fowlkes CC (2019) Weakly-supervised Action Localization with Background Modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5502–5511

  37. Islam A, Long C, Radke R (2021) A Hybrid Attention Mechanism for Weakly-supervised Temporal Action Localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1637–1645

  38. Tong Z, Song Y, Wang J, Wang L (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv neural inf process syst 35:10078–10093

    Google Scholar 

  39. Wang Y, Li K, Li X, Yu J, He Y, Chen G, Pei B, Zheng R, Xu J, Wang Z, et al (2024) Internvideo2: Scaling video foundation models for multimodal video understanding. CoRR

  40. Zhang R, Xu L, Yu Z, Shi Y, Mu C, Xu M (2021) Deep-irtarget: An automatic target detector in infrared imagery using dual-domain feature extraction and allocation. IEEE Trans Multimed 24:1735–1749

    Article  Google Scholar 

  41. Zhang X, Zhu J, Wang D, Wang Y, Liang T, Wang H, Yin Y (2024) A gradual self distillation network with adaptive channel attention for facial expression recognition. Appl Soft Comput 161:111762

    Article  Google Scholar 

  42. Zhou J, Wu Y (2023) Temporal Feature Enhancement Dilated Convolution Network for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6028–6037

  43. Chao Y-W, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the Faster R-cnn Architecture for Temporal Action Localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139

  44. Long F, Yao T, Qiu Z, Tian X, Luo J, Mei T (2019) Gaussian Temporal Awareness Networks for Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353

  45. Zeng R, Huang W, Tan M, Rong Y, Zhao P, Huang J, Gan C (2019) Graph Convolutional Networks for Temporal Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103

  46. Lee S, Jung J, Oh C, Yun S (2024) Enhancing Temporal Action Localization: Advanced s6 Modeling with Recurrent Mechanism. arXiv preprint arXiv:2407.13078

  47. Chen G, Huang Y, Xu J, Pei B, Chen Z, Li Z, Wang J, Li K, Lu T, Wang L (2024) Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding. arXiv preprint arXiv:2403.09626

  48. Paul S, Roy S, Roy-Chowdhury AK (2018) W-TALC: Weakly-Supervised Temporal Activity Localization and Classification, pp. 588–607

  49. Narayan S, Cholakkal H, Khan FS, Shao L (2019) 3c-net: Category Count and Center Loss for Weakly-supervised Action Localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8679–8687

  50. Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised Action Localization by Generative Attention Modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019

  51. Liu Z, Wang L, Zhang Q, Tang W, Yuan J, Zheng N, Hua G (2021) Acsnet: Action-context Separation Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2233–2241

  52. Li J, Yang T, Ji W, Wang J, Cheng L (2022) Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19914–19924

  53. He B, Yang X, Kang L, Cheng Z, Zhou X, Shrivastava A (2022) Asm-loc: Action-aware Segment Modeling for Weakly-supervised Temporal Action Localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13925–13935

  54. Yang Z, Qin J, Huang D (2022) Acgnet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization. Proceedings of the AAAI Conference on Artificial Intelligence, 3090–3098

  55. Huang L, Wang L, Li H Weakly Supervised Temporal Action Localization Via Representative Snippet Knowledge Propagation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3272–3281 (2022)

  56. Moniruzzaman M, Yin Z (2023) Feature weakening, contextualization, and discrimination for weakly supervised temporal action localization. IEEE Trans Multimed 26:270–283

    Article  Google Scholar 

  57. Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-tad: Sub-graph Localization for Temporal Action Detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  58. Zhang C, Cao M, Yang D, Chen J, Zou Y (2021) Cola: Weakly-supervised Temporal Action Localization with Snippet Contrastive Learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  59. Narayan S, Cholakkal H, Hayat M, Khan FS, Yang M-H, Shao L (2021) D2-net: Weakly-supervised Action Localization Via Discriminative Embeddings and Denoised Activations. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

  60. Chen M, Gao J, Yang S, Xu C (2022) Dual-evidential learning for weakly-Supervised Temporal Action Localization. In: European Conference on Computer Vision, pp. 192–208. Springer

  61. Zhai Y, Wang L, Tang W, Zhang Q. Yuan J, Hua G (2020) Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization, pp 37–54

  62. Zhao T, Han J, Yang L, Zhang D (2022) Equivalent classification mapping for weakly supervised temporal action localization. IEEE Trans Pattern Anal Mach Intelli 45(3):3019–031

    Google Scholar 

  63. Liu Z, Wang L, Zhang Q, Tang W, Yuan J, Zheng N, Hua G (2021) Acsnet: Action-context Separation Network for Weakly Supervised Temporal Action Localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2233–2241

  64. Wang Y, Li Y, Wang H (2023) Two-stream Networks for Weakly-supervised Temporal Action Localization with Semantic-aware Mechanisms. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18878–18887

  65. Hu Y, Fu J, Chen M, Gao J, Dong J, Fan B, Liu H (2024) Learning proposal-aware re-ranking for weakly-supervised temporal action localization. IEEE Trans Circuits Syst Video Technol 34(1):207–220

    Article  Google Scholar 

Download references

Funding

This research was supported by the National Natural science Foundation of China under Grant No.62202131 and No.62372145.

Author information

Authors and Affiliations

Authors

Contributions

Haiping Zhang, Haixiang Lin, Fuxing Zhou, Dongyang Xu, Dongjing Wang and Xujian Fang contributed to conceptualization, methodology, writing-original draft and visualization; Haiping Zhang, Haixiang Lin and Fuxing Zhou contributed to data curation and software, formal analysis and investigation; Haiping Zhang, Dongjin Wang, Dongjin Yu and Liming Guan contributed to funding acquisition and resources; and Haiping Zhang and Dongjing Wang contributed to project administration and supervision, validation and writing - review & editing.

Corresponding authors

Correspondence to Haiping Zhang or Xujian Fang.

Ethics declarations

Competing of Interest

All the authors do not have any possible Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Lin, H., Wang, D. et al. TSCANet: a two-stream context aggregation network for weakly-supervised temporal action localization. J Supercomput 81, 311 (2025). https://doi.org/10.1007/s11227-024-06810-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06810-6

Keywords