Skip to main content
Log in

Deep cascaded action attention network for weakly-supervised temporal action localization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Weakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly focus on the most discriminative action snippets of a video by using top-k multiple instance learning (MIL), and ignore the usage of less discriminative action snippets and non-action snippets. This makes the localization performance improve limitedly. In order to mine the less discriminative action snippets and distinguish the non-action snippets better in a video, a novel method based on deep cascaded action attention network is proposed. In this method, the deep cascaded action attention mechanism is presented to model not only the most discriminative action snippets, but also different levels of less discriminative action snippets by introducing threshold erasing, which ensures the completeness of action instances. Besides, the entropy loss for non-action is introduced to restrict the activations of non-action snippets for all action categories, which are generated by aggregating the bottom-k activation scores along the temporal dimension. Thereby, the action snippets can be distinguished from non-action snippets better, which is beneficial to the separation of action and non-action snippets and enables the action instances more accurate. Ultimately, our method can facilitate more precise action localization. Extensive experiments conducted on THUMOS14 and ActivityNet1.3 datasets show that our method outperforms state-of-the-art methods at several t-IoU thresholds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Caba Heilbron F, Escorcia V, Ghanem B et al (2015) ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970

  2. Carreira J, Zisserman A, Quo vadis (2017) Action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  3. Chao Y W, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1130–1139

  4. Ge Y, Qin X, Yang D et al (2021) Deep snippet selective network for weakly supervised temporal action localization. Pattern Recogn 110:107686

    Article  Google Scholar 

  5. Huang L, Huang Y, Ouyang W et al (2020) Relational prototypical network for weakly supervised temporal action localization. Proc AAAI Conf Artif Intell 34(07):11053–11060

    Google Scholar 

  6. Jiang Y G, Liu J, Roshan Zamir A et al (2014) THUMOS challenge: Action recognition with a large number of classes. Sept 3 online. Available: http://crcv.ucf.edu/THUMOS14

  7. Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950

  8. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980

  9. Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. Proc AAAI Conf Artif Intell 34(07):11320–11327

    Google Scholar 

  10. Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 988–996

  11. Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1298–1307

  12. Liu Z, Wang L, Zhang Q et al (2021) Weakly supervised temporal action localization through contrast based evaluation networks. IEEE Trans Pattern Anal Intell (Early Access). https://doi.org/10.1109/TPAMI:3078798

  13. Long F, Yao T, Qiu Z et al (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 344–353

  14. Narayan S, Cholakkal H, Khan FS et al (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8679–8687

  15. Nguyen P, Liu T, Prasad G, et al (2008) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6752–6761

  16. Nguyen P X, Ramanan D, Fowlkes CC (2019) Weakly-supervised action localization with background modeling. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5502–5511

  17. Paul S, Roy S, Roy-Chowdhury A K (2018) W-TALC: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European conference on computer vision (ECCV), pp 563–579

  18. Qin X, Ge Y, Yu H et al (2020) Spatial enhancement and temporal constraint for weakly supervised action localization. IEEE Sig Process Lett 27:1520–1524

    Article  Google Scholar 

  19. Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intell Rev 46(4):485–514

    Article  Google Scholar 

  20. Rashid M, Kjellstrom H, Lee YJ (2020) Action graphs: Weakly-supervised action localization with graph convolution networks. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 615–624

  21. Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require?. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8

  22. Shou Z, Gao H, Zhang L, et al (2018) Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European conference on computer vision (ECCV), pp 154–171

  23. Shou Z, Wang D, Chang S F (2016) Temporal action localization in untrimmed videos via multi-stage CNNS. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058

  24. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199

  25. Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488

  26. Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  27. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  28. Wang L, Xiong Y, Lin D, et al (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4325–4334

  29. Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36

  30. Wedel A, Pock T, Zach C et al (2009) An improved algorithm for TV-L 1 optical flow. In: Statistical and geometrical approaches to visual motion analysis. Springer, Berlin, pp 23–45

  31. Xu M, Zhao C, Rojas D S et al (2020) G-TAD: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10156–10165

  32. Yu J, Ge Y, Qin X et al (2021) Deep feature enhancing and selecting network for weakly supervised temporal action localization. J Vis Commun Image Represent 80:103276

    Article  Google Scholar 

  33. Zeng R, Gan C, Chen P et al (2019) Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans Image Process 28(12):5797–5808

    Article  MathSciNet  MATH  Google Scholar 

  34. Zeng R, Huang W, Tan M, et al (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7094–7103

  35. Zhao P, Xie L, Ju C et al (2020) Bottom-up temporal action localization with mutual regularization. In: European conference on computer vision. Springer, Cham, pp 539–555

  36. Zhao Y, Xiong Y, Wang L et al (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923

  37. Zhong JX, Li N, Kong W et al (2008) Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In: Proceedings of the 26th ACM international conference on multimedia, pp 35–44

Download references

Acknowledgements

This research was supported in part by National Natural Science Foundation of China (Grant No. 61672268).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yong-zhao Zhan.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, Hf., Zhan, Yz. Deep cascaded action attention network for weakly-supervised temporal action localization. Multimed Tools Appl 82, 29769–29787 (2023). https://doi.org/10.1007/s11042-023-14670-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14670-0

Keywords

Navigation