Deep cascaded action attention network for weakly-supervised temporal action localization

Xia, Hui-fen; Zhan, Yong-zhao

doi:10.1007/s11042-023-14670-0

Deep cascaded action attention network for weakly-supervised temporal action localization

Published: 15 March 2023

Volume 82, pages 29769–29787, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

200 Accesses
1 Altmetric
Explore all metrics

Abstract

Weakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly focus on the most discriminative action snippets of a video by using top-k multiple instance learning (MIL), and ignore the usage of less discriminative action snippets and non-action snippets. This makes the localization performance improve limitedly. In order to mine the less discriminative action snippets and distinguish the non-action snippets better in a video, a novel method based on deep cascaded action attention network is proposed. In this method, the deep cascaded action attention mechanism is presented to model not only the most discriminative action snippets, but also different levels of less discriminative action snippets by introducing threshold erasing, which ensures the completeness of action instances. Besides, the entropy loss for non-action is introduced to restrict the activations of non-action snippets for all action categories, which are generated by aggregating the bottom-k activation scores along the temporal dimension. Thereby, the action snippets can be distinguished from non-action snippets better, which is beneficial to the separation of action and non-action snippets and enables the action instances more accurate. Ultimately, our method can facilitate more precise action localization. Extensive experiments conducted on THUMOS14 and ActivityNet1.3 datasets show that our method outperforms state-of-the-art methods at several t-IoU thresholds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Article 18 March 2022

Complementary Attention Network for Weakly Supervised Temporal Action Localization

Article 26 January 2023

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization

References

Caba Heilbron F, Escorcia V, Ghanem B et al (2015) ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970
Carreira J, Zisserman A, Quo vadis (2017) Action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chao Y W, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1130–1139
Ge Y, Qin X, Yang D et al (2021) Deep snippet selective network for weakly supervised temporal action localization. Pattern Recogn 110:107686
Article Google Scholar
Huang L, Huang Y, Ouyang W et al (2020) Relational prototypical network for weakly supervised temporal action localization. Proc AAAI Conf Artif Intell 34(07):11053–11060
Google Scholar
Jiang Y G, Liu J, Roshan Zamir A et al (2014) THUMOS challenge: Action recognition with a large number of classes. Sept 3 online. Available: http://crcv.ucf.edu/THUMOS14
Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. Proc AAAI Conf Artif Intell 34(07):11320–11327
Google Scholar
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 988–996
Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1298–1307
Liu Z, Wang L, Zhang Q et al (2021) Weakly supervised temporal action localization through contrast based evaluation networks. IEEE Trans Pattern Anal Intell (Early Access). https://doi.org/10.1109/TPAMI:3078798
Long F, Yao T, Qiu Z et al (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 344–353
Narayan S, Cholakkal H, Khan FS et al (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8679–8687
Nguyen P, Liu T, Prasad G, et al (2008) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6752–6761
Nguyen P X, Ramanan D, Fowlkes CC (2019) Weakly-supervised action localization with background modeling. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5502–5511
Paul S, Roy S, Roy-Chowdhury A K (2018) W-TALC: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European conference on computer vision (ECCV), pp 563–579
Qin X, Ge Y, Yu H et al (2020) Spatial enhancement and temporal constraint for weakly supervised action localization. IEEE Sig Process Lett 27:1520–1524
Article Google Scholar
Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intell Rev 46(4):485–514
Article Google Scholar
Rashid M, Kjellstrom H, Lee YJ (2020) Action graphs: Weakly-supervised action localization with graph convolution networks. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 615–624
Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require?. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Shou Z, Gao H, Zhang L, et al (2018) Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European conference on computer vision (ECCV), pp 154–171
Shou Z, Wang D, Chang S F (2016) Temporal action localization in untrimmed videos via multi-stage CNNS. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Xiong Y, Lin D, et al (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4325–4334
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36
Wedel A, Pock T, Zach C et al (2009) An improved algorithm for TV-L 1 optical flow. In: Statistical and geometrical approaches to visual motion analysis. Springer, Berlin, pp 23–45
Xu M, Zhao C, Rojas D S et al (2020) G-TAD: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10156–10165
Yu J, Ge Y, Qin X et al (2021) Deep feature enhancing and selecting network for weakly supervised temporal action localization. J Vis Commun Image Represent 80:103276
Article Google Scholar
Zeng R, Gan C, Chen P et al (2019) Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans Image Process 28(12):5797–5808
Article MathSciNet MATH Google Scholar
Zeng R, Huang W, Tan M, et al (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7094–7103
Zhao P, Xie L, Ju C et al (2020) Bottom-up temporal action localization with mutual regularization. In: European conference on computer vision. Springer, Cham, pp 539–555
Zhao Y, Xiong Y, Wang L et al (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923
Zhong JX, Li N, Kong W et al (2008) Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In: Proceedings of the 26th ACM international conference on multimedia, pp 35–44

Download references

Acknowledgements

This research was supported in part by National Natural Science Foundation of China (Grant No. 61672268).

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, Jiangsu, 212013, China
Hui-fen Xia & Yong-zhao Zhan
Changzhou Vocational Institute of Mechatronic Technology, Wujin, Changzhou, Jiangsu, 213164, China
Hui-fen Xia
Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications, Zhenjiang, Jiangsu, 212013, China
Yong-zhao Zhan

Authors

Hui-fen Xia
View author publications
You can also search for this author in PubMed Google Scholar
Yong-zhao Zhan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong-zhao Zhan.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xia, Hf., Zhan, Yz. Deep cascaded action attention network for weakly-supervised temporal action localization. Multimed Tools Appl 82, 29769–29787 (2023). https://doi.org/10.1007/s11042-023-14670-0

Download citation

Received: 15 November 2021
Revised: 12 June 2022
Accepted: 03 February 2023
Published: 15 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11042-023-14670-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep cascaded action attention network for weakly-supervised temporal action localization

Abstract

Access this article

Similar content being viewed by others

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Complementary Attention Network for Weakly Supervised Temporal Action Localization

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep cascaded action attention network for weakly-supervised temporal action localization

Abstract

Access this article

Similar content being viewed by others

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Complementary Attention Network for Weakly Supervised Temporal Action Localization

Extracting Action Sensitive Features to Facilitate Weakly-Supervised Action Localization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation