Abstract
Weakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly focus on the most discriminative action snippets of a video by using top-k multiple instance learning (MIL), and ignore the usage of less discriminative action snippets and non-action snippets. This makes the localization performance improve limitedly. In order to mine the less discriminative action snippets and distinguish the non-action snippets better in a video, a novel method based on deep cascaded action attention network is proposed. In this method, the deep cascaded action attention mechanism is presented to model not only the most discriminative action snippets, but also different levels of less discriminative action snippets by introducing threshold erasing, which ensures the completeness of action instances. Besides, the entropy loss for non-action is introduced to restrict the activations of non-action snippets for all action categories, which are generated by aggregating the bottom-k activation scores along the temporal dimension. Thereby, the action snippets can be distinguished from non-action snippets better, which is beneficial to the separation of action and non-action snippets and enables the action instances more accurate. Ultimately, our method can facilitate more precise action localization. Extensive experiments conducted on THUMOS14 and ActivityNet1.3 datasets show that our method outperforms state-of-the-art methods at several t-IoU thresholds.
Similar content being viewed by others
References
Caba Heilbron F, Escorcia V, Ghanem B et al (2015) ActivityNet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 961–970
Carreira J, Zisserman A, Quo vadis (2017) Action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chao Y W, Vijayanarasimhan S, Seybold B et al (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1130–1139
Ge Y, Qin X, Yang D et al (2021) Deep snippet selective network for weakly supervised temporal action localization. Pattern Recogn 110:107686
Huang L, Huang Y, Ouyang W et al (2020) Relational prototypical network for weakly supervised temporal action localization. Proc AAAI Conf Artif Intell 34(07):11053–11060
Jiang Y G, Liu J, Roshan Zamir A et al (2014) THUMOS challenge: Action recognition with a large number of classes. Sept 3 online. Available: http://crcv.ucf.edu/THUMOS14
Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Lee P, Uh Y, Byun H (2020) Background suppression network for weakly-supervised temporal action localization. Proc AAAI Conf Artif Intell 34(07):11320–11327
Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 988–996
Liu D, Jiang T, Wang Y (2019) Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1298–1307
Liu Z, Wang L, Zhang Q et al (2021) Weakly supervised temporal action localization through contrast based evaluation networks. IEEE Trans Pattern Anal Intell (Early Access). https://doi.org/10.1109/TPAMI:3078798
Long F, Yao T, Qiu Z et al (2019) Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 344–353
Narayan S, Cholakkal H, Khan FS et al (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8679–8687
Nguyen P, Liu T, Prasad G, et al (2008) Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6752–6761
Nguyen P X, Ramanan D, Fowlkes CC (2019) Weakly-supervised action localization with background modeling. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5502–5511
Paul S, Roy S, Roy-Chowdhury A K (2018) W-TALC: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European conference on computer vision (ECCV), pp 563–579
Qin X, Ge Y, Yu H et al (2020) Spatial enhancement and temporal constraint for weakly supervised action localization. IEEE Sig Process Lett 27:1520–1524
Ramezani M, Yaghmaee F (2016) A review on human action analysis in videos for retrieval applications. Artif Intell Rev 46(4):485–514
Rashid M, Kjellstrom H, Lee YJ (2020) Action graphs: Weakly-supervised action localization with graph convolution networks. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 615–624
Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require?. In: 2008 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–8
Shou Z, Gao H, Zhang L, et al (2018) Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European conference on computer vision (ECCV), pp 154–171
Shou Z, Wang D, Chang S F (2016) Temporal action localization in untrimmed videos via multi-stage CNNS. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1049–1058
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Wang L, Xiong Y, Lin D, et al (2017) Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4325–4334
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36
Wedel A, Pock T, Zach C et al (2009) An improved algorithm for TV-L 1 optical flow. In: Statistical and geometrical approaches to visual motion analysis. Springer, Berlin, pp 23–45
Xu M, Zhao C, Rojas D S et al (2020) G-TAD: Sub-graph localization for temporal action detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10156–10165
Yu J, Ge Y, Qin X et al (2021) Deep feature enhancing and selecting network for weakly supervised temporal action localization. J Vis Commun Image Represent 80:103276
Zeng R, Gan C, Chen P et al (2019) Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans Image Process 28(12):5797–5808
Zeng R, Huang W, Tan M, et al (2019) Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7094–7103
Zhao P, Xie L, Ju C et al (2020) Bottom-up temporal action localization with mutual regularization. In: European conference on computer vision. Springer, Cham, pp 539–555
Zhao Y, Xiong Y, Wang L et al (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision, pp 2914–2923
Zhong JX, Li N, Kong W et al (2008) Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In: Proceedings of the 26th ACM international conference on multimedia, pp 35–44
Acknowledgements
This research was supported in part by National Natural Science Foundation of China (Grant No. 61672268).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, Hf., Zhan, Yz. Deep cascaded action attention network for weakly-supervised temporal action localization. Multimed Tools Appl 82, 29769–29787 (2023). https://doi.org/10.1007/s11042-023-14670-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14670-0