Abstract
Weakly supervised temporal action localization is a practical yet challenging task. Although great efforts have been made in recent years, the existing methods still have limited capacity in dealing with the challenges of over-localization, joint-localization, and under-localization. Based on our investigation, the first two challenges arise from insufficient ability to suppress background response, while the third challenge is due to the lack of discovering action frames. To better address these challenges, we first propose the astute background response strategy. By enforcing the classification target of the background category to be zero, such a strategy can endow the conductive effect between video-level classification and frame-level classification, thus guiding the action category to suppress responses at background frames astutely and helping address the over-localization and joint-localization challenges. For alleviating the under-localization challenge, we introduce the self-distillation learning strategy. It simultaneously learns one master network and multiple auxiliary networks, where the auxiliary networks enhance the master network to discover complete action frames. Experimental results on three benchmarks demonstrate the favorable performance of the proposed method against previous counterparts, and its efficacy to tackle the existing three challenges.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig9_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig10_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig11_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig12_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig13_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig14_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig15_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig16_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig17_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11263-021-01473-9/MediaObjects/11263_2021_1473_Fig18_HTML.png)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, (pp. 549–565). Springer
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 535–541).
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6299–6308).
Chan, L., Hosseini, M. S., & Plataniotis, KN. (2020). A comprehensive analysis of weakly-supervised semantic segmentation in different image domains. International Journal of Computer Vision, 129(2), 1–24
Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1130–1139).
Choe, J., & Shim, H. (2019). Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 2219–2228).
Choe, J., Oh, S. J., Lee, S., Chun, S., Akata, Z., & Shim, H. (2020). Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 3133–3142).
Crowley, E. J., Gray, G., & Storkey, A. J. (2018). Moonshine: Distilling with cheap convolutions. In NeurIPS, (pp. 2893–2903).
Caba Heilbron, F., Victor Escorcia, B. G., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 961–970).
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 6202–6211).
Gao, J., Chen, K., & Nevatia, R. (2018a). Ctap: Complementary temporal action proposal generation. In European conference on computer vision, (pp. 70–85). Springer
Gao, Z., Wang, L., Jojic, N., Niu, Z., Zheng, N., & Hua, G. (2018b). Video imprint. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12), 3086–3099.
Gong, C., Tao, D., Liu, W., Liu, L., & Yang, J. (2017). Label propagation via teaching-to-learn and learning-to-teach. IEEE Transactions on Neural Networks and Learning Systems, 28(6), 1452–1465.
Gong, C., Chang, X., Fang, M., & Yang, J. (2018). Teaching semi-supervised classifier via generalized distillation. In IJCAI, (pp 2156–2162).
Gong, C., Yang, J., You, J. J., & Sugiyama, M. (2020a). Centroid estimation with guaranteed efficiency: A general framework for weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3044997
Gong, G., Wang, X., Mu, Y., & Tian, Q. (2020b). Learning temporal co-attention models for unsupervised video action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 9819–9828).
Han, J., Yang, L., Zhang, D., Chang, X., & Liang, X. (2018a). Reinforcement cutting-agent learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 9080–9089).
Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018b). Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine, 35(1), 84–100.
Hattori, H., Lee, N., Boddeti, V. N., Beainy, F., Kitani, K. M., & Kanade, T. (2018). Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance. International Journal of Computer Vision, 126(9), 1027–1044.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, (pp. 409–426). Springer
Hou, Y., Ma, Z., Liu, C., & Loy, C. C. (2019). Learning lightweight lane detection cnns by self attention distillation. In Proceedings of the IEEE/CVF international conference on computer vision, (pp. 1013–1021).
Jain, M., Ghodrati, A., & Snoek, C. G. (2020). Actionbytes: Learning from trimmed videos to localize actions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1171–1180).
Jiang, Y. G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations (ICLR).
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Lee, P., Uh, Y., & Byun, H. (2020). Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI conference on artificial intelligence.
Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). Bsn: Boundary sensitive network for temporal action proposal generation. In European conference on computer vision, (pp. 3–21). Springer
Lin, T., Liu, X., Li, X., Ding, E., & Wen, S. (2019). Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3889–3898).
Liu, D., Jiang, T., & Wang, Y. (2019a). Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1298–1307).
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2020). Deep learning for generic object detection: A survey. International journal of computer vision, 128(2), 261–318.
Liu, Z., Wang, L., Zhang, Q., Gao, Z., Niu, Z., Zheng, N., & Hua, G. (2019b). Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3899–3908).
Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., & Mei, T. (2019). Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 344–353).
Lu, C., Shi, J., Wang, W., & Jia, J. (2019). Fast abnormal event detection. International Journal of Computer Vision, 127(8), 993–1011.
Lu, X., Wang, W., Shen, J., Tai, Y. W., Crandall, D. J., & Hoi, S. C. (2020). Learning video object segmentation from unlabeled videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 8960–8970).
Luo, Z., Guillory, D., Shi, B., Ke, W., Wan, F., Darrell, T., & Xu, H. (2020). Weakly-supervised action localization with expectation-maximization multi-instance learning. In European conference on computer vision.
Ma, F., Zhu, L., Yang, Y., Zha, S., Kundu, G., Feiszli, M., & Shou, Z. (2020). Sf-net: Single-frame supervision for temporal action localization. In European conference on computer vision.
Mettes, P., Van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In European conference on computer vision, (pp. 437–453). Springer
Min, K., & Corso, J. J. (2020). Adversarial background-aware loss for weakly-supervised temporal activity localization. In European conference on computer vision, (pp. 283–299). Springer
Narayan, S., Cholakkal, H., Khan, F. S., & Shao, L. (2019). 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 8679–8687).
Nguyen, P., Liu, T., Prasad, G., & Han, B. (2018). Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6752–6761).
Nguyen, P. X., Ramanan, D., & Fowlkes, C. C. (2019). Weakly-supervised action localization with background modeling. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5502–5511).
Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2020). Multi-scale interactive network for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 9413–9422).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (NeurIPS), (pp. 8024–8035).
Paul, S., Roy, S., & Roy-Chowdhury, A. K. (2018). W-talc: Weakly-supervised temporal activity localization and classification. In European conference on computer vision, (pp. 588–607). Springer
Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5533–5541).
Ramanathan, V., Wang, R., & Mahajan, D. (2020). Dlwl: Improving detection for lowshot classes with weakly labelled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 9342–9352).
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), (pp. 91–99).
Shi, B., Dai, Q., Mu, Y., & Wang, J. (2020). Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1009–1019).
Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1049–1058).
Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, SF. (2018). Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In European conference on computer vision, (pp. 162–179). Springer
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (NeurIPS), (pp. 568–576).
Singh, K. K., & Lee, Y. J. (2017). Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3544–3553). IEEE
Song, L., Liu, J., Sun, M., & Shang, X. (2020). Weakly supervised group mask network for object detection. International Journal of Computer Vision, 129(3), 1–22.
Sun, G., Wang, W., Dai, J., & Van Gool, L. (2020). Mining cross-image semantics for weakly supervised semantic segmentation. In European conference on computer vision, (pp. 347–365). Springer
Toolkit, CPTG. (2019). v10. 1 documentation. URL: https://docs.nvidia.com/cuda/archive/101/pascal-tuning-guide/index.html
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 4489–4497).
Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3551–3558).
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1), 60–79.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, (pp. 20–36). Springer
Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 4325–4334).
Wei, H., Feng, L., Chen, X., & An, B. (2020). Combating noisy labels by agreement: A joint training method with co-regularization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 13726–13735).
Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., & Ding, E. (2019a). A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 8150–8159).
Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019b). Dual attention matching for audio-visual event localization. In Proceedings of the IEEE/CVF international conference on computer vision, (pp. 6292–6300).
Xu, H., Das, A., & Saenko, K. (2017). R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5783–5792).
Xu, M., Zhao, C., Rojas, D. S., Thabet, A., & Ghanem, B. (2020). G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 10156–10165).
Xu, Y., Zhang, C., Cheng, Z., Xie, J., Niu, Y., Pu, S., & Wu, F. (2019). Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In Proceedings of the AAAI conference on artificial intelligence, (vol. 33, pp. 9070–9078).
Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2020). A weakly supervised multi-task ranking framework for actor-action semantic segmentation. International Journal of Computer Vision, 128(5), 1414–1432.
Yang, L., Han, J., Zhang, D., Liu, N., & Zhang, D. (2018). Segmentation in weakly labeled videos via a semantic ranking and optical warping network. IEEE Transactions on Image Processing, 27(8), 4025–4037.
Yang, L., Peng, H., Zhang, D., Fu, J., & Han, J. (2020a). Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29, 8535–8548.
Yang, T., Zhu, S., Chen, C., Yan, S., Zhang, M., & Willis, A. (2020b). Mutualnet: Adaptive convnet via mutual learning from network width and resolution. In European conference on computer vision, (pp. 299–315). Springer
Yu, T., Ren, Z., Li, Y., Yan, E., Xu, N., & Yuan, J. (2019). Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5522–5531).
Yuan, L., Tay, F. E., Li, G., Wang, T., & Feng, J. (2020). Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 3903–3911).
Yuan, Y., Lyu, Y., Shen, X., Tsang, I. W., & Yeung, D. Y. (2019). Marginalized average attentional network for weakly-supervised learning. In International conference on learning representations (ICLR).
Yun, S., Park, J., Lee, K., & Shin, J. (2020). Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 13876–13885).
Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, Springer, (pp. 214–223).
Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2019). Graph convolutional networks for temporal action localization. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 7094–7103).
Zhai, Y., Wang, L., Tang, W., Zhang, Q., Yuan, J., & Hua, G. (2020). Two-stream consensus network for weakly-supervised temporal action localization. In European conference on computer vision, (pp. 37–54). Springer
Zhang, C., Xu, Y., Cheng, Z., Niu, Y., Pu, S., Wu, F., & Zou, F. (2019a). Adversarial seeded sequence growing for weakly-supervised temporal action localization. In Proceedings of the 27th ACM international conference on multimedia, (pp. 738–746).
Zhang, D., Han, J., Yang, L., & Xu, D. (2018a). Spftn: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE transactions on pattern analysis and machine intelligence, 42(2), 475–489.
Zhang, D., Han, J., Zhao, L., & Meng, D. (2019b). Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework. International Journal of Computer Vision, 127(4), 363–380.
Zhang, D., Han, J., Zhao, L., & Zhao, T. (2020a). From discriminant to complete: Reinforcement searching-agent learning for weakly supervised object detection. IEEE Transactions on Neural Networks and Learning Systems, 31(12), 5549–5560.
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019c). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3713–3722).
Zhang, X., Wei, Y., Feng, J., Yang, Y., & Huang, T. S. (2018b). Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1325–1334).
Zhang, X. Y., Shi, H., Li, C., Zheng, K., Zhu, X., & Duan, L. (2019d). Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision. In Proceedings of the AAAI conference on artificial intelligence, (vol. 33, pp. 9227–9234).
Zhang, X. Y., Shi, H., Li, C., & Li, P. (2020b). Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In AAAI, (pp. 12886–12893).
Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018c). Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 4320–4328).
Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., & Tian, Q. (2020). Bottom-up temporal action localization with mutual regularization. In European Conference on Computer Vision, (pp. 539–555). Springer
Zhong, J. X., Li, N., Kong, W., Zhang, T., Li, T. H., & Li, G. (2018). Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In: Proceedings of the 26th ACM international conference on Multimedia, (pp. 35–44).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 2921–2929).
Zhu, L., & Yang, Y. (2020). Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence.https://doi.org/10.1109/TPAMI.2020.3007511.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Communicated by Dong Xu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the the National Natural Science Foundation of China under Grants 61876140 and U1801265, Key-Area Research and Development Program of Guangdong Province(2019B010110001), the Research Funds for Interdisciplinary subject NWPU.
Rights and permissions
About this article
Cite this article
Zhao, T., Han, J., Yang, L. et al. SODA: Weakly Supervised Temporal Action Localization Based on Astute Background Response and Self-Distillation Learning. Int J Comput Vis 129, 2474–2498 (2021). https://doi.org/10.1007/s11263-021-01473-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01473-9