skip to main content
10.1145/3664647.3681571acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Efficient Dual-Confounding Eliminating for Weakly-supervised Temporal Action Localization

Published: 28 October 2024 Publication History

Abstract

Weakly-supervised Temporal Action Localization (WTAL) following a localization-by-classification paradigm has achieved significant results, yet still grapples with confounding arising from ambiguous snippets. Previous works have attempted to distinguish these ambiguous snippets from action snippets without investigating the underlying causes of their formation, thus failing to effectively eliminate the bias on both action-context and action-content. In this paper, we revisit WTAL from the perspective of structural causal model to identify the true origins of confounding, and propose an efficient dual-confounding eliminating framework to alleviate these biases. Specifically, we construct a Substituted Confounder Set (SCS) to eliminate the confounding bias on action-context by leveraging the modal disparity between RGB and FLOW. Then, a Multi-level Consistency Mining (MCM) method is designed to mitigate the confounding bias on action-content by utilizing the consistency between discriminative snippets and corresponding proposals at both the feature and label levels. Notably, SCS and MCM could be seamlessly integrated into any two-stream models without additional parameters by Expectation-Maximization (EM) algorithm. Extensive experiments on two challenging benchmarks including THUMOS14 and ActivityNet-1.2 demonstrate the superior performance of our method.

References

[1]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. en-US. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (July 2017).
[2]
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1130--1139.
[3]
Mengyuan Chen, Junyu Gao, Shicai Yang, and Changsheng Xu. 2022. Dualevidential learning for weakly-supervised temporal action localization. In Computer Vision -- ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part IV. Springer-Verlag, Tel Aviv, Israel, 192--208. isbn: 978--3-031--19771-0.
[4]
Vanessa Didelez and Iris Pigeot. 2001. Causality: models, reasoning, and inference. (2001).
[5]
Xinpeng Ding, Nannan Wang, Xinbo Gao, Jie Li, Xiaoyu Wang, and Tongliang Liu. 2021. Kfc: an efficient framework for semi-supervised temporal action localization. en-US. IEEE Transactions on Image Processing, (Jan. 2021), 6869-- 6878.
[6]
Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19967--19977.
[7]
Yongxin Ge, Xiaolei Qin, Dan Yang, and Martin Jagersand. 2021. Deep snippet selective network for weakly supervised temporal action localization. en-US. Pattern Recognition, (Feb. 2021), 107686.
[8]
Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, and Abhinav Shrivastava. 2022. Asm-loc: action-aware segment modeling for weakly-supervised temporal action localization. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13915--13925.
[9]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: a large-scale video benchmark for human activity understanding. en-US. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015).
[10]
Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, and Wei-Shi Zheng. 2021. Cross-modal consensus network for weakly supervised temporal action localization. en-US. In Proceedings of the 29th ACM International Conference on Multimedia. (Oct. 2021).
[11]
Linjiang Huang, Liang Wang, and Hongsheng Li. 2022. Weakly supervised temporal action localization via representative snippet knowledge propagation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3262--3271.
[12]
Haroon Idrees, Amir R. Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The thumos challenge on action recognition for videos 'in the wild'. en-US. Computer Vision and Image Understanding, (Feb. 2017), 1--23.
[13]
Chen Ju, Kunhao Zheng, Jinxiang Liu, Peisen Zhao, Ya Zhang, Jianlong Chang, Qi Tian, and YanfengWang. 2023. Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14751--14762.
[14]
Will Kay et al. 2017. The kinetics human action video dataset. ArXiv, abs/1705.06950. https://api.semanticscholar.org/CorpusID:27300853.
[15]
Jihwan Kim, Miso Lee, and Jae-Pil Heo. 2023. Self-feedback detr for temporal action detection. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 10252--10262.
[16]
Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, and Xinbo Gao. 2023. Boosting weakly-supervised temporal action localization with text information. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10648--10657.
[17]
Jingjing Li, Tianyu Yang, Wei Ji, Jue Wang, and Li Cheng. 2022. Exploring denoised cross-video contrast for weakly-supervised temporal action localization. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19882--19892.
[18]
Ziqiang Li, Yongxin Ge, Jiaruo Yu, and Zhongming Chen. 2022. Forcing the whole video as background: an adversarial learning strategy for weakly temporal action localization. In Proceedings of the 30th ACM International Conference on Multimedia (MM '22). Association for Computing Machinery, 5371--5379. isbn: 9781450392037.
[19]
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. en-US. In Proceedings of the 25th ACM international conference on Multimedia. (Oct. 2017).
[20]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: boundary sensitive network for temporal action proposal generation. In Computer Vision -- ECCV 2018. Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, (Eds.) Springer International Publishing, Cham, 3--21. isbn: 978--3-030-01225-0.
[21]
Qinying Liu, Zilei Wang, Shenghai Rong, Junjie Li, and Yixin Zhang. 2023. Revisiting foreground and background separation in weakly-supervised temporal action localization: a clustering-based approach. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 10399--10409. 023.00957.
[22]
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-end temporal action detection with transformer. en-US. IEEE Transactions on Image Processing, (Jan. 2022), 5427--5441. .2022.3195321.
[23]
Yuan Liu, Jingyuan Chen, Zhenfang Chen, Bing Deng, Jianqiang Huang, and Hanwang Zhang. 2021. The blessings of unlabeled background in untrimmed videos. en-US. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2021).
[24]
Ziyi Liu, Le Wang, Qilin Zhang, Wei Tang, Junsong Yuan, Nanning Zheng, and Gang Hua. 2021. Acsnet: action-context separation network for weakly supervised temporal action localization. In number 3. Vol. 35. (May 2021), 2233-- 2241.
[25]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 344-- 353.
[26]
Oded Maron and Tomás Lozano-Pérez. 1998. A framework for multiple-instance learning. NIPS '97, 570--576. isbn: 0262100762.
[27]
Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. 2021. Interventional video grounding with dual contrastive learning. en-US. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). (June 2021).
[28]
Phuc Nguyen, Bohyung Han, Ting Liu, and Gautam Prasad. 2018. Weakly supervised action localization by sparse temporal pooling network. en-US. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. (June 2018).
[29]
Sujoy Paul, Sourya Roy, and Amit K. Roy-Chowdhury. 2018. W-talc: weaklysupervised temporal activity localization and classification. In Computer Vision -- ECCV 2018. Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, (Eds.) Springer International Publishing, Cham, 588--607. isbn: 978--3- 030-01225-0.
[30]
Judea Pearl, Madelyn Glymour, and Nicholas P Jewell. 2016. Causal inference in statistics: A primer. John Wiley & Sons.
[31]
Xiaolei Qin, Yongxin Ge, Hui Yu, Feiyu Chen, and Dan Yang. 2020. Spatial enhancement and temporal constraint forweakly supervised action localization. en-US. IEEE Signal Processing Letters, 27, (Jan. 2020), 1520--1524. p.2020.3018914.
[32]
Huan Ren,Wenfei Yang, Tianzhu Zhang, and Yongdong Zhang. 2023. Proposalbased multiple instance learning for weakly-supervised temporal action localization. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2394--2404.
[33]
Mamshad Nayeem Rizve, Gaurav Mittal, Ye Yu, Matthew Hall, Sandra Sajeev, Mubarak Shah, and Mei Chen. 2023. Pivotal: prior-driven supervision for weakly-supervised temporal action localization. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22992--23002. /CVPR52729.2023.02202.
[34]
Feifei Shao, Yawei Luo, Li Zhang, Lu Ye, Siliang Tang, Yi Yang, and Jun Xiao. 2021. Improving weakly supervised object localization via causal intervention. en-US. In Proceedings of the 29th ACM International Conference on Multimedia. (Oct. 2021).
[35]
Haichao Shi, Xiao-Yu Zhang, Changsheng Li, Lixing Gong, Yong Li, and Yongjun Bao. 2022. Dynamic graph modeling for weakly-supervised temporal action localization. In Proceedings of the 30th ACM International Conference on Multimedia (MM '22). Association for Computing Machinery, Lisboa, Portugal, 3820--3828. isbn: 9781450392037.
[36]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1049--1058.
[37]
Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20) Article 128. Curran Associates Inc., Vancouver, BC, Canada, 12 pages. isbn: 9781713829546.
[38]
Xiaojun Tang, Junsong Fan, Chuanchen Luo, Zhaoxiang Zhang, Man Zhang, and Zongyuan Yang. 2023. Ddg-net: discriminability-driven graph network for weakly-supervised temporal action localization. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 6599--6609. 3.00609.
[39]
Yiping Tang, Chuang Niu, Minghao Dong, Shenghan Ren, and Jimin Liang. 2019. Afo-tad: anchor-free one-stage detector for temporal action detection. https://arxiv.org/abs/1910.08250 arXiv: 1910.08250 [cs.CV].
[40]
Binglu Wang, Yongqiang Zhao, Le Yang, Teng Long, and Xuelong Li. 2024. Temporal action localization in the deep learning era: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 4, 2171--2190.
[41]
Guiqin Wang, Peng Zhao, Cong Zhao, Shusen Yang, Jie Cheng, Luziwei Leng, Jianxing Liao, and Qinghai Guo. 2023. Weakly-supervised action localization by hierarchically-structured latent attention modeling. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 10169--10179. /ICCV51070.2023.00936.
[42]
Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6402--6411.
[43]
Wei Wang, Junyu Gao, and Changsheng Xu. 2023. Weakly-supervised video object grounding via causal intervention. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 3, 3933--3948.
[44]
Yu Wang, Yadong Li, and Hongbin Wang. 2023. Two-stream networks for weakly-supervised temporal action localization with semantic-aware mechanisms. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18878--18887.
[45]
2009. An improved algorithm for tv-l 1 optical flow. en-US. Lecture Notes in Computer Science,Statistical and Geometrical Approaches to Visual Motion Analysis. (Jan. 2009), 23--45.
[46]
Huijuan Xu, Abir Das, and Kate Saenko. 2019. Two-stream region convolutional 3d network for temporal activity detection. en-US. IEEE Transactions on Pattern Analysis and Machine Intelligence, (Oct. 2019), 2319--2332. 019.2921539.
[47]
Wenfei Yang, Tianzhu Zhang, Zhendong Mao, Yongdong Zhang, Qi Tian, and Feng Wu. 2021. Multi-scale structure-aware network for weakly supervised temporal action detection. en-US. IEEE Transactions on Image Processing, (Jan. 2021), 5848--5861.
[48]
Wenfei Yang, Tianzhu Zhang, Yongdong Zhang, and Feng Wu. 2022. Uncertainty guided collaborative training for weakly supervised and unsupervised temporal action localization. en-US. IEEE Transactions on Pattern Analysis and Machine Intelligence, (Jan. 2022), 1--15.
[49]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2022. Deconfounded image captioning: a causal retrospect. en-US. IEEE Transactions on Pattern Analysis and Machine Intelligence, (Jan. 2022), 1--1.
[50]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021. Deconfounded video moment retrieval with causal intervention. en-US. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. (July 2021).
[51]
Zichen Yang, Jie Qin, and Di Huang. 2022. Acgnet: action complement graph network for weakly-supervised temporal action localization. In number 3. Vol. 36. (June 2022), 3090--3098.
[52]
Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. 2020. Interventional few-shot learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20) Article 230. Curran Associates Inc., Vancouver, BC, Canada, 13 pages. isbn: 9781713829546.
[53]
Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. 2020. Two-stream consensus network for weakly-supervised temporal action localization. In Computer Vision -- ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VI. Springer-Verlag, Glasgow, United Kingdom, 37--54. isbn: 978--3-030--58538--9. 9--6_3.
[54]
Dong Zhang, Hanwang Zhang, Jinhui Tang, Xiansheng Hua, and Qianru Sun. 2020. Causal intervention for weakly-supervised semantic segmentation. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS '20) Article 56. Curran Associates Inc., Vancouver, BC, Canada, 12 pages. isbn: 9781713829546.
[55]
Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, and Yong Rui. 2021. What if we could not see? counterfactual analysis for egocentric action anticipation. en-US. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. (Aug. 2021).
[56]
Yibo Zhao, Hua Zhang, Zan Gao, Weili Guan, Jie Nie, Anan Liu, Meng Wang, and Shengyong Chen. 2022. A temporal-aware relation and attention network for temporal action localization. IEEE Transactions on Image Processing, 31, 4746--4760.
[57]
Jianxiong Zhou and Ying Wu. 2023. Temporal feature enhancement dilated convolution network for weakly-supervised temporal action localization. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 6017--6026.
[58]
Jingqiu Zhou, Linjiang Huang, Liang Wang, Si Liu, and Hongsheng Li. 2023. Improving weakly supervised temporal action localization by bridging traintest gap in pseudo labels. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 23003--23012.

Cited By

View all
  • (2025)Summarized knowledge guidance for single-frame temporal action localizationPattern Recognition Letters10.1016/j.patrec.2025.02.027191(31-36)Online publication date: May-2025

Index Terms

  1. Efficient Dual-Confounding Eliminating for Weakly-supervised Temporal Action Localization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. consistency mining
    2. structural causal model
    3. substituted confounder set
    4. temporal action localization
    5. weakly-supervised

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)193
    • Downloads (Last 6 weeks)131
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Summarized knowledge guidance for single-frame temporal action localizationPattern Recognition Letters10.1016/j.patrec.2025.02.027191(31-36)Online publication date: May-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media