skip to main content
10.1145/3581783.3613822acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deconfounded Multimodal Learning for Spatio-temporal Video Grounding

Published: 27 October 2023 Publication History

Abstract

The task of spatio-temporal video grounding involves identifying the spatial and temporal regions in a video that correspond to the objects or actions described in a given textual description. However, current models used for spatio-temporal video grounding often rely heavily on spatio-temporal priors to make the predictions. As a result, they may suffer from spurious correlations and lack the ability to generalize well to new or diverse scenarios. To overcome this limitation, we introduce a deconfounded multimodal learning framework, which utilizes a structural causal model to treat dataset biases as a confounder and subsequently remove their confounding effect. Through this framework, we can perform causal intervention on the multimodal input and derive an unbiased estimation formula through the do-calculus technique. In order to tackle the challenge of diverse and often unobservable confounders, we further propose a novel retrieval-based approach with a causal mask mechanism. The proposed method leverages analogical reasoning to facilitate deconfounded learning and mitigate dataset biases, enabling unbiased spatio-temporal prediction without explicitly modeling the confounding factors. Extensive experiments on two challenging benchmarks have well verified the effectiveness and rationality of our proposed solution.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 5803--5812.
[2]
Da Cao, Yawen Zeng, Meng Liu, Xiangnan He, Meng Wang, and Zheng Qin. 2020. STRONG: Spatio-temporal reinforcement learning for cross-modal video moment localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 4162--4170.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision. Springer, 213--229.
[4]
Junwen Chen, Wentao Bao, and Yu Kong. 2020. Activity-driven weakly-supervised spatio-temporal grounding from untrimmed videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 3789--3797.
[5]
Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee Kenneth Wong. 2019. Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. ACL, 1884--1894.
[6]
Cheng Da, Yanhao Zhang, Yun Zheng, Pan Pan, Yinghui Xu, and Chunhong Pan. 2021. AsyNCE: Disentangling false-positives for weakly-supervised video grounding. In Proceedings of the ACM International Conference on Multimedia. ACM, 1129--1137.
[7]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 5267--5275.
[8]
Dedre Gentner and Linsey A Smith. 2013. Analogical Learning and Reasoning. In The Oxford Handbook of Cognitive Psychology.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.
[10]
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4555--4564.
[11]
Jianqiang Huang, Yu Qin, Jiaxin Qi, Qianru Sun, and Hanwang Zhang. 2022. Deconfounded visual grounding. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 998--1006.
[12]
Yang Jin, Yongzhi Li, Zehuan Yuan, and Yadong Mu. 2022. Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding. arXiv preprint arXiv:2209.13306 (2022).
[13]
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1780--1790.
[14]
Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Wenqiao Zhang, Jiaxu Miao, Shiliang Pu, and Fei Wu. 2022b. HERO: Hierarchical spatio-temporal reasoning with contrastive action correspondence for end-to-end video object grounding. In Proceedings of the ACM International Conference on Multimedia. ACM, 3801--3810.
[15]
Qian Li, Xiangmeng Wang, Zhichao Wang, and Guandong Xu. 2023. Be causal: Debiasing social network confounding in recommendation. ACM Transactions on Knowledge Discovery from Data, Vol. 17, 1 (2023), 1--23.
[16]
Yicong Li, Xiang Wang, Junbin Xiao, Wei Ji, and Tat-Seng Chua. 2022a. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2928--2937.
[17]
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10880--10889.
[18]
Daizong Liu, Xiaoye Qu, and Wei Hu. 2022. Reducing the vision and language bias for temporal sentence grounding. In Proceedings of the ACM International Conference on Multimedia. ACM, 4092--4101.
[19]
Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019b. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4673--4682.
[20]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019a. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[21]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[22]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10034--10043.
[23]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? arXiv preprint arXiv:2202.12837 (2022).
[24]
Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In Proceedings of the European Conference on Computer Vision. Springer, 792--807.
[25]
Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, Hao Zhang, and Wei Lu. 2021. Interventional video grounding with dual contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2765--2775.
[26]
Judea Pearl et al. 2000. Models, reasoning and inference. Cambridge University Press, Vol. 19, 2 (2000).
[27]
Ruihong Qiu, Sen Wang, Zhi Chen, Hongzhi Yin, and Zi Huang. 2021. Causalrec: Causal inference for visual debiasing in visually-aware recommendation. In Proceedings of the ACM International Conference on Multimedia. ACM, 3844--3852.
[28]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 658--666.
[29]
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In Proceedings of the International Conference on Multimedia Retrieval. ACM, 279--287.
[30]
Feifei Shao, Yawei Luo, Li Zhang, Lu Ye, Siliang Tang, Yi Yang, and Jun Xiao. 2021. Improving Weakly Supervised Object Localization via Causal Intervention. In Proceedings of the ACM International Conference on Multimedia. ACM, 3321--3329.
[31]
Rui Su, Qian Yu, and Dong Xu. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1533--1542.
[32]
Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. 2021. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 12 (2021), 8238--8249.
[33]
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 1960--1968.
[34]
Xiao Wang, Zheng Wang, Wu Liu, Xin Xu, Qijun Zhao, and Shin'ichi Satoh. 2022. Towards Causality Inference for Very Important Person Localization. In Proceedings of the ACM International Conference on Multimedia. ACM, 6618--6626.
[35]
Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 1453--1462.
[36]
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. 2022. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 16442--16453.
[37]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019b. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 4145--4154.
[38]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019c. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4644--4653.
[39]
Xun Yang, Fuli Feng, Wei Ji, Meng Wang, and Tat-Seng Chua. 2021a. Deconfounded video moment retrieval with causal intervention. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1--10.
[40]
Xun Yang, Xueliang Liu, Meng Jian, Xinjian Gao, and Meng Wang. 2020b. Weakly-supervised video object grounding by exploring spatio-temporal contexts. In Proceedings of the ACM International Conference on Multimedia. ACM, 1939--1947.
[41]
Xu Yang, Hanwang Zhang, Guojun Qi, and Jianfei Cai. 2021b. Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 9847--9857.
[42]
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020a. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the European Conference on Computer Vision. Springer, 387--404.
[43]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019a. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 4683--4693.
[44]
Zhu Zhang, Zhou Zhao, Yang Zhao, Qi Wang, Huasheng Liu, and Lianli Gao. 2020. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 10668--10677.

Cited By

View all
  • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
  • (2024)Causal-driven Large Language Models with Faithful Reasoning for Knowledge Question AnsweringProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681263(4331-4340)Online publication date: 28-Oct-2024
  • (2024)Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681246(4841-4850)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. causal inference
  2. cross-modal retrieval
  3. deconfounded multmodal learning
  4. spatio-temporal video grounding

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)147
  • Downloads (Last 6 weeks)6
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
  • (2024)Causal-driven Large Language Models with Faithful Reasoning for Knowledge Question AnsweringProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681263(4331-4340)Online publication date: 28-Oct-2024
  • (2024)Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual ReasoningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681246(4841-4850)Online publication date: 28-Oct-2024
  • (2024)AutoM3L: An Automated Multimodal Machine Learning Framework with Large Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680665(8586-8594)Online publication date: 28-Oct-2024
  • (2024)Semantic Codebook Learning for Dynamic Recommendation ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680574(9611-9620)Online publication date: 28-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media