TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

Liu, Huabin; Ma, Xiao; Zhong, Cheng; Zhang, Yang; Lin, Weiyao

doi:10.1007/978-3-031-72652-1_6

Huabin Liu ORCID: orcid.org/0000-0001-9174-1696¹³,
Xiao Ma¹⁴,
Cheng Zhong¹⁴,
Yang Zhang¹⁴ &
…
Weiyao Lin ORCID: orcid.org/0000-0001-8307-7107¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15063))

Included in the following conference series:

European Conference on Computer Vision

291 Accesses

Abstract

Video reasoning typically operates within the Video Question-Answering (VQA) paradigm, which demands that the models understand and reason about video content from temporal and causal perspectives. Traditional supervised VQA methods gain this capability through meticulously annotated QA datasets, while advanced visual-language models exhibit remarkable performance due to large-scale visual-text pretraining data. Nevertheless, due to potential language bias and spurious visual-text correlations in cross-modal learning, concerns about the reliability of their answers persist in real-world applications. In this paper, we focus on the grounded VQA task, which necessitates models to provide answers along with explicit visual evidence, i.e., certain video segments. As temporal annotation is not available during training, we propose a novel bi-directional reasoning framework to perform grounded VQA in a weakly-supervised setting. Specifically, our framework consists of two parallel but dual reasoning paths. They conduct temporal grounding and answering based on the video content, approaching it from two dual directions that are symmetrical in terms of temporal order or causal relationships. By constructing a cycle-consistency relationship between these two branches, the model is prompted to provide self-guidance supervision for both temporal grounding and answering. Experiments conducted on the Next-GQA and Env-QA datasets demonstrate that our framework achieves superior performance in grounded VQA and can provide reasonable temporal locations that validate the answers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Locating Visual Explanations for Video Question Answering

Video Graph Transformer for Video Question Answering

Instance-sequence reasoning for video question answering

Article 02 April 2022

References

Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2917–2927 (2022)
Google Scholar
Chung, H.W., et al.: Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fu, T.J., et al.: An empirical study of end-to-end video-language transformers with masked visual modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22898–22909 (2023)
Google Scholar
Gao, D., Wang, R., Bai, Z., Chen, X.: Env-QA: a video question answering benchmark for comprehensive understanding of dynamic environments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1675–1685 (2021)
Google Scholar
Gao, D., Zhou, L., Ji, L., Zhu, L., Yang, Y., Shou, M.Z.: MIST: multi-modal iterative spatial-temporal transformer for long-form video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14773–14783 (2023)
Google Scholar
Gao, M., Davis, L.S., Socher, R., Xiong, C.: WSLLN: weakly supervised natural language localization networks. arXiv preprint arXiv:1909.00239 (2019)
Grunde-McLaughlin, M., Krishna, R., Agrawala, M.: AGQA: a benchmark for compositional spatio-temporal reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11287–11297 (2021)
Google Scholar
Huang, J., Liu, Y., Gong, S., Jin, H.: Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7199–7208 (2021)
Google Scholar
Jang, Y., Song, Y., Kim, C.D., Yu, Y., Kim, Y., Kim, G.: Video question answering with spatio-temporal reasoning. Int. J. Comput. Vis. 127, 1385–1412 (2019)
Article Google Scholar
Jiang, J., Chen, Z., Lin, H., Zhao, X., Gao, Y.: Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11101–11108 (2020)
Google Scholar
Jin, W., Zhao, Z., Li, Y., Li, J., Xiao, J., Zhuang, Y.: Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 15(2s), 1–22 (2019)
Article Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11539–11546 (2020)
Google Scholar
Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: VLANet: video-language alignment network for weakly-supervised video moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXVIII 16. LNCS, vol. 12373, pp. 156–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_10
Chapter Google Scholar
Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048 (2020)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
InternLM Team: InternLM: a multilingual language model with progressively enhanced capabilities (2023). https://github.com/InternLM/InternLM
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, J., et al.: All in one: exploring unified video-language pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6598–6608 (2023)
Google Scholar
Wu, B., Yu, S., Chen, Z., Tenenbaum, J.B., Gan, C.: STAR: a benchmark for situated reasoning in real-world videos. In: Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021)
Google Scholar
Xiao, J., Shang, X., Yao, A., Chua, T.S.: NExT-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9777–9786 (2021)
Google Scholar
Xiao, J., Yao, A., Li, Y., Chua, T.S.: Can i trust your answer? Visually grounded video question answering. arXiv preprint arXiv:2309.01327 (2023)
Xiao, J., Zhou, P., Chua, TS., Yan, S.: Video graph transformer for video question answering. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13696, pp. 39–58. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20059-5_3
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Zero-shot video question answering via frozen bidirectional language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 124–141 (2022)
Google Scholar
Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Google Scholar
Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3517–3525 (2022)
Google Scholar

Download references

Acknowledgements

The paper is supported in part by the National Natural Science Foundation of China (No. 62325109, U21B2013) and the Lenovo Academic Collaboration Project.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Huabin Liu & Weiyao Lin
AI Lab, Lenovo Research, Beijing, China
Xiao Ma, Cheng Zhong & Yang Zhang

Authors

Huabin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weiyao Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiyao Lin .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 959 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, H., Ma, X., Zhong, C., Zhang, Y., Lin, W. (2025). TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15063. Springer, Cham. https://doi.org/10.1007/978-3-031-72652-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-72652-1_6
Published: 30 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72651-4
Online ISBN: 978-3-031-72652-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Locating Visual Explanations for Video Question Answering

Video Graph Transformer for Video Question Answering

Instance-sequence reasoning for video question answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 959 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Locating Visual Explanations for Video Question Answering

Video Graph Transformer for Video Question Answering

Instance-sequence reasoning for video question answering

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 959 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation