Skip to main content

Advertisement

Log in

Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Wu Q, Teney D, Wang P, Shen C, Dick A, Van Den Hengel A. Visual question answering: a survey of methods and datasets. Computer Vision and Image Understanding, 2017, 163: 21–40.

    Article  Google Scholar 

  2. Zhong Y, Ji W, Xiao J, Li Y, Deng W, Chua T S. Video question answering: datasets, algorithms and challenges. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 6439–6455

    Chapter  Google Scholar 

  3. Liu R, Han Y. Instance-sequence reasoning for video question answering. Frontiers of Computer Science, 2022, 16(6): 166708.

    Article  Google Scholar 

  4. Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura J M F, Parikh D, Batra D. Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1080–1089

    Google Scholar 

  5. Chen C, Tan Z, Cheng Q, Jiang X, Liu Q, Zhu Y, Gu X. UTC: A unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 18082–18091

    Google Scholar 

  6. Alamri H, Cartillier V, Das A, Wang J, Cherian A, Essa I, Batra D, Marks T K, Hori C, Anderson P, Lee S, Parikh D. Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7550–7559

    Google Scholar 

  7. Hori C, Alamri H, Wang J, Wichern G, Hori T, Cherian A, Marks T K, Cartillier V, Lopes R G, Das A, Essa I, Batra D, Parikh D. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019, 2352–2356

    Google Scholar 

  8. Robinson N, Tidd B, Campbell D, Kulić D, Corke P. Robotic vision for human-robot interaction and collaboration: a survey and systematic review. ACM Transactions on Human-Robot Interaction, 2023, 12(1): 12.

    Article  Google Scholar 

  9. Gu J, Stefani E, Wu Q, Thomason J, Wang X. Vision-and-language navigation: a survey of tasks, methods, and future directions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 7606–7623

    Chapter  Google Scholar 

  10. Preum S M, Munir S, Ma M, Yasar M S, Stone D J, Williams R, Alemzadeh H, Stankovic J A. A review of cognitive assistants for healthcare: trends, prospects, and future directions. ACM Computing Surveys (CSUR), 2021, 53(6): 130.

    Article  Google Scholar 

  11. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724–4733

    Google Scholar 

  12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

    Google Scholar 

  13. Le H, Sahoo D, Chen N, Hoi S. Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 5612–5623

    Chapter  Google Scholar 

  14. Le H, Chen N F. Multimodal transformer with pointer network for the DSTC8 AVSD challenge. 2020, arXiv preprint arXiv: 2002.10695

    Google Scholar 

  15. Johnson J, Krishna R, Stark M, Li L J, Shamma D A, Bernstein M S, Fei-Fei L. Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3668–3678

    Google Scholar 

  16. Geng S, Gao P, Chatterjee M, Hori C, Le Roux J, Zhang Y, Li H, Cherian A. Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 1415–1423

    Google Scholar 

  17. Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018

    Google Scholar 

  18. Kim J, Yoon S, Kim D, Yoo C D. Structured co-reference graph attention for video-grounded dialogue. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 1789–1797

    Google Scholar 

  19. Le H, Chen N F, Hoi S C H. Learning reasoning paths over semantic graphs for video-grounded dialogues. In: Proceedings of the 9th International Conference on Learning Representations. 2021

    Google Scholar 

  20. Zhao X, Wang Y, Tao C, Wang C, Zhao D. Collaborative reasoning on multi-modal semantic graphs for video-grounded dialogue generation. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022. 2022, 5988–5998

    Chapter  Google Scholar 

  21. Zhao W X, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z, Du Y, Yang C, Chen Y, Chen Z, Jiang J, Ren R, Li Y, Tang X, Liu Z, Liu P, Nie J Y, Wen J R. A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223

    Google Scholar 

  22. Min B, Ross H, Sulem E, Veyseh A P B, Nguyen T H, Sainz O, Agirre E, Heintz I, Roth D. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Computing Surveys, 2024, 56(2): 30.

    Article  Google Scholar 

  23. Le H, Hoi S C H. Video-grounded dialogues with pretrained generation language models. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5842–5848

    Chapter  Google Scholar 

  24. Li Z, Li Z, Zhang J, Feng Y, Zhou J. Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2476–2483.

    Article  Google Scholar 

  25. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. OpenAI blog, 2019, 1(8): 9

    Google Scholar 

  26. Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, Sutskever I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748–8763

    Google Scholar 

  27. Li J, Li D, Savarese S, Hoi S. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 814

    Google Scholar 

  28. Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, Bashlykov N, Batra S, Bhargava P, Bhosale S, Bikel D, Blecher L, Ferrer C C, Chen M, Cucurull G, Esiobu D, Fernandes J, Fu J, Fu W, Fuller B, Gao C, Goswami V, Goyal N, Hartshorn A, Hosseini S, Hou R, Inan H, Kardas M, Kerkez V, Khabsa M, Kloumann I, Korenev A, Koura P S, Lachaux M A, Lavril T, Lee J, Liskovich D, Lu Y, Mao Y, Martinet X, Mihaylov T, Mishra P, Molybog I, Nie Y, Poulton A, Reizenstein J, Rungta R, Saladi K, Schelten A, Silva R, Smith E M, Subramanian R, Tan X E, Tang B, Taylor R, Williams A, Kuan J X, Xu P, Yan Z, Zarov I, Zhang Y, Fan A, Kambadur M, Narang S, Rodriguez A, Stojnic R, Edunov S, Scialom T. Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288

    Google Scholar 

  29. The Vicuna Team. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatGPT quality. See vicuna.lmsys.org website (accessed 14 April 2023), 2023, 2(3): 6

  30. Li K, He Y, Wang Y, Li Y, Wang W, Luo P, Wang Y, Wang L, Qiao Y. VideoChat: chat-centric video understanding. 2023, arXiv preprint arXiv: 2305.06355

    Google Scholar 

  31. Maaz M, Rasheed H, Khan S, Khan F S. Video-chatGPT: towards detailed video understanding via large vision and language models. 2023, arXiv preprint arXiv: 2306.05424

    Google Scholar 

  32. Zhang H, Li X, Bing L. Video-LLaMA: an instruction-tuned audiovisual language model for video understanding. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023, 543–553

    Chapter  Google Scholar 

  33. Perlovsky L. Language and cognition interaction neural mechanisms. Computational Intelligence and Neuroscience, 2011, 2011: 454587.

    Article  Google Scholar 

  34. Hong R, Liu D, Mo X, He X, Zhang H. Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(2): 684–696.

    Article  Google Scholar 

  35. Lin W, Chen J, Mei J, Coca A, Byrne B. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 990

    Google Scholar 

  36. Yoon S, Yoon E, Yoon H S, Kim J, Yoo C D. Information-theoretic text hallucination reduction for video-grounded dialogue. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4182–4193

    Chapter  Google Scholar 

  37. Wang L, Ma C, Feng X, Zhang Z, Yang H, Zhang J, Chen Z, Tang J, Chen X, Lin Y, Zhao W, Wei Z, Wen J. A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18(6): 186345.

    Article  Google Scholar 

  38. Xu P, Zhu X, Clifton D A. Multimodal learning with transformers: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(10): 12113–12132.

    Article  Google Scholar 

  39. Zhang J, Huang J, Jin S, Lu S. Vision-language models for vision tasks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5625–5644.

    Article  Google Scholar 

  40. Yang Y, Guo J, Li G, Li L, Li W, Yang J. Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning. Frontiers of Computer Science, 2024, 18(1): 181335.

    Article  Google Scholar 

  41. Jiang A Q, Sablayrolles A, Roux A, Mensch A, Savary B, Bamford C, Chaplot D S, De Las Casas D, Hanna E B, Bressand F, Lengyel G, Bour G, Lample G, Lavaud L R, Saulnier L, Lachaux M A, Stock P, Subramanian S, Yang S, Antoniak S, Le Scao T, Gervet T, Lavril T, Wang T, Lacroix T, El Sayed W. Mixtral of experts. 2024, arXiv preprint arXiv: 2401.04088

    Google Scholar 

  42. Cen J, Wu C, Liu X, Yin S, Pei Y, Yang J, Chen Q, Duan N, Zhang J. Using left and right brains together: Towards vision and language planning. 2024, arXiv preprint arXiv: 2402.10534

    Google Scholar 

  43. Blattmann A, Dockhorn T, Kulal S, Mendelevitch D, Kilian M, Lorenz D, Levi Y, English Z, Voleti V, Letts A, Jampani V, Rombach R. Stable video diffusion: scaling latent video diffusion models to large datasets. 2023, arXiv preprint arXiv: 2311.15127

    Google Scholar 

  44. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021

    Google Scholar 

  45. Khan S, Naseer M, Hayat M, Zamir S W, Khan F S, Shah M. Transformers in vision: a survey. ACM computing surveys (CSUR), 2022, 54(10s): 200.

    Article  Google Scholar 

  46. Jang E, Gu S, Poole B. Categorical reparameterization with gumbel-softmax. In: Proceedings of the 5th International Conference on Learning Representations. 2017

    Google Scholar 

  47. Maddison C J, Mnih A, Teh Y W. The concrete distribution: a continuous relaxation of discrete random variables. In: Proceedings of the 5th International Conference on Learning Representations. 2017

    Google Scholar 

  48. Niu Y, Zhang H, Zhang M, Zhang J, Lu Z, Wen J R. Recursive visual attention in visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6672–6681

    Google Scholar 

  49. Pulvermüller F. Neural reuse of action perception circuits for language, concepts and communication. Progress in Neurobiology, 2018, 160: 1–44.

    Article  Google Scholar 

  50. Müller R, Kornblith S, Hinton G. When does label smoothing help? In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 422

    Google Scholar 

  51. Goyal A, Lamb A, Zhang Y, Zhang S, Courville A, Bengio Y. Professor forcing: a new algorithm for training recurrent networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4608–4616

    Google Scholar 

  52. Le H, Sankar C, Moon S, Beirami A, Geramifard A, Kottur S. DVD: a diagnostic dataset for multi-step reasoning in video grounded dialogue. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 5651–5665

    Google Scholar 

  53. Sigurdsson G A, Varol G, Wang X, Farhadi A, Laptev I, Gupta A. Hollywood in homes: Crowdsourcing data collection for activity understanding. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 510–526

    Google Scholar 

  54. Alamri H, Hori C, Marks T K, Batra D, Parikh D. Audio visual scene-aware dialog (AVSD) track for natural language generation in DSTC7. In: Proceedings of DSTC7 at AAAI2019Workshop. 2018

    Google Scholar 

  55. Hori C, Cherian A, Hori T, Marks T K. Audio visual scene-aware dialog (AVSD) track for natural language generation in DSTC8. In: Proceedings of AAAI-DSTC8 Workshop. 2020

    Google Scholar 

  56. Le H, Chen N, Hoi S. VGNMN: video-grounded neural module networks for video-grounded dialogue systems. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 3377–3393

    Google Scholar 

  57. Le H, Sahoo D, Chen N, Hoi S C H. BiST: bi-directional spatiotemporal reasoning for video-grounded dialogues. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 1846–1859

    Chapter  Google Scholar 

  58. Chen Z, Liu H, Wang Y. DialogMCF: multimodal context flow for audio visual scene-aware dialog. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 32: 753–764.

    Article  Google Scholar 

  59. Zhang R, Han J, Liu C, Gao P, Zhou A, Hu X, Yan S, Lu P, Li H, Qiao Y. LLaMA-Adapter: efficient fine-tuning of language models with zero-init attention. 2023, arXiv preprint arXiv: 2303.16199

    Google Scholar 

  60. Papineni K, Roukos S, Ward T, Zhu W J. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002, 311–318

    Google Scholar 

  61. Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005, 65–72

    Google Scholar 

  62. Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Text Summarization Branches Out. 2004, 74–81

    Google Scholar 

  63. Vedantam R, Lawrence Zitnick C, Parikh D. CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4566–4575

    Google Scholar 

  64. Zhang T, Kishore V, Wu F, Weinberger K Q, Artzi Y. BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations. 2020

    Google Scholar 

  65. Girdhar R, Ramanan D. CATER: a diagnostic dataset for compositional actions & TEmporal reasoning. In: Proceedings of the 8th International Conference on Learning Representations. 2020

    Google Scholar 

  66. Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016

    Google Scholar 

  67. Le H, Chen N, Hoi S. Multimodal dialogue state tracking. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 3394–3415

    Google Scholar 

  68. Fleiss J L. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76(5): 378–382.

    Article  Google Scholar 

  69. Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 961–970

    Google Scholar 

  70. Loshchilov I, Hutter F. Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations. 2019

    Google Scholar 

  71. Li J, Li D, Xiong C, Hoi S C H. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the International Conference on Machine Learning. 2022, 12888–12900

    Google Scholar 

  72. Bain M, Nagrani A, Varol G, Zisserman A. Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 1708–1718

    Google Scholar 

  73. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 721

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the National Science Fund for Distinguished Young Scholars (62025205), the National Natural Science Foundation of China (Grant No. 62032020), and the Innovation Foundation for Doctor Dissertation of Northwestern Polytechnical University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Guo.

Ethics declarations

Competing interests Bin Guo is an Editorial Board member of the journal and a co-author of this article. To minimize bias, they were excluded from all editorial decision-making related to the acceptance of this article for publication. The remaining authors declare no conflict of interest.

Additional information

Hao Wang received his BE degree in computer science and technology from Northwestern Polytechnical University (NWPU), China in 2019. He is currently working toward a PhD degree at NWPU. His current research interests include natural language processing, dialog systems, and large language models.

Bin Guo is a PhD professor and PhD supervisor at Northwestern Polytechnical University (NWPU), China. He is a senior member of the China Computer Federation. His main research interests include ubiquitous computing, social and community intelligence, urban big data mining, mobile crowdsensing, and human-computer interaction.

Mengqi Chen received her master’s degree in digital textiles from Xi’an Polytechnical University (XPU), China in 2022. She is currently working toward a PhD degree at Northwestern Polytechnical University (NWPU), China. Her current research interests include natural language processing, dialog systems, and large language models.

Qiuyun Zhang received her BE degree in computer science and technology from Northwestern Polytechnical University (NWPU), China in 2019. She is currently working toward a PhD degree at NWPU, China. Her current research interests include multimedia intelligence and computational aesthetics.

Yasan Ding received his BE degree in computer science and technology from Northwestern Polytechnical University (NWPU), China in 2018. He is currently working toward a PhD degree at NWPU, China. His current research interests include fake news detection and natural language processing.

Ying Zhang received her PhD degree in computer science from the National University of Singapore, Singapore in 2014 and her BE degree in computer science from NWPU, China in 2009. She is currently a Professor in the School of Computer Science at NWPU, China. Her research interests include deep learning, multimedia intelligence, and edge intelligence.

Zhiwen Yu is a PhD professor and PhD supervisor at Northwestern Polytechnical University (NWPU), China. He is a senior member of the China Computer Federation. His main research interests include mobile internet, ubiquitous computing, social and community intelligence, urban big data mining, mobile crowdsensing, and human-computer interaction.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Guo, B., Chen, M. et al. Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues. Front. Comput. Sci. 19, 197329 (2025). https://doi.org/10.1007/s11704-024-40387-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-024-40387-w

Keywords