Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering

Lyu, Chenyang; Li, Wenxi; Ji, Tianbo; Zhou, Liting; Gurrin, Cathal

doi:10.1007/978-3-031-44195-0_35

Chenyang Lyu¹¹,
Wenxi Li¹²,
Tianbo Ji¹³,
Liting Zhou¹¹ &
…
Cathal Gurrin¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

International Conference on Artificial Neural Networks

1089 Accesses

Abstract

Video Question Answering (VideoQA) is a challenging task that requires the model to understand the complex nature of video data and the variety of questions that can be asked about them. Existing approaches often suffer from the problem of ambiguous answer candidates with low relevance to the visual and auditory part of the video, which limits the performance of VideoQA systems. In this paper, we introduce a novel approach that leverages multi-modal fusion and cross-modal contrastive learning to utilize multi-modal information and enhance the relevance of answer candidates in VideoQA. First, we introduce a gated multi-modal fusion network that learns to combine different modalities such as visual and speech based on their relevance to the question to enrich the representations of video and improve the accuracy of finding the correct answer. Second, we introduce cross-modal contrastive learning to increase the similarity between positive example pairs (i.e., correct answers and corresponding video clips) while decreasing the similarity between negative example pairs (i.e., incorrect answers and unpaired video clips). Specifically, we use three-way contrastive learning between answer and video frame, answer and audio, answer and cross-modal features. Our proposed approach is evaluated on two benchmark audio-aware VideoQA datasets, including AVQA and Music-AVQA, and compared to several state-of-the-art methods. The results show that our approach significantly improves the performance of VideoQA, achieving new state-of-the-art results on these benchmarks.

C. Lyu and W. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://openai.com/blog/clip/.

References

Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022)
Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. arXiv preprint arXiv:2106.13432 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Google Scholar
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR (2019)
Google Scholar
Fayek, H.M., Johnson, J.: Temporal reasoning via audio question answering. IEEE/ACM TASLP 28, 2283–2294 (2020)
Google Scholar
Gan, Z., et al.: Vision-language pre-training: basics, recent advances, and future trends. FTCGV 14(3–4), 163–352 (2022)
Google Scholar
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020)
Google Scholar
Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D.: Modality shifting attention network for multi-modal video question answering. In: CVPR, pp. 10106–10115 (2020)
Google Scholar
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: CVPR (2020)
Google Scholar
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: CVPR (2022)
Google Scholar
Li, X., et al.: Learnable aggregating net with diversity learning for video question answering. In: ACM MM (2019)
Google Scholar
Li, X., et al.: Beyond RNNs: positional self-attention with co-attention for video question answering. In: AAAI (2019)
Google Scholar
Li, Y., Wang, X., Xiao, J., Ji, W., Chua, T.S.: Invariant grounding for video question answering. In: CVPR (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016)
Lyu, C., Nguyen, M.D., Ninh, V.T., Zhou, L., Gurrin, C., Foster, J.: Dialogue-to-video retrieval. In: ECIR (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. CoRR abs/2212.04356 (2022)
Google Scholar
Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: CVPR (2019)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: understanding stories in movies through question-answering. In: CVPR, pp. 4631–4640 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (2020)
Google Scholar
Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 447–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_27
Chapter Google Scholar
Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. In: AAAI, vol. 36, pp. 2804–2812 (2022)
Google Scholar
Yang, H., Chaisorn, L., Zhao, Y., Neo, S.Y., Chua, T.S.: Videoqa: question answering on news video. In: ACM MM (2003)
Google Scholar
Yang, P., et al.: AVQA: a dataset for audio-visual question answering on videos. In: ACM MM (2022)
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR (2019)
Google Scholar
Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360\(^{\circ }\) videos. In: ICCV (2021)
Google Scholar
Zhang, J., Shao, J., Cao, R., Gao, L., Xu, X., Shen, H.T.: Action-centric relation transformer network for video question answering. IEEE TCSVT 32(1), 63–74 (2020)
Google Scholar
Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI (2017)
Google Scholar
Zhong, Y., Ji, W., Xiao, J., Li, Y., Deng, W., Chua, T.S.: Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225 (2022)
Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: ACL (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Dublin City University, Dublin, Ireland
Chenyang Lyu, Liting Zhou & Cathal Gurrin
Shanghai Jiao Tong University, Shanghai, China
Wenxi Li
Nantong University, Nantong, China
Tianbo Ji

Authors

Chenyang Lyu
View author publications
You can also search for this author in PubMed Google Scholar
Wenxi Li
View author publications
You can also search for this author in PubMed Google Scholar
Tianbo Ji
View author publications
You can also search for this author in PubMed Google Scholar
Liting Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Cathal Gurrin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianbo Ji .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lyu, C., Li, W., Ji, T., Zhou, L., Gurrin, C. (2023). Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_35

Download citation

DOI: https://doi.org/10.1007/978-3-031-44195-0_35
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering