Abstract
Video Question Answering (VideoQA) is a challenging task that requires the model to understand the complex nature of video data and the variety of questions that can be asked about them. Existing approaches often suffer from the problem of ambiguous answer candidates with low relevance to the visual and auditory part of the video, which limits the performance of VideoQA systems. In this paper, we introduce a novel approach that leverages multi-modal fusion and cross-modal contrastive learning to utilize multi-modal information and enhance the relevance of answer candidates in VideoQA. First, we introduce a gated multi-modal fusion network that learns to combine different modalities such as visual and speech based on their relevance to the question to enrich the representations of video and improve the accuracy of finding the correct answer. Second, we introduce cross-modal contrastive learning to increase the similarity between positive example pairs (i.e., correct answers and corresponding video clips) while decreasing the similarity between negative example pairs (i.e., incorrect answers and unpaired video clips). Specifically, we use three-way contrastive learning between answer and video frame, answer and audio, answer and cross-modal features. Our proposed approach is evaluated on two benchmark audio-aware VideoQA datasets, including AVQA and Music-AVQA, and compared to several state-of-the-art methods. The results show that our approach significantly improves the performance of VideoQA, achieving new state-of-the-art results on these benchmarks.
C. Lyu and W. Li—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022)
Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. arXiv preprint arXiv:2106.13432 (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR (2019)
Fayek, H.M., Johnson, J.: Temporal reasoning via audio question answering. IEEE/ACM TASLP 28, 2283–2294 (2020)
Gan, Z., et al.: Vision-language pre-training: basics, recent advances, and future trends. FTCGV 14(3–4), 163–352 (2022)
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020)
Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D.: Modality shifting attention network for multi-modal video question answering. In: CVPR, pp. 10106–10115 (2020)
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: CVPR (2020)
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: CVPR (2022)
Li, X., et al.: Learnable aggregating net with diversity learning for video question answering. In: ACM MM (2019)
Li, X., et al.: Beyond RNNs: positional self-attention with co-attention for video question answering. In: AAAI (2019)
Li, Y., Wang, X., Xiao, J., Ji, W., Chua, T.S.: Invariant grounding for video question answering. In: CVPR (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016)
Lyu, C., Nguyen, M.D., Ninh, V.T., Zhou, L., Gurrin, C., Foster, J.: Dialogue-to-video retrieval. In: ECIR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. CoRR abs/2212.04356 (2022)
Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: CVPR (2019)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: understanding stories in movies through question-answering. In: CVPR, pp. 4631–4640 (2016)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (2020)
Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 447–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_27
Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. In: AAAI, vol. 36, pp. 2804–2812 (2022)
Yang, H., Chaisorn, L., Zhao, Y., Neo, S.Y., Chua, T.S.: Videoqa: question answering on news video. In: ACM MM (2003)
Yang, P., et al.: AVQA: a dataset for audio-visual question answering on videos. In: ACM MM (2022)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR (2019)
Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360\(^{\circ }\) videos. In: ICCV (2021)
Zhang, J., Shao, J., Cao, R., Gao, L., Xu, X., Shen, H.T.: Action-centric relation transformer network for video question answering. IEEE TCSVT 32(1), 63–74 (2020)
Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI (2017)
Zhong, Y., Ji, W., Xiao, J., Li, Y., Deng, W., Chua, T.S.: Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225 (2022)
Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: ACL (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lyu, C., Li, W., Ji, T., Zhou, L., Gurrin, C. (2023). Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham. https://doi.org/10.1007/978-3-031-44195-0_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-44195-0_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44194-3
Online ISBN: 978-3-031-44195-0
eBook Packages: Computer ScienceComputer Science (R0)