Abstract
Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
- [2] . 2007. Dbpedia: A nucleus for a web of open data. In Asian Semantic Web Conference. 722–735.Google ScholarDigital Library
- [3] . 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612–2620.Google ScholarCross Ref
- [4] . 2019. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 8102–8109.Google ScholarDigital Library
- [5] . 2019. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1989–1998.Google ScholarCross Ref
- [6] . 2021. Knowledge-routed visual question reasoning: Challenges for deep representation embedding. IEEE Trans. Neural Netw. Learn. Syst. 33, 7 (2021), 2758–2767.Google Scholar
- [7] . 2022. MuKEA: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5089–5098.Google ScholarCross Ref
- [8] . 2020. Recurrent attention network with reinforced generator for visual dialog. ACM Trans. Multimedia Comput. Commun. Appl. 16, 3 (2020), 1–16.
DOI: Google ScholarDigital Library - [9] . 2020. Conceptbert: Concept-aware representation for visual question answering. In Findings of the Association for Computational Linguistics (EMNLP’20). 489–498.Google Scholar
- [10] . 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904–6913.Google ScholarCross Ref
- [11] . 2004. hasPartKB: A New Knowledge Base of hasPart Relations. Retrieved from https://allenai.org/data/haspartkbGoogle Scholar
- [12] . 2017. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations.Google Scholar
- [13] . 2019. Video question answering via knowledge-based progressive spatial-temporal attention network. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–22.
DOI: Google ScholarDigital Library - [14] . 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564–1574.Google Scholar
- [15] . 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32–73.Google Scholar
- [16] . 2020. Boosting visual question answering with context-aware knowledge aggregation. In Proceedings of the 28th ACM International Conference on Multimedia. 1227–1235.Google ScholarDigital Library
- [17] . 2020. What does BERT with vision look at? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5265–5275.
DOI: Google ScholarCross Ref - [18] . 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740–755.Google ScholarCross Ref
- [19] . 2004. ConceptNet–a practical commonsense reasoning tool-kit. BT Technol. J. 22, 4 (2004), 211–226.Google ScholarDigital Library
- [20] . 2022. Answer questions with right image regions: A visual attention regularization approach. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4 (2022), 1–18.
DOI: Google ScholarDigital Library - [21] . 2018. Fixing weight decay regularization in Adam. In International Conference on Learning Representations.Google Scholar
- [22] . 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems.Google Scholar
- [23] . 2020. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimedia 22, 10 (2020), 2597–2609.
DOI: Google ScholarCross Ref - [24] . 2021. Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14111–14121.Google ScholarCross Ref
- [25] . 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3195–3204.Google ScholarCross Ref
- [26] . 2018. Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM International Conference on Multimedia. 1856–1864.Google ScholarDigital Library
- [27] . 2018. A multimodal translation-based approach for knowledge graph representation learning. In Proceedings of the 7th Joint Conference on Lexical and Computational Semantics. 225–234.Google ScholarCross Ref
- [28] . 2022. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Comput. Commun. Appl. 18, 3 (2022), 1–23.
DOI: Google ScholarDigital Library - [29] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems.Google Scholar
- [30] . 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence. 3942–3951.Google ScholarCross Ref
- [31] . 2021. Passage retrieval for outside-knowledge visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1753–1757.Google ScholarDigital Library
- [32] . 2016. YAGO: A multilingual knowledge base from wikipedia, wordnet, and geonames. In International Semantic Web Conference. 177–185.Google ScholarDigital Library
- [33] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neur. Inf. Process. Syst. 28 (2015), 91–99.Google Scholar
- [34] . 2021. Reasoning over vision and language: Exploring the benefits of supplemental knowledge. In Proceedings of the 3rd Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN’21).Google Scholar
- [35] . 2015. Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In Proceedings of the 23rd ACM International Conference on Multimedia. New York, NY, 35–44. Google ScholarDigital Library
- [36] . 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5100–5111.Google ScholarCross Ref
- [37] . 2014. Acquiring comparative commonsense knowledge from the web. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 28.Google ScholarCross Ref
- [38] . 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4s (2016), 1–22. Google ScholarDigital Library
- [39] . 2020. What makes for good views for contrastive learning? Adv. Neur. Inf. Process. Syst. 33 (2020), 6827–6839.Google Scholar
- [40] . 2019. Cross-modality retrieval by joint correlation learning. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–16.
DOI: Google ScholarDigital Library - [41] . 2020. Long video question answering: A matching-guided attention model. Pattern Recogn. 102 (2020), 107248.Google ScholarDigital Library
- [42] . 2017. Wikipedia, The Free Encyclopedia. Retrieved from https://www.wikipedia.org/Google Scholar
- [43] . 2022. Multi-modal answer validation for knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence. 2712–2721.Google ScholarCross Ref
- [44] . 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1 (2021), 1–23
DOI: Google ScholarDigital Library - [45] . 2019. Multi-source multi-level attention networks for visual question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–20.
DOI: Google ScholarDigital Library - [46] . 2020. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108 (2020), 107563.Google ScholarCross Ref
- [47] . 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6281–6290.Google ScholarCross Ref
- [48] . 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 5947–5959.Google ScholarCross Ref
- [49] . 2019. Spatiotemporal-textual co-attention network for video question answering. ACM Trans. Multimedia Comput. Commun. Appl. 15, 2s (2019), 1–18.
DOI: Google ScholarDigital Library - [50] . 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5579–5588.Google ScholarCross Ref
- [51] . 2021. KM4: Visual reasoning via knowledge embedding memory model with mutual modulation. Inf. Fusion 67 (2021), 14–28.Google ScholarCross Ref
- [52] . 2021. Knowledge is power: Hierarchical-knowledge embedded meta-learning for visual reasoning in artistic domains. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2360–2368.Google ScholarDigital Library
- [53] . 2020. Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visual question answering. In Proceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence. 1097–1103.Google ScholarCross Ref
Index Terms
- Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering
Recommendations
Cross-modality co-attention networks for visual question answering
AbstractVisual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention ...
Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
Advances in Information RetrievalAbstractKnowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to ...
Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
Web and Big DataAbstractVisual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different ...
Comments