Abstract:
To handle VQA tasks in complex scenarios involving multiple entities and obtain reliable explanations, models need to fully understand the high-level semantic information...Show MoreMetadata
Abstract:
To handle VQA tasks in complex scenarios involving multiple entities and obtain reliable explanations, models need to fully understand the high-level semantic information of visual and textual features. Existing VQA methods usually lack the exploration of entity relation features, resulting in insufficient answer prediction accuracy and generated explanations that are not sufficiently relevant to the image and question. To address this issue, we use visual relational reasoning to enhance the overall understanding of image scenes and improve the accuracy of predicted answers and explanations. Our proposed method, named Multi-Entity Relational Reasoning based Explanation (MERGE), leverages the construction of action, spatial, and attribute relations among the question-related entities in images. The contextual visual features are encoded through a graph attention mechanism and fused with question and answer embeddings to generate more accurate textual explanations. To validate the effectiveness of our method, we conducted extensive experiments on seven datasets, including VQA-CP, VQA-X, and CLEVR-X. The results demonstrate improved answer accuracy and high-quality explanations. Furthermore, our results show that the supervisory role of explanations can quantitatively improve the accuracy of answer prediction.
Date of Conference: 14-17 November 2023
Date Added to IEEE Xplore: 25 December 2023
ISBN Information:
ISSN Information:
Conference Location: Abu Dhabi, United Arab Emirates