Aligning vision-language for graph inference in visual dialog
Introduction
Cross-modal semantically-aligned understanding has become an attractive challenge in computer vision and natural language processing, motivating many tasks such as image captioning [1], [2] and visual question answering (VQA)[3], [4], [5]. Although these missions have inspired enormous efforts on merging vision and language to promote smarter AI, they are both single-round while human conversations are mainly multi rounds. Therefore, Das et al. introduced a continuous question-answering task, visual dialog [6]. This task needs an AI agent to answer a series of questions based on visually-grounded information and contextual information from dialog history and question.
Recently, a manual investigation [7] on the Visual Dialog dataset (VisDial) reveals that around 80% of the questions can be answered with images and about 20% questions need the knowledge from dialog history. According to the above survey data, if the model [6] only directly utilized the original features, which had a drawback of losing fine-grained information. Hence, it is momentous to capture question-relevant dialog history and ground related visual objects in the given image. For example, when answering the Q2 in Fig. 1, the agent is required to understanding the meaning of “hers” and locate the relevant objects of “woman” and “umbrella” in image. Therefore, one of the key challenges in visual dialog is how to effectively take advantage of these underlying contents in the textual and visual information, i.e., input questions, dialog history and input image. In previous works, such as RvA [8] and DAN [9], both tended to extract the desired clues by referring back to previous dialog history. But they ignored the underlying relational visual structure which contributes to dialog inference. Nowadays, researchers have tried to premeditate the fixed graph attention or embedding to settle the obstacle with structural representations [10], [11]. They generally focused on the textual modality though neglected the rich underlying information hidden in image. In this task, it is a visually-grounded conversational system, which is required to understand a series of multi-modal entities and reason the rich information in both vision and language.
To address aforementioned problem, we pay more attention on visual-textual relation. The architecture of the whole system for exploring potential information in AVLGI and digging out more relevant external knowledge for structural inference by introducing CSGTC is illustrated in Fig. 2. The agent first employs attention mechanism to obtain the attended question features, history features attended by question and visual features attended by question-history. However, the semantics in different modalities are usually inconsistent. Furthermore, the obtained visual features are deficient in structural representations and mutual relationships. Thus, the IMA module is to align the visual features and context-aware contents with their relevant counterparts in each image domain. As a result, the visual features contain more specific semantic information. For example, in Fig. 2, the visual feature v2 in IMA module is linked to the text of “playing game, in white shirt, what she doing”, because the intention of this module is to detect all related contextual information for each visual feature. In order to infer more reasonable and connect integral individual visual features, we design the VGAT module to construct a visual graph which shows the different levels of importance showing the correlation among various visual features. For example, considering the feature v1 in Fig. 2 VGAT module, the thickest link between the v1 and v2 indicates the most important connection of the two features. This module learns how to select other nodes related to the current node. In the last step of this process, each visual feature node in this structural module is connected to its related nodes. Through the two modules, the final visual representations possess more related semantic information and they are intra-connected in the graph, which are beneficial to inference. Nevertheless, we discover that we lack relationship information between the image objects, so we introduce the external data, scene graphs to enhance the learning ability of the model. These relationships in the scene graph enrich the spatial position relationships and verb relationships between different objects, and so on. For example, the object nodes “woman v1” and “gamepad v2” in Fig. 2 CSGTC module are linked with a relational edge of “holding”.
In our work, the major contributions are as follows:
- •
The Inter-Modalities Alignment (IMA) module is to identify the corresponding information between different attended regions and words.
- •
The novel Visual Graph Attended by Text (VGAT) module is to refine the representation of each visual feature by building visual connections with the matching textual information.
- •
The Combining Scene Graph and Textual Contents (CSGTC) module is to compensate for the semantic information of the different relationships among visual objects by integrating scene graphs.
- •
Extensive experiments conducted on the VisDial v1.0 dataset achieve promising performance compared to other methods, which demonstrates the effectiveness and reliability of our proposed model.
Section snippets
Visual dialog
The proposed models for the task of visual dialog introduced by Das et al.[6] can be categorized into four groups. Fusion-based Models: late fusion (LF)[6] and hierarchical recurrent network (HRE)[6] took the multi-modal inputs(image, question and history) to encode directly, then threw into the decoder to get the answer. Attention-based Models: memory network (MN) encoder [6], history-conditioned image attention (HCIAE)[12] and synergistic co-attention network (Synergistic)[13] applied
Proposed approach
In this section, we firstly define the visual dialog task as in Das et al.[6]. Formally, a visual dialog agent takes image I, question Qt and dialog history Ht as inputs. Among them, the Qt is asked in the current round t, and the Ht is consist of Q&A pairs till round t-1 while in the first round it only contains the caption C about the image I. The agent is required to return an answer At to the Qt, by ranking a list of 100 candidate answers.
We will present the language features and the image
Dataset and evaluation metrics
We evaluate our proposed model on VisDial v1.0 [6] composed by MS-COCO images [43] and Flicker. The training set of VisDial v1.0 is all from coco train2014 and val2014. The collection of validation and test sets on Flicker images is similar to that on MS-COCO images. The training, validation and test sets in v1.0 dataset contains 123 k, 2 k and 8 k dialogs, respectively. The dialog in test set has a random length within 10 rounds, which is different from training and validation sets in v1.0
Conclusion
In this paper, we propose the Aligning Vision-Language for Graph Inference (AVLGI) network to slove structural representations in the visual dialog task. In contrast to most existing works depended on attention maps, AVLGI is capable of aligning different modalities’ information between visual regions and textual words and it utilizes graph neural network to measure the connection-value among different regions. On this basis, our method has another innovation that is the introduction of
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was partially supported by Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX21_1341), National Natural Science Foundation of China (NSFC Grant No. 61773272, 61272258, 61301299, 61572085, 61170124, 61272005), Provincial Natural Science Foundation of Jiangsu (Grant No. BK20151254, BK2-0151260), Science and Education Innovation based Cloud Data fusion Foundation of Science and Technology Development Center of Education Ministry (2017B03112), Six talent peaks
References (44)
- et al.
Image caption model of double lstm with scene factors
Image Vis. Comput.
(2019) - et al.
Learning visual relationship and context-aware attention for image captioning
Pattern Recognition
(2020) - et al.
Visual question answering model based on visual relationship detection
Signal Proces. Image Commun.
(2020) - et al.
Stimulus-driven and concept-driven analysis for image caption generation
Neurocomputing
(2020) - et al.
Show, attend and tell: neural image caption generation with visual attention
- et al.
Vqa: visual question answering
- et al.
Bottom-up and top-down attention for image captioning and visual question answering
- et al.
Visual question answering dataset for bilingual image understanding: a study of cross-lingual transfer using attention maps
- et al.
Visual dialog
- et al.
Modality-balanced models for visual dialogue.
Recursive visual attention in visual dialog
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
Reasoning visual dialogs with structural and partial observations.
Factor graph attention
Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model
Advances in Neural Information Processing Systems
Image-question-answer synergistic network for visual dialog
Visual reference resolution using attention memory for visual dialog
Advances in Neural Information Processing Systems
Visual coreference resolution in visual dialog using neural module networks
Iterative context-aware graph inference for visual dialog
Deep visual-semantic alignments for generating image descriptions
Dual attention networks for multimodal reasoning and matching.
Bilinear attention networks
Cited by (8)
Multi-modal spatial relational attention networks for visual question answering
2023, Image and Vision ComputingEditorial to special issue on cross-media learning for visual question answering
2022, Image and Vision ComputingReciprocal question representation learning network for visual dialog
2023, Applied IntelligenceMulti-Granularity Semantic Collaborative Reasoning Network for Visual Dialog
2022, Applied Sciences (Switzerland)