Aligning vision-language for graph inference in visual dialog

doi:10.1016/j.imavis.2021.104316

Image and Vision Computing

Volume 116, December 2021, 104316

https://doi.org/10.1016/j.imavis.2021.104316 Get rights and content

Highlights

•
Visual dialog needs to construct semantic dependencies between visual and textual contents.
•
The gap between different modalities should be shrinked by aligning the visual and textual knowledge.
•
The application of graph structure is to connect isolated visual objects incorporated with textual semantics.
•
The introduction of external visual relationships information is to comprehend the complex relationships more easily.

Abstract

As a cross-media intelligence task, visual dialog calls for answering a sequence of questions based on an image, using the dialog history as context. To acquire correct answers, the exploration of the semantic dependencies among potential visual and textual contents becomes vital. Prior works usually ignored the underlying knowledge hidden in internal and external textual-visual relationships, which resulted in unreasonable inferring. In this paper, we propose an Aligning Vision-Language for Graph Inference (AVLGI) in visual dialog by combining the internal context-aware information and the external scene graph knowledge. Compared with other approaches, it makes up the lack of structural inference in visual dialog. So the whole system consists of three modules, Inter-Modalities Alignment (IMA), Visual Graph Attended by Text (VGAT) and Combining Scene Graph and Textual Contents(CSGTC). Specifically, the IMA module aims at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. And the VGAT module views the visual features with semantic information as observed nodes and measures the weight of importance between each two nodes in visual graph. The CSGTC supplements various relationships between visual objects by introducing additional information of the scene graph. We also qualitatively and quantitatively evaluate the model on VisDial v1.0 dataset, showing our AVLGI outperforms previous state-of-the-art models.

Introduction

Cross-modal semantically-aligned understanding has become an attractive challenge in computer vision and natural language processing, motivating many tasks such as image captioning [1], [2] and visual question answering (VQA)[3], [4], [5]. Although these missions have inspired enormous efforts on merging vision and language to promote smarter AI, they are both single-round while human conversations are mainly multi rounds. Therefore, Das et al. introduced a continuous question-answering task, visual dialog [6]. This task needs an AI agent to answer a series of questions based on visually-grounded information and contextual information from dialog history and question.

Recently, a manual investigation [7] on the Visual Dialog dataset (VisDial) reveals that around 80% of the questions can be answered with images and about 20% questions need the knowledge from dialog history. According to the above survey data, if the model [6] only directly utilized the original features, which had a drawback of losing fine-grained information. Hence, it is momentous to capture question-relevant dialog history and ground related visual objects in the given image. For example, when answering the Q₂ in Fig. 1, the agent is required to understanding the meaning of “hers” and locate the relevant objects of “woman” and “umbrella” in image. Therefore, one of the key challenges in visual dialog is how to effectively take advantage of these underlying contents in the textual and visual information, i.e., input questions, dialog history and input image. In previous works, such as RvA [8] and DAN [9], both tended to extract the desired clues by referring back to previous dialog history. But they ignored the underlying relational visual structure which contributes to dialog inference. Nowadays, researchers have tried to premeditate the fixed graph attention or embedding to settle the obstacle with structural representations [10], [11]. They generally focused on the textual modality though neglected the rich underlying information hidden in image. In this task, it is a visually-grounded conversational system, which is required to understand a series of multi-modal entities and reason the rich information in both vision and language.

To address aforementioned problem, we pay more attention on visual-textual relation. The architecture of the whole system for exploring potential information in AVLGI and digging out more relevant external knowledge for structural inference by introducing CSGTC is illustrated in Fig. 2. The agent first employs attention mechanism to obtain the attended question features, history features attended by question and visual features attended by question-history. However, the semantics in different modalities are usually inconsistent. Furthermore, the obtained visual features are deficient in structural representations and mutual relationships. Thus, the IMA module is to align the visual features and context-aware contents with their relevant counterparts in each image domain. As a result, the visual features contain more specific semantic information. For example, in Fig. 2, the visual feature v₂ in IMA module is linked to the text of “playing game, in white shirt, what she doing”, because the intention of this module is to detect all related contextual information for each visual feature. In order to infer more reasonable and connect integral individual visual features, we design the VGAT module to construct a visual graph which shows the different levels of importance showing the correlation among various visual features. For example, considering the feature v₁ in Fig. 2 VGAT module, the thickest link between the v₁ and v₂ indicates the most important connection of the two features. This module learns how to select other nodes related to the current node. In the last step of this process, each visual feature node in this structural module is connected to its related nodes. Through the two modules, the final visual representations possess more related semantic information and they are intra-connected in the graph, which are beneficial to inference. Nevertheless, we discover that we lack relationship information between the image objects, so we introduce the external data, scene graphs to enhance the learning ability of the model. These relationships in the scene graph enrich the spatial position relationships and verb relationships between different objects, and so on. For example, the object nodes “woman v₁” and “gamepad v₂” in Fig. 2 CSGTC module are linked with a relational edge of “holding”.

In our work, the major contributions are as follows:

•
The Inter-Modalities Alignment (IMA) module is to identify the corresponding information between different attended regions and words.
•
The novel Visual Graph Attended by Text (VGAT) module is to refine the representation of each visual feature by building visual connections with the matching textual information.
•
The Combining Scene Graph and Textual Contents (CSGTC) module is to compensate for the semantic information of the different relationships among visual objects by integrating scene graphs.
•
Extensive experiments conducted on the VisDial v1.0 dataset achieve promising performance compared to other methods, which demonstrates the effectiveness and reliability of our proposed model.

Section snippets

Visual dialog

The proposed models for the task of visual dialog introduced by Das et al.[6] can be categorized into four groups. Fusion-based Models: late fusion (LF)[6] and hierarchical recurrent network (HRE)[6] took the multi-modal inputs(image, question and history) to encode directly, then threw into the decoder to get the answer. Attention-based Models: memory network (MN) encoder [6], history-conditioned image attention (HCIAE)[12] and synergistic co-attention network (Synergistic)[13] applied

Proposed approach

In this section, we firstly define the visual dialog task as in Das et al.[6]. Formally, a visual dialog agent takes image I, question Q_t and dialog history H_t as inputs. Among them, the Q_t is asked in the current round t, and the H_t is consist of Q&A pairs till round t-1 while in the first round it only contains the caption C about the image I. The agent is required to return an answer A_t to the Q_t, by ranking a list of 100 candidate answers.

We will present the language features and the image

Dataset and evaluation metrics

We evaluate our proposed model on VisDial v1.0 [6] composed by MS-COCO images [43] and Flicker. The training set of VisDial v1.0 is all from coco train2014 and val2014. The collection of validation and test sets on Flicker images is similar to that on MS-COCO images. The training, validation and test sets in v1.0 dataset contains 123 k, 2 k and 8 k dialogs, respectively. The dialog in test set has a random length within 10 rounds, which is different from training and validation sets in v1.0

Conclusion

In this paper, we propose the Aligning Vision-Language for Graph Inference (AVLGI) network to slove structural representations in the visual dialog task. In contrast to most existing works depended on attention maps, AVLGI is capable of aligning different modalities’ information between visual regions and textual words and it utilizes graph neural network to measure the connection-value among different regions. On this basis, our method has another innovation that is the introduction of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partially supported by Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX21_1341), National Natural Science Foundation of China (NSFC Grant No. 61773272, 61272258, 61301299, 61572085, 61170124, 61272005), Provincial Natural Science Foundation of Jiangsu (Grant No. BK20151254, BK2-0151260), Science and Education Innovation based Cloud Data fusion Foundation of Science and Technology Development Center of Education Ministry (2017B03112), Six talent peaks

References (44)

Yuqing Peng et al.
Image caption model of double lstm with scene factors
Image Vis. Comput.
(2019)
Junbo Wang et al.
Learning visual relationship and context-aware attention for image captioning
Pattern Recognition
(2020)
Yuling Xi et al.
Visual question answering model based on visual relationship detection
Signal Proces. Image Commun.
(2020)
Songtao Ding et al.
Stimulus-driven and concept-driven analysis for image caption generation
Neurocomputing
(2020)
Kelvin Xu et al.
Show, attend and tell: neural image caption generation with visual attention
Stanislaw Antol et al.
Vqa: visual question answering
Peter Anderson et al.
Bottom-up and top-down attention for image captioning and visual question answering
Nobuyuki Shimizu et al.
Visual question answering dataset for bilingual image understanding: a study of cross-lingual transfer using attention maps
Abhishek Das et al.
Visual dialog
Hyounghun Kim et al.
Modality-balanced models for visual dialogue.

Yulei Niu et al.

Recursive visual attention in visual dialog

Gi-Cheon Kang et al.

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

(2019)

Zilong Zheng et al.

Reasoning visual dialogs with structural and partial observations.

Idan Schwartz et al.

Factor graph attention

Jiasen Lu et al.

Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model

Advances in Neural Information Processing Systems

(2017)

Dalu Guo et al.

Image-question-answer synergistic network for visual dialog

Paul Hongsuck Seo et al.

Visual reference resolution using attention memory for visual dialog

Advances in Neural Information Processing Systems

(2017)

Satwik Kottur et al.

Visual coreference resolution in visual dialog using neural module networks

Dan Guo et al.

Iterative context-aware graph inference for visual dialog

Andrej Karpathy et al.

Deep visual-semantic alignments for generating image descriptions

Hyeonseob Nam et al.

Dual attention networks for multimodal reasoning and matching.

Jin-Hwa Kim et al.

Bilinear attention networks

Cited by (8)

Multi-modal spatial relational attention networks for visual question answering
2023, Image and Vision Computing
Visual Question Answering (VQA) is a task that requires VQA model to fully understand the visual information of the image and the language information of the question, and then combine both to provide an answer. Recently, a large amount of VQA approaches focus on modeling intra- and inter-modal interactions with respect to vision and language using a deep modular co-attention network, which can achieve a good performance. Despite their benefits, they also have their limitations. First, the question representation is obtained through Glove word embeddings and Recurrent Neural Network, which may not be sufficient to capture the intricate semantics of the question features. Second, they mostly use visual appearance features extracted by Faster R-CNN to interact with language features, and they ignore important spatial relations between objects in images, resulting in incomplete use of image information. To overcome the limitations of previous methods, we propose a novel Multi-modal Spatial Relation Attention Network (MSRAN) for VQA, which can introduce spatial relationships between objects to fully utilize the image information, thus improving the performance of VQA. In order to achieve the above, we design two types of spatial relational attention modules to comprehensively explore the attention schemes: (i) Self-Attention based on Explicit Spatial Relation (SA-ESR) module that explores geometric relationships between objects explicitly; and (ii) Self-Attention based on Implicit Spatial Relation (SA-ISR) module that can capture the hidden dynamic relationships between objects by using spatial relationship. Moreover, the pre-training model BERT, which replaces Glove word embeddings and Recurrent Neural Network, is applied to MSRAN in order to obtain the better question representation. Extensive experiments on two large benchmark datasets, VQA 2.0 and GQA, demonstrate that our proposed model achieves the state-of-the-art performance.
Editorial to special issue on cross-media learning for visual question answering
2022, Image and Vision Computing
Application of wrapper based hybrid system for classification of risk tolerance in the Indian mining industry
2023, Scientific Reports
Reciprocal question representation learning network for visual dialog
2023, Applied Intelligence
Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications
2023, IEEE Access
Multi-Granularity Semantic Collaborative Reasoning Network for Visual Dialog
2022, Applied Sciences (Switzerland)

View all citing articles on Scopus

View full text

Aligning vision-language for graph inference in visual dialog

Highlights

Abstract

Introduction

Section snippets

Visual dialog

Proposed approach

Dataset and evaluation metrics

Conclusion

Declaration of Competing Interest

Acknowledgement

Image Vis. Comput.

Pattern Recognition

Signal Proces. Image Commun.

Neurocomputing

Show, attend and tell: neural image caption generation with visual attention

Vqa: visual question answering

Bottom-up and top-down attention for image captioning and visual question answering

Visual question answering dataset for bilingual image understanding: a study of cross-lingual transfer using attention maps

Visual dialog

Modality-balanced models for visual dialogue.

Recursive visual attention in visual dialog

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Reasoning visual dialogs with structural and partial observations.

Factor graph attention

Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model

Advances in Neural Information Processing Systems

Image-question-answer synergistic network for visual dialog

Visual reference resolution using attention memory for visual dialog

Advances in Neural Information Processing Systems

Visual coreference resolution in visual dialog using neural module networks

Iterative context-aware graph inference for visual dialog

Deep visual-semantic alignments for generating image descriptions

Dual attention networks for multimodal reasoning and matching.

Bilinear attention networks