Aligning vision-language for graph inference in visual dialog

https://doi.org/10.1016/j.imavis.2021.104316Get rights and content

Highlights

  • Visual dialog needs to construct semantic dependencies between visual and textual contents.

  • The gap between different modalities should be shrinked by aligning the visual and textual knowledge.

  • The application of graph structure is to connect isolated visual objects incorporated with textual semantics.

  • The introduction of external visual relationships information is to comprehend the complex relationships more easily.

Abstract

As a cross-media intelligence task, visual dialog calls for answering a sequence of questions based on an image, using the dialog history as context. To acquire correct answers, the exploration of the semantic dependencies among potential visual and textual contents becomes vital. Prior works usually ignored the underlying knowledge hidden in internal and external textual-visual relationships, which resulted in unreasonable inferring. In this paper, we propose an Aligning Vision-Language for Graph Inference (AVLGI) in visual dialog by combining the internal context-aware information and the external scene graph knowledge. Compared with other approaches, it makes up the lack of structural inference in visual dialog. So the whole system consists of three modules, Inter-Modalities Alignment (IMA), Visual Graph Attended by Text (VGAT) and Combining Scene Graph and Textual Contents(CSGTC). Specifically, the IMA module aims at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. And the VGAT module views the visual features with semantic information as observed nodes and measures the weight of importance between each two nodes in visual graph. The CSGTC supplements various relationships between visual objects by introducing additional information of the scene graph. We also qualitatively and quantitatively evaluate the model on VisDial v1.0 dataset, showing our AVLGI outperforms previous state-of-the-art models.

Introduction

Cross-modal semantically-aligned understanding has become an attractive challenge in computer vision and natural language processing, motivating many tasks such as image captioning [1], [2] and visual question answering (VQA)[3], [4], [5]. Although these missions have inspired enormous efforts on merging vision and language to promote smarter AI, they are both single-round while human conversations are mainly multi rounds. Therefore, Das et al. introduced a continuous question-answering task, visual dialog [6]. This task needs an AI agent to answer a series of questions based on visually-grounded information and contextual information from dialog history and question.

Recently, a manual investigation [7] on the Visual Dialog dataset (VisDial) reveals that around 80% of the questions can be answered with images and about 20% questions need the knowledge from dialog history. According to the above survey data, if the model [6] only directly utilized the original features, which had a drawback of losing fine-grained information. Hence, it is momentous to capture question-relevant dialog history and ground related visual objects in the given image. For example, when answering the Q2 in Fig. 1, the agent is required to understanding the meaning of “hers” and locate the relevant objects of “woman” and “umbrella” in image. Therefore, one of the key challenges in visual dialog is how to effectively take advantage of these underlying contents in the textual and visual information, i.e., input questions, dialog history and input image. In previous works, such as RvA [8] and DAN [9], both tended to extract the desired clues by referring back to previous dialog history. But they ignored the underlying relational visual structure which contributes to dialog inference. Nowadays, researchers have tried to premeditate the fixed graph attention or embedding to settle the obstacle with structural representations [10], [11]. They generally focused on the textual modality though neglected the rich underlying information hidden in image. In this task, it is a visually-grounded conversational system, which is required to understand a series of multi-modal entities and reason the rich information in both vision and language.

To address aforementioned problem, we pay more attention on visual-textual relation. The architecture of the whole system for exploring potential information in AVLGI and digging out more relevant external knowledge for structural inference by introducing CSGTC is illustrated in Fig. 2. The agent first employs attention mechanism to obtain the attended question features, history features attended by question and visual features attended by question-history. However, the semantics in different modalities are usually inconsistent. Furthermore, the obtained visual features are deficient in structural representations and mutual relationships. Thus, the IMA module is to align the visual features and context-aware contents with their relevant counterparts in each image domain. As a result, the visual features contain more specific semantic information. For example, in Fig. 2, the visual feature v2 in IMA module is linked to the text of “playing game, in white shirt, what she doing”, because the intention of this module is to detect all related contextual information for each visual feature. In order to infer more reasonable and connect integral individual visual features, we design the VGAT module to construct a visual graph which shows the different levels of importance showing the correlation among various visual features. For example, considering the feature v1 in Fig. 2 VGAT module, the thickest link between the v1 and v2 indicates the most important connection of the two features. This module learns how to select other nodes related to the current node. In the last step of this process, each visual feature node in this structural module is connected to its related nodes. Through the two modules, the final visual representations possess more related semantic information and they are intra-connected in the graph, which are beneficial to inference. Nevertheless, we discover that we lack relationship information between the image objects, so we introduce the external data, scene graphs to enhance the learning ability of the model. These relationships in the scene graph enrich the spatial position relationships and verb relationships between different objects, and so on. For example, the object nodes “woman v1” and “gamepad v2” in Fig. 2 CSGTC module are linked with a relational edge of “holding”.

In our work, the major contributions are as follows:

  • The Inter-Modalities Alignment (IMA) module is to identify the corresponding information between different attended regions and words.

  • The novel Visual Graph Attended by Text (VGAT) module is to refine the representation of each visual feature by building visual connections with the matching textual information.

  • The Combining Scene Graph and Textual Contents (CSGTC) module is to compensate for the semantic information of the different relationships among visual objects by integrating scene graphs.

  • Extensive experiments conducted on the VisDial v1.0 dataset achieve promising performance compared to other methods, which demonstrates the effectiveness and reliability of our proposed model.

Section snippets

Visual dialog

The proposed models for the task of visual dialog introduced by Das et al.[6] can be categorized into four groups. Fusion-based Models: late fusion (LF)[6] and hierarchical recurrent network (HRE)[6] took the multi-modal inputs(image, question and history) to encode directly, then threw into the decoder to get the answer. Attention-based Models: memory network (MN) encoder [6], history-conditioned image attention (HCIAE)[12] and synergistic co-attention network (Synergistic)[13] applied

Proposed approach

In this section, we firstly define the visual dialog task as in Das et al.[6]. Formally, a visual dialog agent takes image I, question Qt and dialog history Ht as inputs. Among them, the Qt is asked in the current round t, and the Ht is consist of Q&A pairs till round t-1 while in the first round it only contains the caption C about the image I. The agent is required to return an answer At to the Qt, by ranking a list of 100 candidate answers.

We will present the language features and the image

Dataset and evaluation metrics

We evaluate our proposed model on VisDial v1.0 [6] composed by MS-COCO images [43] and Flicker. The training set of VisDial v1.0 is all from coco train2014 and val2014. The collection of validation and test sets on Flicker images is similar to that on MS-COCO images. The training, validation and test sets in v1.0 dataset contains 123 k, 2 k and 8 k dialogs, respectively. The dialog in test set has a random length within 10 rounds, which is different from training and validation sets in v1.0

Conclusion

In this paper, we propose the Aligning Vision-Language for Graph Inference (AVLGI) network to slove structural representations in the visual dialog task. In contrast to most existing works depended on attention maps, AVLGI is capable of aligning different modalities’ information between visual regions and textual words and it utilizes graph neural network to measure the connection-value among different regions. On this basis, our method has another innovation that is the introduction of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partially supported by Postgraduate Research & Practice Innovation Program of Jiangsu Province (SJCX21_1341), National Natural Science Foundation of China (NSFC Grant No. 61773272, 61272258, 61301299, 61572085, 61170124, 61272005), Provincial Natural Science Foundation of Jiangsu (Grant No. BK20151254, BK2-0151260), Science and Education Innovation based Cloud Data fusion Foundation of Science and Technology Development Center of Education Ministry (2017B03112), Six talent peaks

References (44)

  • Yulei Niu et al.

    Recursive visual attention in visual dialog

  • Gi-Cheon Kang et al.

    Dual Attention Networks for Visual Reference Resolution in Visual Dialog

    (2019)
  • Zilong Zheng et al.

    Reasoning visual dialogs with structural and partial observations.

  • Idan Schwartz et al.

    Factor graph attention

  • Jiasen Lu et al.

    Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model

    Advances in Neural Information Processing Systems

    (2017)
  • Dalu Guo et al.

    Image-question-answer synergistic network for visual dialog

  • Paul Hongsuck Seo et al.

    Visual reference resolution using attention memory for visual dialog

    Advances in Neural Information Processing Systems

    (2017)
  • Satwik Kottur et al.

    Visual coreference resolution in visual dialog using neural module networks

  • Dan Guo et al.

    Iterative context-aware graph inference for visual dialog

  • Andrej Karpathy et al.

    Deep visual-semantic alignments for generating image descriptions

  • Hyeonseob Nam et al.

    Dual attention networks for multimodal reasoning and matching.

  • Jin-Hwa Kim et al.

    Bilinear attention networks

  • Cited by (8)

    View all citing articles on Scopus
    View full text