Visual question answering with attention transfer and a cross-modal gating mechanism

doi:10.1016/j.patrec.2020.02.031

Pattern Recognition Letters

Volume 133, May 2020, Pages 334-340

https://doi.org/10.1016/j.patrec.2020.02.031 Get rights and content

Highlights

•
A multi-step attention model for Visual Question Answering is proposed.
•
It relies on attention transfer and a cross-modal gating mechanism.
•
The attention transfer model adjusts attention with question guidance.
•
The cross-modal gating mechanism filters out irrelevant information.

Abstract

Visual question answering (VQA) is challenging since it requires to understand both language information and corresponding visual contents. A lot of efforts have been made to capture single-step language and visual interactions. However, answering complex questions requires multiple steps of reasoning which gradually adjusts the region of interest to the most relevant part of the given image, which has not been well investigated. To integrate question related object relations into attention mechanism, we propose a multi-step attention architecture to facilitate the modeling of multi-modal correlations. Firstly, an attention transfer mechanism is integrated to gradually adjust the region of interest considering reasoning representation of questions. Secondly, we propose a cross-modal gating strategy to filter out irrelevant information based on multi-modal correlations. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset, which verifies the effectiveness of our proposed method.

Introduction

Visual Question Answering which means to answer questions based on the given image, has aroused a lot of research attention for its important role in linking neural language processing and computer vision. Recently, deep learning has achieved great success in several computer vision or language processing areas, e.g. image classification [1], [2], [3], [4], object detection [5], [6], [7], [8], [9] and neural machine translation [10], [11].

Compared with image captioning [12], [13] or image-text retrieval [14], [15], visual question answering is much more challenging with the need of understanding both visual and language contents and conducting complex reasoning. There has been much works concentrating on the fusion of visual and language information [16], [17], [18]. For example, Fukui et al. [16] employed the simpler method of compact bilinear pooling to approximate the full bilinear pooling, which exploits the interaction between visual and language modalities in a quadradic manner. Recently, attention mechanism has been proven to be effective in VQA, which helps the model to concentrate on question related image areas [18], [19].

However, answering complex questions like “Does the man have his nose pierced?” in Fig. 1, the VQA model should attend to the man first and then refines its region of interest to his nose. Previous works normally only apply the question-based image attention once or perform it several times separately, which neglects the dependency of each attention process.

Recently, multi-head attention has been proven to be effective in modeling sequential correlations [20]. In order to delve deeper into the intrinsic object relations related to the question, we propose an attention transfer algorithm to adjust its region of interest based on the reasoning information encoded from questions. The multi-head attention mechanism infers the dependency of each pair of objects with the guidance of question information. Furthermore, a cross-modal gating mechanism is integrated to filter out irrelevant question and image channel-wise patterns by the guidance of their multi-modal correlations.

In summary, the contributions of this paper are:

•
We first propose a multi-step attention architecture for visual question answering, which explicitly adjusts the region of interest with the guidance of question representation.
•
We introduce an attention transfer mechanism, which is based on a multi-head attention algorithm to capture question related object relations.
•
A cross-modal gating algorithm is further introduced to control the channel-wise information flow guided by the multi-modal correlations.
•
Our model achieves the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset.
•
We perform extensive ablation analysis to verify the effectiveness of each component of our proposed model and qualitatively evaluate our multi-step attention mechanism with visualizations of attention distributions.

The rest of the paper is organized as follows. In Section 2, we will review some related works on visual question answering. In Section 3, we will introduce our multi-step attention architecture for VQA in detail. In Section 4, we show experimental results on the VQA 1.0 and VQA 2.0 datasets to verify the effectiveness of our proposed model. We will conclude our paper in Section 5.

Section snippets

Related works

Visual question answering has drawn much research attention in the past few years. It normally can be regarded as a classification task. Image features and language features are extracted with CNN and RNN architectures, respectively. The selection of fusing strategies plays a key role in modelling the correlations between both modalities. A number of methods have been proposed along two main directions: attention mechanisms [21], [22], [23], [24], [25] and fusion methods [16], [18], [26], [27],

Proposed model

The overall architecture of our proposed model is shown in Fig. 2. Our proposed model extracts image and question representations with CNN and LSTM respectively, then performs question attention and question-guided multi-step attention to gradually adjusts its region of interest, and finally fuses image information and question information together with a cross-modal gating algorithm to predict the answer distribution. In this section, we will first introduce feature extraction for image and

Experiments and evaluation metric

To evaluate our multi-step attention model for visual question answering, we conduct experiments on both the VQA 1.0 and VQA 2.0 datasets. We further perform ablation analysis on the VQA 2.0 dataset to verify the effectiveness of each component of our architecture. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset.

Conclusion and future work

In this paper, we propose a multi-step attention mechanism for visual question answering, which gradually adjusts its region of interest with the guidance of question information. First, we integrate an attention transfer algorithm to gradually adjust the region of interest, which is based on the multi-head attention mechanism to capture object relations. Second, a cross-modal gating algorithm is further incorporated to control the channel-wise information flow of both visual and language

Declaration of Competing Interest

The authors declare that they do not have any financial or nonfinancial conflict of interests.

References (38)

D. Mandal et al.
Query specific re-ranking for improved cross-modal retrieval
Pattern Recognit. Lett.
(2017)
J. Moraleda
Large scalability in document image matching using text retrieval
Pattern Recognit. Lett.
(2012)
V. Lioutas et al.
Explicit ensemble attention learning for improving visual question answering
Pattern Recognit. Lett.
(2018)
A. Krizhevsky et al.
Imagenet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems (NIPS)
(2012)
K. Simonyan et al.
Very deep convolutional networks for large-scale image recognition
The International Conference on Learning Representations (ICLR)
(2015)
C. Szegedy et al.
Going deeper with convolutions
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2015)
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,...
S. Ren et al.
Faster R-CNN: towards real-time object detection with region proposal networks
Advances In Neural Information Processing Systems (NIPS)
(2015)
W. Liu et al.
Ssd: single shot multibox detector
The European Conference on Computer Vision (ECCV)
(2016)
J. Redmon et al.
You only look once: unified, real-time object detection
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2016)

T.-Y. Lin et al.

Feature pyramid networks for object detection

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

Y. Xi et al.

Beyond context: exploring semantic similarity for small object detection in crowded scenes

Pattern Recognit. Lett.

(2019)

I. Sutskever et al.

Sequence to sequence learning with neural networks

Advances in Neural Information Processing Systems (NIPS)

(2014)

K.C. Dzmitry Bahdanau et al.

Neural machine translation by jointly learning to align and translate

The International Conference on Machine Learning (ICLR)

(2015)

K. Xu et al.

Show, attend and tell: neural image caption generation with visual attention

The International Conference on Machine Learning (ICML)

(2015)

X. Chen et al.

Leveraging unpaired out-of-domain data for image captioning

Pattern Recognit. Lett.

(2018)

A. Fukui et al.

Multimodal compact bilinear pooling for visual question answering and visual grounding

Conference on Empirical Methods in Natural Language Processing (EMNLP)

(2016)

J.-H. Kim et al.

Hadamard product for low-rank bilinear pooling

The International Conference on Learning Representations (ICLR)

(2017)

Z. Yu et al.

Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering

IEEE Trans. Neural Netw. Learn. Syst.

(2018)

Cited by (19)

Dual-decoder transformer network for answer grounding in visual question answering
2023, Pattern Recognition Letters
Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art methods generally require large amounts of data and devices to predict textualized answers and fail to provide visualized evidence of the answers. To mitigate these limitations, we propose a novel dual-decoder Transformer network (DDTN) for predicting the language answer and corresponding vision instance. Specifically, the linguistic features are first embedded by Long Short-Term Memory (LSTM) block and Transformer encoder, which are shared between the Transformer dual-decoder. Then, we introduce object detector to obtain vision region features and grid features for reducing the size and cost of DDTN. These visual features are combined with the linguistic features and are respectively fed into two decoders. Moreover, we design an instance query to guide the fused visual-linguistic features for outputting the instance mask or bounding box. The classification layers aggregate results from decoders and predict answer as well as corresponding instance coordinates at last. Without bells and whistles, DDTN achieves state-of-the-art performance and even competitive to pretraining models on VizWizGround and GQA dataset. The code is available at https://github.com/zlj63501/DDTN.
Question-aware dynamic scene graph of local semantic representation learning for visual question answering
2023, Pattern Recognition Letters
In visual question answering task, it is vital to learn the semantic interactions between the question and target objects in the input image. Existing scene graph-based methods generally extract global features from the image and then perform feature fusion with the question representation. However, the scene graph constructed by these methods only obtains the abstract semantic features from the image, but does not consider the influence of the positional words and semantic information in question. In this paper, we propose a Question-aware Dynamic Scene Graph (QDSG) method. Firstly, we adopt a scene graph of the initial state based on the local attribute features of the image target. Then we design a dynamic scene graph adaptive to different questions based on the initial scene graph, which is used a word-level co-attention mechanism to refine node features and edge features. Finally, iterative reasoning is performed on the refined scene graph and the correct answer is predicted by using the graph attention network model. The proposed method is sufficient to learn the semantic local features to generate the interactive scene graph between the image and question, which is beneficial to the logistic reasoning depending on the adaptive graph refinement. The proposed method outperforms the comparative performance when compared with state-of-the-art models on the GQA dataset and its semantic and structural type datasets.
VD-PCR: Improving visual dialog with pronoun coreference resolution
2022, Pattern Recognition
Citation Excerpt :
The models are pretrained on large-scale multi-modal datasets with self-supervised objectives. Further finetuning them on specific tasks leads to new state-of-the-art records on several multi-modal challenges such as visual question answering [2,21,22], image-text retrieval [23,24], and visual commonsense reasoning [3]. Murahari et al. [8] adapt the two-stream ViLBERT [19] to VisDial via a two-step finetuning and boost the evaluation metrics by a large margin.
The visual dialog task requires an AI agent to interact with humans in multi-round dialogs based on a visual environment. As a common linguistic phenomenon, pronouns are often used in dialogs to improve the communication efficiency. As a result, resolving pronouns (i.e., grounding pronouns to the noun phrases they refer to) is an essential step towards understanding dialogs. In this paper, we propose VD-PCR, a novel framework to improve Visual Dialog understanding with Pronoun Coreference Resolution in both implicit and explicit ways. First, to implicitly help models understand pronouns, we design novel methods to perform the joint training of the pronoun coreference resolution and visual dialog tasks. Second, after observing that the coreference relationship of pronouns and their referents indicates the relevance between dialog rounds, we propose to explicitly prune the irrelevant history rounds in visual dialog models’ input. With pruned input, the models can focus on relevant dialog history and ignore the distraction in the irrelevant one. With the proposed implicit and explicit methods, VD-PCR achieves state-of-the-art experimental results on the VisDial dataset.
Interpretable visual reasoning: A survey
2021, Image and Vision Computing
Citation Excerpt :
DAFT models visual reasoning as a continuous dynamic system, and it can learn the dynamics in-between reasoning steps, yielding more interpretable attention maps. Li et al. [26] proposed a multi-step attention model for visual question answering. Firstly, the model uses CNN and LSTM to extract the image and the question representation.
Visual reasoning refers to the process of solving questions about visual information. At present, most visual reasoning models are mainly based on deep learning and end-to-end architecture. Although these models have achieved good performance, they are usually black boxes for users, and it is difficult to understand the basic rationales of the reasoning process. In recent years, the academic community has realized the importance of interpretability in visual reasoning and has developed a series of Interpretable Visual Reasoning (IVR) models. In this paper, we review these models. First, we have established a taxonomy based on four explanation forms of vision, text, graph and symbol used in current visual reasoning. Secondly, we explore the typical IVR models of each category and analyze their pros and cons. Thirdly, we elaborate on the current mainstream datasets about visual reasoning and VQA, and analyze how these datasets promote IVR research from different perspectives. Finally, we summarize the challenges for IVR and point out potential research directions.
STA3D: Spatiotemporally attentive 3D network for video saliency prediction
2021, Pattern Recognition Letters
Citation Excerpt :
Without taking any temporal information into account, some state-of-the-art image saliency models [3–5] used for video saliency prediction are greatly outperformed by the video ones. Recently, the attention mechanism has played an important role in many computer vision (such as video captioning [32,33], image super-resolution [34], image de-raining [12], person re-identification [35]) and natural language processing [36,37] tasks. Particularly, attention is learned task-specifically to help the network to assign more weights to related parts of images and sentences.
3D fully convolutional networks (FCN), which jointly leverage the spatial and temporal cues, have achieved great success in video saliency prediction. However, they still have limitations in some challenging cases, e.g. fixation shift. To address this issue, we propose a SpatioTemporally Attentive 3D Network (STA3D) to selectively propagate the significant temporal features and refine the spatial features in 3D FCN for video saliency prediction. Extensive experiments on three standard datasets demonstrate the superiority of the proposed model against the state-of-the-art.
Probabilistic framework for solving visual dialog
2021, Pattern Recognition
Citation Excerpt :
Further, the community moved on to answer questions based on an image in the visual question answering (VQA) task [5,15]. Recently [16–19] proposed attention based method to solve this task. Another interesting problem that has been addressed is that of visual question generation (VQG) [20–23], where the aim is that given an image to generate natural questions similar to that asked by humans.
In this paper, we propose a probabilistic framework for solving the task of ‘Visual Dialog’. Solving this task requires reasoning and understanding of visual modality, language modality, and common sense knowledge to answer. Various architectures have been proposed to solve this task by variants of multi-modal deep learning techniques that combine visual and language representations. However, we believe that it is crucial to understand and analyze the sources of uncertainty for solving this task. Our approach allows for estimating uncertainty and also aids a diverse generation of answers. The proposed approach is obtained through a probabilistic representation module that provides us with representations for image, question and conversation history, a module that ensures that diverse latent representations for candidate answers are obtained given the probabilistic representations and an uncertainty representation module that chooses the appropriate answer that minimizes uncertainty. We thoroughly evaluate the model with a detailed ablation analysis, comparison with state of the art and visualization of the uncertainty that aids in the understanding of the method. Using the proposed probabilistic framework, we thus obtain an improved visual dialog system that is also more explainable.

View all citing articles on Scopus

View full text

Visual question answering with attention transfer and a cross-modal gating mechanism

Highlights

Abstract

Introduction

Section snippets

Related works

Proposed model

Experiments and evaluation metric

Conclusion and future work

Declaration of Competing Interest

Pattern Recognit. Lett.

Pattern Recognit. Lett.

Pattern Recognit. Lett.

Imagenet classification with deep convolutional neural networks

Advances in Neural Information Processing Systems (NIPS)

Very deep convolutional networks for large-scale image recognition

The International Conference on Learning Representations (ICLR)

Going deeper with convolutions

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Faster R-CNN: towards real-time object detection with region proposal networks

Advances In Neural Information Processing Systems (NIPS)

Ssd: single shot multibox detector

The European Conference on Computer Vision (ECCV)

You only look once: unified, real-time object detection

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Feature pyramid networks for object detection

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Beyond context: exploring semantic similarity for small object detection in crowded scenes

Pattern Recognit. Lett.

Sequence to sequence learning with neural networks

Advances in Neural Information Processing Systems (NIPS)

Neural machine translation by jointly learning to align and translate

The International Conference on Machine Learning (ICLR)

Show, attend and tell: neural image caption generation with visual attention

The International Conference on Machine Learning (ICML)

Leveraging unpaired out-of-domain data for image captioning

Pattern Recognit. Lett.

Multimodal compact bilinear pooling for visual question answering and visual grounding

Conference on Empirical Methods in Natural Language Processing (EMNLP)

Hadamard product for low-rank bilinear pooling

The International Conference on Learning Representations (ICLR)

Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering

IEEE Trans. Neural Netw. Learn. Syst.