Elsevier

Pattern Recognition Letters

Volume 133, May 2020, Pages 334-340
Pattern Recognition Letters

Visual question answering with attention transfer and a cross-modal gating mechanism

https://doi.org/10.1016/j.patrec.2020.02.031Get rights and content

Highlights

  • A multi-step attention model for Visual Question Answering is proposed.

  • It relies on attention transfer and a cross-modal gating mechanism.

  • The attention transfer model adjusts attention with question guidance.

  • The cross-modal gating mechanism filters out irrelevant information.

Abstract

Visual question answering (VQA) is challenging since it requires to understand both language information and corresponding visual contents. A lot of efforts have been made to capture single-step language and visual interactions. However, answering complex questions requires multiple steps of reasoning which gradually adjusts the region of interest to the most relevant part of the given image, which has not been well investigated. To integrate question related object relations into attention mechanism, we propose a multi-step attention architecture to facilitate the modeling of multi-modal correlations. Firstly, an attention transfer mechanism is integrated to gradually adjust the region of interest considering reasoning representation of questions. Secondly, we propose a cross-modal gating strategy to filter out irrelevant information based on multi-modal correlations. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset, which verifies the effectiveness of our proposed method.

Introduction

Visual Question Answering which means to answer questions based on the given image, has aroused a lot of research attention for its important role in linking neural language processing and computer vision. Recently, deep learning has achieved great success in several computer vision or language processing areas, e.g. image classification [1], [2], [3], [4], object detection [5], [6], [7], [8], [9] and neural machine translation [10], [11].

Compared with image captioning [12], [13] or image-text retrieval [14], [15], visual question answering is much more challenging with the need of understanding both visual and language contents and conducting complex reasoning. There has been much works concentrating on the fusion of visual and language information [16], [17], [18]. For example, Fukui et al. [16] employed the simpler method of compact bilinear pooling to approximate the full bilinear pooling, which exploits the interaction between visual and language modalities in a quadradic manner. Recently, attention mechanism has been proven to be effective in VQA, which helps the model to concentrate on question related image areas [18], [19].

However, answering complex questions like “Does the man have his nose pierced?” in Fig. 1, the VQA model should attend to the man first and then refines its region of interest to his nose. Previous works normally only apply the question-based image attention once or perform it several times separately, which neglects the dependency of each attention process.

Recently, multi-head attention has been proven to be effective in modeling sequential correlations [20]. In order to delve deeper into the intrinsic object relations related to the question, we propose an attention transfer algorithm to adjust its region of interest based on the reasoning information encoded from questions. The multi-head attention mechanism infers the dependency of each pair of objects with the guidance of question information. Furthermore, a cross-modal gating mechanism is integrated to filter out irrelevant question and image channel-wise patterns by the guidance of their multi-modal correlations.

In summary, the contributions of this paper are:

  • We first propose a multi-step attention architecture for visual question answering, which explicitly adjusts the region of interest with the guidance of question representation.

  • We introduce an attention transfer mechanism, which is based on a multi-head attention algorithm to capture question related object relations.

  • A cross-modal gating algorithm is further introduced to control the channel-wise information flow guided by the multi-modal correlations.

  • Our model achieves the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset.

  • We perform extensive ablation analysis to verify the effectiveness of each component of our proposed model and qualitatively evaluate our multi-step attention mechanism with visualizations of attention distributions.

The rest of the paper is organized as follows. In Section 2, we will review some related works on visual question answering. In Section 3, we will introduce our multi-step attention architecture for VQA in detail. In Section 4, we show experimental results on the VQA 1.0 and VQA 2.0 datasets to verify the effectiveness of our proposed model. We will conclude our paper in Section 5.

Section snippets

Related works

Visual question answering has drawn much research attention in the past few years. It normally can be regarded as a classification task. Image features and language features are extracted with CNN and RNN architectures, respectively. The selection of fusing strategies plays a key role in modelling the correlations between both modalities. A number of methods have been proposed along two main directions: attention mechanisms [21], [22], [23], [24], [25] and fusion methods [16], [18], [26], [27],

Proposed model

The overall architecture of our proposed model is shown in Fig. 2. Our proposed model extracts image and question representations with CNN and LSTM respectively, then performs question attention and question-guided multi-step attention to gradually adjusts its region of interest, and finally fuses image information and question information together with a cross-modal gating algorithm to predict the answer distribution. In this section, we will first introduce feature extraction for image and

Experiments and evaluation metric

To evaluate our multi-step attention model for visual question answering, we conduct experiments on both the VQA 1.0 and VQA 2.0 datasets. We further perform ablation analysis on the VQA 2.0 dataset to verify the effectiveness of each component of our architecture. Finally, we achieve the state-of-the-art performance on the VQA 1.0 dataset and favorable results on the VQA 2.0 dataset.

Conclusion and future work

In this paper, we propose a multi-step attention mechanism for visual question answering, which gradually adjusts its region of interest with the guidance of question information. First, we integrate an attention transfer algorithm to gradually adjust the region of interest, which is based on the multi-head attention mechanism to capture object relations. Second, a cross-modal gating algorithm is further incorporated to control the channel-wise information flow of both visual and language

Declaration of Competing Interest

The authors declare that they do not have any financial or nonfinancial conflict of interests.

References (38)

  • D. Mandal et al.

    Query specific re-ranking for improved cross-modal retrieval

    Pattern Recognit. Lett.

    (2017)
  • J. Moraleda

    Large scalability in document image matching using text retrieval

    Pattern Recognit. Lett.

    (2012)
  • V. Lioutas et al.

    Explicit ensemble attention learning for improving visual question answering

    Pattern Recognit. Lett.

    (2018)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems (NIPS)

    (2012)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    The International Conference on Learning Representations (ICLR)

    (2015)
  • C. Szegedy et al.

    Going deeper with convolutions

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2015)
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,...
  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    Advances In Neural Information Processing Systems (NIPS)

    (2015)
  • W. Liu et al.

    Ssd: single shot multibox detector

    The European Conference on Computer Vision (ECCV)

    (2016)
  • J. Redmon et al.

    You only look once: unified, real-time object detection

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • T.-Y. Lin et al.

    Feature pyramid networks for object detection

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • Y. Xi et al.

    Beyond context: exploring semantic similarity for small object detection in crowded scenes

    Pattern Recognit. Lett.

    (2019)
  • I. Sutskever et al.

    Sequence to sequence learning with neural networks

    Advances in Neural Information Processing Systems (NIPS)

    (2014)
  • K.C. Dzmitry Bahdanau et al.

    Neural machine translation by jointly learning to align and translate

    The International Conference on Machine Learning (ICLR)

    (2015)
  • K. Xu et al.

    Show, attend and tell: neural image caption generation with visual attention

    The International Conference on Machine Learning (ICML)

    (2015)
  • X. Chen et al.

    Leveraging unpaired out-of-domain data for image captioning

    Pattern Recognit. Lett.

    (2018)
  • A. Fukui et al.

    Multimodal compact bilinear pooling for visual question answering and visual grounding

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    (2016)
  • J.-H. Kim et al.

    Hadamard product for low-rank bilinear pooling

    The International Conference on Learning Representations (ICLR)

    (2017)
  • Z. Yu et al.

    Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering

    IEEE Trans. Neural Netw. Learn. Syst.

    (2018)
  • Cited by (19)

    • VD-PCR: Improving visual dialog with pronoun coreference resolution

      2022, Pattern Recognition
      Citation Excerpt :

      The models are pretrained on large-scale multi-modal datasets with self-supervised objectives. Further finetuning them on specific tasks leads to new state-of-the-art records on several multi-modal challenges such as visual question answering [2,21,22], image-text retrieval [23,24], and visual commonsense reasoning [3]. Murahari et al. [8] adapt the two-stream ViLBERT [19] to VisDial via a two-step finetuning and boost the evaluation metrics by a large margin.

    • Interpretable visual reasoning: A survey

      2021, Image and Vision Computing
      Citation Excerpt :

      DAFT models visual reasoning as a continuous dynamic system, and it can learn the dynamics in-between reasoning steps, yielding more interpretable attention maps. Li et al. [26] proposed a multi-step attention model for visual question answering. Firstly, the model uses CNN and LSTM to extract the image and the question representation.

    • STA3D: Spatiotemporally attentive 3D network for video saliency prediction

      2021, Pattern Recognition Letters
      Citation Excerpt :

      Without taking any temporal information into account, some state-of-the-art image saliency models [3–5] used for video saliency prediction are greatly outperformed by the video ones. Recently, the attention mechanism has played an important role in many computer vision (such as video captioning [32,33], image super-resolution [34], image de-raining [12], person re-identification [35]) and natural language processing [36,37] tasks. Particularly, attention is learned task-specifically to help the network to assign more weights to related parts of images and sentences.

    • Probabilistic framework for solving visual dialog

      2021, Pattern Recognition
      Citation Excerpt :

      Further, the community moved on to answer questions based on an image in the visual question answering (VQA) task [5,15]. Recently [16–19] proposed attention based method to solve this task. Another interesting problem that has been addressed is that of visual question generation (VQG) [20–23], where the aim is that given an image to generate natural questions similar to that asked by humans.

    View all citing articles on Scopus
    View full text