Elsevier

Knowledge-Based Systems

Volume 207, 5 November 2020, 106339
Knowledge-Based Systems

Visual Question Answering via Combining Inferential Attention and Semantic Space Mapping

https://doi.org/10.1016/j.knosys.2020.106339Get rights and content

Abstract

Visual Question Answering (VQA) has emerged and aroused widespread interest in recent years. Its purpose is to explore the close correlations between the image and question for answer inference. We have two observations about the VQA task: (1) the number of newly defined answers is ever-growing, which means that answer prediction on pre-defined labeled answers may lead to errors, as some unlabeled answers may be the right choice to the question–image pairs; (2) in the process of answering visual questions, the gradual change of human attention has an important guiding role in exploring the correlations between images and questions. Based on these observations, we propose a novel model for VQA, i.e., combining Inferential Attention and Semantic Space Mapping (IASSM). Specifically, our model has two salient aspects: (1) a semantic space shared by both the labeled and unlabeled answers is constructed to learn new answers, where the joint embedding of a question and the corresponding image is mapped and clustered around the answer exemplar; (2) a novel inferential attention model is designed to simulate the learning process of human attention to explore the correlations between the image and question. It focuses on the more important question words and image regions associated with the question. Both the inferential attention and the semantic space mapping modules are integrated into an end-to-end framework to infer the answer. Experiments performed on two public VQA datasets and our newly constructed dataset show the superiority of IASSM compared with existing methods.

Introduction

With the great development of natural language processing and computer vision, problems of combining vision and language in artificial intelligence are inspiring considerable research interests. A new task called Visual Question Answering (VQA) [1], [2], [3], [4], [5] has emerged as an promising but intractable research point. VQA requires the algorithms to output answers for natural language questions about the contents of the given images. Compared with conventional multi-modal tasks such as cross-modal retrieval [6], [7], [8] and image captioning [9], [10], [11], the VQA task demands a deep understanding of the input image and question sentence to infer the answer. VQA can be widely applied to many scenarios and plays a crucial role, e.g., early education, human-machine interaction, medical assistance, and automatic customer service [12].

A great variety of VQA methods sprung up in recent years are based on deep neural networks [13], [14], [15], [16], which concentrates on learning an effective multi-modal joint embedding of the image and question to infer the answer. Most of the existing methods first employ a visual attention mechanism [17], [18] to learn the joint embedding by exploring the correlations between the question words and image regions. Then, VQA is considered as a classification problem, and the learned joint embedding is fed into an answer classifier trained on a large number of labeled samples. The candidate answers of the classifier are the labeled answers in the training dataset. However, there are two drawbacks in the existing VQA methods. On one hand, a sufficient set of question–image pairs labeled with corresponding answers are usually unavailable in real-world applications. It results in that the performance of VQA will be affected when the correct answers of the testing questions are unlabeled and existing outside the training dataset. Moreover, the number of newly defined answers is ever-growing, which means that training a specific model for each answer is unattainable. One the other hand, existing attention-based VQA methods implicitly explore the correlations between the visual content and textual sentence, which cannot capture the different importance of the terms in the process of answering questions. These attention-based methods lack a reasonable derivation process, which makes them inconsistent with the progressive changes in human attention as they answer visual questions. Therefore, there is an urgent need to design an explicit mechanism to solve these two shortcomings in the previous VQA methods for more accurate answer prediction.

In recent years, Zero-Shot Learning (ZSL) [19], [20], [21] has been proposed as an ambitious paradigm in the image recognition task to recognize the novel classes for which no training samples are provided. Inspired by it, we regard that new answers that are unlabeled in the training set can be predicted by treating them as recognizing the novel classes with ZSL. To predict the unlabeled answers, we introduce an intermediate semantic space that is shared between the labeled and unlabeled answers. In the semantic space, semantic information can be transferred from labeled answers to unlabeled answers. As the examples shown in Fig. 1, given the question “what is on the sofa?” and the corresponding image with the labeled answer such as “dog”, a joint embedding of the question–image pair is first learned. Then, the joint embeddings which seek for the same labeled answer are mapped into the semantic space and clustered around the corresponding answer exemplar. The answer exemplar in this instance is the embedding of the labeled answer “dog” sought by the question–image pair. When testing with a new question “what is beside the clock?” which seeks for an unlabeled answer “rabbit”, the joint embedding learned from the question–image pair is mapped into the semantic space. Since this embedding is located nearby the labeled answer exemplar “dog”, the answer for the image and the corresponding question is obtained by searching the most matched answer exemplar around “dog”. Then, the unlabeled answer “rabbit” is inferred.

As for the visual attention mechanism used to capture the correlations between the image and question, it can receive inspiration from the learning process of human attention. Take the image and the related question “what is beside the clock?” shown in Fig. 1 as an example, the inferential process of human attention can be elaborated as follows. First, we will focus on finding the image regions related to the noun word “clock”. Then, these regions are combined with the question sentence to learn the question-related multi-modal information, i.e., “What is located beside the focused regions?”. In the end, this multi-modal information will be utilized to attend on the image again, and the answer “rabbit” is inferred. Therefore, it is appropriate to design a visual attention mechanism that is consistent with the learning process of human attention to infer the answer.

In this paper, we propose to make the best of Zero-Shot Learning and the inferential process of human attention for visual question answering. In particular, we investigate: (1) how to transfer information from labeled answers to unlabeled answers; (2) how to effectively capture the close correlations between the image and question sentence to learn effective multi-modal joint embedding for answer inference. To solve these questions, we propose a novel VQA model, i.e., combining Inferential Attention and Semantic Space Mapping (IASSM). Our model mainly contains two components as shown in Fig. 2. Specifically, an inferential attention network is designed to imitate the learning process of human attention to capture more effective multi-modal correlations between the image and question. The inferential attention network attempts to learn more reasonable attention maps for the noun words and the question-related multi-modal information. In order to predict unlabeled answers, a semantic space shared by the labeled and unlabeled answers is designed. Each answer exemplar in the semantic space corresponds to the embedding of an answer. For the question–image pairs, the joint embeddings are mapped to the semantic space and clustered around the answer exemplars. Then, the unlabeled answers to the new questions can be inferred by seeking the most matched answer exemplars in the semantic space. The major contributions of this paper are concluded as follows:

  • Unlike previous VQA works, we investigate the problem of predicting unlabeled answers. A semantic space is designed to transfer information from labeled answers to unlabeled answers.

  • Different from existing attention-based approaches, we propose a novel inferential attention network to imitate the learning process of human attention to learn more effective multi-modal correlations between the image and question.

  • We construct a zero-shot dataset for VQA from a public dataset. Extensive experiments conducted on our constructed dataset and two public VQA datasets confirm the favorable performance of our model compared with the baselines.

The remainder of this paper is organized as follows. We first review the related works of VQA and Zero-Shot Learning. Then, our model IASSM is introduced in detail. Next, the experimental results and further analysis are presented. At last, we summarize this paper and look forward to future work.

Section snippets

Related work

VQA To facilitate VQA research, several datasets are constructed and introduced in [1], [22], [23], [24], which are automatically generated or manually labeled from image caption datasets. Based on these datasets, the VQA works emerged in recent years mainly adopt deep neural networks to lean a joint embedding of the image and question for answer inference. Early methods detailed in [25], [26], [27] directly combine the two embeddings learned from image and question as a joint embedding and

IASSM for visual question answering

In this section, we first present an overview of IASSM. Then, each component of IASSM is introduced in detail.

Datasets and baselines

We conduct extensive experiments on the following three datasets to evaluate the performance of IASSM:

VQA v1.0 [1] is the most commonly used VQA dataset constructed from the image caption dataset MS-COCO. The questions are divided into three categories, including yesno, number, and other. Each question corresponds to 10 answers created by crow-sourced workers. This dataset consists of three splits, i.e., train (248,349 samples), val (121,512 samples), and test (244,302 samples). The test set

Conclusion and future work

This paper focuses on the Visual Question Answering (VQA) task that has emerged in recent years, and its research has important practical implications for early education, human-machine interaction, etc. Our goal is to explore human-like attention to capture more effective multi-modal correlations and predicting unlabeled answers for VQA. We propose a novel model for the VQA task, i.e., combining Inferential Attention and Semantic Space Mapping (IASSM). Specifically, an inferential attention

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos. U1636210 and U1636211) and Beijing Natural Science Foundation of China (No. 4182037).

References (55)

  • WuQi et al.

    Visual question answering: A survey of methods and datasets

    Comput. Vis. Image Underst.

    (2017)
  • HongJongkwang et al.

    Exploiting hierarchical visual features for visual question answering

    Neurocomputing

    (2019)
  • LiuYun et al.

    Visual question answering via attention-based syntactic structure tree-LSTM

    Appl. Soft Comput.

    (2019)
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, Vqa:...
  • Kushal Kafle, Christopher Kanan:, Answer-type prediction for visual question answering, in: Proceedings of the IEEE...
  • Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, Devi Parikh, Making the v in vqa matter: Elevating the role...
  • Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, Devi Parikh, Probabilistic neural symbolic...
  • Liang Xie, Jialie Shen, Lei Zhu, Online cross-modal hashing for web image retrieval, in: Proceedings of the Thirtieth...
  • Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, Xiaogang Wang, Person search with natural language...
  • Po-Yao Huang, . Vaibhav, Xiaojun Chang, Alexander G. Hauptmann, Improving what cross-modal retrieval models learn...
  • Andrej Karpathy, Li Fei-Fei, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the...
  • Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan, Show and tell: A neural image caption generator, in:...
  • WuQi et al.

    Image captioning and visual question answering based on attributes and external knowledge

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, Jianyong Wang, R-vqa: learning visual relation facts with semantic...
  • Dongfei Yu, Jianlong Fu, Tao Mei, Yong Rui, Multi-level attention networks for visual question answering, in:...
  • WuQi et al.

    Image captioning and visual question answering based on attributes and external knowledge

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Tingting Qiao, Jianfeng Dong, Duanqing Xu, Exploring human-like attention supervision in visual question answering, in:...
  • Badri Patro, Vinay P. Namboodiri, Differential attention for visual question answering, in: Proceedings of the IEEE...
  • Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, Alexander Hauptmann, Focal visual-text attention for visual question...
  • Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic...
  • Soravit Changpinyo, Wei-Lun Chao, Fei Sha, Predicting visual exemplars of unseen classes for zero-shot learning, in:...
  • Bin Tong, Martin Klinkigt, Junwen Chen, Xiankun Cui, Quan Kong, Tomokazu Murakami, Yoshiyuki Kobayashi, Adversarial...
  • Mateusz Malinowski, Mario Fritz, A multi-world approach to question answering about real-world scenes based on...
  • Mengye Ren, Ryan Kiros, Richard S. Zemel, Exploring models and data for image question answering, in: Proceedings of...
  • Lin Ma, Zhengdong Lu, Hang Li, Learning to answer questions from image using convolutional neural network, in:...
  • JiangAiwen et al.

    Compositional memory for visual question answering

    (2015)
  • Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, Wei Xu, Are you talking to a machine? dataset and methods...
  • Cited by (0)

    View full text