Local relation network with multilevel attention for visual question answering

https://doi.org/10.1016/j.jvcir.2020.102762Get rights and content

Highlights

  • We proposed LRNs, which provide deeper semantic information, and a multilevel attention mechanism for VQA tasks.

  • We comprehensively evaluated the COCO-QA dataset and the largest VQA v2.0 benchmark dataset, achieving competitive results on both datasets without additional training data.

  • We performed detailed analysis, where we visualized the outputs of different attention levels, and demonstrated the effectiveness of the proposed model.

Abstract

With the tremendous success of the visual question answering (VQA) tasks, visual attention mechanisms have become an indispensable part of VQA models. However, these attention-based methods do not consider any relationship among regions, which is crucial for the thorough understanding of the image by the model. We propose local relation networks for generating context-aware image features for each image region, which contain information on the relationship among the other image regions. Furthermore, we propose a multilevel attention mechanism to combine semantic information from the LRNs and the original image regions, rendering the decision of the model more reasonable. With these two measures, we improve the region representation and achieve better attentive effect and VQA performance. We conduct numerous experiments on the COCO-QA dataset and the largest VQA v2.0 benchmark dataset. Our model achieves competitive results, proving the effectiveness of our proposed LRNs and multilevel attention mechanism through visual demonstrations.

Introduction

Visual question answering (VQA) is an interesting and challenging task that produces meaningful results by processing and fusing visual and textual data, and involves the areas of computer vision and natural language processing. In detail, it takes an image and a corresponding free-form, open-ended, natural language question as the input, and produces a natural language answer as the output [1], [2], [3]. The question can be as easy as the categorization of an object(e.g., “What kind of sheep are there?“) or as difficult as object detection(e.g., “How many bikes are there?”), activity recognition(e.g., “Is this man crying?“), and knowledge- base reasoning(e.g., “Is this a vegetarian pizza?”). In recent years, with the tremendous advancement in computer vision(e.g., object detection [4]) and natural language processing(e.g., word representation [5]), which enable the in-depth analysis and understanding of the image and language, VQA tasks have been further developed.

Most of the current VQA models can be divided into three categories, based on the reasoning. The first category involves the relation-based method that views the reasoning procedure as relational reasoning. It fuses the relationships of all the image-region pairs into one feature representation, and then infers the answer in a single step. However, the generated relation network (RN) is highly general and is not always sufficient for complex questions. Next, the attention-based method views the reasoning procedure as the updation of the attention distribution on objects (such as image regions or bounding boxes) with variants of the attention mechanism, and gradually infers the answer. The attention mechanism [7] provides a method for capturing the question-relevant regions and for representing the desired visual information accurately. It takes multiple-region representations and a question representation as the input, and calculates the correlation values between them; a larger value indicates that the corresponding region is more relevant to the question. Initially, most attention-based VQA models divide the image into multiple grids or region proposals to capture fine-grained information for the in-depth understanding of the image, and utilize the attention mechanism to obtain the final representation of the image feature. Attention-based models finally combine the question representation and final image feature to infer the answer. However, these VQA models do not pay sufficient attention to the determination of the image regions that are subject to attention [6] takes advantage of the bottom-up region proposal (e.g., faster R-CNN [4]) that captures more meaningful image regions, which are not limited to gridded regions or objects with curves, and combines top-down and bottom-up attention achieving immense success in image captioning and VQA tasks. The third category includes the module-based method that views the reasoning procedure as a layout generated from manually predefined modules. This method uses the layout to instantiate modular networks.

Despite the immense success of the top-down and bottom-up approaches [6], the relationship among the image regions remain neglected, although they include crucial information for understanding the image. However, the RNs used in a relation-based approach can better capture the relationship among the image regions and reason for answering relational questions. In view of the above, we propose a network that combines the RN and multilevel attention to jointly exploit the advantages of both, for enhancing the model's reasoning ability, while obtaining a thorough understanding of the image, in this study. Inspired by [6], [8], we propose a local relation network (LRN), which captures the relationship among the image regions, and produces more semantic and context-aware image region features for each image region. Further, we propose a multilevel attention mechanism for the convergence of both relational and original region information. A visual analysis example for bottom-up attention and the proposed multilevel attention is shown in Fig. 1. We conducted numerous experiments with the COCO-QA dataset and the largest VQA v2.0 benchmark dataset, and achieved competitive results without additional training data.

In summary, our model overcomes the shortcomings of the attention-based model that do not consider the relationship between image regions via local relation network and multi-level attention mechanism, and achieves better performance on the VQA 2.0 and COCO-QA datasets. The main contributions of this study are summarized as follows:

  • We proposed LRNs, which provide deeper semantic information, and a multilevel attention mechanism for VQA tasks.

  • We comprehensively evaluated the COCO-QA dataset and the largest VQA v2.0 benchmark dataset, achieving competitive results on both datasets without additional training data.

  • We performed detailed analysis, where we visualized the outputs of different attention levels, and demonstrated the effectiveness of the proposed model.

Section snippets

Relation-based VQA models

Relational reasoning is the central component of intelligent behavior, in general, and is required by VQA tasks. The relation-based method performs one-step relational reasoning to infer the answer. [8] proposed a plug-and-play module called the “relation network“ (RN), which model all the interactions between the regions in the image and applies multilayer perceptrons (MLPs) to calculate all the relations. The relations are then summed and passed through other MLPs to infer the final answer.

Method

The proposed VQA model is an extended attention model based on the top-down and bottom-up attention models [6], [25]; however, it differs in three aspects. First, we expect the model to pay attention to the associated image regions, while noting the original image region itself, to obtain more meaningful and context-aware image feature representation for better image understanding. We propose LRNs based on [8], for this purpose. Next, we propose multilevel attention, which has the same

Experiments

In order to validate the effectiveness of our proposed model, we conducted numerous experiments comparing the improvement by the proposed method with the existing reported results, and certain visualization results on the COCO-QA [2] and VQA v2.0 [3] datasets. Note that we trained our model and conducted comparative experiments on both datasets with the same hyperparameters.

Conclusion

In this study, we proposed an LRN and multilevel attention mechanism to address the problem, wherein the attention-based-mechanism model does not pay attention to the relationship among the image regions, which can be crucial for understanding the image. Empirical experiments and visual demonstration using the COCO-QA dataset and the largest VQA v2.0 benchmark dataset established that the region representation was improved, and better attentive effect and VQA performance were achieved by the

CRediT authorship contribution statement

Bo Sun: Conceptualization, Methodology. Zeng Yao: Conceptualization, Methodology, Software, Writing - original draft. Yinghui Zhang: Writing - review & editing. Lejun Yu: .

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, “Local Relation Network with Multilevel Attention for Visual Question Answering.”

References (38)

  • S. Antol et al.

    Vqa: Visual question answering

  • M. Ren et al.

    Exploring models and data for image question answering

    Advances in neural information processing systems

    (2015)
  • Y. Goyal et al.

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

  • S. Ren et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

  • J. Pennington et al.

    Glove: Global vectors for word representation

  • P. Anderson et al.

    Bottom-up and top-down attention for image captioning and visual question answering

  • K. Xu et al.

    Show, attend and tell: Neural image caption generation with visual attention

  • A. Santoro et al.

    A simple neural network module for relational reasoning

  • C. Wu et al.

    Chain of reasoning for visual question answering

  • K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, R. Nevatia, Abc-cnn: An attention based convolutional neural network for...
  • Z. Yang et al.

    Stacked attention networks for image question answering

  • H. Xu et al.

    Ask, attend and answer: Exploring question-guided spatial attention for visual question answering

  • J. Lu et al.

    Hierarchical question-image co-attention for visual question answering

  • H. Nam et al.

    Dual attention networks for multimodal reasoning and matching

  • I. Schwartz et al.

    High-order attention models for visual question answering

  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal compact bilinear pooling for visual...
  • J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, B.-T. Zhang, Hadamard product for low-rank bilinear pooling, arXiv...
  • Z. Yu et al.

    Multi-modal factorized bilinear pooling with co-attention learning for visual question answering

  • Z. Yu et al.

    Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering

    IEEE Trans. Neural Networks Learn. Syst.

    (2018)
  • Cited by (7)

    • A survey of methods, datasets and evaluation metrics for visual question answering

      2021, Image and Vision Computing
      Citation Excerpt :

      The model is shown in Fig. 15 (a). Sun et al. [89] proposed a local relation networks (LRNs) which capture deeper semantic relationship information and thus produces context-aware visual features for every image region. Further, they employ multi-level attention to combine the information obtained from LRNs to converge both original image and relational information.

    • Innovating Sustainability: VQA-Based AI for Carbon Neutrality Challenges

      2024, Journal of Organizational and End User Computing
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Zicheng Liu.

    View full text