research-article

MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering

Authors:

Guanghui QiuAuthors Info & Claims

ICVIP '19: Proceedings of the 3rd International Conference on Video and Image Processing

Pages 143 - 147

https://doi.org/10.1145/3376067.3376082

Published: 25 February 2020 Publication History

Abstract

The majority of existing methods for Visual Question Answering are primarily based on Recurrent Neural Networks(RNNs) with attention to extract question features and outputs the last hidden state for modal fusion. However, only the last hidden state cannot preserve precise location information, which may lead to semantic confusion. Our work adopts Multiple Encoder Block to replace RNNs, and propose a Double Attention Module which includes Objective Attention and Spatial Attention to acquire more important regions or objects. The proposed method is evaluated on VQA 2.0 and CLEVR, and obtain competitive result.

References

[1]

Agrawal, A., Lu, J., Antol, S., Zitnick, C. L., Zitnick, C. L., & Parikh, D., et al. (2017). Vqa: visual question answering. International Journal of Computer Vision, 123(1), 4--31.

Digital Library

[2]

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735--1780.

Digital Library

[3]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A. N., et al. (2017). Attention is all you need.

[4]

Wang, L. (2015). Places205-vggnet models for scene recognition. Computer Science.

[5]

He, K., Zhang, X., Ren, S., & Jian, S. (2015). Delving deep into rectifiers: surpassing human-level performance on imagenet classification.

Digital Library

[6]

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., & Gould, S., et al. (2017). Bottom-up and top-down attention for image captioning and visual question answering.

[7]

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. International Conference on Neural Information Processing Systems.

[8]

Xu, H., & Saenko, K. (2015). Ask, attend and answer: exploring question-guided spatial attention for visual question answering.

[9]

Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2015). Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks.

[10]

Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: towards accurate region proposal generation and joint object detection.

[11]

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., & Fu, C. Y., et al. (2016). Ssd: single shot multibox detector.

[12]

Bao, S., & Chung, A. C. S. (2016). Multi-scale structured cnn with label consistency for brain mr image segmentation.

[13]

Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: convolutional block attention module. H. Ben-Younes, R. Cadene, N. Thome, and M. Cord. Mutan: Multimodal tucker fusion for visual question answering. ICCV, 2017.

[14]

J.-H. Kim, K. W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for Low-rank Bilinear Pooling. In The 5th International Conference on Learning Representations, 2017.

[15]

Yu Jiang*, Vivek Natarajan*, Xinlei Chen*, M. Rohrbach, D. Batra, and D. Parikh. Pythia v0.1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956, 2018.

[16]

Y. Zhang, J. Hare, and A. Prgel-Bennett. Learning to count objects in natural images for visual question answering. In International Conference on Learning Representations, 2018

[17]

Nguyen, D. K., & Okatani, T. (2018). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering.

[18]

Dozat, T., & Manning, C. D. (2016). Deep biaffine attention for neural dependency parsing.

[19]

Johnson, J., Hariharan, B., Laurens, V. D. M., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2016). Clevr: a diagnostic dataset for compositional language and elementary visual reasoning.

[20]

Lu, P., Li, H., Zhang, W., Wang, J., & Wang, X. (2017). Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering.

[21]

Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: convolutional block attention module.

Digital Library

[22]

Cadene, R., Ben-Younes, H., Cord, M., & Thome, N. (2019). Murel: multimodal relational reasoning for visual question answering.

[23]

Tang, K., Zhang, H., Wu, B., Luo, W., & Liu, W. (2018). Learning to compose dynamic tree structures for visual contexts.

[24]

Kim, Y., Denton, C., Hoang, L., & Rush, A. M. (2017). Structured attention networks.

[25]

Lu, P., Li, H., Zhang, W., Wang, J., & Wang, X. (2017). Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering.

Index Terms

MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object detection
    2. Natural language processing
      1. Discourse, dialogue and pragmatics

Recommendations

Selective feature fusion network for salient object detection
Abstract
Fully convolutional neural networks have achieved great success in salient object detection, in which the effective use of multi‐layer features plays a critical role. Based on this advantage, many saliency detectors have emerged in recent years, ...

In this paper, we propose a selective feature fusion network which consists of a selective feature fusion module (SFM) and an attention‐guide hierarchical feature emphasis module (AEM). Selective feature fusion modules adaptively selects the important ...
Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Abstract
Current visual question answering (VQA) has become a research hotspot in the computer vision and natural language processing field. A core solution of VQA is how to fuse multi-modal features from images and questions. This paper proposes a ...
Text Classification Method Based on BiGRU-Attention and CNN Hybrid Model
AIPR '21: Proceedings of the 2021 4th International Conference on Artificial Intelligence and Pattern Recognition

Aiming at the problem that traditional Gated Recurrent Unit (GRU) and Convolution Neural Network (CNN) can not reflect the importance of each word in the text when extracting features, a text classification method based on BiGRU Attention and CNN is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVIP '19: Proceedings of the 3rd International Conference on Video and Image Processing

December 2019

270 pages

ISBN:9781450376822

DOI:10.1145/3376067

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Shanghai Jiao Tong University: Shanghai Jiao Tong University
Xidian University
TU: Tianjin University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVIP 2019

ICVIP 2019: 2019 the 3rd International Conference on Video and Image Processing

December 20 - 23, 2019

Shanghai, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
68
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten