skip to main content
10.1145/3376067.3376082acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvipConference Proceedingsconference-collections
research-article

MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering

Published: 25 February 2020 Publication History

Abstract

The majority of existing methods for Visual Question Answering are primarily based on Recurrent Neural Networks(RNNs) with attention to extract question features and outputs the last hidden state for modal fusion. However, only the last hidden state cannot preserve precise location information, which may lead to semantic confusion. Our work adopts Multiple Encoder Block to replace RNNs, and propose a Double Attention Module which includes Objective Attention and Spatial Attention to acquire more important regions or objects. The proposed method is evaluated on VQA 2.0 and CLEVR, and obtain competitive result.

References

[1]
Agrawal, A., Lu, J., Antol, S., Zitnick, C. L., Zitnick, C. L., & Parikh, D., et al. (2017). Vqa: visual question answering. International Journal of Computer Vision, 123(1), 4--31.
[2]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735--1780.
[3]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., & Gomez, A. N., et al. (2017). Attention is all you need.
[4]
Wang, L. (2015). Places205-vggnet models for scene recognition. Computer Science.
[5]
He, K., Zhang, X., Ren, S., & Jian, S. (2015). Delving deep into rectifiers: surpassing human-level performance on imagenet classification.
[6]
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., & Gould, S., et al. (2017). Bottom-up and top-down attention for image captioning and visual question answering.
[7]
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. International Conference on Neural Information Processing Systems.
[8]
Xu, H., & Saenko, K. (2015). Ask, attend and answer: exploring question-guided spatial attention for visual question answering.
[9]
Bell, S., Zitnick, C. L., Bala, K., & Girshick, R. (2015). Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks.
[10]
Kong, T., Yao, A., Chen, Y., & Sun, F. (2016). Hypernet: towards accurate region proposal generation and joint object detection.
[11]
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., & Fu, C. Y., et al. (2016). Ssd: single shot multibox detector.
[12]
Bao, S., & Chung, A. C. S. (2016). Multi-scale structured cnn with label consistency for brain mr image segmentation.
[13]
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: convolutional block attention module. H. Ben-Younes, R. Cadene, N. Thome, and M. Cord. Mutan: Multimodal tucker fusion for visual question answering. ICCV, 2017.
[14]
J.-H. Kim, K. W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang. Hadamard Product for Low-rank Bilinear Pooling. In The 5th International Conference on Learning Representations, 2017.
[15]
Yu Jiang*, Vivek Natarajan*, Xinlei Chen*, M. Rohrbach, D. Batra, and D. Parikh. Pythia v0.1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956, 2018.
[16]
Y. Zhang, J. Hare, and A. Prgel-Bennett. Learning to count objects in natural images for visual question answering. In International Conference on Learning Representations, 2018
[17]
Nguyen, D. K., & Okatani, T. (2018). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering.
[18]
Dozat, T., & Manning, C. D. (2016). Deep biaffine attention for neural dependency parsing.
[19]
Johnson, J., Hariharan, B., Laurens, V. D. M., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2016). Clevr: a diagnostic dataset for compositional language and elementary visual reasoning.
[20]
Lu, P., Li, H., Zhang, W., Wang, J., & Wang, X. (2017). Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering.
[21]
Woo, S., Park, J., Lee, J. Y., & Kweon, I. S. (2018). Cbam: convolutional block attention module.
[22]
Cadene, R., Ben-Younes, H., Cord, M., & Thome, N. (2019). Murel: multimodal relational reasoning for visual question answering.
[23]
Tang, K., Zhang, H., Wu, B., Luo, W., & Liu, W. (2018). Learning to compose dynamic tree structures for visual contexts.
[24]
Kim, Y., Denton, C., Hoang, L., & Rush, A. M. (2017). Structured attention networks.
[25]
Lu, P., Li, H., Zhang, W., Wang, J., & Wang, X. (2017). Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICVIP '19: Proceedings of the 3rd International Conference on Video and Image Processing
December 2019
270 pages
ISBN:9781450376822
DOI:10.1145/3376067
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Shanghai Jiao Tong University: Shanghai Jiao Tong University
  • Xidian University
  • TU: Tianjin University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attention Mechanism
  2. Feature Fusion
  3. Multiple Feature
  4. Visual Question Answering

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICVIP 2019

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 68
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media