skip to main content
research-article

Multi-source Multi-level Attention Networks for Visual Question Answering

Published: 19 July 2019 Publication History

Abstract

In recent years, Visual Question Answering (VQA) has attracted increasing attention due to its requirement on cross-modal understanding and reasoning of vision and language. VQA is proposed to automatically answer natural language questions with reference to a given image. VQA is challenging, because the reasoning process on a visual domain needs a full understanding of the spatial relationship, semantic concepts, as well as the common sense for a real image. However, most existing approaches jointly embed the abstract low-level visual features and high-level question features to infer answers. These works have limited reasoning ability due to the lack of modeling of the rich spatial context of regions, high-level semantics of images, and knowledge across multiple sources. To solve the challenges, we propose multi-source multi-level attention networks for visual question answering that can benefit both spatial inferences by visual attention on context-aware region representation and reasoning by semantic attention on concepts as well as external knowledge. Indeed, we learn to reason on image representation by question-guided attention at different levels across multiple sources, including region and concept level representation from image source as well as sentence level representation from the external knowledge base. First, we encode region-based middle-level outputs from Convolutional Neural Networks (CNNs) into spatially embedded representation by a multi-directional two-dimensional recurrent neural network and, further, locate the answer-related regions by Multiple Layer Perceptron as visual attention. Second, we generate semantic concepts from high-level semantics in CNNs and select those question-related concepts as concept attention. Third, we query semantic knowledge from the general knowledge base by concepts and selected question-related knowledge as knowledge attention. Finally, we jointly optimize visual attention, concept attention, knowledge attention, and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach achieved significant improvement on two very challenging VQA datasets.

References

[1]
A. Agrawal, D. Batra, and D. Parikh. 2016. Analyzing the behavior of visual question answering models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1955--1960.
[2]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[3]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.
[4]
H. Ben-younes, R. Cadene, M. Cord, and N. Thome. 2017. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2612--2620.
[5]
A. Das, H. Agrawal, et al. 2016. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2016), 32--73.
[6]
A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. 2017. Human attention in visual question answering: Do humans and deep networks look at the same regions? Comput. Vis. Image Understand. 163, C (2017), 90--100.
[7]
J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.
[8]
H. Fang, S. Gupta, F. Landola, R. Srivastava, L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.
[9]
J. Fu, J. Wang, Y. Rui, X. Wang, T. Mei, and H. Lu. 2015. Image tag refinement with view-dependent concept representations. IEEE Trans. Circ. Syst. Vid. Technol. 25, 28 (2015), 1409--1422.
[10]
J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, and Y. Rui. 2015. Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging. In Proceedings of the IEEE International Conference on Computer Vision. 1985--1993.
[11]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468.
[12]
H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. In Advances in Neural Information Processing Systems. 2296--2304.
[13]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904--6913.
[14]
A. Graves, S. Fernández, and J. Schmidhuber. 2007. Multi-dimensional recurrent neural networks. In Proceedings of the International Conference on Artificial Neural Networks. 549--558.
[15]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[16]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.
[17]
R. Hong, Z. Hu, R. Wang, M. Wang, and D. Tao. 2016. Multi-view object retrieval via multi-scale topic models. IEEE Trans. Image Process. 25, 12 (2016), 5814--5827.
[18]
R. Hong, L. Li, J. Cai, D. Tao, M. Wang, and Q. Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans. Image Process. 26, 9 (2017), 4128--4138.
[19]
R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, and X. Wu. 2014. Image annotation by multiple-instance learning with discriminative feature mapping and selection. IEEE Trans. Cybernet. 44, 5 (2014), 669--680.
[20]
H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588--3597.
[21]
I. Ilievski, S. Yan, and J. Feng. 2016. A focused dynamic attention model for visual question answering. In arXiv preprint arXiv:1604.01485.
[22]
A. Jabri, A. Joulin, and L. v. d. Maaten. 2016. Revisiting visual question answering baselines. In Proceedings of the European Conference on Computer Vision. 727--739.
[23]
K. Kafle and C. Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE International Conference on Computer Vision. 1965--1973.
[24]
J. Kim, K. On, W. Lim, J. Kim, J. Ha, and B. Zhang. 2017. Hadamard product for low-rank bilinear pooling. In Proceedings of the 5th International Conference on Learning Representations.
[25]
R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun, and S. Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3294--3302.
[26]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[27]
Q. Li, J. Fu, D. Yu, T. Mei, and J. Luo. 2018. Tell-and-answer: Towards explainable visual question answering using attributes and captions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1338--1346.
[28]
L. Liang, L. Jiang, L. Gao, L. Li, and A. Hauptmann. 2018. Facal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6135--6143.
[29]
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. 2014. Microsoft COCO: Common objects in ConText. In Proceedings of the European Conference on Computer Vision. 740--755.
[30]
Y. Liu, J. Fu, T. Mei, and C. Chen. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 1445--1452.
[31]
J. Lu, J. Yang, D. Batra, and D. Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. 289--297.
[32]
M. Malinowski, M. Rohrbach, and M. Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision. 1--9.
[33]
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the 3rd International Conference on Learning Representations.
[34]
H. Nam, J. Ha, and J. Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 299--307.
[35]
H. Noh, P. H. Seo, and B. Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 30--38.
[36]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4594--4602.
[37]
Y. Pan, T. Yao, H. Li, and T. Mei. 2017. Video captioning with transferred semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6504--6512.
[38]
M. Ren, R. Kiros, and R. S. Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961.
[39]
M. Rohrbach. 2017. Attributes as semantic units between natural language and visual recognition. In Visual Attributes. 301--330.
[40]
A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. 2017. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems. 4967--4976.
[41]
K. J. Shih, S. Singh, and D. Hoiem. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4613--4621.
[42]
K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations.
[43]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[44]
C. Wang, H. Yang, and C. Meinel. 2018. Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 40 (2018), 1--20. Issue 2s.
[45]
J. Wang, J. Fu, T. Mei, and Y. Xu. 2016. Beyond object recognition: Visual sentiment analysis with deep coupled adjective and noun neural networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI’16). 3484--3490.
[46]
Q. Wu, C. Shen, L. Liu, A. Dick, and A. Hengel. 2016. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 203--212.
[47]
Q. Wu, C. Shen, P. Wang, A. Dick, and A. v. d. Hengel. 2017. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2017), 1367--1381.
[48]
Q. Wu, P. Wang, C. Shen, A. Dick, and A. Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.
[49]
C. Xiong, S. Merity, and R. Socher. 2016. Dynamic memory networks for visual and texual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.
[50]
H. Xu and K. Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. 451--466.
[51]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.
[52]
T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei. 2017. Boosting image captioning with attributes. In Proceedings of the IEEE International Conference on Computer Vision. 4894--4902.
[53]
Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.
[54]
D. Yu, J. Fu, T. Mei, and Y. Rui. 2017. Multi-level attention network for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4709--4717.
[55]
C. Zhu, Y. Zhao, S. Huang, K. Tu, and Y. Ma. 2017. Structured attentions for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 1291--1300.

Cited By

View all
  • (2025)Graph-enhanced visual representations and question-guided dual attention for visual question answeringNeurocomputing10.1016/j.neucom.2024.128850614(128850)Online publication date: Jan-2025
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 4-Apr-2024
  • (2024)Diverse Visual Question Generation Based on Multiple Objects SelectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364001420:6(1-22)Online publication date: 8-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2s
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018
April 2019
381 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3343360
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2019
Accepted: 01 February 2019
Revised: 01 February 2019
Received: 01 June 2018
Published in TOMM Volume 15, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Visual question answering
  2. attention model
  3. multi-modal representations
  4. visual relationship

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Key R&D Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Graph-enhanced visual representations and question-guided dual attention for visual question answeringNeurocomputing10.1016/j.neucom.2024.128850614(128850)Online publication date: Jan-2025
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 4-Apr-2024
  • (2024)Diverse Visual Question Generation Based on Multiple Objects SelectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364001420:6(1-22)Online publication date: 8-Mar-2024
  • (2023)Cross-modality Multiple Relations Learning for Knowledge-based Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361830120:3(1-22)Online publication date: 23-Oct-2023
  • (2023)Language-guided Residual Graph Attention Network and Data Augmentation for Visual GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/360455720:1(1-23)Online publication date: 14-Jun-2023
  • (2023)Variational Autoencoder with CCA for Audio–Visual Cross-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/357565819:3s(1-21)Online publication date: 24-Feb-2023
  • (2023)DisCover: Disentangled Music Representation Learning for Cover Song IdentificationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591664(453-463)Online publication date: 19-Jul-2023
  • (2023)Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal featuresComputational Intelligence10.1111/coin.1262440:1Online publication date: 21-Dec-2023
  • (2023)MUSE: Visual Analysis of Musical Semantic SequenceIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.317536429:9(4015-4030)Online publication date: 1-Sep-2023
  • (2023)Video-to-Music Recommendation Using Temporal Alignment of SegmentsIEEE Transactions on Multimedia10.1109/TMM.2022.315259825(2898-2911)Online publication date: 1-Jan-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media