Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Jiang, Ai-Wen; Liu, Bo; Wang, Ming-Wen

doi:10.1007/s11390-017-1755-6

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Regular Paper
Published: 14 July 2017

Volume 32, pages 738–748, (2017)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Ai-Wen Jiang¹,
Bo Liu² &
Ming-Wen Wang¹

322 Accesses
2 Citations
Explore all metrics

Abstract

Image question answering (IQA) has emerged as a promising interdisciplinary topic in computer vision and natural language processing fields. In this paper, we propose a contextually guided recurrent attention model for solving the IQA issues. It is a deep reinforcement learning based multimodal recurrent neural network. Based on compositional contextual information, it recurrently decides where to look using reinforcement learning strategy. Different from traditional “static” soft attention, it is deemed as a kind of “dynamic” attention whose objective is designed based on reinforcement rewards purposefully towards IQA. The finally learned compositional information incorporates both global context and local informative details, which is demonstrated to benefit for generating answers. The proposed method is compared with several state-of-the-art methods on two public IQA datasets, including COCO-QA and VQA from dataset MS COCO. The experimental results demonstrate that our proposed model outperforms those methods and achieves better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FNet with Cross-Attention Encoder for Visual Question Answering

Video Question Answering Using a Forget Memory Network

Deep Attention Neural Tensor Network for Visual Question Answering

References

Ren M Y, Kiros R, Zemel R. Image question answering: A visual semantic embedding model and a new dataset. arXiv: 1505.02074, 2015. https://arxiv.org/abs/1505.02074v1, June 2017.
Gao H Y, Mao J H, Zhou J, Huang Z H, Wang L, Xu W. Are you talking to a machine? Dataset and methods for multilingual image question answering. arXiv: 1505.05612, 2015. https://arxiv.org/abs/1505.05612, June 2017.
Antol S, Agrawal A, Lu J S, Mitchell M, Batra D, Zitnick L, Parikh D. VQA: Visual question answering. In Proc. IEEE Int. Conf. Computer Vision, December 2015, pp.2425-2433.
Malinowski M, Rohrbach M, Fritz M. Ask your neurons: A deep learning approach to visual question answering. arXiv: 1605.02697, 2016. https://arxiv.org/abs/1605.02697, June 2017.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y. Show, attend and tell: Neural image caption generation with visual attention. In Proc. the 32nd IEEE Int. Conf. Machine Learning, February 2015, pp.2048-2057.
Yang Z C, He X D, Gao J F, Deng L, Smola A. Stacked attention networks for image question answering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.21-29.
Xu H J, Saenko K. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv: 1511.05234, 2015. https://arxiv. org/abs/1511.05234, June 2017.
Chen K, Wang J, Chen L C, Gao H Y, Xu W, Nevatia R. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv: 1511.05960, 2015. https://arxiv.org/abs/1511.05960, June 2017.
Shih K J, Singh S, Hoiem D. Where to look: Focus regions for visual question answering. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.4613-4621.
Zhu Y K, Groth O, Bernstein M, Li F F. Visual7W: Grounded question answering in images. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.4995-5004.
Ilievski I, Yan S C, Feng J S. A focused dynamic attention model for visual question answering. arXiv: 1604.01485, 2016. https://arxiv.org/abs/1604.01485, June 2017.
Kumar A, Irsoy O, Ondruska P, Iyyer M, Bradbury J, Gulrajani I, Zhong V, Paulus R, Socher R. Ask me anything: Dynamic memory networks for natural language processing. In Proc. the 33rd Int. Conf. Machine Learning, June 2016, pp.1378-1387.
Xiong C M, Merity S, Socher R. Dynamic memory networks for visual and textual question answering. In Proc. the 33rd Int. Conf. Machine Learning, June 2016, pp.2397-2406.
Lu J S, Yang J W, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. In Proc. Advances in Neural Information Processing System, Dec. 2016.
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv: 1606.01847, 2016. https://arxiv.org/abs/1606.01847, June 2017.
Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.30-38.
Kim J H, Lee S W, Kwak D H, Heo M, Kim J, Ha J W, Zhang B T. Multimodal residual learning for visual QA. In Proc. the 30th Conf. Neural Information Processing System, Dec. 2016.
Andreas J, Rohrbach M, Darrell T, Klein D. Neural module networks. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.39-48.
Wang P, Wu Q, Shen C H, van den Hengel A, Dick A. Explicit knowledge-based reasoning for visual question answering. arXiv: 1511.02570, 2015. https://arxiv.org/abs/1511.02570v2, June 2017.
Ma L, Lu Z D, Li H. Learning to answer questions from image using convolutional neural network. In Proc. the 30th AAAI Conf. Artificial Intelligence, March 2016, pp.3567-3573.
Mnih V, Heess N, Graves A, Kavukcuoglu K. Recurrent models of visual attention. In Proc. Advances in Neural Information Processing Systems, Dec. 2014.
Ba J, Mnih V, Kavukcuoglu K. Multiple object recognition with visual attention. arXiv: 1412.7755, 2015. https://arxiv.org/abs/1412.7755, June 2017.
Li J N, Wei Y C, Liang X D, Dong J, Xu T F, Feng J S, Yan S C. Attentive contexts for object detection. arXiv: 1603.07415, 2016. https://arxiv.org/abs/1603.07415, June 2017.
Chung K, Gulcehre C, Cho K, Bengio Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv: 1412.3555, 2014. https://arxiv.org/abs/14-12.3555, June 2017.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv: 1301.3781, 2013. https://arxiv.org/abs/1301.3781, June 2017.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2015. https://arxiv.org/abs/1409.1556, June 2017.
Williams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4): 229-256.
Article MATH Google Scholar
Kiros R, Zhu Y K, Salakhutdinov R, Zemel R, Torralba A, Urtasun R, Fidler S. Skip-thought vectors. arXiv: 1506.06726, 2015. https://arxiv.org/abs/1506.06726, June 2017.
Zhou B L, Tian Y D, Sukhbaatar S, Szlam A, Fergus R. Simple baseline for visual question answering. arXiv: 1512.02167, 2015. https://arxiv.org/abs/1512.02167, June 2017.
Malinowski M, Fritz M. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proc. the 27th Int. Conf. Neural Information Processing Systems, Dec. 2014, pp.1682-1690.
Wu Z B, Palmer M. Verbs semantics and lexical selection. In Proc. the 32nd Annual Meeting on Association for Computational Linguistics, June 1994, pp.133-138.
Miller G A. WordNet: A lexical database for English. Communications of the ACM, 1995, 38(11): 39-41.
Article Google Scholar
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S E, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. arXiv: 1409.4842, 2014. https://arxiv.org/abs/1409.4842, June 2017.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, pp.770-778.

Download references

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

College of Computer and Information Engineering, Jiangxi Normal University, Nanchang, 330022, China
Ai-Wen Jiang & Ming-Wen Wang
College of Computer Science and Software Engineering, Auburn University, Auburn, AL36849, U.S.A.
Bo Liu

Authors

Ai-Wen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Wen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming-Wen Wang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 916 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, AW., Liu, B. & Wang, MW. Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering. J. Comput. Sci. Technol. 32, 738–748 (2017). https://doi.org/10.1007/s11390-017-1755-6

Download citation

Received: 19 December 2016
Revised: 26 May 2017
Published: 14 July 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s11390-017-1755-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Abstract

Access this article

Similar content being viewed by others

FNet with Cross-Attention Encoder for Visual Question Answering

Video Question Answering Using a Forget Memory Network

Deep Attention Neural Tensor Network for Visual Question Answering

References

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep Multimodal Reinforcement Network with Contextually Guided Recurrent Attention for Image Question Answering

Abstract

Access this article

Similar content being viewed by others

FNet with Cross-Attention Encoder for Visual Question Answering

Video Question Answering Using a Forget Memory Network

Deep Attention Neural Tensor Network for Visual Question Answering

References

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation