Answer Distillation for Visual Question Answering

Fang, Zhiwei; Liu, Jing; Tang, Qu; Li, Yong; Lu, Hanqing

doi:10.1007/978-3-030-20887-5_5

Zhiwei Fang^18,19,
Jing Liu^18,19,
Qu Tang¹⁸,
Yong Li²⁰ &
…
Hanqing Lu^18,19

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11361))

Included in the following conference series:

Asian Conference on Computer Vision

2111 Accesses

Abstract

Answering open-ended questions in Visual Question Answering (VQA) is a challenging task. As the answers are totally free-form, the answer space for open-ended questions is infinite in theory. This increases the difficulty for algorithms to predict the correct answers. In this paper, we propose a method named answer distillation to decrease the scale of answer space and limit the correct result into a small set of answer candidates. Specifically, we design a two-stage architecture to answer a question: First, we develop an answer distillation network to distill the answers, converting an open-ended question to a multiple-choice one with a short list of answer candidates. Then, we make full use of the knowledge from the answer candidates to guide the visual attention and refine the prediction results. Extensive experiments are conducted to validate the effectiveness of our answer distillation architecture. The results show that our method can effectively compress the answer space and improve the accuracy on open-ended task, providing a new state-of-the-art performance on COCO-VQA dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering (2017). http://arxiv.org/abs/1707.07998
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. CoRR abs/1511.02799 (2015). http://arxiv.org/abs/1511.02799
Antol, S., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Ben-younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering, pp. 2612–2620 (2017). https://doi.org/10.1109/ICCV.2017.285, http://arxiv.org/abs/1705.06676
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 (2016)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jabri, A., Joulin, A., van der Maaten, L.: Revisiting visual question answering baselines. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 727–739. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_44
Chapter Google Scholar
Jordan, M.I.: Serial order: a parallel distributed processing approach. In: Advances in Psychology, vol. 121, pp. 471–495. Elsevier (1997)
Google Scholar
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. In: The 5th International Conference on Learning Representations (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. CoRR abs/1602.07332 (2016). http://arxiv.org/abs/1602.07332
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014). http://arxiv.org/abs/1405.0312
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering (2016)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. CoRR abs/1410.0210 (2014). http://arxiv.org/abs/1410.0210
Nam, H., Ha, J., Kim, J.: Dual attention networks for multimodal reasoning and matching. CoRR abs/1611.00471 (2016). http://arxiv.org/abs/1611.00471
Noh, H., Han, B.: Training recurrent answering units with joint loss minimization for VQA. CoRR abs/1606.03647 (2016). http://arxiv.org/abs/1606.03647
Ren, M., Kiros, R., Zemel, R.S.: Image question answering: a visual semantic embedding model and a new dataset. CoRR abs/1505.02074 (2015). http://arxiv.org/abs/1505.02074
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). http://arxiv.org/abs/1506.01497
Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering (Nips) (2017). http://arxiv.org/abs/1711.04323
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Teney, D., Anderson, P., He, X., van den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge (2017). http://arxiv.org/abs/1708.02711
Wu, Q., Shen, C., van den Hengel, A., Wang, P., Dick, A.R.: Image captioning and visual question answering based on attributes and their related external knowledge. CoRR abs/1603.02814 (2016). http://arxiv.org/abs/1603.02814
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. CoRR abs/1511.02274 (2015). http://arxiv.org/abs/1511.02274
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering (2017). https://doi.org/10.1109/ICCV.2017.202, http://arxiv.org/abs/1708.01471
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. CoRR abs/1512.02167 (2015). http://arxiv.org/abs/1512.02167
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7W: grounded question answering in images. CoRR abs/1511.03416 (2015). http://arxiv.org/abs/1511.03416

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (61872366 and 61472422).

Author information

Authors and Affiliations

National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhiwei Fang, Jing Liu, Qu Tang & Hanqing Lu
University of Chinese Academy of Sciences, Beijing, China
Zhiwei Fang, Jing Liu & Hanqing Lu
Business Growth BU, JD.com, Beijing, China
Yong Li

Authors

Zhiwei Fang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qu Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hanqing Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiwei Fang .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C. V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fang, Z., Liu, J., Tang, Q., Li, Y., Lu, H. (2019). Answer Distillation for Visual Question Answering. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11361. Springer, Cham. https://doi.org/10.1007/978-3-030-20887-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-20887-5_5
Published: 28 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20886-8
Online ISBN: 978-3-030-20887-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics