Abstract
Answering open-ended questions in Visual Question Answering (VQA) is a challenging task. As the answers are totally free-form, the answer space for open-ended questions is infinite in theory. This increases the difficulty for algorithms to predict the correct answers. In this paper, we propose a method named answer distillation to decrease the scale of answer space and limit the correct result into a small set of answer candidates. Specifically, we design a two-stage architecture to answer a question: First, we develop an answer distillation network to distill the answers, converting an open-ended question to a multiple-choice one with a short list of answer candidates. Then, we make full use of the knowledge from the answer candidates to guide the visual attention and refine the prediction results. Extensive experiments are conducted to validate the effectiveness of our answer distillation architecture. The results show that our method can effectively compress the answer space and improve the accuracy on open-ended task, providing a new state-of-the-art performance on COCO-VQA dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering (2017). http://arxiv.org/abs/1707.07998
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. CoRR abs/1511.02799 (2015). http://arxiv.org/abs/1511.02799
Antol, S., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
Ben-younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering, pp. 2612–2620 (2017). https://doi.org/10.1109/ICCV.2017.285, http://arxiv.org/abs/1705.06676
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847 (2016)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Jabri, A., Joulin, A., van der Maaten, L.: Revisiting visual question answering baselines. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 727–739. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_44
Jordan, M.I.: Serial order: a parallel distributed processing approach. In: Advances in Psychology, vol. 121, pp. 471–495. Elsevier (1997)
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W., Zhang, B.T.: Hadamard product for low-rank bilinear pooling. In: The 5th International Conference on Learning Representations (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. CoRR abs/1602.07332 (2016). http://arxiv.org/abs/1602.07332
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014). http://arxiv.org/abs/1405.0312
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering (2016)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. CoRR abs/1410.0210 (2014). http://arxiv.org/abs/1410.0210
Nam, H., Ha, J., Kim, J.: Dual attention networks for multimodal reasoning and matching. CoRR abs/1611.00471 (2016). http://arxiv.org/abs/1611.00471
Noh, H., Han, B.: Training recurrent answering units with joint loss minimization for VQA. CoRR abs/1606.03647 (2016). http://arxiv.org/abs/1606.03647
Ren, M., Kiros, R., Zemel, R.S.: Image question answering: a visual semantic embedding model and a new dataset. CoRR abs/1505.02074 (2015). http://arxiv.org/abs/1505.02074
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). http://arxiv.org/abs/1506.01497
Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering (Nips) (2017). http://arxiv.org/abs/1711.04323
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Teney, D., Anderson, P., He, X., van den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge (2017). http://arxiv.org/abs/1708.02711
Wu, Q., Shen, C., van den Hengel, A., Wang, P., Dick, A.R.: Image captioning and visual question answering based on attributes and their related external knowledge. CoRR abs/1603.02814 (2016). http://arxiv.org/abs/1603.02814
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. CoRR abs/1511.02274 (2015). http://arxiv.org/abs/1511.02274
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering (2017). https://doi.org/10.1109/ICCV.2017.202, http://arxiv.org/abs/1708.01471
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. CoRR abs/1512.02167 (2015). http://arxiv.org/abs/1512.02167
Zhu, Y., Groth, O., Bernstein, M.S., Fei-Fei, L.: Visual7W: grounded question answering in images. CoRR abs/1511.03416 (2015). http://arxiv.org/abs/1511.03416
Acknowledgements
This work was supported by National Natural Science Foundation of China (61872366 and 61472422).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Fang, Z., Liu, J., Tang, Q., Li, Y., Lu, H. (2019). Answer Distillation for Visual Question Answering. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11361. Springer, Cham. https://doi.org/10.1007/978-3-030-20887-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-20887-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20886-8
Online ISBN: 978-3-030-20887-5
eBook Packages: Computer ScienceComputer Science (R0)