Abstract
Visual conversation has recently emerged as a research area in the visually-grounded language modeling domain. It requires an intelligent agent to maintain a natural language conversation with humans about visual content. Its main difference from traditional visual question answering is that the agent must infer the answer not only by grounding the question in the image, but also from the context of the conversation history. In this paper we propose a novel multimodal attention architecture that enables the conversation agent to focus on parts of the conversation history and specific image regions to infer the answer based on the conversation context. We evaluate our model on the VisDial dataset and demonstrate that it performs better than current state of the art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Das, A., Kottur, S., Moura, J. M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV (2017)
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., Batra, D.: Visual dialog. In: CVPR (2017)
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: CVPR (2017)
Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv:1703.05423 (2017)
Chattopadhyay, P., Yadav, D., Prabhu, V., Chandrasekaran, A., Das, A., Lee, S., Batra, D., Parikh, D.: Evaluating visual conversational agents via cooperative human-AI games. In: CVPR (2017)
Hyeonseob, N., Jung-Woo, H., Jeonghee, K.: Dual Attention Networks for Multimodal Reasoning and Matching. arXiv:1611.00471 (2017)
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR (2016)
Delbrouck, J. B., Dupont, S.: Multimodal compact bilinear pooling for multimodal neural machine translation. In: ICLR (2017)
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
Huang, P-Y., Liu, F., Shiang, Sz-R., Oh, J., Dyer, C.: Attention-based multimodal neural machine translation. In: Proceedings of the First Conference on Machine Translation (2016)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)
Caglayan, O., Aransa, W., Wang, Y., Masana, M., García-Martínez, M., Bougares, F., Barrault, L., van de Weijer, J.: Does multimodality help human and machine for translation and image captioning? arXiv preprint arXiv:1605.09186 (2016)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: ICCV (2015)
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: ECCV (2016)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)
Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: ICML (2016)
Mei, H., Bansal, M., Walter, M. R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: AAAI (2016)
Lin, T-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., Dollar, P.: Microsoft COCO: Common Objects in Context. arXiv:1405.0312 (2015)
Fukui, A., Huk Park, D., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler. S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)
Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.C.: Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2), 42–70 (2014)
Zitnick, L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., Parikh, D.: Measuring machine intelligence through visual question answering. AI Mag. (2016)
Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI (2016)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
Berkeley Design Technology: A Test Drive of the NVIDIA Jetson TX1 Developer Kit for Deep Learning and Computer Vision Applications. https://www.bdti.com/MyBDTI/pubs:Nvidia_JetsonTX1_Kit.pdf. Accessed 06 Nov 2017
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)
Noh, H., Hongsuck Seo, P., Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: CVPR (2016)
Li, R., Jia, J.: Visual question answering with Question Representation Update (QRU). In: NIPS (2016)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Kodra, L., Meçe, E.K. (2018). Multimodal Attention Agents in Visual Conversation. In: Barolli, L., Xhafa, F., Javaid, N., Spaho, E., Kolici, V. (eds) Advances in Internet, Data & Web Technologies. EIDWT 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-75928-9_52
Download citation
DOI: https://doi.org/10.1007/978-3-319-75928-9_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75927-2
Online ISBN: 978-3-319-75928-9
eBook Packages: EngineeringEngineering (R0)