Multimodal Attention Agents in Visual Conversation

Kodra, Lorena; Meçe, Elinda Kajo

doi:10.1007/978-3-319-75928-9_52

Lorena Kodra⁷ &
Elinda Kajo Meçe⁷

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 17))

Included in the following conference series:

International Conference on Emerging Internetworking, Data & Web Technologies

2038 Accesses

Abstract

Visual conversation has recently emerged as a research area in the visually-grounded language modeling domain. It requires an intelligent agent to maintain a natural language conversation with humans about visual content. Its main difference from traditional visual question answering is that the agent must infer the answer not only by grounding the question in the image, but also from the context of the conversation history. In this paper we propose a novel multimodal attention architecture that enables the conversation agent to focus on parts of the conversation history and specific image regions to infer the answer based on the conversation context. We evaluate our model on the VisDial dataset and demonstrate that it performs better than current state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Das, A., Kottur, S., Moura, J. M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV (2017)
Google Scholar
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., Batra, D.: Visual dialog. In: CVPR (2017)
Google Scholar
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: CVPR (2017)
Google Scholar
Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv:1703.05423 (2017)
Chattopadhyay, P., Yadav, D., Prabhu, V., Chandrasekaran, A., Das, A., Lee, S., Batra, D., Parikh, D.: Evaluating visual conversational agents via cooperative human-AI games. In: CVPR (2017)
Google Scholar
Hyeonseob, N., Jung-Woo, H., Jeonghee, K.: Dual Attention Networks for Multimodal Reasoning and Matching. arXiv:1611.00471 (2017)
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Google Scholar
Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR (2016)
Google Scholar
Delbrouck, J. B., Dupont, S.: Multimodal compact bilinear pooling for multimodal neural machine translation. In: ICLR (2017)
Google Scholar
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
Google Scholar
Huang, P-Y., Liu, F., Shiang, Sz-R., Oh, J., Dyer, C.: Attention-based multimodal neural machine translation. In: Proceedings of the First Conference on Machine Translation (2016)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)
Google Scholar
Caglayan, O., Aransa, W., Wang, Y., Masana, M., García-Martínez, M., Bougares, F., Barrault, L., van de Weijer, J.: Does multimodality help human and machine for translation and image captioning? arXiv preprint arXiv:1605.09186 (2016)
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: ICCV (2015)
Google Scholar
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: ECCV (2016)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)
Google Scholar
Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: ICML (2016)
Google Scholar
Mei, H., Bansal, M., Walter, M. R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: AAAI (2016)
Google Scholar
Lin, T-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., Dollar, P.: Microsoft COCO: Common Objects in Context. arXiv:1405.0312 (2015)
Fukui, A., Huk Park, D., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler. S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)
Google Scholar
Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.C.: Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2), 42–70 (2014)
Article Google Scholar
Zitnick, L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., Parikh, D.: Measuring machine intelligence through visual question answering. AI Mag. (2016)
Google Scholar
Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI (2016)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)
Berkeley Design Technology: A Test Drive of the NVIDIA Jetson TX1 Developer Kit for Deep Learning and Computer Vision Applications. https://www.bdti.com/MyBDTI/pubs:Nvidia_JetsonTX1_Kit.pdf. Accessed 06 Nov 2017
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)
Google Scholar
Noh, H., Hongsuck Seo, P., Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: CVPR (2016)
Google Scholar
Li, R., Jia, J.: Visual question answering with Question Representation Update (QRU). In: NIPS (2016)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Polytechnic University of Tirana, Tirana, Albania
Lorena Kodra & Elinda Kajo Meçe

Authors

Lorena Kodra
View author publications
You can also search for this author in PubMed Google Scholar
Elinda Kajo Meçe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lorena Kodra .

Editor information

Editors and Affiliations

Department of Information and Communication Engineering, Fukuoka Institute of Technology, Fukuoka-shi, Japan
Leonard Barolli
Technical University of Catalonia, Barcelona, Spain
Fatos Xhafa
Department of Computer Science, COMSATS Institute of Information Technology, Islamabad, Pakistan
Nadeem Javaid
Polytechnic University of Tirana, Tirana, Albania
Evjola Spaho
Polytechnic University of Tirana, Tirana, Albania
Vladi Kolici

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kodra, L., Meçe, E.K. (2018). Multimodal Attention Agents in Visual Conversation. In: Barolli, L., Xhafa, F., Javaid, N., Spaho, E., Kolici, V. (eds) Advances in Internet, Data & Web Technologies. EIDWT 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-75928-9_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-75928-9_52
Published: 24 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75927-2
Online ISBN: 978-3-319-75928-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics