Skip to main content

Multimodal Attention Agents in Visual Conversation

  • Conference paper
  • First Online:
Advances in Internet, Data & Web Technologies (EIDWT 2018)

Abstract

Visual conversation has recently emerged as a research area in the visually-grounded language modeling domain. It requires an intelligent agent to maintain a natural language conversation with humans about visual content. Its main difference from traditional visual question answering is that the agent must infer the answer not only by grounding the question in the image, but also from the context of the conversation history. In this paper we propose a novel multimodal attention architecture that enables the conversation agent to focus on parts of the conversation history and specific image regions to infer the answer based on the conversation context. We evaluate our model on the VisDial dataset and demonstrate that it performs better than current state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Das, A., Kottur, S., Moura, J. M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV (2017)

    Google Scholar 

  2. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., Batra, D.: Visual dialog. In: CVPR (2017)

    Google Scholar 

  3. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: CVPR (2017)

    Google Scholar 

  4. Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv:1703.05423 (2017)

  5. Chattopadhyay, P., Yadav, D., Prabhu, V., Chandrasekaran, A., Das, A., Lee, S., Batra, D., Parikh, D.: Evaluating visual conversational agents via cooperative human-AI games. In: CVPR (2017)

    Google Scholar 

  6. Hyeonseob, N., Jung-Woo, H., Jeonghee, K.: Dual Attention Networks for Multimodal Reasoning and Matching. arXiv:1611.00471 (2017)

  7. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)

    Google Scholar 

  8. Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR (2016)

    Google Scholar 

  9. Delbrouck, J. B., Dupont, S.: Multimodal compact bilinear pooling for multimodal neural machine translation. In: ICLR (2017)

    Google Scholar 

  10. Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)

    Google Scholar 

  11. Huang, P-Y., Liu, F., Shiang, Sz-R., Oh, J., Dyer, C.: Attention-based multimodal neural machine translation. In: Proceedings of the First Conference on Machine Translation (2016)

    Google Scholar 

  12. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)

    Google Scholar 

  14. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)

    Google Scholar 

  15. Caglayan, O., Aransa, W., Wang, Y., Masana, M., García-Martínez, M., Bougares, F., Barrault, L., van de Weijer, J.: Does multimodality help human and machine for translation and image captioning? arXiv preprint arXiv:1605.09186 (2016)

  16. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: Visual Question Answering. In: ICCV (2015)

    Google Scholar 

  17. Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: ECCV (2016)

    Google Scholar 

  18. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)

    Google Scholar 

  19. Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: ICML (2016)

    Google Scholar 

  20. Mei, H., Bansal, M., Walter, M. R.: Listen, attend, and walk: neural mapping of navigational instructions to action sequences. In: AAAI (2016)

    Google Scholar 

  21. Lin, T-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., Dollar, P.: Microsoft COCO: Common Objects in Context. arXiv:1405.0312 (2015)

  22. Fukui, A., Huk Park, D., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016)

    Google Scholar 

  23. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler. S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)

    Google Scholar 

  24. Tu, K., Meng, M., Lee, M.W., Choe, T.E., Zhu, S.C.: Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2), 42–70 (2014)

    Article  Google Scholar 

  25. Zitnick, L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., Parikh, D.: Measuring machine intelligence through visual question answering. AI Mag. (2016)

    Google Scholar 

  26. Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI (2016)

    Google Scholar 

  27. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)

    Google Scholar 

  28. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv preprint arXiv:1502.03044 (2015)

  29. Berkeley Design Technology: A Test Drive of the NVIDIA Jetson TX1 Developer Kit for Deep Learning and Computer Vision Applications. https://www.bdti.com/MyBDTI/pubs:Nvidia_JetsonTX1_Kit.pdf. Accessed 06 Nov 2017

  30. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)

    Google Scholar 

  31. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2014)

    Google Scholar 

  32. Noh, H., Hongsuck Seo, P., Han, B.: Image question answering using convolutional neural network with dynamic parameter prediction. In: CVPR (2016)

    Google Scholar 

  33. Li, R., Jia, J.: Visual question answering with Question Representation Update (QRU). In: NIPS (2016)

    Google Scholar 

  34. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lorena Kodra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kodra, L., Meçe, E.K. (2018). Multimodal Attention Agents in Visual Conversation. In: Barolli, L., Xhafa, F., Javaid, N., Spaho, E., Kolici, V. (eds) Advances in Internet, Data & Web Technologies. EIDWT 2018. Lecture Notes on Data Engineering and Communications Technologies, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-75928-9_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-75928-9_52

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-75927-2

  • Online ISBN: 978-3-319-75928-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics