Skip to main content

Word-by-Word Generation of Visual Dialog Using Reinforcement Learning

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2022 (ICANN 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13530))

Included in the following conference series:

  • 2218 Accesses

Abstract

The task of visual dialog generation requires an agent holding a conversation referencing question history, putting the current question into context, and processing visual content. While previous research focused on arranging questions to form dialog, we tackle the more challenging task of arranging questions from words, and dialog from questions. We develop our model in a simple “Guess which?” game scenario where the agent needs to predict an image region that has been selected by an oracle by asking questions to the oracle. As a result, the reinforcement learning agent arranges words to refer to the image features strategically to acquire the required information from the oracle, memorizing it and giving the correct prediction with an accuracy well above 80%. Imposing costs on the number of questions asked to the oracle leads to a strategy using few questions, while imposing costs on the number of words used leads to more but shorter questions. Our results are a step towards making goal-directed dialog fully generic by assembling it from words, elementary constituents of language.

The authors acknowledge support from the German Research Foundation DFG under project Crossmodal Learning (TRR 169). Mengdi Li provided inspiration and feedback on the document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The dataset and the code of the model implementation are available at: https://github.com/ylysa/Recurrent-Attention-Model.

References

  1. Zhao, R., Tresp, V.: Efficient dialog policy learning via positive memory retention. In: Spoken Language Technology Workshop (SLT), pp. 868–875. IEEE (2018)

    Google Scholar 

  2. Li, M., et al.: Robotic occlusion reasoning for efficient object existence prediction. In: IROS (2021)

    Google Scholar 

  3. Das, A., et al.: Visual dialog. In: IEEE Computer Vision and Pattern Recognition, CVPR (2017)

    Google Scholar 

  4. Giannakopoulou, D., Namjoshi, K.S., Păsăreanu, C.S.: Compositional reasoning. In: Handbook of Model Checking, pp. 345–383. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-10575-8_12

    Chapter  Google Scholar 

  5. Koushik, J., Hayashi, H., Sachan, D.S.: Compositional reasoning for visual question answering. In: ICML (2017)

    Google Scholar 

  6. Hongsuck, S.P., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. In: Advances in Neural Information Processing Systems. NIPS, pp. 3719–3729 (2017)

    Google Scholar 

  7. Das, A., et al.: Learning cooperative visual dialog agents with deep reinforcement learning. In: IEEE International Conference on Computer Vision. ICCV, pp. 2970–2979 (2017)

    Google Scholar 

  8. Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), vol. 2, pp. 2204–2212 (2014)

    Google Scholar 

  9. de Vries H., Strub F., Chandar S., Pietquin O., Larochelle H., Courville A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp. 4466–4475 (2017)

    Google Scholar 

  10. Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123(1), 4–31 (2016). https://doi.org/10.1007/s11263-016-0966-6

    Article  MathSciNet  Google Scholar 

  11. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997 (2017)

    Google Scholar 

  12. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)

    Article  MATH  Google Scholar 

  13. Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vision 127(4), 398–414 (2018). https://doi.org/10.1007/s11263-018-1116-0

    Article  Google Scholar 

  14. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 6693–6702 (2019)

    Google Scholar 

  15. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2016)

    Google Scholar 

  16. Subramanian, S., Singh, S., Gardner, M.: Analyzing compositionality of visual question answering. In: ViGIL@NeurIPS (2019)

    Google Scholar 

  17. Agrawal, A., Kembhavi, A., Batra, D., Parikh, D.: C-VQA: a compositional split of the Visual Question Answering (VQA) v1.0 Dataset. CoRR (2017)

    Google Scholar 

  18. Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment. arXiv:1704.01444 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuliia Lysa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lysa, Y., Weber, C., Becker, D., Wermter, S. (2022). Word-by-Word Generation of Visual Dialog Using Reinforcement Learning. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13530. Springer, Cham. https://doi.org/10.1007/978-3-031-15931-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15931-2_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15930-5

  • Online ISBN: 978-3-031-15931-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics