Word-by-Word Generation of Visual Dialog Using Reinforcement Learning

Lysa, Yuliia; Weber, Cornelius; Becker, Dennis; Wermter, Stefan

doi:10.1007/978-3-031-15931-2_11

Yuliia Lysa¹²,
Cornelius Weber¹²,
Dennis Becker¹² &
…
Stefan Wermter¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13530))

Included in the following conference series:

International Conference on Artificial Neural Networks

2218 Accesses

Abstract

The task of visual dialog generation requires an agent holding a conversation referencing question history, putting the current question into context, and processing visual content. While previous research focused on arranging questions to form dialog, we tackle the more challenging task of arranging questions from words, and dialog from questions. We develop our model in a simple “Guess which?” game scenario where the agent needs to predict an image region that has been selected by an oracle by asking questions to the oracle. As a result, the reinforcement learning agent arranges words to refer to the image features strategically to acquire the required information from the oracle, memorizing it and giving the correct prediction with an accuracy well above 80%. Imposing costs on the number of questions asked to the oracle leads to a strategy using few questions, while imposing costs on the number of words used leads to more but shorter questions. Our results are a step towards making goal-directed dialog fully generic by assembling it from words, elementary constituents of language.

The authors acknowledge support from the German Research Foundation DFG under project Crossmodal Learning (TRR 169). Mengdi Li provided inspiration and feedback on the document.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The dataset and the code of the model implementation are available at: https://github.com/ylysa/Recurrent-Attention-Model.

References

Zhao, R., Tresp, V.: Efficient dialog policy learning via positive memory retention. In: Spoken Language Technology Workshop (SLT), pp. 868–875. IEEE (2018)
Google Scholar
Li, M., et al.: Robotic occlusion reasoning for efficient object existence prediction. In: IROS (2021)
Google Scholar
Das, A., et al.: Visual dialog. In: IEEE Computer Vision and Pattern Recognition, CVPR (2017)
Google Scholar
Giannakopoulou, D., Namjoshi, K.S., Păsăreanu, C.S.: Compositional reasoning. In: Handbook of Model Checking, pp. 345–383. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-10575-8_12
Chapter Google Scholar
Koushik, J., Hayashi, H., Sachan, D.S.: Compositional reasoning for visual question answering. In: ICML (2017)
Google Scholar
Hongsuck, S.P., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. In: Advances in Neural Information Processing Systems. NIPS, pp. 3719–3729 (2017)
Google Scholar
Das, A., et al.: Learning cooperative visual dialog agents with deep reinforcement learning. In: IEEE International Conference on Computer Vision. ICCV, pp. 2970–2979 (2017)
Google Scholar
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), vol. 2, pp. 2204–2212 (2014)
Google Scholar
de Vries H., Strub F., Chandar S., Pietquin O., Larochelle H., Courville A.: GuessWhat?! Visual object discovery through multi-modal dialogue. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR, pp. 4466–4475 (2017)
Google Scholar
Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vision 123(1), 4–31 (2016). https://doi.org/10.1007/s11263-016-0966-6
Article MathSciNet Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997 (2017)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992)
Article MATH Google Scholar
Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vision 127(4), 398–414 (2018). https://doi.org/10.1007/s11263-018-1116-0
Article Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR, pp. 6693–6702 (2019)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2016)
Google Scholar
Subramanian, S., Singh, S., Gardner, M.: Analyzing compositionality of visual question answering. In: ViGIL@NeurIPS (2019)
Google Scholar
Agrawal, A., Kembhavi, A., Batra, D., Parikh, D.: C-VQA: a compositional split of the Visual Question Answering (VQA) v1.0 Dataset. CoRR (2017)
Google Scholar
Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment. arXiv:1704.01444 (2017)

Download references

Author information

Authors and Affiliations

Knowledge Technology Research Group, University of Hamburg, Hamburg, Germany
Yuliia Lysa, Cornelius Weber, Dennis Becker & Stefan Wermter

Authors

Yuliia Lysa
View author publications
You can also search for this author in PubMed Google Scholar
Cornelius Weber
View author publications
You can also search for this author in PubMed Google Scholar
Dennis Becker
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Wermter
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuliia Lysa .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lysa, Y., Weber, C., Becker, D., Wermter, S. (2022). Word-by-Word Generation of Visual Dialog Using Reinforcement Learning. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13530. Springer, Cham. https://doi.org/10.1007/978-3-031-15931-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-15931-2_11
Published: 07 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15930-5
Online ISBN: 978-3-031-15931-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Word-by-Word Generation of Visual Dialog Using Reinforcement Learning