Improving Goal-Oriented Visual Dialogue by Asking Fewer Questions

Kanazawa, Soma; Matsumori, Shoya; Imai, Michita

doi:10.1007/978-3-030-92270-2_14

Improving Goal-Oriented Visual Dialogue by Asking Fewer Questions

Soma Kanazawa¹³,
Shoya Matsumori¹³ &
Michita Imai¹³

Conference paper
First Online: 07 December 2021

1597 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13109))

Abstract

An agent who adaptively asks the user questions to seek information is a crucial element in designing a real-world artificial intelligence agent. In particular, goal-oriented visual dialogue, which locates an object of interest from a group of visually presented objects by asking verbal questions, must be able to efficiently narrow down and identify objects through question generation. Several models based on GuessWhat?! and CLEVR Ask have been published, most of which leverage reinforcement learning to maximize the success rate of the task. However, existing models take a policy of asking questions up to a predefined limit, resulting in the generation of redundant questions. Moreover, the generated questions often refer only to a limited number of objects, which prevents efficient narrowing down and the identification of a wide range of attributes. This paper proposes Two-Stream Splitter (TSS) for redundant question reduction and efficient question generation. TSS utilizes a self-attention structure in the processing of image features and location features of objects to enable efficient narrowing down of candidate objects by combining the information content of both. Experimental results on the CLEVR Ask dataset show that the proposed method reduces redundant questions and enables efficient interaction compared to previous models.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! visual object discovery through multi-modal dialogue. In: CVPR, pp. 5503–5512 (2017). https://hal.inria.fr/hal-01549641
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997 (2017). https://doi.org/10.1109/CVPR.2017.215
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Matsumori, S., Shingyouchi, K., Abe, Y., Fukuchi, Y., Sugiura, K., Imai, M.: Unified questioner transformer for descriptive question generation in goal-oriented visual dialogue. arXiv preprint arXiv:2106.15550 (2021)
Pang, W., Wang, X.: Guessing state tracking for visual dialogue. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 683–698. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_40
Chapter Google Scholar
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, vol. 34, pp. 11831–11838 (2020)
Google Scholar
Shekhar, R., Baumgärtner, T., Venkatesh, A., Bruni, E., Bernardi, R., Fernandez, R.: Ask no more: deciding when to guess in referential visual dialogue. In: COLING, pp. 1218–1233. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018). https://www.aclweb.org/anthology/C18-1104
Shukla, P., Elmadjian, C., Sharan, R., Kulkarni, V., Turk, M., Wang, W.Y.: What should I ask? using conversationally informative rewards for goal-oriented visual dialog. In: ACL, pp. 6442–6451. Association for Computational Linguistics, Florence, Italy, July 2019. https://doi.org/10.18653/v1/P19-1646
Strub, F., de Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. In: IJCAI, pp. 2765–2771 (2017). https://doi.org/10.24963/ijcai.2017/385
Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. In: NeurIPS, pp. 1057–1063 (2000)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)
MATH Google Scholar
Yu, L., Tan, H., Bansal, M., Berg, T.L.: A joint speaker-listener-reinforcer model for referring expressions. In: CVPR, pp. 7282–7290 (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP21J13789 and JST CREST Grant Number JPMJCR19A1, Japan.

Author information

Authors and Affiliations

Graduate School of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa, 223-8522, Japan
Soma Kanazawa, Shoya Matsumori & Michita Imai

Authors

Soma Kanazawa
View author publications
You can also search for this author in PubMed Google Scholar
Shoya Matsumori
View author publications
You can also search for this author in PubMed Google Scholar
Michita Imai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soma Kanazawa .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kanazawa, S., Matsumori, S., Imai, M. (2021). Improving Goal-Oriented Visual Dialogue by Asking Fewer Questions. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13109. Springer, Cham. https://doi.org/10.1007/978-3-030-92270-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-92270-2_14
Published: 07 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92269-6
Online ISBN: 978-3-030-92270-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics