Abstract
GuessWhat?! is a task-oriented visual dialogue task which has two players, a guesser and an oracle. Guesser aims to locate the object supposed by oracle by asking several Yes/No questions which are answered by oracle. How to ask proper questions is crucial to achieve the final goal of the whole task. Previous methods generally use an word-level generator, which is hard to grasp the dialogue-level questioning strategy. They often generate repeated or useless questions. This paper proposes a sentence-level category-based strategy-driven question generator (CSQG) to explicitly provide a category based questioning strategy for the generator. First we encode the image and the dialogue history to decide the category of the next question to be generated. Then the question is generated with the helps of category-based dialogue strategy as well as encoding of both the image and dialogue history. The evaluation on large-scale visual dialogue dataset GuessWhat?! shows that our method can help guesser achieve 51.71% success rate which is the state-of-the-art on the supervised training methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015
Chattopadhyay, P., et al.: Evaluating visual conversational agents via cooperative human-AI games. arXiv:1708.05122 [cs], August 2017
Chen, Q., Zhuo, Z., Wang, W.: BERT for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909 (2019)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Das, A., et al.: Visual Dialog. arXiv:1611.08669 [cs], August 2017
Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dušek, O., Jurčíček, F.: A context-aware natural language generator for dialogue systems. arXiv preprint arXiv:1608.07076 (2016)
Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017)
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 315–323 (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Krishna, R., Bernstein, M., Fei-Fei, L.: Information maximizing visual question generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2008–2018 (2019)
Lin, C.: Recall-oriented understudy for gisting evaluation (rouge) (2005). Accessed 20 Aug 2005
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. arXiv:1911.07928 [cs], November 2019
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)
Saha, A., Khapra, M., Sankaranarayanan, K.: Towards building large scale multimodal domain-aware conversation systems. arXiv:1704.00200 [cs], January 2018
Serban, I., Sordoni, A., Bengio, Y., Courville, A., Pineau, J.: Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Shan, Y., et al.: A contextual hierarchical attention network with adaptive objective for dialogue state tracking. arXiv preprint arXiv:2006.01554 (2020)
Shekhar, R., Baumgartner, T., Venkatesh, A., Bruni, E., Bernardi, R., Fernández, R.: Ask no more: deciding when to guess in referential visual dialogue. arXiv preprint arXiv:1805.06960 (2018)
Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guesswhat. arXiv preprint arXiv:1809.03408 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Smith, E.M., Gonzalez-Rico, D., Dinan, E., Boureau, Y.L.: Controlling style in generated dialogue. arXiv preprint arXiv:2009.10855 (2020)
Strub, F., De Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423 (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 (2014)
Tan, Y., Ou, Z., Liu, K., Shi, Y., Song, M.: Turn-level recurrence self-attention for joint dialogue action prediction and response generation. In: Wang, X., Zhang, R., Lee, Y.K., Sun, L., Moon, Y.S. (eds.) Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pp. 309–316. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60290-1_24
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. arXiv:1611.08481 [cs], February 2017
Xu, Z., Feng, F., Wang, X., Yang, Y., Jiang, H., Wang, Z.: Answer-driven visual state estimator for goal-oriented visual dialogue. In: Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, pp. 4271–4279. ACM, October 2020. https://doi.org/10.1145/3394171.3413668
Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., Van Den Hengel, A.: Goal-oriented visual question generation via intermediate rewards. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), pp. 186–201. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_12
Zhao, R., Tresp, V.: Learning goal-oriented visual dialog via tempered policy gradient. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 868–875. IEEE (2018)
Zhuang, B., Wu, Q., Shen, C., Reid, I., van den Hengel, A.: Parallel attention: a unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Acknowledgments
We would like to thank anonymous reviewers for their suggestions and comments. The work was supported by the National Natural Science Foundation of China (NSFC62076032) and the Cooperation Poject with Beijing SanKuai Technology Co., Ltd.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X. (2021). Category-Based Strategy-Driven Question Generator for Visual Dialogue. In: Li, S., et al. Chinese Computational Linguistics. CCL 2021. Lecture Notes in Computer Science(), vol 12869. Springer, Cham. https://doi.org/10.1007/978-3-030-84186-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-84186-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84185-0
Online ISBN: 978-3-030-84186-7
eBook Packages: Computer ScienceComputer Science (R0)