Skip to main content

Category-Based Strategy-Driven Question Generator for Visual Dialogue

  • Conference paper
  • First Online:
Chinese Computational Linguistics (CCL 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12869))

Included in the following conference series:

Abstract

GuessWhat?! is a task-oriented visual dialogue task which has two players, a guesser and an oracle. Guesser aims to locate the object supposed by oracle by asking several Yes/No questions which are answered by oracle. How to ask proper questions is crucial to achieve the final goal of the whole task. Previous methods generally use an word-level generator, which is hard to grasp the dialogue-level questioning strategy. They often generate repeated or useless questions. This paper proposes a sentence-level category-based strategy-driven question generator (CSQG) to explicitly provide a category based questioning strategy for the generator. First we encode the image and the dialogue history to decide the category of the next question to be generated. Then the question is generated with the helps of category-based dialogue strategy as well as encoding of both the image and dialogue history. The evaluation on large-scale visual dialogue dataset GuessWhat?! shows that our method can help guesser achieve 51.71% success rate which is the state-of-the-art on the supervised training methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

    Google Scholar 

  2. Chattopadhyay, P., et al.: Evaluating visual conversational agents via cooperative human-AI games. arXiv:1708.05122 [cs], August 2017

  3. Chen, Q., Zhuo, Z., Wang, W.: BERT for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909 (2019)

  4. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  5. Das, A., et al.: Visual Dialog. arXiv:1611.08669 [cs], August 2017

  6. Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Dušek, O., Jurčíček, F.: A context-aware natural language generator for dialogue systems. arXiv preprint arXiv:1608.07076 (2016)

  9. Freitag, M., Al-Onaizan, Y.: Beam search strategies for neural machine translation. arXiv preprint arXiv:1702.01806 (2017)

  10. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 315–323 (2011)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  12. Krishna, R., Bernstein, M., Fei-Fei, L.: Information maximizing visual question generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2008–2018 (2019)

    Google Scholar 

  13. Lin, C.: Recall-oriented understudy for gisting evaluation (rouge) (2005). Accessed 20 Aug 2005

    Google Scholar 

  14. Pang, W., Wang, X.: Visual dialogue state tracking for question generation. arXiv:1911.07928 [cs], November 2019

  15. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  16. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2016)

    Article  Google Scholar 

  17. Saha, A., Khapra, M., Sankaranarayanan, K.: Towards building large scale multimodal domain-aware conversation systems. arXiv:1704.00200 [cs], January 2018

  18. Serban, I., Sordoni, A., Bengio, Y., Courville, A., Pineau, J.: Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)

    Google Scholar 

  19. Shan, Y., et al.: A contextual hierarchical attention network with adaptive objective for dialogue state tracking. arXiv preprint arXiv:2006.01554 (2020)

  20. Shekhar, R., Baumgartner, T., Venkatesh, A., Bruni, E., Bernardi, R., Fernández, R.: Ask no more: deciding when to guess in referential visual dialogue. arXiv preprint arXiv:1805.06960 (2018)

  21. Shekhar, R., et al.: Beyond task success: a closer look at jointly learning to see, ask, and guesswhat. arXiv preprint arXiv:1809.03408 (2018)

  22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  23. Smith, E.M., Gonzalez-Rico, D., Dinan, E., Boureau, Y.L.: Controlling style in generated dialogue. arXiv preprint arXiv:2009.10855 (2020)

  24. Strub, F., De Vries, H., Mary, J., Piot, B., Courville, A., Pietquin, O.: End-to-end optimization of goal-driven and visually grounded dialogue systems. arXiv preprint arXiv:1703.05423 (2017)

  25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 (2014)

  26. Tan, Y., Ou, Z., Liu, K., Shi, Y., Song, M.: Turn-level recurrence self-attention for joint dialogue action prediction and response generation. In: Wang, X., Zhang, R., Lee, Y.K., Sun, L., Moon, Y.S. (eds.) Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, pp. 309–316. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60290-1_24

  27. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  28. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: GuessWhat?! Visual object discovery through multi-modal dialogue. arXiv:1611.08481 [cs], February 2017

  29. Xu, Z., Feng, F., Wang, X., Yang, Y., Jiang, H., Wang, Z.: Answer-driven visual state estimator for goal-oriented visual dialogue. In: Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, pp. 4271–4279. ACM, October 2020. https://doi.org/10.1145/3394171.3413668

  30. Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., Van Den Hengel, A.: Goal-oriented visual question generation via intermediate rewards. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Proceedings of the European Conference on Computer Vision (ECCV), pp. 186–201. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_12

  31. Zhao, R., Tresp, V.: Learning goal-oriented visual dialog via tempered policy gradient. In: 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 868–875. IEEE (2018)

    Google Scholar 

  32. Zhuang, B., Wu, Q., Shen, C., Reid, I., van den Hengel, A.: Parallel attention: a unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

    Google Scholar 

Download references

Acknowledgments

We would like to thank anonymous reviewers for their suggestions and comments. The work was supported by the National Natural Science Foundation of China (NSFC62076032) and the Cooperation Poject with Beijing SanKuai Technology Co., Ltd.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaojie Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X. (2021). Category-Based Strategy-Driven Question Generator for Visual Dialogue. In: Li, S., et al. Chinese Computational Linguistics. CCL 2021. Lecture Notes in Computer Science(), vol 12869. Springer, Cham. https://doi.org/10.1007/978-3-030-84186-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-84186-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-84185-0

  • Online ISBN: 978-3-030-84186-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics