Skip to main content

Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

  • 161 Accesses

Abstract

Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)

    Google Scholar 

  2. Agrawal, A., et al.: Vqa: Visual question answering. In: ICCV, pp. 2425–2433 (2015)

    Google Scholar 

  3. Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)

    Google Scholar 

  4. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 11 (2016). https://doi.org/10.48550/arXiv.1607.06450

  5. Chen, C., et al.: Utc: a unified transformer with inter-task contrastive learning for visual dialog. In: CVPR, pp. 18103–18112 (2022)

    Google Scholar 

  6. Chen, F., Chen, X., Meng, F., Li, P., Zhou, J.: Gog: relation-aware graph-over-graph network for visual dialog. In: Findings of ACL, pp. 230–243 (2021)

    Google Scholar 

  7. Chen, F., Chen, X., Xu, C., Jiang, D.: Learning to ground visual objects for visual dialog. In: EMNLP Findings, pp. 1081–1091 (2021)

    Google Scholar 

  8. Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP (2022)

    Google Scholar 

  9. Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: Dmrm: a dual-channel multi-hop reasoning model for visual dialog. In: AAAI (2020)

    Google Scholar 

  10. Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., Xu, B.: Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In: ACM MM, pp. 4142–4153 (2022)

    Google Scholar 

  11. Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)

    Google Scholar 

  12. Desai, K., Das, A., Batra, D., Parikh, D.: Visual dialog challenge starter code. https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch (2019)

  13. Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. In: ACL, pp. 6463–6474 (2019)

    Google Scholar 

  14. Jiang, X., Du, S., Qin, Z., Sun, Y., Yu, J.: Kbgn: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In: ACM MM (2020)

    Google Scholar 

  15. Jiang, X., et al.: Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In: AAAI, pp. 11125–11132 (2020)

    Google Scholar 

  16. Jiang, X., et al.: Dam: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. In: IJCAI (2020)

    Google Scholar 

  17. Kang, G.C., Lim, J., Zhang, B.T.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP, pp. 2024–2033 (2019)

    Google Scholar 

  18. Le, H., Sahoo, D., Chen, N.F., Hoi, S.C.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: ACL, pp. 5612–5623 (2019)

    Google Scholar 

  19. Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: NeurIPS (2017)

    Google Scholar 

  20. Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: ECCV, pp. 336–352 (2020)

    Google Scholar 

  21. Nguyen, V.Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: ECCV, pp. 223–240 (2020)

    Google Scholar 

  22. Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: CVPR (2019)

    Google Scholar 

  23. Pang, W., Wang, X.: Guessing state tracking for visual dialogue. In: 16th European Conference on Computer Vision - ECCV 2020, pp. 683–698 (2020)

    Google Scholar 

  24. Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI (Oral), pp. 11831–11838 (2020)

    Google Scholar 

  25. Sungjin, P., Taesun, W., Yeochan, Y., Heuiseok, L.: Multi-view attention network for visual dialog. Appl. Sci. 11(7) (2021). https://doi.org/10.3390/app11073009

  26. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  27. de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: CVPR, pp. 5503–5512 (2017)

    Google Scholar 

  28. Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a Unified Vision and Dialog Transformer with BERT. In: EMNLP, pp. 3325–3338 (2020)

    Google Scholar 

  29. Wu, Q., Wang, P., Shen, C., Reid, I., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)

    Google Scholar 

  30. Yang, L., Meng, F., Liu, X., Wu, M.K.D., Ying, V., Xu, X.: Seqdialn: sequential visual dialog networks in joint visual-linguistic representation space. In: 1st Workshop on Document-grounded Dialogue and Conversational Question Answering, pp. 8–17 (2021)

    Google Scholar 

  31. Yang, T., Zha, Z.J., Zhang, H.: Making history matter: history-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)

    Google Scholar 

  32. Zhao, L., Li, J., Gao, L., Rao, Y., Song, J., Shen, H.T.: Heterogeneous knowledge network for visual dialog. IEEE Trans. Circ. Syst. Video Technol. (TCSVT), pp. 1–1 (2022). https://doi.org/10.1109/TCSVT.2022.3207228

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions. This paper was partially supported by the National Natural Science Foundation of China (NSFC 62076032), Huawei Noah’s Ark Lab, MoECMCC “Artificial Intelligence” Project (No. MCM20190701), Beijing Natural Science Foundation (Grant No. 4204100), and BUPT Excellent Ph.D. Students Foundation (No. CX2020309).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Pang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pang, W. (2024). Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_44

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8850-1_44

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8849-5

  • Online ISBN: 978-981-99-8850-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics