Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog

Pang, Wei

doi:10.1007/978-981-99-8850-1_44

Wei Pang¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

161 Accesses

Abstract

Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Agarwal, S., Bui, T., Lee, J.Y., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)
Google Scholar
Agrawal, A., et al.: Vqa: Visual question answering. In: ICCV, pp. 2425–2433 (2015)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 11 (2016). https://doi.org/10.48550/arXiv.1607.06450
Chen, C., et al.: Utc: a unified transformer with inter-task contrastive learning for visual dialog. In: CVPR, pp. 18103–18112 (2022)
Google Scholar
Chen, F., Chen, X., Meng, F., Li, P., Zhou, J.: Gog: relation-aware graph-over-graph network for visual dialog. In: Findings of ACL, pp. 230–243 (2021)
Google Scholar
Chen, F., Chen, X., Xu, C., Jiang, D.: Learning to ground visual objects for visual dialog. In: EMNLP Findings, pp. 1081–1091 (2021)
Google Scholar
Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP (2022)
Google Scholar
Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: Dmrm: a dual-channel multi-hop reasoning model for visual dialog. In: AAAI (2020)
Google Scholar
Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., Xu, B.: Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In: ACM MM, pp. 4142–4153 (2022)
Google Scholar
Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)
Google Scholar
Desai, K., Das, A., Batra, D., Parikh, D.: Visual dialog challenge starter code. https://github.com/batra-mlp-lab/visdial-challenge-starter-pytorch (2019)
Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. In: ACL, pp. 6463–6474 (2019)
Google Scholar
Jiang, X., Du, S., Qin, Z., Sun, Y., Yu, J.: Kbgn: Knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In: ACM MM (2020)
Google Scholar
Jiang, X., et al.: Dualvd: An adaptive dual encoding model for deep visual understanding in visual dialogue. In: AAAI, pp. 11125–11132 (2020)
Google Scholar
Jiang, X., et al.: Dam: Deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. In: IJCAI (2020)
Google Scholar
Kang, G.C., Lim, J., Zhang, B.T.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP, pp. 2024–2033 (2019)
Google Scholar
Le, H., Sahoo, D., Chen, N.F., Hoi, S.C.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: ACL, pp. 5612–5623 (2019)
Google Scholar
Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: NeurIPS (2017)
Google Scholar
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: ECCV, pp. 336–352 (2020)
Google Scholar
Nguyen, V.Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: ECCV, pp. 223–240 (2020)
Google Scholar
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: CVPR (2019)
Google Scholar
Pang, W., Wang, X.: Guessing state tracking for visual dialogue. In: 16th European Conference on Computer Vision - ECCV 2020, pp. 683–698 (2020)
Google Scholar
Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI (Oral), pp. 11831–11838 (2020)
Google Scholar
Sungjin, P., Taesun, W., Yeochan, Y., Heuiseok, L.: Multi-view attention network for visual dialog. Appl. Sci. 11(7) (2021). https://doi.org/10.3390/app11073009
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
de Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: CVPR, pp. 5503–5512 (2017)
Google Scholar
Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a Unified Vision and Dialog Transformer with BERT. In: EMNLP, pp. 3325–3338 (2020)
Google Scholar
Wu, Q., Wang, P., Shen, C., Reid, I., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)
Google Scholar
Yang, L., Meng, F., Liu, X., Wu, M.K.D., Ying, V., Xu, X.: Seqdialn: sequential visual dialog networks in joint visual-linguistic representation space. In: 1st Workshop on Document-grounded Dialogue and Conversational Question Answering, pp. 8–17 (2021)
Google Scholar
Yang, T., Zha, Z.J., Zhang, H.: Making history matter: history-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)
Google Scholar
Zhao, L., Li, J., Gao, L., Rao, Y., Song, J., Shen, H.T.: Heterogeneous knowledge network for visual dialog. IEEE Trans. Circ. Syst. Video Technol. (TCSVT), pp. 1–1 (2022). https://doi.org/10.1109/TCSVT.2022.3207228

Download references

Acknowledgements

We thank the reviewers for their comments and suggestions. This paper was partially supported by the National Natural Science Foundation of China (NSFC 62076032), Huawei Noah’s Ark Lab, MoECMCC “Artificial Intelligence” Project (No. MCM20190701), Beijing Natural Science Foundation (Grant No. 4204100), and BUPT Excellent Ph.D. Students Foundation (No. CX2020309).

Author information

Authors and Affiliations

Beijing Information Science and Technology University, Beijing, China
Wei Pang

Authors

Wei Pang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Pang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pang, W. (2024). Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_44

Download citation

DOI: https://doi.org/10.1007/978-981-99-8850-1_44
Published: 04 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8849-5
Online ISBN: 978-981-99-8850-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-round Dialogue State Tracking by Object-Entity Alignment in Visual Dialog