UED: A Unified Encoder Decoder Network for Visual Dialog

Chen, Cheng; Gu, Xiaodong

doi:10.1007/978-3-030-92310-5_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1517))

Included in the following conference series:

International Conference on Neural Information Processing

1799 Accesses

Abstract

This paper addresses the problem of visual dialog, which aims to answer multi-round questions based on the dialog history and image content. This is a challenging task because a question may be answered in relations to any previous dialog and visual clues in image. Existing methods mainly focus on discriminative setting, which design various attention mechanisms to model interaction between answer candidates and multi-modal context. Despite having impressive results with attention based model for visual dialog, a universal encoder-decoder for both answer understanding and generation remains challenging. In this paper, we propose UED, a unified framework that exploits answer candidates to jointly train discriminative and generative tasks. UED is unified in that (1) it fully exploiting the interaction between different modalities to support answer ranking and generation in a single transformer based model, and (2) it uses the answers as anchors to facilitate both two settings. We evaluate the proposed UED on the VisDial dataset, where our model outperforms the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision and-language tasks. In: 2019 Advance in Neural Information Processing Systems (NIPS), pp. 524–534. MIT Press, Vancouver, CA (2019)
Google Scholar
Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335. IEEE, Honolulu, HI (2017)
Google Scholar
Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2039–2048. IEEE, Long Beach, CA (2019)
Google Scholar
Wu, Q., Wang, P., Shen, C., Reid, I., van den Hengel, A.: Are you talking to me? Reasoned visual dialog generation through adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6106–6115. IEEE, Salt Lake City, Utah (2018)
Google Scholar
Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: Advances in Neural Information Processing Systems, pp. 314–324. MIT press, California, USA (2017)
Google Scholar
Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. In: Proceedings of the Conference of the Association for Computational Linguistics, pp. 6463–6474. ACL, Florence, ITA (2019)
Google Scholar
Nguyen, V.-Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 223–240. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_14
Chapter Google Scholar
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. In: 2020 ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3325–3338. ACL (2020)
Google Scholar
Kang, G.C., Lim, J., Zhang, B.T.: Dual attention networks for visual reference resolution in visual dialog. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2024–2033. ACL, Hong Kong (2019)
Google Scholar
Guo, D., Xu, C., Tao, D.: Image-question-answer synergistic network for visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10434–10443. IEEE, Long Beach, CA (2019)
Google Scholar
Zheng, Z., Wang, W., Qi, S., Zhu, S.C.: Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6669–6678. IEEE, Long Beach, CA (2019)
Google Scholar

Download references

Acknowledgments

This work was supported in part by National Natural Science Foundation of China under grant 61771145.

Author information

Authors and Affiliations

Department of Electronic Engineering, Fudan University, Shanghai, 200433, China
Cheng Chen & Xiaodong Gu

Authors

Cheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodong Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodong Gu .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, C., Gu, X. (2021). UED: A Unified Encoder Decoder Network for Visual Dialog. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-92310-5_12
Published: 02 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

UED: A Unified Encoder Decoder Network for Visual Dialog