Multi-aware coreference relation network for visual dialog

Zhang, Zefan; Jiang, Tianling; Liu, Chunping; Ji, Yi

doi:10.1007/s13735-022-00257-2

Multi-aware coreference relation network for visual dialog

Regular paper
Published: 29 November 2022

Volume 11, pages 567–576, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Zefan Zhang ORCID: orcid.org/0000-0001-5627-050X¹^na1,
Tianling Jiang¹^na1,
Chunping Liu¹ &
…
Yi Ji¹

368 Accesses
Explore all metrics

Abstract

As a challenging cross-media task, visual dialog assesses whether an AI agent can converse in human language based on its understanding of visual content. So the critical issue is to pay attention not only to the problem of coreference in vision, but also to the problem of coreference in and between vision and language. In this paper, we propose the multi-aware coreference relation network (MACR-Net) to solve it from both textual and visual perspectives and to do fusion in complementary awareness. Specifically, its textual coreference relation module identifies textual coreference relations based on multi-aware textual representation from textual view. Furthermore, the visual coreference relation module adaptively adjusts visual coreference relations based on contextual-aware relations representation from visual view. Finally, the multi-modals fusion module fuses multi-aware relations to get an aligned representation. Extensive experiments on the VisDial v1.0 benchmarks show that MACR-Net achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relation-Aware Multi-hop Reasoning forVisual Dialog

Closed-loop reasoning with graph-aware dense interaction for visual dialog

Article 01 June 2022

Hierarchical cross-modal contextual attention network for visual grounding

Article 17 April 2023

Research data policy and data availability statements

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2017) Visual dialog. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 326–335
Niu Y, Zhang H, Zhang M, Zhang J, Lu Z, Wen J-R (2019) Recursive visual attention in visual dialog. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Jiang T, Ji Y, Liu C (2021) Integrating historical states and co-attention mechanism for visual dialog. In: 2020 25th International conference on pattern recognition (ICPR), pp 2041–2048. IEEE
Yu J, Jiang X, Qin Z, Zhang W, Hu Y, Wu Q (2020) Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans Image Process 30:220–233
Article Google Scholar
Chen F, Chen X, Meng F, Li P, Zhou J (2021) Gog: relation-aware graph-over-graph network for visual dialog. arXiv preprint arXiv:2109.08475
Qiao Y, Yu Z, Liu J (2020) Rankvqa: answer re-ranking for visual question answering. In: 2020 IEEE international conference on multimedia and expo (ICME), pp. 1–6. IEEE
Yang T, Zha Z-J, Zhang H (2019) Making history matter: history-advantage sequence training for visual dialog. In: Proceedings of the IEEE international conference on computer vision, pp 2561–2569
Durrett G, Klein D (2013) Easy victories and uphill battles in coreference resolution. In: EMNLP
Clark K, Manning CD (2016) Deep reinforcement learning for mention-ranking coreference models. In: EMNLP
Lee K, He L, Zettlemoyer L (2018) Higher-order coreference resolution with coarse-to-fine inference. In: NAACL
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: NAACL
Lu C, Krishna R, Bernstein MS, Fei-Fei L (2016) Visual relationship detection with language priors. In: ECCV
Lv J, Xiao Q-z, Zhong J (2020) Avr: Attention based salient visual relationship detection. arXiv preprint arXiv:2003.07012
Xi Y, Zhang Y, Ding S, Wan S (2020) Visual question answering model based on visual relationship detection. Signal Process Image Commun, 80
Hoang M, Kim S-H, Yang H-J, Lee G-S (2021) Context-aware emotion recognition based on visual relationship detection. IEEE Access 9:90465–90474
Article Google Scholar
Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: scene graph parsing with global context. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5831–5840
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Quan J, Xiong D, Webber BL, Hu C (2019) Gecor: an end-to-end generative ellipsis and co-reference resolution model for task-oriented dialogue. arXiv preprint arXiv:1909.12086
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Ren S, He K, Girshick RB, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Article Google Scholar
Guo D, Xu C, Tao D (2019) Image-question-answer synergistic network for visual dialog. 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10426–10435
Kottur S, Moura JM, Parikh D, Batra D, Rohrbach M (2018) Visual coreference resolution in visual dialog using neural module networks. In: Proceedings of the European conference on computer vision (ECCV), pp 153–169
Jiang X, Du S, Qin Z, Sun Y, Yu J (2020) Kbgn: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In: Proceedings of the 28th ACM international conference on multimedia, pp 1265–1273
Lu J, Kannan A, Yang J, Parikh D, Batra D (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: NIPS
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: NIPS
Gan Z, Cheng Y, Kholy AE, Li L, Liu J, Gao J (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579
Nguyen V-Q, Suganuma M, Okatani T (2020) Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: ECCV
Jiang X, Yu J, Sun Y, Qin Z, Zhu Z, Hu Y, Wu Q (2020) Dam: deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. arXiv preprint arXiv:2007.03310
Chen F, Meng F, Chen X, Li P, Zhou J (2021) Multimodal incremental transformer with visual grounding for visual dialogue generation. arXiv preprint arXiv:2109.08478
Lin J, Yang A, Zhang Y, Liu J, Zhou J, Yang H (2020) Interbert: vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198
Tu T, Ping Q, Thattai G, Tur G, Natarajan P (2021) Learning better visual dialog agents with pretrained visual-linguistic representation. 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5618–5627

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China Nos 61972059, 61773272, 61602332; Natural Science Foundation of the Jiangsu Higher Education Institutions of China No 19KJA230001, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University No93K172016K08; the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).

Author information

Zefan Zhang and Tianling Jiang have contributed equally to this work.

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, 215000, Jiangsu, China
Zefan Zhang, Tianling Jiang, Chunping Liu & Yi Ji

Authors

Zefan Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Tianling Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Chunping Liu
View author publications
You can also search for this author inPubMed Google Scholar
Yi Ji
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Chunping Liu or Yi Ji.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interest to disclose. The authors have no competing interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interest in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Z., Jiang, T., Liu, C. et al. Multi-aware coreference relation network for visual dialog. Int J Multimed Info Retr 11, 567–576 (2022). https://doi.org/10.1007/s13735-022-00257-2

Download citation

Received: 13 June 2022
Revised: 25 August 2022
Accepted: 07 September 2022
Published: 29 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13735-022-00257-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-aware coreference relation network for visual dialog

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Relation-Aware Multi-hop Reasoning forVisual Dialog

Closed-loop reasoning with graph-aware dense interaction for visual dialog

Hierarchical cross-modal contextual attention network for visual grounding

Research data policy and data availability statements

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now