Skip to main content
Log in

Multi-aware coreference relation network for visual dialog

  • Regular paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

As a challenging cross-media task, visual dialog assesses whether an AI agent can converse in human language based on its understanding of visual content. So the critical issue is to pay attention not only to the problem of coreference in vision, but also to the problem of coreference in and between vision and language. In this paper, we propose the multi-aware coreference relation network (MACR-Net) to solve it from both textual and visual perspectives and to do fusion in complementary awareness. Specifically, its textual coreference relation module identifies textual coreference relations based on multi-aware textual representation from textual view. Furthermore, the visual coreference relation module adaptively adjusts visual coreference relations based on contextual-aware relations representation from visual view. Finally, the multi-modals fusion module fuses multi-aware relations to get an aligned representation. Extensive experiments on the VisDial v1.0 benchmarks show that MACR-Net achieves state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Research data policy and data availability statements

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, Parikh D, Batra D (2017) Visual dialog. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 326–335

  2. Niu Y, Zhang H, Zhang M, Zhang J, Lu Z, Wen J-R (2019) Recursive visual attention in visual dialog. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  3. Jiang T, Ji Y, Liu C (2021) Integrating historical states and co-attention mechanism for visual dialog. In: 2020 25th International conference on pattern recognition (ICPR), pp 2041–2048. IEEE

  4. Yu J, Jiang X, Qin Z, Zhang W, Hu Y, Wu Q (2020) Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans Image Process 30:220–233

    Article  Google Scholar 

  5. Chen F, Chen X, Meng F, Li P, Zhou J (2021) Gog: relation-aware graph-over-graph network for visual dialog. arXiv preprint arXiv:2109.08475

  6. Qiao Y, Yu Z, Liu J (2020) Rankvqa: answer re-ranking for visual question answering. In: 2020 IEEE international conference on multimedia and expo (ICME), pp. 1–6. IEEE

  7. Yang T, Zha Z-J, Zhang H (2019) Making history matter: history-advantage sequence training for visual dialog. In: Proceedings of the IEEE international conference on computer vision, pp 2561–2569

  8. Durrett G, Klein D (2013) Easy victories and uphill battles in coreference resolution. In: EMNLP

  9. Clark K, Manning CD (2016) Deep reinforcement learning for mention-ranking coreference models. In: EMNLP

  10. Lee K, He L, Zettlemoyer L (2018) Higher-order coreference resolution with coarse-to-fine inference. In: NAACL

  11. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: NAACL

  12. Lu C, Krishna R, Bernstein MS, Fei-Fei L (2016) Visual relationship detection with language priors. In: ECCV

  13. Lv J, Xiao Q-z, Zhong J (2020) Avr: Attention based salient visual relationship detection. arXiv preprint arXiv:2003.07012

  14. Xi Y, Zhang Y, Ding S, Wan S (2020) Visual question answering model based on visual relationship detection. Signal Process Image Commun, 80

  15. Hoang M, Kim S-H, Yang H-J, Lee G-S (2021) Context-aware emotion recognition based on visual relationship detection. IEEE Access 9:90465–90474

    Article  Google Scholar 

  16. Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: scene graph parsing with global context. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 5831–5840

  17. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  19. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  20. Quan J, Xiong D, Webber BL, Hu C (2019) Gecor: an end-to-end generative ellipsis and co-reference resolution model for task-oriented dialogue. arXiv preprint arXiv:1909.12086

  21. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  22. Ren S, He K, Girshick RB, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149

    Article  Google Scholar 

  23. Guo D, Xu C, Tao D (2019) Image-question-answer synergistic network for visual dialog. 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10426–10435

  24. Kottur S, Moura JM, Parikh D, Batra D, Rohrbach M (2018) Visual coreference resolution in visual dialog using neural module networks. In: Proceedings of the European conference on computer vision (ECCV), pp 153–169

  25. Jiang X, Du S, Qin Z, Sun Y, Yu J (2020) Kbgn: knowledge-bridge graph network for adaptive vision-text reasoning in visual dialogue. In: Proceedings of the 28th ACM international conference on multimedia, pp 1265–1273

  26. Lu J, Kannan A, Yang J, Parikh D, Batra D (2017) Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: NIPS

  27. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: NIPS

  28. Gan Z, Cheng Y, Kholy AE, Li L, Liu J, Gao J (2019) Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579

  29. Nguyen V-Q, Suganuma M, Okatani T (2020) Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: ECCV

  30. Jiang X, Yu J, Sun Y, Qin Z, Zhu Z, Hu Y, Wu Q (2020) Dam: deliberation, abandon and memory networks for generating detailed and non-repetitive responses in visual dialogue. arXiv preprint arXiv:2007.03310

  31. Chen F, Meng F, Chen X, Li P, Zhou J (2021) Multimodal incremental transformer with visual grounding for visual dialogue generation. arXiv preprint arXiv:2109.08478

  32. Lin J, Yang A, Zhang Y, Liu J, Zhou J, Yang H (2020) Interbert: vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198

  33. Tu T, Ping Q, Thattai G, Tur G, Natarajan P (2021) Learning better visual dialog agents with pretrained visual-linguistic representation. 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5618–5627

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China Nos 61972059, 61773272, 61602332; Natural Science Foundation of the Jiangsu Higher Education Institutions of China No 19KJA230001, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University No93K172016K08; the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chunping Liu or Yi Ji.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interest to disclose. The authors have no competing interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interest in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z., Jiang, T., Liu, C. et al. Multi-aware coreference relation network for visual dialog. Int J Multimed Info Retr 11, 567–576 (2022). https://doi.org/10.1007/s13735-022-00257-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-022-00257-2

Keywords