Abstract
Recently, benefiting from the powerful representation ability learned from large-scale image-text pre-training, pre-trained vision-language models show significant improvements in visual dialog task. However, these works suffer from two main challenges: 1) how to incorporate the sequential nature of multi-turn dialog systems for better capturing temporal dependencies of visual dialog; 2) how to align the semantics among different modal-specific features for better multi-modal interactions and understandings. To address the above issues, we propose a recurrent multi-modal transformer (named RecFormer) to capture temporal dependencies between utterances via encoding dialog utterances and interacting with visual information turn by turn. Specifically, we equip a pre-trained transformer with a recurrent function that maintains cross-modal history encoding for the dialog agent. Thus, the dialog agent can make better predictions by considering temporal dependencies. Besides, we also propose history-aware contrastive learning as an auxiliary task to align visual features and dialog history features for improving visual dialog understanding. The experimental results demonstrate that our RecFormer can achieve new state-of-the-art performances on both VisDial v0.9 (72.52 MRR score and 60.47 R@1 on val split) and VisDial v1.0 (69.29 MRR score and 55.90 R@1 on test-std split) datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, C., et al.: UTC: a unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18103–18112 (2022)
Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7937–7941. IEEE (2022)
Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., Xu, B.: Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4142–4153 (2022)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Dai, W., et al.: Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579 (2019)
Guo, D., Wang, H., Wang, M.: Dual visual attention network for visual dialog. In: IJCAI, pp. 4989–4995 (2019)
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 336–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_20
Nguyen, V.Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. pp. 223–240. Springer (2020)
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6679–6688 (2019)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Su, W., et al.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. arXiv preprint arXiv:2004.13278 (2020)
Wu, Q., Wang, P., Shen, C., Reid, I., Van Den Hengel, A.: Are you talking to me? Reasoned visual dialog generation through adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6106–6115 (2018)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Acknowledgment
This work was supported in part by National Key R &D Program of China under Grant No. 2020AAA0109700, National Natural Science Foundation of China (NSFC) under Grant No. 62206314, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), GuangDong Basic and Applied Basic Research Foundation under Grant No. 2022A1515011835, China Postdoctoral Science Foundation funded project under Grant No. 2021M703687, Shenzhen Science and Technology Program (Grant No. RCYX20200714114642083), Shenzhen Fundamental Research Program (Grant No. JCYJ20190807154211365), Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key RD Program under Grant No. 2022ZD014, and Sun Yat-sen University under Grant No. 22lgqb38 and 76160-12220011. We thank MindSpore for the partial support of this work, which is a new deep learning computing framework (https://www.mindspore.cn/).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lu, L., Qin, J., Jie, Z., Ma, L., Lin, L., Liang, X. (2024). RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_13
Download citation
DOI: https://doi.org/10.1007/978-981-99-8429-9_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)