RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog

Lu, Liucun; Qin, Jinghui; Jie, Zequn; Ma, Lin; Lin, Liang; Liang, Xiaodan

doi:10.1007/978-981-99-8429-9_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

978 Accesses

Abstract

Recently, benefiting from the powerful representation ability learned from large-scale image-text pre-training, pre-trained vision-language models show significant improvements in visual dialog task. However, these works suffer from two main challenges: 1) how to incorporate the sequential nature of multi-turn dialog systems for better capturing temporal dependencies of visual dialog; 2) how to align the semantics among different modal-specific features for better multi-modal interactions and understandings. To address the above issues, we propose a recurrent multi-modal transformer (named RecFormer) to capture temporal dependencies between utterances via encoding dialog utterances and interacting with visual information turn by turn. Specifically, we equip a pre-trained transformer with a recurrent function that maintains cross-modal history encoding for the dialog agent. Thus, the dialog agent can make better predictions by considering temporal dependencies. Besides, we also propose history-aware contrastive learning as an auxiliary task to align visual features and dialog history features for improving visual dialog understanding. The experimental results demonstrate that our RecFormer can achieve new state-of-the-art performances on both VisDial v0.9 (72.52 MRR score and 60.47 R@1 on val split) and VisDial v1.0 (69.29 MRR score and 55.90 R@1 on test-std split) datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, C., et al.: UTC: a unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18103–18112 (2022)
Google Scholar
Chen, C.F.R., Fan, Q., Panda, R.: Crossvit: cross-attention multi-scale vision transformer for image classification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366 (2021)
Google Scholar
Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7937–7941. IEEE (2022)
Google Scholar
Chen, F., Zhang, D., Chen, X., Shi, J., Xu, S., Xu, B.: Unsupervised and pseudo-supervised vision-language alignment in visual dialog. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4142–4153 (2022)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Dai, W., et al.: Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579 (2019)
Guo, D., Wang, H., Wang, M.: Dual visual attention network for visual dialog. In: IJCAI, pp. 4989–4995 (2019)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 336–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_20
Chapter Google Scholar
Nguyen, V.Q., Suganuma, M., Okatani, T.: Efficient attention mechanism for visual dialog that can handle all the interactions between multiple inputs. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16. pp. 223–240. Springer (2020)
Google Scholar
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6679–6688 (2019)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Su, W., et al.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. arXiv preprint arXiv:2004.13278 (2020)
Wu, Q., Wang, P., Shen, C., Reid, I., Van Den Hengel, A.: Are you talking to me? Reasoned visual dialog generation through adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6106–6115 (2018)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Google Scholar

Download references

Acknowledgment

This work was supported in part by National Key R &D Program of China under Grant No. 2020AAA0109700, National Natural Science Foundation of China (NSFC) under Grant No. 62206314, Guangdong Outstanding Youth Fund (Grant No. 2021B1515020061), GuangDong Basic and Applied Basic Research Foundation under Grant No. 2022A1515011835, China Postdoctoral Science Foundation funded project under Grant No. 2021M703687, Shenzhen Science and Technology Program (Grant No. RCYX20200714114642083), Shenzhen Fundamental Research Program (Grant No. JCYJ20190807154211365), Shenzhen Science and Technology Program (Grant No. GJHZ20220913142600001), Nansha Key RD Program under Grant No. 2022ZD014, and Sun Yat-sen University under Grant No. 22lgqb38 and 76160-12220011. We thank MindSpore for the partial support of this work, which is a new deep learning computing framework (https://www.mindspore.cn/).

Author information

Authors and Affiliations

Shenzhen Campus of Sun Yat-sen University, Guangzhou, China
Liucun Lu & Xiaodan Liang
Guangdong University of Technology, Guangzhou, China
Jinghui Qin
Meituan, Beijing, China
Zequn Jie & Lin Ma
Sun Yat-sen University, Guangzhou, China
Liang Lin
DarkMatter AI Research, Shenzhen, China
Xiaodan Liang

Authors

Liucun Lu
View author publications
You can also search for this author in PubMed Google Scholar
Jinghui Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zequn Jie
View author publications
You can also search for this author in PubMed Google Scholar
Lin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Liang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Liang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaodan Liang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, L., Qin, J., Jie, Z., Ma, L., Lin, L., Liang, X. (2024). RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_13

Download citation

DOI: https://doi.org/10.1007/978-981-99-8429-9_13
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

RecFormer: Recurrent Multi-modal Transformer with History-Aware Contrastive Learning for Visual Dialog