skip to main content
10.1145/3573942.3574091acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

Published:16 May 2023Publication History

ABSTRACT

In view of the fact that the current multimodal dialogue generation models are based on a single image for question-and-answer dialogue generation, the image information cannot be deeply integrated into the sentences, resulting in the inability to generate semantically coherent, informative visual contextual dialogue responses, which further limits the application of multimodal dialogue generation models in real scenarios. This paper proposes a Deep Collaborative Attention Model (DCAN) method for multimodal dialogue generation tasks. First, the method globally encode the dialogue context and its corresponding visual context information respectively; second, to guide the simultaneous learning of interactions between image and text multimodal representations, after the visual context features are fused with the dialogue context features through the collaborative attention mechanism, the hadamard product is used to fully fuse the multimodal features again to improve the network performance; finally, the fused features are fed into a transformer-based decoder to generate coherent, informative responses. in order to solve the problem of continuous dialogue in multimodal dialogue, the method of this paper uses the OpenVidial2.0 data set to conduct experiments. The results show that the responses generated by this model have higher correlation and diversity than existing comparison models, and it can effectively integrate visual context information.

References

  1. Zhao Yangyang, Wang Zhenyu, Wang Pei, Yang Tian, Zhang Rui, Yin Kai. A review of research on task-based dialogue systems [J]. Journal of Computers,2020,43(10):1862-1896.(in Chinese)Google ScholarGoogle Scholar
  2. Ren S, He K, Girshick R, and Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017:39(6)1137-1149.Google ScholarGoogle Scholar
  3. Yu Z, Yu J, Cui Y, Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6281-6290.Google ScholarGoogle Scholar
  4. Wang S , Meng Y , Li X , OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts[J]. 2021.Google ScholarGoogle Scholar
  5. Wang S, Meng Y, Sun X, Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation[J]. 2021.Google ScholarGoogle Scholar
  6. Zhou H , Huang M , Zhang T , Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory[J]. 2017.Google ScholarGoogle Scholar
  7. Peng Y ,Fang Y ,Xie Z , Topic-enhanced emotional conversation generation with attention mechanism[J]. Knowledge-Based Systems, 2019, 163:429-437.Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579, 2019.Google ScholarGoogle Scholar
  9. Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks for visual reference resolution in visual dialog. arXiv preprint arXiv:1902.09368, 2019.Google ScholarGoogle Scholar
  10. Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adver sarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195, 2020.Google ScholarGoogle Scholar
  11. Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation.2021.Google ScholarGoogle Scholar
  12. Zou Pinrong, Xiao Feng, Zhang Wenjuan, Zhang Wanyu, Wang Chenyang. Multi-module collaborative attention model for visual question answering [J]. Computer Engineering, 2022,48(02):250-260.(in Chinese)Google ScholarGoogle Scholar
  13. Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.Google ScholarGoogle Scholar
  14. Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. User attention-guided multimodal dialog systems. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 445–454, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945, 2018.Google ScholarGoogle Scholar
  16. Kurt Shuster, Eric Michael Smith, Da Ju, and Jason Weston. Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082, 2020.Google ScholarGoogle Scholar
  17. Liang Y, Meng, Zhang Y , Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation[J]. 2020.Google ScholarGoogle Scholar
  18. Cai H, Chen H, Song Y , Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight[J]. 2020Google ScholarGoogle Scholar

Index Terms

  1. Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
      September 2022
      1221 pages
      ISBN:9781450396899
      DOI:10.1145/3573942

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 May 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)30
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format