skip to main content
10.1145/3573942.3574091acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

Published: 16 May 2023 Publication History

Abstract

In view of the fact that the current multimodal dialogue generation models are based on a single image for question-and-answer dialogue generation, the image information cannot be deeply integrated into the sentences, resulting in the inability to generate semantically coherent, informative visual contextual dialogue responses, which further limits the application of multimodal dialogue generation models in real scenarios. This paper proposes a Deep Collaborative Attention Model (DCAN) method for multimodal dialogue generation tasks. First, the method globally encode the dialogue context and its corresponding visual context information respectively; second, to guide the simultaneous learning of interactions between image and text multimodal representations, after the visual context features are fused with the dialogue context features through the collaborative attention mechanism, the hadamard product is used to fully fuse the multimodal features again to improve the network performance; finally, the fused features are fed into a transformer-based decoder to generate coherent, informative responses. in order to solve the problem of continuous dialogue in multimodal dialogue, the method of this paper uses the OpenVidial2.0 data set to conduct experiments. The results show that the responses generated by this model have higher correlation and diversity than existing comparison models, and it can effectively integrate visual context information.

References

[1]
Zhao Yangyang, Wang Zhenyu, Wang Pei, Yang Tian, Zhang Rui, Yin Kai. A review of research on task-based dialogue systems [J]. Journal of Computers,2020,43(10):1862-1896.(in Chinese)
[2]
Ren S, He K, Girshick R, and Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017:39(6)1137-1149.
[3]
Yu Z, Yu J, Cui Y, Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6281-6290.
[4]
Wang S, Meng Y, Li X, OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts[J]. 2021.
[5]
Wang S, Meng Y, Sun X, Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation[J]. 2021.
[6]
Zhou H, Huang M, Zhang T, Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory[J]. 2017.
[7]
Peng Y,Fang Y,Xie Z, Topic-enhanced emotional conversation generation with attention mechanism[J]. Knowledge-Based Systems, 2019, 163:429-437.
[8]
Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579, 2019.
[9]
Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks for visual reference resolution in visual dialog. arXiv preprint arXiv:1902.09368, 2019.
[10]
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adver sarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195, 2020.
[11]
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation.2021.
[12]
Zou Pinrong, Xiao Feng, Zhang Wenjuan, Zhang Wanyu, Wang Chenyang. Multi-module collaborative attention model for visual question answering [J]. Computer Engineering, 2022,48(02):250-260.(in Chinese)
[13]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.
[14]
Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. User attention-guided multimodal dialog systems. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 445–454, 2019.
[15]
Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945, 2018.
[16]
Kurt Shuster, Eric Michael Smith, Da Ju, and Jason Weston. Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082, 2020.
[17]
Liang Y, Meng, Zhang Y, Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation[J]. 2020.
[18]
Cai H, Chen H, Song Y, Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight[J]. 2020

Index Terms

  1. Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Collaborative attention
    2. Feature fusion
    3. Multimodal dialog generation
    4. Transformer

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 41
      Total Downloads
    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media