research-article

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention

Authors:

Li MaAuthors Info & Claims

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

Pages 753 - 760

https://doi.org/10.1145/3573942.3574091

Published: 16 May 2023 Publication History

Abstract

In view of the fact that the current multimodal dialogue generation models are based on a single image for question-and-answer dialogue generation, the image information cannot be deeply integrated into the sentences, resulting in the inability to generate semantically coherent, informative visual contextual dialogue responses, which further limits the application of multimodal dialogue generation models in real scenarios. This paper proposes a Deep Collaborative Attention Model (DCAN) method for multimodal dialogue generation tasks. First, the method globally encode the dialogue context and its corresponding visual context information respectively; second, to guide the simultaneous learning of interactions between image and text multimodal representations, after the visual context features are fused with the dialogue context features through the collaborative attention mechanism, the hadamard product is used to fully fuse the multimodal features again to improve the network performance; finally, the fused features are fed into a transformer-based decoder to generate coherent, informative responses. in order to solve the problem of continuous dialogue in multimodal dialogue, the method of this paper uses the OpenVidial2.0 data set to conduct experiments. The results show that the responses generated by this model have higher correlation and diversity than existing comparison models, and it can effectively integrate visual context information.

References

[1]

Zhao Yangyang, Wang Zhenyu, Wang Pei, Yang Tian, Zhang Rui, Yin Kai. A review of research on task-based dialogue systems [J]. Journal of Computers,2020,43(10):1862-1896.(in Chinese)

[2]

Ren S, He K, Girshick R, and Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017:39(6)1137-1149.

[3]

Yu Z, Yu J, Cui Y, Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6281-6290.

[4]

Wang S, Meng Y, Li X, OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts[J]. 2021.

[5]

Wang S, Meng Y, Sun X, Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation[J]. 2021.

[6]

Zhou H, Huang M, Zhang T, Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory[J]. 2017.

[7]

Peng Y,Fang Y,Xie Z, Topic-enhanced emotional conversation generation with attention mechanism[J]. Knowledge-Based Systems, 2019, 163:429-437.

[8]

Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579, 2019.

[9]

Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks for visual reference resolution in visual dialog. arXiv preprint arXiv:1902.09368, 2019.

[10]

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adver sarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195, 2020.

[11]

Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation.2021.

[12]

Zou Pinrong, Xiao Feng, Zhang Wenjuan, Zhang Wanyu, Wang Chenyang. Multi-module collaborative attention model for visual question answering [J]. Computer Engineering, 2022,48(02):250-260.(in Chinese)

[13]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.

[14]

Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. User attention-guided multimodal dialog systems. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 445–454, 2019.

Digital Library

[15]

Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945, 2018.

[16]

Kurt Shuster, Eric Michael Smith, Da Ju, and Jason Weston. Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082, 2020.

[17]

Liang Y, Meng, Zhang Y, Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation[J]. 2020.

[18]

Cai H, Chen H, Song Y, Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight[J]. 2020

Index Terms

Multimodal Dialogue Generation Based on Transformer and Collaborative Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

User Attention-guided Multimodal Dialog Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

As an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. ...
Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction

In this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
Abstract
Multimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights
- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 2022

1221 pages

ISBN:9781450396899

DOI:10.1145/3573942

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 May 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

AIPR 2022

AIPR 2022: 2022 5th International Conference on Artificial Intelligence and Pattern Recognition

September 23 - 25, 2022

Xiamen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
41
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten