ABSTRACT
In view of the fact that the current multimodal dialogue generation models are based on a single image for question-and-answer dialogue generation, the image information cannot be deeply integrated into the sentences, resulting in the inability to generate semantically coherent, informative visual contextual dialogue responses, which further limits the application of multimodal dialogue generation models in real scenarios. This paper proposes a Deep Collaborative Attention Model (DCAN) method for multimodal dialogue generation tasks. First, the method globally encode the dialogue context and its corresponding visual context information respectively; second, to guide the simultaneous learning of interactions between image and text multimodal representations, after the visual context features are fused with the dialogue context features through the collaborative attention mechanism, the hadamard product is used to fully fuse the multimodal features again to improve the network performance; finally, the fused features are fed into a transformer-based decoder to generate coherent, informative responses. in order to solve the problem of continuous dialogue in multimodal dialogue, the method of this paper uses the OpenVidial2.0 data set to conduct experiments. The results show that the responses generated by this model have higher correlation and diversity than existing comparison models, and it can effectively integrate visual context information.
- Zhao Yangyang, Wang Zhenyu, Wang Pei, Yang Tian, Zhang Rui, Yin Kai. A review of research on task-based dialogue systems [J]. Journal of Computers,2020,43(10):1862-1896.(in Chinese)Google Scholar
- Ren S, He K, Girshick R, and Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2017:39(6)1137-1149.Google Scholar
- Yu Z, Yu J, Cui Y, Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 6281-6290.Google Scholar
- Wang S , Meng Y , Li X , OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts[J]. 2021.Google Scholar
- Wang S, Meng Y, Sun X, Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation[J]. 2021.Google Scholar
- Zhou H , Huang M , Zhang T , Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory[J]. 2017.Google Scholar
- Peng Y ,Fang Y ,Xie Z , Topic-enhanced emotional conversation generation with attention mechanism[J]. Knowledge-Based Systems, 2019, 163:429-437.Google ScholarCross Ref
- Zhe Gan, Yu Cheng, Ahmed El Kholy, Linjie Li, Jingjing Liu, and Jianfeng Gao. Multi-step reasoning via recurrent dual attention for visual dialog. arXiv preprint arXiv:1902.00579, 2019.Google Scholar
- Gi-Cheon Kang, Jaeseo Lim, and Byoung-Tak Zhang. Dual attention networks for visual reference resolution in visual dialog. arXiv preprint arXiv:1902.09368, 2019.Google Scholar
- Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adver sarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195, 2020.Google Scholar
- Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. Unifying vision-and-language tasks via text generation.2021.Google Scholar
- Zou Pinrong, Xiao Feng, Zhang Wenjuan, Zhang Wanyu, Wang Chenyang. Multi-module collaborative attention model for visual question answering [J]. Computer Engineering, 2022,48(02):250-260.(in Chinese)Google Scholar
- Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer, 2020.Google Scholar
- Chen Cui, Wenjie Wang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. User attention-guided multimodal dialog systems. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 445–454, 2019.Google ScholarDigital Library
- Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason Weston. Image chat: Engaging grounded conversations. arXiv preprint arXiv:1811.00945, 2018.Google Scholar
- Kurt Shuster, Eric Michael Smith, Da Ju, and Jason Weston. Multi-modal open-domain dialogue. arXiv preprint arXiv:2010.01082, 2020.Google Scholar
- Liang Y, Meng, Zhang Y , Infusing Multi-Source Knowledge with Heterogeneous Graph Neural Network for Emotional Conversation Generation[J]. 2020.Google Scholar
- Cai H, Chen H, Song Y , Data Manipulation: Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight[J]. 2020Google Scholar
Index Terms
- Multimodal Dialogue Generation Based on Transformer and Collaborative Attention
Recommendations
Human-robot collaborative tutoring using multiparty multimodal spoken dialogue
HRI '14: Proceedings of the 2014 ACM/IEEE international conference on Human-robot interactionIn this paper, we describe a project that explores a novel experimental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is ...
User Attention-guided Multimodal Dialog Systems
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalAs an intelligent way to interact with computers, the dialog system has been catching more and more attention. However, most research efforts only focus on text-based dialog systems, completely ignoring the rich semantics conveyed by the visual cues. ...
Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis
AbstractMultimodal Sentiment Analysis (MSA) constitutes a pivotal technology in the realm of multimedia research. The efficacy of MSA models largely hinges on the quality of multimodal fusion. Notably, when conveying information pertinent to specific ...
Highlights- Novel multimodal adaptive weight matrix enables accurate sentiment analysis by considering unique contributions of each modality.
- Multimodal attention mechanism addresses over-focusing on intra-modality attention.
- Multiple Softmax ...
Comments