skip to main content
10.1145/3474085.3475234acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding

Published: 17 October 2021 Publication History

Abstract

Multimodal dialog system has attracted increasing attention from both academia and industry over recent years. Although existing methods have achieved some progress, they are still confronted with challenges in the aspect of question understanding (i.e., user intention comprehension). In this paper, we present a relational graph-based context-aware question understanding scheme, which enhances the user intention comprehension from local to global. Specifically, we first utilize multiple attribute matrices as the guidance information to fully exploit the product-related keywords from each textual sentence, strengthening the local representation of user intentions. Afterwards, we design a sparse graph attention network to adaptively aggregate effective context information for each utterance, completely understanding the user intentions from a global perspective. Moreover, extensive experiments over a benchmark dataset show the superiority of our model compared with several state-of-the-art baselines.

Supplementary Material

MP4 File (MM21-fp0411.mp4)
Multimodal dialog systems have attracted increasing research interest, due to their significance in retail, travel, and other domains. Although existing methods have achieved some progress, they are still confronted with challenges in the aspect of user intention comprehension. Toward this end, we present a relational graph-based context-aware question understanding scheme, which enhances the user intention comprehension from local to global. Concretely, we utilize multiple attribute matrices as the guidance information to fully exploit the product-related keywords from each textual sentence, strengthening the local representation of user intentions. Besides, we design a sparse graph attention network to adaptively aggregate effective context information for each utterance, completely understanding the user intentions from a global perspective. Moreover, extensive experiments over a benchmark dataset show the superiority of our model compared with several state-of-the-art baselines.

References

[1]
Diederik P. Kingma andJimmy Ba. 2015. Adam: A method for stochastic optimization. In The International Conference on Learning Representations. 1--15.
[2]
Hardik Chauhan, Mauajama Firdaus, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Ordinal and attribute aware response generation in a multimodal dialogue system. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 5437--5447.
[3]
Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter 19, 2 (2017), 25--35.
[4]
Qibin Chen, Junyang Lin, Yichang Zhang, Ming Ding, Yukuo Cen, Hongxia Yang, and Jie Tang. 2019. Towards knowledge-based recommender dialog system. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing. 1803--1813.
[5]
Zheqian Chen, Rongqin Yang, Zhou Zhao, Deng Cai, and Xiaofei He. 2018. Dialogue act recognition via crf-attentive structured network. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 225--234.
[6]
Chen Cui,WenjieWang, Xuemeng Song, Minlie Huang, Xin-Shun Xu, and Liqiang Nie. 2019. User attention-guided multimodal dialog systems. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 445--454.
[7]
Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2016. Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the Annual Meeting of the Association for Computational. 484--495.
[8]
George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the International Conference on Human Language Technology Research. 138--145.
[9]
Varun Gangal, Abhinav Arora, Arash Einolghozati, and Sonal Gupta. 2020. Likelihood ratios and generative classifiers for unsupervised out-of-domain detection in task oriented dialog. In Proceedings of the AAAI Conference on Artificial Intelligence. 7764--7771.
[10]
Zan Gao, Yinming Li,Weili Guan,Weizhi Nie, Zhiyong Cheng, and Anan Liu. 2020. Pairwise view weighted graph network for view-based 3D model retrieval. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 129--138.
[11]
Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, and Kee-Eung Kim. 2020. End-to-end neural pipeline for goal-oriented dialogue systems using GPT-2. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 583--592.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[13]
Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, and Jing Yuan. 2020. Multimodal dialogue systems via capturing context-aware dependencies of semantic elements. In Proceedings of the ACM International Conference on Multimedia. 2755--2764.
[14]
Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in building intelligent open-domain dialog systems. ACM Transactions on Information System 38, 3 (2020), 1--33.
[15]
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1192--1202.
[16]
Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedings of the International Joint Conference on Natural Language Processing. 733--743.
[17]
Lizi Liao, Xiangnan He, Bo Zhao, Chong-Wah Ngo, and Tat-Seng Chua. 2018. Interpretable multimodal retrieval for fashion products. In Proceedings of the ACM International Conference on Multimedia. 1571--1579.
[18]
Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In Proceedings of the ACM International Conference on Multimedia. 801--809.
[19]
Jianfeng Liu, Feiyang Pan, and Ling Luo. 2020. GoChat: Goal-oriented chatbots with hierarchical reinforcement learning. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1793--1796.
[20]
Meng Liu, Liqiang Nie, Xiang Wang, Qi Tian, and Baoquan Chen. 2019. Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing 28, 3 (2019), 1235--1247.
[21]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. 843--851.
[22]
Liqiang Nie,Wenjie Wang, Richang Hong, MengWang, and Qi Tian. 2019. Multimodal dialog system: Generating responses via adaptive decoders. In Proceedings of the ACM International Conference on Multimedia. 1098--1106.
[23]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311--318.
[24]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1532--1543.
[25]
Amrita Saha, Mitesh Khapra, and Karthik Sankaranarayanan. 2018. Towards building large scale multimodal domain-aware conversation systems. In Proceedings of the AAAI Conference on Artificial Intelligence. 696--704.
[26]
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Natural Language Processing. 1577--1586.
[27]
Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma. 2017. NeuroStylist: Neural compatibility modeling for clothing matching. In Proceedings of the ACM International Conference on Multimedia. 753--761.
[28]
Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 235--244.
[29]
Chongyang Tao, Wei Wu, Yansong Feng, Dongyan Zhao, and Rui Yan. 2020. Improving matching models with hierarchical contextualized representations for multi-turn response selection. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 1865--1868.
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[31]
Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations. 1--12.
[32]
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI Conference on Artificial Intelligence. 3776--3784.
[33]
Jian Wang, Junhao Liu, Wei Bi, Xiaojiang Liu, Kejing He, Ruifeng Xu, and Min Yang. 2020. Dual dynamic memory network for end-to-end multi-turn taskoriented dialog systems. In Proceedings of the International Conference on Computational Linguistics. 4100--4110.
[34]
Wenjie Wang, Minlie Huang, Xin-Shun Xu, Fumin Shen, and Liqiang Nie. 2018. Chat more: Deepening and widening the chatting topic via a deep model. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 255--264.
[35]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2019. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29, 1 (2019), 1--14.
[36]
YinweiWei, XiangWang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the ACM International Conference on Multimedia. 1437--1445.
[37]
Tsung-Hsien Wen, David Vandyke, Lina M. Rojas- Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A network-based endto- end trainable task-oriented dialogue system. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. 438--449.
[38]
Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory networks. In International Conference on Learning Representations. 1--15.
[39]
Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 496--505.
[40]
Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. 2017. Topic aware neural response generation. In Proceedings of the AAAI Conference on Artificial Intelligence. 3351--3357.
[41]
Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 55--64.
[42]
Rui Yan, Dongyan Zhao, and Weinan E. 2017. Joint learning of response ranking and next utterance suggestion in human-computer conversation system. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 685--694.
[43]
Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In Proceedings of the International ACM SIGIR Conference on Research & Development in Information Retrieval. 245--254.
[44]
Yichi Zhang, Zhijian Ou, and Zhou Yu. 2020. Task-oriented dialog systems that consider multiple appropriate responses under the same context. In Proceedings of the AAAI Conference on Artificial Intelligence. 9604--9611.

Cited By

View all
  • (2025)Dynamic Strategy Prompt Reasoning for Emotional Support ConversationIEEE Transactions on Multimedia10.1109/TMM.2024.352166927(108-119)Online publication date: 2025
  • (2024)Multi-factor adaptive vision selection for egocentric video question answeringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694520(59310-59328)Online publication date: 21-Jul-2024
  • (2024)SCREEN: A Benchmark for Situated Conversational RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681651(9591-9600)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attribute-enhanced text representation
    2. multimodal dialog system
    3. sparse relational context modeling

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)61
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Dynamic Strategy Prompt Reasoning for Emotional Support ConversationIEEE Transactions on Multimedia10.1109/TMM.2024.352166927(108-119)Online publication date: 2025
    • (2024)Multi-factor adaptive vision selection for egocentric video question answeringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694520(59310-59328)Online publication date: 21-Jul-2024
    • (2024)SCREEN: A Benchmark for Situated Conversational RecommendationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681651(9591-9600)Online publication date: 28-Oct-2024
    • (2024)Sample Efficiency Matters: Training Multimodal Conversational Recommendation Systems in a Small Data SettingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681217(2223-2232)Online publication date: 28-Oct-2024
    • (2024)AutoGraph: Enabling Visual Context via Graph Alignment in Open Domain Multi-Modal Dialogue GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681012(2079-2088)Online publication date: 28-Oct-2024
    • (2024)Multimodal Dialogue Systems via Capturing Context-aware Dependencies and Ordinal Information of Semantic ElementsACM Transactions on Intelligent Systems and Technology10.1145/364509915:3(1-25)Online publication date: 15-Apr-2024
    • (2024)LOIS: Looking Out of Instance Semantics for Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2023.334709326(6202-6214)Online publication date: 2024
    • (2024)Resolving Zero-Shot and Fact-Based Visual Question Answering via Enhanced Fact RetrievalIEEE Transactions on Multimedia10.1109/TMM.2023.328972926(1790-1800)Online publication date: 1-Jan-2024
    • (2023)Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language ModelACM Transactions on Information Systems10.1145/360636842:2(1-25)Online publication date: 6-Oct-2023
    • (2023)Enhancing Product Representation with Multi-form Interactions for Multimodal Conversational RecommendationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613755(6491-6500)Online publication date: 26-Oct-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media