Abstract
We present a new solution towards building a crowd knowledge enhanced multimodal conversational system for travel. It aims to assist users in completing various travel-related tasks, such as searching for restaurants or things to do, in a multimodal conversation manner involving both text and images. In order to achieve this goal, we ground this research on the combination of multimodal understanding and recommendation techniques which explores the possibility of a more convenient information seeking paradigm. Specifically, we build the system in a modular manner where each modular construction is enriched with crowd knowledge from social sites. To the best of our knowledge, this is the first work that attempts to build intelligent multimodal conversational systems for travel, and moves an important step towards developing human-like assistants for completion of daily life tasks. Several current challenges are also pointed out as our future directions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Bordes, A., Weston, J.: Learning end-to-end goal-oriented dialog. In: The 3rd International Conference on Learning Representations, pp. 1–14 (2016)
Budzianowski, P., et al.: MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In: EMNLP, pp. 5016–5026 (2018)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324. ACM (2003)
Chen, Y.N., Wang, W.Y., Rudnicky, A.I.: Leveraging frame semantics and distributional semantics for unsupervised semantic slot induction in spoken dialogue systems. In: 2014 IEEE Spoken Language Technology Workshop, pp. 584–589 (2014)
Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Li, R., Kahou, S.E., Schulz, H., Michalski, V., Charlin, L., Pal, C.: Towards deep conversational recommendations. In: NIPS, pp. 9748–9758 (2018)
Liao, L., He, X., Ren, Z., Nie, L., Xu, H., Chua, T.S.: Representativeness-aware aspect analysis for brand monitoring in social media. In: IJCAI, pp. 310–316 (2017)
Liao, L., Takanobu, R., Ma, Y., Yang, X., Huang, M., Chua, T.: Deep conversational recommender in travel. arxiv:1907.00710 (2019)
Liu, B., Lane, I.: Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454 (2016)
Madotto, A., Wu, C.S., Fung, P.: Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In: ACL, pp. 1468–1478 (2018)
Rieser, V., Lemon, O.: Natural language generation as planning under uncertainty for spoken dialogue systems. In: Krahmer, E., Theune, M. (eds.) EACL/ENLG -2009. LNCS (LNAI), vol. 5790, pp. 105–120. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15573-4_6
Sukhbaatar, S., et al.: End-to-end memory networks. In: NIPS, pp. 2440–2448 (2015)
Sun, Y., Zhang, Y.: Conversational recommender system. In: SIGIR, pp. 235–244 (2018)
Tur, G., Jeong, M., Wang, Y.Y., Hakkani-Tür, D., Heck, L.: Exploiting the semantic web for unsupervised natural language semantic parsing. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Wen, T.H., et al.: A network-based end-to-end trainable task-oriented dialogue system. In: EACL, pp. 438–449 (2017)
Yan, Z., Duan, N., Chen, P., Zhou, M., Zhou, J., Li, Z.: Building task-oriented dialogue systems for online shopping. In: AAAI, pp. 4618–4625 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A: State Tracking
State tracking refers to the maintenance of the dialogue state \(\mathcal {S}_t\) which is the representation of the conversation session until time t. Based on the state \(\mathcal {S}_{t-1}\) in former time step and the multimodal understanding result \(\mathcal {U}_t\) for utterance in time step t, the dialogue state is obtained as follows:
where \(\mathcal {G}\) refers to a set of rules. We generally summarize the rules as below:
-
(1)
if \(\mathcal {M}_t = Chitchat\), then \(\mathcal {S}_t = \mathcal {S}_{t-1}\);
-
(2)
if domain \(\mathcal {D}_t\) is changed: \(\mathcal {S}_t\) will be updated totally based on \(\mathcal {U}_t\);
-
(3)
if domain \(\mathcal {D}_t\) is not changed: if \(\mathcal {M}_t \ne Negation\), \(\mathcal {S}_t\) will inherit information stored in \(\mathcal {S}_{t-1}\);
-
(4)
if domain \(\mathcal {D}_t\) is not changed: if \(\mathcal {M}_t = Negation\), \(\mathcal {S}_t\) will inherit information stored in \(\mathcal {S}_{t-1}\) while update the parts according \(\mathcal {U}_t\);
-
(5)
if the time interval between two consecutive utterances exceeds a pre-defined length at time t, then \(\mathcal {S}_t\) will be cleaned.
Tracking dialogue states is the key to elevate user experience on multi-turn conversation. The main reason that we do not follow the previous works to learn models is because of the lack of dialogue data to train statistical tracking models. We leave leveraging session-level labeled dialogue data to improve the state tracking task as our future work.
Appendix B: Action Decision
At each turn of conversation between user and agent, the dialogue management module takes the current state tracking results as input, and outputs the corresponding actions. Due to the lack of large-scale dialogue training data, we also resort to a set of rules. Intuitively, the main action types considered and the conditions for triggering it are as below:
-
Proactive Questioning. This action will be triggered when (a) a recommendation intent is detected, (b) a domain is detected, and (c) no enough constraints or attributes is detected in \(\mathcal {S}_t\). This action is often used to obtain more constraints or attributes to narrow down the search space.
-
Candidate Listing. This action is often triggered when recommendation results are obtained or the Show more intent is detected. As each venue in our dataset is associated with Foursquare photos, we implement candidate listing via a list of images where each image corresponds to a venue. In the interface, user can conveniently choose a venue by simply clicking its corresponding image.
-
Venue Recommendation. This action is triggered when the intent \(\mathcal {I}_t\) in \(\mathcal {S}_t\) is Recommendation, and it will retrieve results from the recommendation module.
-
Question Answering. It will be triggered when a venue is selected and one of its slot names are detected in \(\mathcal {U}_t\) without value around. It returns the missing attribute value by looking up the venue database.
-
Review Summary. This action will be triggered when a venue is selected and the intent \(\mathcal {I}_t\) in \(\mathcal {S}_t\) is Ask opinion. It will summarize the reviews of the target venue and present in organized form.
-
API call. It will be triggered when a venue is selected and the Map direction intent is present in \(\mathcal {I}_t\). The Google Map API will be called with the start position and destination. Currently, only the Map API is integrated. However, other APIs such as weather report can also be integrated with proper modifications.
-
Chitchat. This action will be triggered when none travel venue seeking related intent is detected. As pointed out by [18], nearly 80% of utterances are chitchat queries for e-commerce bots. If the system cannot reply to them, then the conversation may not be able to continue. Thus, it will activate the chitchat response generation to obtain a reply.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Liao, L., Kennedy, L., Wilcox, L., Chua, TS. (2020). Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-37731-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37730-4
Online ISBN: 978-3-030-37731-1
eBook Packages: Computer ScienceComputer Science (R0)