Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain

Liao, Lizi; Kennedy, Lyndon; Wilcox, Lynn; Chua, Tat-Seng

doi:10.1007/978-3-030-37731-1_33

Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain

Lizi Liao¹⁶,
Lyndon Kennedy¹⁷,
Lynn Wilcox¹⁷ &
…
Tat-Seng Chua¹⁶

Conference paper
First Online: 24 December 2019

2834 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11961))

Abstract

We present a new solution towards building a crowd knowledge enhanced multimodal conversational system for travel. It aims to assist users in completing various travel-related tasks, such as searching for restaurants or things to do, in a multimodal conversation manner involving both text and images. In order to achieve this goal, we ground this research on the combination of multimodal understanding and recommendation techniques which explores the possibility of a more convenient information seeking paradigm. Specifically, we build the system in a modular manner where each modular construction is enriched with crowd knowledge from social sites. To the best of our knowledge, this is the first work that attempts to build intelligent multimodal conversational systems for travel, and moves an important step towards developing human-like assistants for completion of daily life tasks. Several current challenges are also pointed out as our future directions.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Bordes, A., Weston, J.: Learning end-to-end goal-oriented dialog. In: The 3rd International Conference on Learning Representations, pp. 1–14 (2016)
Google Scholar
Budzianowski, P., et al.: MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In: EMNLP, pp. 5016–5026 (2018)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: SIGMOD, pp. 313–324. ACM (2003)
Google Scholar
Chen, Y.N., Wang, W.Y., Rudnicky, A.I.: Leveraging frame semantics and distributional semantics for unsupervised semantic slot induction in spoken dialogue systems. In: 2014 IEEE Spoken Language Technology Workshop, pp. 584–589 (2014)
Google Scholar
Ester, M., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Li, R., Kahou, S.E., Schulz, H., Michalski, V., Charlin, L., Pal, C.: Towards deep conversational recommendations. In: NIPS, pp. 9748–9758 (2018)
Google Scholar
Liao, L., He, X., Ren, Z., Nie, L., Xu, H., Chua, T.S.: Representativeness-aware aspect analysis for brand monitoring in social media. In: IJCAI, pp. 310–316 (2017)
Google Scholar
Liao, L., Takanobu, R., Ma, Y., Yang, X., Huang, M., Chua, T.: Deep conversational recommender in travel. arxiv:1907.00710 (2019)
Liu, B., Lane, I.: Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454 (2016)
Madotto, A., Wu, C.S., Fung, P.: Mem2seq: effectively incorporating knowledge bases into end-to-end task-oriented dialog systems. In: ACL, pp. 1468–1478 (2018)
Google Scholar
Rieser, V., Lemon, O.: Natural language generation as planning under uncertainty for spoken dialogue systems. In: Krahmer, E., Theune, M. (eds.) EACL/ENLG -2009. LNCS (LNAI), vol. 5790, pp. 105–120. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15573-4_6
Chapter Google Scholar
Sukhbaatar, S., et al.: End-to-end memory networks. In: NIPS, pp. 2440–2448 (2015)
Google Scholar
Sun, Y., Zhang, Y.: Conversational recommender system. In: SIGIR, pp. 235–244 (2018)
Google Scholar
Tur, G., Jeong, M., Wang, Y.Y., Hakkani-Tür, D., Heck, L.: Exploiting the semantic web for unsupervised natural language semantic parsing. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Google Scholar
Wen, T.H., et al.: A network-based end-to-end trainable task-oriented dialogue system. In: EACL, pp. 438–449 (2017)
Google Scholar
Yan, Z., Duan, N., Chen, P., Zhou, M., Zhou, J., Li, Z.: Building task-oriented dialogue systems for online shopping. In: AAAI, pp. 4618–4625 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

NGS, National University of Singapore, Singapore, Singapore
Lizi Liao & Tat-Seng Chua
FXPAL, Palo Alto, USA
Lyndon Kennedy & Lynn Wilcox

Authors

Lizi Liao
View author publications
You can also search for this author in PubMed Google Scholar
Lyndon Kennedy
View author publications
You can also search for this author in PubMed Google Scholar
Lynn Wilcox
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Seng Chua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lizi Liao .

Editor information

Editors and Affiliations

Korea Advanced Institute of Science and, Daejeon, Korea (Republic of)
Yong Man Ro
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Junmo Kim
National Cheng Kung University, Tainan City, Taiwan
Wei-Ta Chu
Tsinghua University, Beijing, China
Peng Cui
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Jung-Woo Choi
National Tsing Hua University, Hsinchu, Taiwan
Min-Chun Hu
Ghent University, Ghent, Belgium
Wesley De Neve

Appendices

Appendix A: State Tracking

State tracking refers to the maintenance of the dialogue state $\mathcal {S}_t$ which is the representation of the conversation session until time t. Based on the state $\mathcal {S}_{t-1}$ in former time step and the multimodal understanding result $\mathcal {U}_t$ for utterance in time step t, the dialogue state is obtained as follows:

$$\begin{aligned} \mathcal {S}_t = \mathcal {G}(\mathcal {S}_{t-1},{<}\mathcal {M}_t,\mathcal {D}_t,\mathcal {C}_t,\mathcal {A}_t,\mathcal {I}_t{>}), \end{aligned}$$

(6)

where $\mathcal {G}$ refers to a set of rules. We generally summarize the rules as below:

(1)
if $\mathcal {M}_t = Chitchat$, then $\mathcal {S}_t = \mathcal {S}_{t-1}$;
(2)
if domain $\mathcal {D}_t$ is changed: $\mathcal {S}_t$ will be updated totally based on $\mathcal {U}_t$;
(3)
if domain $\mathcal {D}_t$ is not changed: if $\mathcal {M}_t \ne Negation$, $\mathcal {S}_t$ will inherit information stored in $\mathcal {S}_{t-1}$;
(4)
if domain $\mathcal {D}_t$ is not changed: if $\mathcal {M}_t = Negation$, $\mathcal {S}_t$ will inherit information stored in $\mathcal {S}_{t-1}$ while update the parts according $\mathcal {U}_t$;
(5)
if the time interval between two consecutive utterances exceeds a pre-defined length at time t, then $\mathcal {S}_t$ will be cleaned.

Tracking dialogue states is the key to elevate user experience on multi-turn conversation. The main reason that we do not follow the previous works to learn models is because of the lack of dialogue data to train statistical tracking models. We leave leveraging session-level labeled dialogue data to improve the state tracking task as our future work.

Appendix B: Action Decision

At each turn of conversation between user and agent, the dialogue management module takes the current state tracking results as input, and outputs the corresponding actions. Due to the lack of large-scale dialogue training data, we also resort to a set of rules. Intuitively, the main action types considered and the conditions for triggering it are as below:

Proactive Questioning. This action will be triggered when (a) a recommendation intent is detected, (b) a domain is detected, and (c) no enough constraints or attributes is detected in $\mathcal {S}_t$. This action is often used to obtain more constraints or attributes to narrow down the search space.
Candidate Listing. This action is often triggered when recommendation results are obtained or the Show more intent is detected. As each venue in our dataset is associated with Foursquare photos, we implement candidate listing via a list of images where each image corresponds to a venue. In the interface, user can conveniently choose a venue by simply clicking its corresponding image.
Venue Recommendation. This action is triggered when the intent $\mathcal {I}_t$ in $\mathcal {S}_t$ is Recommendation, and it will retrieve results from the recommendation module.
Question Answering. It will be triggered when a venue is selected and one of its slot names are detected in $\mathcal {U}_t$ without value around. It returns the missing attribute value by looking up the venue database.
Review Summary. This action will be triggered when a venue is selected and the intent $\mathcal {I}_t$ in $\mathcal {S}_t$ is Ask opinion. It will summarize the reviews of the target venue and present in organized form.
API call. It will be triggered when a venue is selected and the Map direction intent is present in $\mathcal {I}_t$. The Google Map API will be called with the start position and destination. Currently, only the Map API is integrated. However, other APIs such as weather report can also be integrated with proper modifications.
Chitchat. This action will be triggered when none travel venue seeking related intent is detected. As pointed out by [18], nearly 80% of utterances are chitchat queries for e-commerce bots. If the system cannot reply to them, then the conversation may not be able to continue. Thus, it will activate the chitchat response generation to obtain a reply.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liao, L., Kennedy, L., Wilcox, L., Chua, TS. (2020). Crowd Knowledge Enhanced Multimodal Conversational Assistant in Travel Domain. In: Ro, Y., et al. MultiMedia Modeling. MMM 2020. Lecture Notes in Computer Science(), vol 11961. Springer, Cham. https://doi.org/10.1007/978-3-030-37731-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-37731-1_33
Published: 24 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37730-4
Online ISBN: 978-3-030-37731-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics