MCIC: Multimodal Conversational Intent Classification for E-commerce Customer Service

Yuan, Shaozu; Shen, Xin; Zhao, Yuming; Liu, Hang; Yan, Zhiling; Liu, Ruixue; Chen, Meng

doi:10.1007/978-3-031-17120-8_58

Shaozu Yuan¹¹,
Xin Shen^11,12,
Yuming Zhao¹¹,
Hang Liu¹¹,
Zhiling Yan¹¹,
Ruixue Liu¹¹ &
…
Meng Chen¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13551))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2505 Accesses
2 Citations

Abstract

Conversational intent classification (CIC) plays a significant role in dialogue understanding, and most previous works only focus on the text modality. Nevertheless, in real conversations of E-commerce customer service, users often send images (screenshots and photos) among the text, which makes multimodal CIC a challenging task for customer service systems. To understand the intent of a multimodal conversation, it is essential to understand the content of both text and images. In this paper, we construct a large-scale dataset for multimodal CIC in the Chinese E-commerce scenario, named MCIC, which contains more than 30,000 multimodal dialogues with image categories, OCR text (the text contained in images), and intent labels. To fuse visual and textual information effectively, we design two vision-language baselines to integrate either images or OCR text with the dialogue utterances. Experimental results verify that both the text and images are important for CIC in E-commerce customer service.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.jd.com/.
2.
An image in a session is regarded as an utterance in our multimodal dataset.
3.
A “turn” in a conversation is marked by one back-and-forth interaction: the user speaks and the staff follows, or vice-versa.
4.
Because of the space limitation, we only show part of context in the figure.

References

Liu, R., Chen, M., Liu, H., Shen, L., Song, Y., He, X.: Enhancing multi-turn dialogue modeling with intent information for E-commerce customer service. In: Zhu, X., Zhang, M., Hong, Yu., He, R. (eds.) NLPCC 2020. LNCS (LNAI), vol. 12430, pp. 65–77. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60450-9_6
Chapter Google Scholar
Chen, M., et al.: The jddc corpus: a large-scale multi-turn Chinese dialogue dataset for e-commerce customer service. In: Proceedings of LREC 2022 (2020)
Google Scholar
Liao, L., Ma, Y., He, X., Hong, R., Chua, T.: Knowledge-aware multimodal dialogue systems. In: Proceedings of ACM MM 2018 (2018)
Google Scholar
Das, A., et al.: Visual dialog. In: Proceedings of CVPR 2017 (2017)
Google Scholar
Cai, Y., Cai, H., Wan, X.: Multi-modal sarcasm detection in twitter with hierarchical fusion model. In: Proceedings of ACL 2019 (2019)
Google Scholar
Antol, S., et al.: Vqa: visual question answering. In: Proceedings of ICCV 2015 (2015)
Google Scholar
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: Murel: multimodal relational reasoning for visual question answering. In: Proceedings of CVPR 2019 (2019)
Google Scholar
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: East: an efficient and accurate scene text detector. In: Proceedings of CVPR 2017 (2017)
Google Scholar
Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of CVPR 2016 (2016)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of ICLR 2015 (2015)
Google Scholar
Mostafazadeh, N., Brockett, C., Dolan, B., Galley, M., Gao, J., Spithourakis, G., Vanderwende, L.: Image-grounded conversations: Multimodal context for natural question and response generation. In: Proceedings of IJCNLP 2017 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of CVPR 2016 (2016)
Google Scholar
Shuster, K., Humeau, S., Bordes, A., Weston, J.: Image chat: engaging grounded conversations. In: Proceedings of ACL 2020 (2020)
Google Scholar
Kottur, S., Moon, S., Geramifard, A., Damavandi, B.: SIMMC 2.0: a task-oriented dialog dataset for immersive multimodal conversations. In: Proceedings of EMNLP 2021 (2021)
Google Scholar
Budzianowski, P., et al.: MultiWOZ-a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling. In: Proceedings of EMNLP 2018 (2018)
Google Scholar
Li, X., Wang, Y., Sun, S., Panda, S., Liu, J., Gao, J.: Microsoft dialogue challenge: building end-to-end task-completion dialogue systems. Journal: arXiv preprint arXiv:1807.11125 (2018)
Rastogi, A., Zang, X., Sunkara, S., Gupta, R., Khaitan, P.: Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34(05), pp. 8689–8696 (2020)
Google Scholar
Zhu, Q., Huang, K., Zhang, Z., Zhu, X., Huang, M.: Crosswoz: a large-scale Chinese cross-domain task-oriented dialogue dataset. TACL. 8, 281–295 (2020)
Article Google Scholar
Joo, J., Li, W., Steen, F., Zhu, S.: Visual persuasion: inferring communicative intents of images. In: Proceedings of CVPR 2014 (2014)
Google Scholar
Vondrick, C., Oktay, D., Pirsiavash, H., Torralba, A.: Predicting motivations of actions by leveraging text. In: Proceedings of CVPR 2016 (2016)
Google Scholar
Kruk, J., Lubin, J., Sikka, K., Lin, X., Jurafsky, D., Divakaran, A.: Integrating text and image: determining multimodal document intent in instagram posts. In: Proceedings of IJCNLP 2019 (2019)
Google Scholar
Jia, M., Wu, Z., Reiter, A., Cardie, C., Belongie, S., Lim, S.: Intentonomy: a Dataset and Study towards Human Intent Understanding. In: Proceedings of CVPR 2021 (2021)
Google Scholar
Saha, A., Khapra, M., Sankaranarayanan, K.: Towards building large scale multimodal domain-aware conversation systems. In: Proceedings of ACL 2018 (2018)
Google Scholar
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Zhao, N., Li, H., Wu, Y., He, X., Zhou, B.: The JDDC 2.0 Corpus: A Large-Scale Multimodal Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service. Journal: arXiv preprint arXiv:2109.12913 (2021)
Rahman, W., Hasan, M., Zadeh, A., Morency, L., Hoque, Mohammed E.: M-bert: Injecting multimodal information in the bert structure. Journal: arXiv preprint arXiv:1908.05787 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

JD AI, Beijing, China
Shaozu Yuan, Xin Shen, Yuming Zhao, Hang Liu, Zhiling Yan, Ruixue Liu & Meng Chen
Australian National University, Canberra, Australia
Xin Shen

Authors

Shaozu Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xin Shen
View author publications
You can also search for this author in PubMed Google Scholar
Yuming Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiling Yan
View author publications
You can also search for this author in PubMed Google Scholar
Ruixue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Meng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Chen .

Editor information

Editors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Wei Lu
Nanjing University, Nanjing, China
Shujian Huang
Soochow University, Suzhou, China
Yu Hong
Soochow University, Soochow, China
Xiabing Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, S. et al. (2022). MCIC: Multimodal Conversational Intent Classification for E-commerce Customer Service. In: Lu, W., Huang, S., Hong, Y., Zhou, X. (eds) Natural Language Processing and Chinese Computing. NLPCC 2022. Lecture Notes in Computer Science(), vol 13551. Springer, Cham. https://doi.org/10.1007/978-3-031-17120-8_58

Download citation

DOI: https://doi.org/10.1007/978-3-031-17120-8_58
Published: 24 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17119-2
Online ISBN: 978-3-031-17120-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

MCIC: Multimodal Conversational Intent Classification for E-commerce Customer Service