Abstract
Most Conversational AI agents in today's marketplace are unimodal in which only text is exchanged between the user and the bot. However, employing additional modes (e.g., image) in the interaction improves customer experience, potentially increasing efficiency and profits in applications such as online shopping. Most of the existing techniques have used feature extraction from the multimodal inputs, but very few works used multi-headed attention from transformers conversational AI. In this work, we propose a novel architecture called Cross-modal Multi-headed Hierarchical Encoder-Decoder with Sentence Embeddings (CMHRED-SE) to enhance the quality of natural language response by better understanding features such as color, sentence structure, and continuity of the conversation. CMHRED-SE uses multi-headed attention and image representations from VGGNet19 and ResNet50 architectures to improve the effectiveness in fashion domain-specific conversations. The results of CMHRED-SE are compared with two other similar models, namely M-HRED and MHRED-attn, and the quality of answers returned by the models are evaluated using BLEU-4, ROUGE-L, and the Cosine scores. The evaluation results show an improvement of 5% for Cosine similarity, 9% for ROUGE-L F1 score, and 11% for the BLEU-4 score over the scores returned by the baseline models. The results also show that our approach better understands and generates clearer textual responses leveraging the sentence embeddings.
Similar content being viewed by others
Data availability
The datasets analyzed during the current study are publicly available at https://amritasaha1812.github.io/MMD/_pages/dataset.html repository.
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D (2015) VQA: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 2425–2433
Bell S, Bala K (2015) Learning Visual Similarity for Product Design with Convolutional Neural Networks. ACM Trans Graph (TOG) 34(4):1–10
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching Word Vectors with Subword Information. Trans Assoc Comput Linguist 5:135–146
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Bag of Tricks for Efficient Text Classification. Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain, ACL 2:427–431
Chauhan H, Firdaus M, Ekbal A, Bhattacharyya P (2019) Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System. Proc. of the 57th Annual Meeting of the Association for Computational Linguistics 5437–5447.
Chen W, Wang W, Liu L, Lew MS (2021) New Ideas and Trends in Deep Multimodal Content Understanding: A Review. Neurocomputing 426:195–215
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura J M F, Parikh D, Batra D (2017) Visual dialog. Proc. of the IEEE Computer Vision and Pattern Recognition (CVPR), IEEE Xplore Honolulu, HI, USA, 326–335
Devlin J, Chang M-W, Lee K, Toutanov K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL), 1, 4171–4186.
Fatigante M, Zucchermaglio C, Alby F (2021) Being in Place: A Multimodal Analysis of the Contribution of the Patient’s Companion to “First Time” Oncological Visits. Front Psychol 12:57–79. https://doi.org/10.3389/fpsyg.2021.664747
Griol D, Molina JM, de Miguel AS (2014) Developing multimodal conversational agents for an enhanced e-learning experience. Adv Distrib Comput Artif Intell J 3(8):1–13. https://doi.org/10.14201/ADCAIJ2014381326
Han X, Wu Z, Huang P X, Zhang X, Zhu M, Li Y, Zhao Y, Davis L S (2017) Automatic Spatially-Aware Fashion Concept Discovery. 2017 IEEE International Conference on Computer Vision (ICCV), 1472–1480.
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778
Hsiao J -H, Li L -J (2014) On Visual Similarity based Interactive Product Recommendation for Online Shopping. 2014 IEEE International Conference on Image Processing (ICIP) 3038–3041
Jiang S, Rijke M de (2018) Why are sequence-to-sequence models so dull? Understanding the low-diversity problem of chatbots. Proc. of the 2018 EMNLP Workshop on Search-Oriented Conversational AI (SCAI), Brussels, Belgium. 81–86
Kerly A, Hall P, Bull S (2007) Bringing chatbots into education: Towards natural language negotiation of open learner models. Knowl Based Syst 20:177–185
Kingma D P, Adam J Ba (2015) A method for stochastic optimization. 3rd International Conference for Learning Representations, San Diego
Laenen K, Zoghbi S, Moens M-F (2018) Web Search of Fashion Items with Multimodal Querying. Proc. of 11th ACM International Conference on Web Search and Data Mining (WSDM 2018), Marina Del Rey, CA, USA.
Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. Spain ACL, Barcelona, pp 74–81
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. Proc. of Workshop at ICLR. arXiv:1301.3781v1
Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis G P, Vanderwende L (2017) Image grounded conversations: Multimodal context for natural question and response generation. Proc. of the Eighth International Joint Conference on Natural Language Processing (IJACNLP), Taipei, Taiwan. 1, 462–472.
Nils R, Gurevych I (2019) Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks. ArXiv.org, 27 Aug. 2019.
Papineni K, Roukos S, Ward T, ZhuBLEU W J (2002) A method for automatic evaluation of machine translation. Proc. of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), 311–318.
Paranjape A, See A, Kenealy K, Li H, Hardy A, Qi P, Sadagopan K R, Phu N M, Soylu D, Manning C D (2020) Neural generation meets real people: Towards emotionally engaging mixed-initiative conversations. Stanford NLP, 3rd Proceedings of Alexa Prize. arXiv:2008.12348
Pennington J, Socher R, Manning C (2014) GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Doha, Qatar, ACL, 1532–1543.
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Texas, US, ACL, 2383–2392.
Roccetti M, Marfia G, Salomoni P, Prandi C, Zagari R M, Kengni FLG, Bazzoli F, Montagnani M (2017) Attitudes of Crohn's Disease Patients: Infodemiology Case Study and Sentiment Analysis of Facebook and Twitter Posts. JMIR Public Health Surveill. 3(3) https://doi.org/10.2196/publichealth.7004
Saha A, Khapra M M, Sankaranarayanan K (2018) Towards building large scale multimodal domain-aware conversation systems. Proc. of 32nd AAAI Conference on Artificial Intelligence 696–704.
Sapna C R, Anagha M, Vats K, Baradia K, Khan T, Sarkar S, Roychowdhury S (2019) Recommendence and fashionsence online fashion advisor for offline experience. ACM International Conference Proceeding series, 256–259.
Schaffer S, Reithinger N (2019) Conversation is multimodal: thus conversational user interfaces should be as well. Proc. of the 1st International Conference on Conversational User Interfaces (CUI '19). ACM, New York, NY, USA. Article 12, 1–3.
Serban V, Sordoni A, Lowe R, Charlin L, Pineau J, Courville A C, Bengio Y (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. Proc of AAAI, 3295–3301
Shubham A, Dusek O, Konstas I, Rieser V (2018) Improving context modeling in multimodal dialogue generation. Proc. of 11th International Conference on Natural Language Generation 129–134
Simonoyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc. of 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA.
Tao C, Gao S, Shang M, Wu W, Zhao D, Yan R (2018) Get the point of my utterance! Learning towards effective responses with a multi-head attention mechanism. Proc. of the 27th International Joint Conference on Artificial Intelligence 4418–4424.
Thomas NT (2016) An e-business chatbot using AIML and LSA, Proc. Int. Conf. Adv. Computing Commun. Informat. (ICACCI), 2740–2742
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is All you Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 1–11.
Vries H de, Strub F, Chandar S, Pietquin O, Larochelle H, Courville AC (2017) Guesswhat?! visual object discovery through multimodal dialogue. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4466–4475.
Xu A, Liu Z, Guo Y, Sinha V, Akkiraju R (2017) A new chatbot for customer service on social media, Proc. CHI Conf. Human Factors Comput. Syst. (CHI) 3506–3510
Zhao B, Feng J, Wu X, Yan S (2017) Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) 6156–6164
Zoghbi S, Heyman G, Gomez JC, Moens M-F (2016) Fashion Meets Computer Vision and NLP at e-Commerce Search. Int J Comput Elec Eng (IJCEE) 8(1):31–43
Funding
No funding was received for conducting this study.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Belagur, H., Reddy, N.S., Krishna, P.R. et al. Cross-modal multi-headed attention for long multimodal conversations. Multimed Tools Appl 82, 45679–45697 (2023). https://doi.org/10.1007/s11042-023-15606-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15606-4