Cross-modal multi-headed attention for long multimodal conversations

Belagur, Harshith; Reddy, N. Saketh; Krishna, P. Radha; Tumuluri, Raj

doi:10.1007/s11042-023-15606-4

Cross-modal multi-headed attention for long multimodal conversations

Published: 05 May 2023

Volume 82, pages 45679–45697, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Harshith Belagur¹,
N. Saketh Reddy¹,
P. Radha Krishna¹ &
…
Raj Tumuluri²

191 Accesses
1 Altmetric
Explore all metrics

Abstract

Most Conversational AI agents in today's marketplace are unimodal in which only text is exchanged between the user and the bot. However, employing additional modes (e.g., image) in the interaction improves customer experience, potentially increasing efficiency and profits in applications such as online shopping. Most of the existing techniques have used feature extraction from the multimodal inputs, but very few works used multi-headed attention from transformers conversational AI. In this work, we propose a novel architecture called Cross-modal Multi-headed Hierarchical Encoder-Decoder with Sentence Embeddings (CMHRED-SE) to enhance the quality of natural language response by better understanding features such as color, sentence structure, and continuity of the conversation. CMHRED-SE uses multi-headed attention and image representations from VGGNet19 and ResNet50 architectures to improve the effectiveness in fashion domain-specific conversations. The results of CMHRED-SE are compared with two other similar models, namely M-HRED and MHRED-attn, and the quality of answers returned by the models are evaluated using BLEU-4, ROUGE-L, and the Cosine scores. The evaluation results show an improvement of 5% for Cosine similarity, 9% for ROUGE-L F1 score, and 11% for the BLEU-4 score over the scores returned by the baseline models. The results also show that our approach better understands and generates clearer textual responses leveraging the sentence embeddings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 5

MCIC: Multimodal Conversational Intent Classification for E-commerce Customer Service

Sentiment Guided Aspect Conditioned Dialogue Generation in a Multimodal System

DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention

Data availability

The datasets analyzed during the current study are publicly available at https://amritasaha1812.github.io/MMD/_pages/dataset.html repository.

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D (2015) VQA: Visual question answering. 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile, 2425–2433
Bell S, Bala K (2015) Learning Visual Similarity for Product Design with Convolutional Neural Networks. ACM Trans Graph (TOG) 34(4):1–10
Article Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Enriching Word Vectors with Subword Information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2016) Bag of Tricks for Efficient Text Classification. Proc. of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Valencia, Spain, ACL 2:427–431
Google Scholar
Chauhan H, Firdaus M, Ekbal A, Bhattacharyya P (2019) Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System. Proc. of the 57^th Annual Meeting of the Association for Computational Linguistics 5437–5447.
Chen W, Wang W, Liu L, Lew MS (2021) New Ideas and Trends in Deep Multimodal Content Understanding: A Review. Neurocomputing 426:195–215
Article Google Scholar
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura J M F, Parikh D, Batra D (2017) Visual dialog. Proc. of the IEEE Computer Vision and Pattern Recognition (CVPR), IEEE Xplore Honolulu, HI, USA, 326–335
Devlin J, Chang M-W, Lee K, Toutanov K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL), 1, 4171–4186.
Fatigante M, Zucchermaglio C, Alby F (2021) Being in Place: A Multimodal Analysis of the Contribution of the Patient’s Companion to “First Time” Oncological Visits. Front Psychol 12:57–79. https://doi.org/10.3389/fpsyg.2021.664747
Article Google Scholar
Griol D, Molina JM, de Miguel AS (2014) Developing multimodal conversational agents for an enhanced e-learning experience. Adv Distrib Comput Artif Intell J 3(8):1–13. https://doi.org/10.14201/ADCAIJ2014381326
Article Google Scholar
Han X, Wu Z, Huang P X, Zhang X, Zhu M, Li Y, Zhao Y, Davis L S (2017) Automatic Spatially-Aware Fashion Concept Discovery. 2017 IEEE International Conference on Computer Vision (ICCV), 1472–1480.
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778
Hsiao J -H, Li L -J (2014) On Visual Similarity based Interactive Product Recommendation for Online Shopping. 2014 IEEE International Conference on Image Processing (ICIP) 3038–3041
Jiang S, Rijke M de (2018) Why are sequence-to-sequence models so dull? Understanding the low-diversity problem of chatbots. Proc. of the 2018 EMNLP Workshop on Search-Oriented Conversational AI (SCAI), Brussels, Belgium. 81–86
Kerly A, Hall P, Bull S (2007) Bringing chatbots into education: Towards natural language negotiation of open learner models. Knowl Based Syst 20:177–185
Article Google Scholar
Kingma D P, Adam J Ba (2015) A method for stochastic optimization. 3rd International Conference for Learning Representations, San Diego
Laenen K, Zoghbi S, Moens M-F (2018) Web Search of Fashion Items with Multimodal Querying. Proc. of 11^th ACM International Conference on Web Search and Data Mining (WSDM 2018), Marina Del Rey, CA, USA.
Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out. Spain ACL, Barcelona, pp 74–81
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient Estimation of Word Representations in Vector Space. Proc. of Workshop at ICLR. arXiv:1301.3781v1
Mostafazadeh N, Brockett C, Dolan B, Galley M, Gao J, Spithourakis G P, Vanderwende L (2017) Image grounded conversations: Multimodal context for natural question and response generation. Proc. of the Eighth International Joint Conference on Natural Language Processing (IJACNLP), Taipei, Taiwan. 1, 462–472.
Nils R, Gurevych I (2019) Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks. ArXiv.org, 27 Aug. 2019.
Papineni K, Roukos S, Ward T, ZhuBLEU W J (2002) A method for automatic evaluation of machine translation. Proc. of the 40^th Annual Meeting on Association for Computational Linguistics (ACL 2002), 311–318.
Paranjape A, See A, Kenealy K, Li H, Hardy A, Qi P, Sadagopan K R, Phu N M, Soylu D, Manning C D (2020) Neural generation meets real people: Towards emotionally engaging mixed-initiative conversations. Stanford NLP, 3rd Proceedings of Alexa Prize. arXiv:2008.12348
Pennington J, Socher R, Manning C (2014) GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Doha, Qatar, ACL, 1532–1543.
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Texas, US, ACL, 2383–2392.
Roccetti M, Marfia G, Salomoni P, Prandi C, Zagari R M, Kengni FLG, Bazzoli F, Montagnani M (2017) Attitudes of Crohn's Disease Patients: Infodemiology Case Study and Sentiment Analysis of Facebook and Twitter Posts. JMIR Public Health Surveill. 3(3) https://doi.org/10.2196/publichealth.7004
Saha A, Khapra M M, Sankaranarayanan K (2018) Towards building large scale multimodal domain-aware conversation systems. Proc. of 32^nd AAAI Conference on Artificial Intelligence 696–704.
Sapna C R, Anagha M, Vats K, Baradia K, Khan T, Sarkar S, Roychowdhury S (2019) Recommendence and fashionsence online fashion advisor for offline experience. ACM International Conference Proceeding series, 256–259.
Schaffer S, Reithinger N (2019) Conversation is multimodal: thus conversational user interfaces should be as well. Proc. of the 1^st International Conference on Conversational User Interfaces (CUI '19). ACM, New York, NY, USA. Article 12, 1–3.
Serban V, Sordoni A, Lowe R, Charlin L, Pineau J, Courville A C, Bengio Y (2017) A hierarchical latent variable encoder-decoder model for generating dialogues. Proc of AAAI, 3295–3301
Shubham A, Dusek O, Konstas I, Rieser V (2018) Improving context modeling in multimodal dialogue generation. Proc. of 11^th International Conference on Natural Language Generation 129–134
Simonoyan K, Zisserman A (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc. of 3^rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA.
Tao C, Gao S, Shang M, Wu W, Zhao D, Yan R (2018) Get the point of my utterance! Learning towards effective responses with a multi-head attention mechanism. Proc. of the 27^th International Joint Conference on Artificial Intelligence 4418–4424.
Thomas NT (2016) An e-business chatbot using AIML and LSA, Proc. Int. Conf. Adv. Computing Commun. Informat. (ICACCI), 2740–2742
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L, Polosukhin I (2017) Attention is All you Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), 1–11.
Vries H de, Strub F, Chandar S, Pietquin O, Larochelle H, Courville AC (2017) Guesswhat?! visual object discovery through multimodal dialogue. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4466–4475.
Xu A, Liu Z, Guo Y, Sinha V, Akkiraju R (2017) A new chatbot for customer service on social media, Proc. CHI Conf. Human Factors Comput. Syst. (CHI) 3506–3510
Zhao B, Feng J, Wu X, Yan S (2017) Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) 6156–6164
Zoghbi S, Heyman G, Gomez JC, Moens M-F (2016) Fashion Meets Computer Vision and NLP at e-Commerce Search. Int J Comput Elec Eng (IJCEE) 8(1):31–43
Article Google Scholar

Download references

Funding

No funding was received for conducting this study.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Warangal, Warangal, India
Harshith Belagur, N. Saketh Reddy & P. Radha Krishna
Openstream Inc, New Jersey, USA
Raj Tumuluri

Authors

Harshith Belagur
View author publications
You can also search for this author in PubMed Google Scholar
N. Saketh Reddy
View author publications
You can also search for this author in PubMed Google Scholar
P. Radha Krishna
View author publications
You can also search for this author in PubMed Google Scholar
Raj Tumuluri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. Radha Krishna.

Ethics declarations

Conflict of interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Belagur, H., Reddy, N.S., Krishna, P.R. et al. Cross-modal multi-headed attention for long multimodal conversations. Multimed Tools Appl 82, 45679–45697 (2023). https://doi.org/10.1007/s11042-023-15606-4

Download citation

Received: 06 November 2021
Revised: 04 May 2022
Accepted: 22 April 2023
Published: 05 May 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11042-023-15606-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal multi-headed attention for long multimodal conversations

Abstract

Access this article

Similar content being viewed by others

MCIC: Multimodal Conversational Intent Classification for E-commerce Customer Service

Sentiment Guided Aspect Conditioned Dialogue Generation in a Multimodal System

DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-modal multi-headed attention for long multimodal conversations

Abstract

Access this article

Similar content being viewed by others

MCIC: Multimodal Conversational Intent Classification for E-commerce Customer Service

Sentiment Guided Aspect Conditioned Dialogue Generation in a Multimodal System

DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation