Abstract
Multimodal emotional expressions affect the progress of conversation in complex ways in our lives. For multimodal emotion recognition in conversation (ERC), previous studies focus on modeling partial influences of speaker and modality to infer emotion states in historical context based on traditional modeling units. However, with the tremendous success of Transformer in broad fields, how to effectively model intra- and inter-speaker, intra- and intermodal influences in historical dialog context based on Transformer is still not been tackled. In this paper, we propose a novel methodology HAAN-ERC, which hierarchically uses dialogue context information to model intra-speaker, inter-speaker, intra-modal, and intermodal influences to infer the emotional state of speakers. Meanwhile, we propose an adaptive attention mechanism, which can be trained in an end-to-end manner and automatically makes the unique decision for each speaker to omit redundant or valueless utterances from historical contexts in multiple hierarchies for adaptive fusion. The performance of HAAN-ERC is comprehensively evaluated on two popular multimodal ERC datasets of IEMOCAP and MELD, and achieves new state-of-the-art results. The encouraging results prove the validity of our HAAN-ERC. Our original codes will be publicly available at https://github.com/TAN-OpenLab/HAAN-ERC.







Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Code availability
The code is available at https://github.com/TAN-OpenLab/HAAN-ERC.
References
Chen F, Sun Z, Ouyang D, Liu X, Shao J (2021) Learning what and when to drop: adaptive multimodal and contextual dynamics for emotion recognition in conversation. Association for Computing Machinery, New York, pp 1064–1073
Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, pp 2594–2604
Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the 2018 Conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), Association for Computational Linguistics, New Orleans, pp 2122–2132
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh AF, Cambria E (2019) Dialoguernn: an attentive RNN for emotion detection in conversations. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, The thirty-first innovative applications of artificial intelligence conference, IAAI 2019, The ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, January 27–February 1, 2019, pp 6818–6825
Hsu C-C, Chen S-Y, Kuo C-C, Huang T-H, Ku L-W (2018) Emotionlines: an emotion corpus of multi-party conversations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, Yang Z, Zhang Y, Tao D (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 1–1
Kalyan KS, Rajasekharan A, Sangeetha S (2022) AMMU: a survey of transformer-based biomedical pretrained language models. J Biomed Inf 126:103982
Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2019) MELD: a multimodal multi-party dataset for emotion recognition in conversations. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, July 28–August 2, 2019, vol 1 (Long Papers), pp 527–536
Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)
Zhang D, Wu L, Sun C, Li S, Zhu Q, Zhou G (2019) Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, August 10–16, 2019, pp 5415–5421
Shen W, Chen J, Quan X, Xie Z (2021) Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, The eleventh symposium on educational advances in artificial intelligence, EAAI 2021, Virtual Event, February 2–9, 2021, pp 13789–13797
Hazarika D, Poria S, Zimmermann R, Mihalcea R (2021) Conversational transfer learning for emotion recognition. Inf Fusion 65:1–12
Ghosal D, Majumder N, Gelbukh AF, Mihalcea R, Poria S (2020) COSMIC: commonsense knowledge for emotion identification in conversations. In: Findings of the association for computational linguistics: EMNLP 2020, Online Event, 16–20 November 2020, pp 2470–2481
Jiao W, Lyu MR, King I (2020) Real-time emotion recognition via attention gated hierarchical memory network. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, The thirty-second innovative applications of artificial intelligence conference, IAAI 2020, The tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, February 7–12, 2020, pp 8002–8009
Guo Y, Shi H, Kumar A, Grauman K, Feris R (2019) Spottune: Transfer learning through adaptive fine-tuning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Ahn C, Kim E, Oh S (2019) Deep elastic networks with model selection for multi-task learning. In: 2019 IEEE/CVF international conference on computer vision (ICCV)
Rosenbaum C, Klinger T, Riemer M (2017) Routing networks: adaptive selection of non-linear functions for multi-task learning. Preprint arXiv:1711.01239
Sun X, Panda R, Feris R, Saenko K (2020) Adashare: learning what to share for efficient deep multi-task learning. Adv Neural Inf Process Syst 33:8728–8740
Zhang T, Huang M, Zhao L (2018) Learning structured representation for text classification via reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Veit A, Belongie S (2018) Convolutional networks with adaptive inference graphs. In: Proceedings of the European conference on computer vision (ECCV), pp 3–18
Wu Z, Nagarajan T, Kumar A, Rennie S, Davis LS, Grauman K, Feris R (2018) Blockdrop: dynamic inference paths in residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8817–8826
Zhang D, Li S, Zhu Q, Zhou G (2019) Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language. In: Proceedings of the 27th ACM international conference on multimedia, pp 148–156
Panda R, Chen C-FR, Fan Q, Sun X, Saenko K, Oliva A, Feris R (2021) Adamml: adaptive multi-modal learning for efficient video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7576–7585
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. Preprint arXiv:1907.11692
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462
Jang E, Gu S, Poole B (2016) Categorical reparameterization with gumbel-softmax. Preprint arXiv:1611.01144
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization
Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. Preprint arXiv:1908.11540
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. Computer Science
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. Preprint arXiv:1711.05101
Acknowledgements
This work is partially supported by the National Key Research and Development Program of China under Grant No.2019YFB1405803, and the National Natural Science Foundation of China under Grants No. 61772125.
Author information
Authors and Affiliations
Contributions
Not applicable.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Ethics approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, T., Tan, Z. & Wu, X. HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation. Neural Comput & Applic 35, 17619–17632 (2023). https://doi.org/10.1007/s00521-023-08638-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08638-2
Keywords
Profiles
- Tao Zhang View author profile
- Zhenhua Tan View author profile