HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

Zhang, Tao; Tan, Zhenhua; Wu, Xiaoer

doi:10.1007/s00521-023-08638-2

HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

Original Article
Published: 16 May 2023

Volume 35, pages 17619–17632, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

712 Accesses
Explore all metrics

Abstract

Multimodal emotional expressions affect the progress of conversation in complex ways in our lives. For multimodal emotion recognition in conversation (ERC), previous studies focus on modeling partial influences of speaker and modality to infer emotion states in historical context based on traditional modeling units. However, with the tremendous success of Transformer in broad fields, how to effectively model intra- and inter-speaker, intra- and intermodal influences in historical dialog context based on Transformer is still not been tackled. In this paper, we propose a novel methodology HAAN-ERC, which hierarchically uses dialogue context information to model intra-speaker, inter-speaker, intra-modal, and intermodal influences to infer the emotional state of speakers. Meanwhile, we propose an adaptive attention mechanism, which can be trained in an end-to-end manner and automatically makes the unique decision for each speaker to omit redundant or valueless utterances from historical contexts in multiple hierarchies for adaptive fusion. The performance of HAAN-ERC is comprehensively evaluated on two popular multimodal ERC datasets of IEMOCAP and MELD, and achieves new state-of-the-art results. The encouraging results prove the validity of our HAAN-ERC. Our original codes will be publicly available at https://github.com/TAN-OpenLab/HAAN-ERC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical Multimodal Transformer with Localness and Speaker Aware Attention for Emotion Recognition in Conversations

UCEMA: Uni-modal and cross-modal encoding network based on multi-head attention for emotion recognition in conversation

Article 21 November 2024

DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data availability statement

The datasets generated or analyzed during the current study are available at https://github.com/TAN-OpenLab/HAAN-ERC, and the original datasets are published in related works [9, 10].

Code availability

The code is available at https://github.com/TAN-OpenLab/HAAN-ERC.

References

Chen F, Sun Z, Ouyang D, Liu X, Shao J (2021) Learning what and when to drop: adaptive multimodal and contextual dynamics for emotion recognition in conversation. Association for Computing Machinery, New York, pp 1064–1073
Google Scholar
Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, pp 2594–2604
Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the 2018 Conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), Association for Computational Linguistics, New Orleans, pp 2122–2132
Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh AF, Cambria E (2019) Dialoguernn: an attentive RNN for emotion detection in conversations. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, The thirty-first innovative applications of artificial intelligence conference, IAAI 2019, The ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, January 27–February 1, 2019, pp 6818–6825
Hsu C-C, Chen S-Y, Kuo C-C, Huang T-H, Ku L-W (2018) Emotionlines: an emotion corpus of multi-party conversations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, Yang Z, Zhang Y, Tao D (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 1–1
Kalyan KS, Rajasekharan A, Sangeetha S (2022) AMMU: a survey of transformer-based biomedical pretrained language models. J Biomed Inf 126:103982
Article Google Scholar
Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Article Google Scholar
Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2019) MELD: a multimodal multi-party dataset for emotion recognition in conversations. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, July 28–August 2, 2019, vol 1 (Long Papers), pp 527–536
Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)
Zhang D, Wu L, Sun C, Li S, Zhu Q, Zhou G (2019) Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, August 10–16, 2019, pp 5415–5421
Shen W, Chen J, Quan X, Xie Z (2021) Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, The eleventh symposium on educational advances in artificial intelligence, EAAI 2021, Virtual Event, February 2–9, 2021, pp 13789–13797
Hazarika D, Poria S, Zimmermann R, Mihalcea R (2021) Conversational transfer learning for emotion recognition. Inf Fusion 65:1–12
Article Google Scholar
Ghosal D, Majumder N, Gelbukh AF, Mihalcea R, Poria S (2020) COSMIC: commonsense knowledge for emotion identification in conversations. In: Findings of the association for computational linguistics: EMNLP 2020, Online Event, 16–20 November 2020, pp 2470–2481
Jiao W, Lyu MR, King I (2020) Real-time emotion recognition via attention gated hierarchical memory network. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, The thirty-second innovative applications of artificial intelligence conference, IAAI 2020, The tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, February 7–12, 2020, pp 8002–8009
Guo Y, Shi H, Kumar A, Grauman K, Feris R (2019) Spottune: Transfer learning through adaptive fine-tuning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Ahn C, Kim E, Oh S (2019) Deep elastic networks with model selection for multi-task learning. In: 2019 IEEE/CVF international conference on computer vision (ICCV)
Rosenbaum C, Klinger T, Riemer M (2017) Routing networks: adaptive selection of non-linear functions for multi-task learning. Preprint arXiv:1711.01239
Sun X, Panda R, Feris R, Saenko K (2020) Adashare: learning what to share for efficient deep multi-task learning. Adv Neural Inf Process Syst 33:8728–8740
Google Scholar
Zhang T, Huang M, Zhao L (2018) Learning structured representation for text classification via reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Veit A, Belongie S (2018) Convolutional networks with adaptive inference graphs. In: Proceedings of the European conference on computer vision (ECCV), pp 3–18
Wu Z, Nagarajan T, Kumar A, Rennie S, Davis LS, Grauman K, Feris R (2018) Blockdrop: dynamic inference paths in residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8817–8826
Zhang D, Li S, Zhu Q, Zhou G (2019) Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language. In: Proceedings of the 27th ACM international conference on multimedia, pp 148–156
Panda R, Chen C-FR, Fan Q, Sun X, Saenko K, Oliva A, Feris R (2021) Adamml: adaptive multi-modal learning for efficient video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7576–7585
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. Preprint arXiv:1907.11692
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462
Jang E, Gu S, Poole B (2016) Categorical reparameterization with gumbel-softmax. Preprint arXiv:1611.01144
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization
Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. Preprint arXiv:1908.11540
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. Computer Science
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. Preprint arXiv:1711.05101

Download references

Acknowledgements

This work is partially supported by the National Key Research and Development Program of China under Grant No.2019YFB1405803, and the National Natural Science Foundation of China under Grants No. 61772125.

Author information

Authors and Affiliations

Software College, Northeastern University, Shenyang, China
Tao Zhang, Zhenhua Tan & Xiaoer Wu

Authors

Tao Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Zhenhua Tan
View author publications
You can also search for this author inPubMed Google Scholar
Xiaoer Wu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Not applicable.

Corresponding author

Correspondence to Zhenhua Tan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, T., Tan, Z. & Wu, X. HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation. Neural Comput & Applic 35, 17619–17632 (2023). https://doi.org/10.1007/s00521-023-08638-2

Download citation

Received: 07 October 2022
Accepted: 02 May 2023
Published: 16 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s00521-023-08638-2

Keywords

Profiles

Tao Zhang View author profile
Zhenhua Tan View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hierarchical Multimodal Transformer with Localness and Speaker Aware Attention for Emotion Recognition in Conversations

UCEMA: Uni-modal and cross-modal encoding network based on multi-head attention for emotion recognition in conversation

DialogueSMM: Emotion Recognition in Conversation with Speaker-Aware Multimodal Multi-head Attention

Explore related subjects

Data availability statement

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent to participate

Consent for publication

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now