Skip to main content

Advertisement

Log in

HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Multimodal emotional expressions affect the progress of conversation in complex ways in our lives. For multimodal emotion recognition in conversation (ERC), previous studies focus on modeling partial influences of speaker and modality to infer emotion states in historical context based on traditional modeling units. However, with the tremendous success of Transformer in broad fields, how to effectively model intra- and inter-speaker, intra- and intermodal influences in historical dialog context based on Transformer is still not been tackled. In this paper, we propose a novel methodology HAAN-ERC, which hierarchically uses dialogue context information to model intra-speaker, inter-speaker, intra-modal, and intermodal influences to infer the emotional state of speakers. Meanwhile, we propose an adaptive attention mechanism, which can be trained in an end-to-end manner and automatically makes the unique decision for each speaker to omit redundant or valueless utterances from historical contexts in multiple hierarchies for adaptive fusion. The performance of HAAN-ERC is comprehensively evaluated on two popular multimodal ERC datasets of IEMOCAP and MELD, and achieves new state-of-the-art results. The encouraging results prove the validity of our HAAN-ERC. Our original codes will be publicly available at https://github.com/TAN-OpenLab/HAAN-ERC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data availability statement

The datasets generated or analyzed during the current study are available at https://github.com/TAN-OpenLab/HAAN-ERC, and the original datasets are published in related works [9, 10].

Code availability

The code is available at https://github.com/TAN-OpenLab/HAAN-ERC.

References

  1. Chen F, Sun Z, Ouyang D, Liu X, Shao J (2021) Learning what and when to drop: adaptive multimodal and contextual dynamics for emotion recognition in conversation. Association for Computing Machinery, New York, pp 1064–1073

    Google Scholar 

  2. Hazarika D, Poria S, Mihalcea R, Cambria E, Zimmermann R (2018) ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, pp 2594–2604

  3. Hazarika D, Poria S, Zadeh A, Cambria E, Morency L-P, Zimmermann R (2018) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the 2018 Conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1 (Long Papers), Association for Computational Linguistics, New Orleans, pp 2122–2132

  4. Majumder N, Poria S, Hazarika D, Mihalcea R, Gelbukh AF, Cambria E (2019) Dialoguernn: an attentive RNN for emotion detection in conversations. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, The thirty-first innovative applications of artificial intelligence conference, IAAI 2019, The ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, January 27–February 1, 2019, pp 6818–6825

  5. Hsu C-C, Chen S-Y, Kuo C-C, Huang T-H, Ku L-W (2018) Emotionlines: an emotion corpus of multi-party conversations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki

  6. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  7. Han K, Wang Y, Chen H, Chen X, Guo J, Liu Z, Tang Y, Xiao A, Xu C, Xu Y, Yang Z, Zhang Y, Tao D (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 1–1

  8. Kalyan KS, Rajasekharan A, Sangeetha S (2022) AMMU: a survey of transformer-based biomedical pretrained language models. J Biomed Inf 126:103982

    Article  Google Scholar 

  9. Busso C, Bulut M, Lee C-C, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

    Article  Google Scholar 

  10. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2019) MELD: a multimodal multi-party dataset for emotion recognition in conversations. In: Korhonen A, Traum DR, Màrquez L (eds) Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, Florence, July 28–August 2, 2019, vol 1 (Long Papers), pp 527–536

  11. Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

  12. Zhang D, Wu L, Sun C, Li S, Zhu Q, Zhou G (2019) Modeling both context- and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, August 10–16, 2019, pp 5415–5421

  13. Shen W, Chen J, Quan X, Xie Z (2021) Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, The eleventh symposium on educational advances in artificial intelligence, EAAI 2021, Virtual Event, February 2–9, 2021, pp 13789–13797

  14. Hazarika D, Poria S, Zimmermann R, Mihalcea R (2021) Conversational transfer learning for emotion recognition. Inf Fusion 65:1–12

    Article  Google Scholar 

  15. Ghosal D, Majumder N, Gelbukh AF, Mihalcea R, Poria S (2020) COSMIC: commonsense knowledge for emotion identification in conversations. In: Findings of the association for computational linguistics: EMNLP 2020, Online Event, 16–20 November 2020, pp 2470–2481

  16. Jiao W, Lyu MR, King I (2020) Real-time emotion recognition via attention gated hierarchical memory network. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, The thirty-second innovative applications of artificial intelligence conference, IAAI 2020, The tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, February 7–12, 2020, pp 8002–8009

  17. Guo Y, Shi H, Kumar A, Grauman K, Feris R (2019) Spottune: Transfer learning through adaptive fine-tuning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  18. Ahn C, Kim E, Oh S (2019) Deep elastic networks with model selection for multi-task learning. In: 2019 IEEE/CVF international conference on computer vision (ICCV)

  19. Rosenbaum C, Klinger T, Riemer M (2017) Routing networks: adaptive selection of non-linear functions for multi-task learning. Preprint arXiv:1711.01239

  20. Sun X, Panda R, Feris R, Saenko K (2020) Adashare: learning what to share for efficient deep multi-task learning. Adv Neural Inf Process Syst 33:8728–8740

    Google Scholar 

  21. Zhang T, Huang M, Zhao L (2018) Learning structured representation for text classification via reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  22. Veit A, Belongie S (2018) Convolutional networks with adaptive inference graphs. In: Proceedings of the European conference on computer vision (ECCV), pp 3–18

  23. Wu Z, Nagarajan T, Kumar A, Rennie S, Davis LS, Grauman K, Feris R (2018) Blockdrop: dynamic inference paths in residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8817–8826

  24. Zhang D, Li S, Zhu Q, Zhou G (2019) Effective sentiment-relevant word selection for multi-modal sentiment analysis in spoken language. In: Proceedings of the 27th ACM international conference on multimedia, pp 148–156

  25. Panda R, Chen C-FR, Fan Q, Sun X, Saenko K, Oliva A, Feris R (2021) Adamml: adaptive multi-modal learning for efficient video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7576–7585

  26. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. Preprint arXiv:1907.11692

  27. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia, pp 1459–1462

  28. Jang E, Gu S, Poole B (2016) Categorical reparameterization with gumbel-softmax. Preprint arXiv:1611.01144

  29. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  30. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization

  31. Ghosal D, Majumder N, Poria S, Chhaya N, Gelbukh A (2019) Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. Preprint arXiv:1908.11540

  32. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. Computer Science

  33. Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. Preprint arXiv:1711.05101

Download references

Acknowledgements

This work is partially supported by the National Key Research and Development Program of China under Grant No.2019YFB1405803, and the National Natural Science Foundation of China under Grants No. 61772125.

Author information

Authors and Affiliations

Authors

Contributions

Not applicable.

Corresponding author

Correspondence to Zhenhua Tan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Ethics approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, T., Tan, Z. & Wu, X. HAAN-ERC: hierarchical adaptive attention network for multimodal emotion recognition in conversation. Neural Comput & Applic 35, 17619–17632 (2023). https://doi.org/10.1007/s00521-023-08638-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08638-2

Keywords

Profiles

  1. Tao Zhang
  2. Zhenhua Tan