Hierarchical Multimodal Transformer with Localness and Speaker Aware Attention for Emotion Recognition in Conversations

Jin, Xiao; Yu, Jianfei; Ding, Zixiang; Xia, Rui; Zhou, Xiangsheng; Tu, Yaofeng

doi:10.1007/978-3-030-60457-8_4

Xiao Jin¹²,
Jianfei Yu¹²,
Zixiang Ding¹²,
Rui Xia¹²,
Xiangsheng Zhou¹³ &
…
Yaofeng Tu¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12431))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2186 Accesses
2 Citations

Abstract

Emotion Recognition in Conversations (ERC) aims to predict the emotion of each utterance in a given conversation. Existing approaches for the ERC task mainly suffer from two drawbacks: (1) failing to pay enough attention to the emotional impact of the local context; (2) ignoring the effect of the emotional inertia of speakers. To tackle these limitations, we first propose a Hierarchical Multimodal Transformer as our base model, followed by carefully designing a localness-aware attention mechanism and a speaker-aware attention mechanism to respectively capture the impact of the local context and the emotional inertia. Extensive evaluations on a benchmark dataset demonstrate the superiority of our proposed model over existing multimodal methods for ERC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Utterance is typically defined as a unit of speech bounded by breathes or pause [10].

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Chen, S.Y., Hsu, C.C., Kuo, C.C., Ku, L.W., et al.: Emotionlines: an emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ekman, P.: An argument for basic emotions. Cogn. Emotion 6(3–4), 169–200 (1992)
Article Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462 (2010)
Google Scholar
Ghosal, D., Majumder, N., Poria, S., Chhaya, N., Gelbukh, A.: Dialoguegcn: a graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540 (2019)
Hazarika, D., Poria, S., Mihalcea, R., Cambria, E., Zimmermann, R.: ICON: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2594–2604 (2018)
Google Scholar
Jiao, W., Lyu, M.R., King, I.: Real-time emotion recognition via attention gated hierarchical memory network. arXiv preprint arXiv:1911.09075 (2019)
Majumder, N., Poria, S., Hazarika, D., Mihalcea, R., Gelbukh, A., Cambria, E.: Dialoguernn: an attentive rnn for emotion detection in conversations. Proc. AAAI Conf. Artif. Intell. 33, 6818–6825 (2019)
Google Scholar
Olson, D.: From utterance to text: the bias of language in speech and writing. Harvard Educ. Rev. 47(3), 257–281 (1977)
Article Google Scholar
Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., Morency, L.P.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers), pp. 873–883 (2017)
Google Scholar
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., Mihalcea, R.: Meld: a multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508 (2018)
Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. arXiv preprint arXiv:1906.00295 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Yang, B., Li, J., Wong, D.F., Chao, L.S., Wang, X., Tu, Z.: Context-aware self-attention networks. Proc. AAAI Conf. Artif. Intell. 33, 387–394 (2019)
Google Scholar
Yang, B., Tu, Z., Wong, D.F., Meng, F., Chao, L.S., Zhang, T.: Modeling localness for self-attention networks. arXiv preprint arXiv:1810.10182 (2018)
Yuan, J., Liberman, M.: Speaker identification on the scotus corpus. J. Acoustical Soc. Am. 123(5), 3878 (2008)
Article Google Scholar
Zhang, D., Wu, L., Sun, C., Li, S., Zhu, Q., Zhou, G.: Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In: See Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pp. 10–16. IJCAI (2019)
Google Scholar
Zhong, P., Wang, D., Miao, C.: Knowledge-enriched transformer for emotion detection in textual conversations. arXiv preprint arXiv:1909.10681 (2019)

Download references

Acknowledgments

We would like to thank three anonymous reviewers for their valuable comments. This work was supported by the Natural Science Foundation of China (No. 61672288). Xiao Jin and Jianfei Yu contributed equally to this paper.

Author information

Authors and Affiliations

Nanjing University of Science and Technology, Nanjing, China
Xiao Jin, Jianfei Yu, Zixiang Ding & Rui Xia
ZTE Corporation, Shenzhen, China
Xiangsheng Zhou & Yaofeng Tu

Authors

Xiao Jin
View author publications
You can also search for this author in PubMed Google Scholar
Jianfei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zixiang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Rui Xia
View author publications
You can also search for this author in PubMed Google Scholar
Xiangsheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yaofeng Tu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Xia .

Editor information

Editors and Affiliations

ECE & Ingenuity Labs Research Institute, Queen’s University, Kingston, ON, Canada
Xiaodan Zhu
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Min Zhang
School of Computer Science and Technology, Soochow University, Suzhou, China
Yu Hong
College of Intelligence and Computing, Tianjin University, Tianjin, China
Ruifang He

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, X., Yu, J., Ding, Z., Xia, R., Zhou, X., Tu, Y. (2020). Hierarchical Multimodal Transformer with Localness and Speaker Aware Attention for Emotion Recognition in Conversations. In: Zhu, X., Zhang, M., Hong, Y., He, R. (eds) Natural Language Processing and Chinese Computing. NLPCC 2020. Lecture Notes in Computer Science(), vol 12431. Springer, Cham. https://doi.org/10.1007/978-3-030-60457-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-60457-8_4
Published: 02 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60456-1
Online ISBN: 978-3-030-60457-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

Hierarchical Multimodal Transformer with Localness and Speaker Aware Attention for Emotion Recognition in Conversations