DBT: multimodal emotion recognition based on dual-branch transformer

Yi, Yufan; Tian, Yan; He, Cong; Fan, Yajing; Hu, Xinli; Xu, Yiping

doi:10.1007/s11227-022-05001-5

DBT: multimodal emotion recognition based on dual-branch transformer

Published: 21 December 2022

Volume 79, pages 8611–8633, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yufan Yi¹,
Yan Tian¹,
Cong He¹,
Yajing Fan¹,
Xinli Hu² &
…
Yiping Xu¹

942 Accesses
Explore all metrics

Abstract

There are very few labeled datasets in speech emotion recognition. The reason is that emotion is subjective and requires much time for labeling experts to identify emotion categories, while the wav2vec2.0 model is a general model for obtaining speech representations through self-supervised training. Therefore, we try to apply it to speech-emotion recognition tasks. We propose a multimodal dual-branch transformer network. For the speech processing branch, first, we use wav2vec2.0 to extract speech features. Then, a fine-tuning strategy and a self-attention-based interlayer feature fusion strategy are used. Second, a fully convolutional classification network is used for emotion classification. Then, we use RoBERTa for text emotion recognition and bimodal fusion by an improved weighted Dempster–Shafer (DS) strategy. In addition, we propose an accuracy-weighted label smoothing method, which can improve recognition accuracy. We perform comprehensive experiments on two benchmarks: IEMOCAP and CASIA, covering both Chinese and English datasets. The experimental results show that the proposed method has higher accuracy than state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Speech Emotion Recognition Using U-Net

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Data Availability Statement

Data cannot be made available.

Notes

References

Baevski A, Schneider S, Auli M (2019) vq-wav2vec: Self-supervised learning of discrete speech representations. http://arxiv.org/abs/1910.05453
Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460
Google Scholar
Balakrishnan V, Lok PY, Abdul Rahim H (2021) A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews. J Supercomput 77(4):3795–3810. https://doi.org/10.1007/s11227-020-03412-w
Article Google Scholar
Busso C, Bulut M, Lee CC et al (2008) Iemocap: interactive emotional dyadic motion capture database. Language Resour Eval 42(4):335–359
Article Google Scholar
Chen LW, Rudnicky A (2021) Exploring wav2vec 2.0 fine-tuning for improved speech emotion recognition. http://arxiv.org/abs/2110.06309
Chen M, Zhao X (2020) A multi-scale fusion framework for bimodal speech emotion recognition. In: Interspeech, 374–378
Clark K, Luong MT, Le QV, et al (2020) Electra: Pre-training text encoders as discriminators rather than generators. http://arxiv.org/abs/2003.10555
Garofolo J, Graff D, Paul D et al (1993) Csr-i (wsj0) complete ldc93s6a. Web Download Philadelphia: Linguistic Data Consortium 83:87
Google Scholar
Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993
Gupta V, Juyal S, Hu YC (2022) Understanding human emotions through speech spectrograms using deep neural network. J Supercomput 78(5):6944–6973. https://doi.org/10.1007/s11227-021-04124-5
Article Google Scholar
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. http://arxiv.org/abs/1801.06146
Jiang C, Liu J, Mao R et al (2020) Speech emotion recognition based on dcnn bigru self-attention model. 2020 International Conference on Information Science. Parallel and Distributed Systems (ISPDS), IEEE, pp 46–51
Jousselme AL, Grenier D, Bossé É (2001) A new distance between two bodies of evidence. Inf Fusion 2(2):91–101
Article Google Scholar
Kenton JDMWC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, pp 4171–4186
Kommineni J, Mandala S, Sunar MS et al (2021) Accurate computing of facial expression recognition using a hybrid feature extraction technique. J Supercomput 77(5):5019–5044. https://doi.org/10.1007/s11227-020-03468-8
Article Google Scholar
Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, 4243–4247
Lample G, Conneau A (2019) Cross-lingual language model pretraining. http://arxiv.org/abs/1901.07291
Lan Z, Chen M, Goodman S, et al (2019) Albert: A lite bert for self-supervised learning of language representations. http://arxiv.org/abs/1909.11942
Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized bert pretraining approach. http://arxiv.org/abs/1907.11692
Macary M, Tahon M, Estève Y, et al (2021) On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 373–380
Makiuchi MR, Uto K, Shinoda K (2021) Multimodal emotion recognition with high-level speech and text features. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp 350–357
Mao S, Tao D, Zhang G et al (2019) Revisiting hidden markov models for speech emotion recognition. ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 6715–6719
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, Cambridge
MATH Google Scholar
Nimmagadda R, Arora K, Martin MV (2022) Emotion recognition models for companion robots. J Supercomput. https://doi.org/10.1007/s11227-022-04416-4
Article Google Scholar
Park DS, Chan W, Zhang Y, et al (2019) Specaugment: a simple data augmentation method for automatic speech recognition. http://arxiv.org/abs/1904.08779
Pepino L, Riera P, Ferrer L (2021) Emotion recognition from speech using wav2vec 2.0 embeddings. http://arxiv.org/abs/2104.03502
Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. CoRR http://arxiv.org/1802.05365
Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI
Rajamani ST, Rajamani KT, Mallol-Ragolta A et al (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. ICASSP 2021–2021 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 6294–6298
Santoso J, Yamada T, Makino S, et al (2021) Speech emotion recognition based on attention weight correction using word-level confidence measure. In: Interspeech, pp 1947–1951
Sarma M, Ghahremani P, Povey D, et al (2018) Emotion identification from raw speech signals using dnns. In: Interspeech, pp 3097–3101
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093
Shafer G (1992) Dempster-shafer theory. Encycl Artif Intell 1:330–331
Google Scholar
Siriwardhana S, Reis A, Weerasekera R, et al (2020) Jointly fine-tuning" bert-like" self supervised models to improve multimodal speech emotion recognition. http://arxiv.org/abs/2008.06682
Sun C, Qiu X, Xu Y, et al (2019) How to fine-tune bert for text classification? In: China National Conference on Chinese Computational Linguistics, Springer, pp 194–206
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:4
Google Scholar
Wan CX, Li B (2022) Financial causal sentence recognition based on bert-cnn text classification. J Supercomput 78(5):6503–6527. https://doi.org/10.1007/s11227-021-04097-5
Article Google Scholar
Wang H, Wei S, Fang B (2020) Facial expression recognition using iterative fusion of mo-hog and deep features. J Supercomput 76(5):3211–3221. https://doi.org/10.1007/s11227-018-2554-8
Article Google Scholar
Wang Y, Boumadane A, Heba A (2021) A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. http://arxiv.org/abs/2111.02735
Yang Z, Dai Z, Yang Y et al (2019) Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst 32:5
Google Scholar
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 112–118
Yosinski J, Clune J, Bengio Y et al (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 27:8
Google Scholar
Zadeh A, Zellers R, Pincus E, et al (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. http://arxiv.org/abs/1606.06259
Zadeh AB, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2236–2246
Zhao D, Qian Y, Liu J et al (2022) The facial expression recognition technology under image processing and neural network. J Supercomput 78(4):4681–4708. https://doi.org/10.1007/s11227-021-04058-y
Article Google Scholar
Zheng L, Li Q, Ban H, et al (2018) Speech emotion recognition based on convolution neural network combined with random forest. In: 2018 Chinese Control and Decision Conference (CCDC), IEEE, pp 4143–4147

Download references

Funding

This study was supported by the National Key R &D Program of China under Grant 2020YFC0833102.

Author information

Authors and Affiliations

School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China
Yufan Yi, Yan Tian, Cong He, Yajing Fan & Yiping Xu
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
Xinli Hu

Authors

Yufan Yi
View author publications
You can also search for this author inPubMed Google Scholar
Yan Tian
View author publications
You can also search for this author inPubMed Google Scholar
Cong He
View author publications
You can also search for this author inPubMed Google Scholar
Yajing Fan
View author publications
You can also search for this author inPubMed Google Scholar
Xinli Hu
View author publications
You can also search for this author inPubMed Google Scholar
Yiping Xu
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

YY, CH, YT, and YX contributed to the conception of the study. YY, CH, and YF performed the experiment. YY, YT and CH, YX contributed significantly to the analysis and manuscript preparation. YY and CH performed the data analyses and wrote the manuscript. YT, YX, and XH helped perform the analysis with constructive discussions. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yan Tian.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yi, Y., Tian, Y., He, C. et al. DBT: multimodal emotion recognition based on dual-branch transformer. J Supercomput 79, 8611–8633 (2023). https://doi.org/10.1007/s11227-022-05001-5

Download citation

Accepted: 04 December 2022
Published: 21 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11227-022-05001-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DBT: multimodal emotion recognition based on dual-branch transformer

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Speech Emotion Recognition Using U-Net

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Data Availability Statement

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now