Abstract
There are very few labeled datasets in speech emotion recognition. The reason is that emotion is subjective and requires much time for labeling experts to identify emotion categories, while the wav2vec2.0 model is a general model for obtaining speech representations through self-supervised training. Therefore, we try to apply it to speech-emotion recognition tasks. We propose a multimodal dual-branch transformer network. For the speech processing branch, first, we use wav2vec2.0 to extract speech features. Then, a fine-tuning strategy and a self-attention-based interlayer feature fusion strategy are used. Second, a fully convolutional classification network is used for emotion classification. Then, we use RoBERTa for text emotion recognition and bimodal fusion by an improved weighted Dempster–Shafer (DS) strategy. In addition, we propose an accuracy-weighted label smoothing method, which can improve recognition accuracy. We perform comprehensive experiments on two benchmarks: IEMOCAP and CASIA, covering both Chinese and English datasets. The experimental results show that the proposed method has higher accuracy than state-of-the-art methods.



Similar content being viewed by others
Data Availability Statement
Data cannot be made available.
References
Baevski A, Schneider S, Auli M (2019) vq-wav2vec: Self-supervised learning of discrete speech representations. http://arxiv.org/abs/1910.05453
Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460
Balakrishnan V, Lok PY, Abdul Rahim H (2021) A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews. J Supercomput 77(4):3795–3810. https://doi.org/10.1007/s11227-020-03412-w
Busso C, Bulut M, Lee CC et al (2008) Iemocap: interactive emotional dyadic motion capture database. Language Resour Eval 42(4):335–359
Chen LW, Rudnicky A (2021) Exploring wav2vec 2.0 fine-tuning for improved speech emotion recognition. http://arxiv.org/abs/2110.06309
Chen M, Zhao X (2020) A multi-scale fusion framework for bimodal speech emotion recognition. In: Interspeech, 374–378
Clark K, Luong MT, Le QV, et al (2020) Electra: Pre-training text encoders as discriminators rather than generators. http://arxiv.org/abs/2003.10555
Garofolo J, Graff D, Paul D et al (1993) Csr-i (wsj0) complete ldc93s6a. Web Download Philadelphia: Linguistic Data Consortium 83:87
Garofolo JS (1993) Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993
Gupta V, Juyal S, Hu YC (2022) Understanding human emotions through speech spectrograms using deep neural network. J Supercomput 78(5):6944–6973. https://doi.org/10.1007/s11227-021-04124-5
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. http://arxiv.org/abs/1801.06146
Jiang C, Liu J, Mao R et al (2020) Speech emotion recognition based on dcnn bigru self-attention model. 2020 International Conference on Information Science. Parallel and Distributed Systems (ISPDS), IEEE, pp 46–51
Jousselme AL, Grenier D, Bossé É (2001) A new distance between two bodies of evidence. Inf Fusion 2(2):91–101
Kenton JDMWC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, pp 4171–4186
Kommineni J, Mandala S, Sunar MS et al (2021) Accurate computing of facial expression recognition using a hybrid feature extraction technique. J Supercomput 77(5):5019–5044. https://doi.org/10.1007/s11227-020-03468-8
Krishna D, Patil A (2020) Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In: Interspeech, 4243–4247
Lample G, Conneau A (2019) Cross-lingual language model pretraining. http://arxiv.org/abs/1901.07291
Lan Z, Chen M, Goodman S, et al (2019) Albert: A lite bert for self-supervised learning of language representations. http://arxiv.org/abs/1909.11942
Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized bert pretraining approach. http://arxiv.org/abs/1907.11692
Macary M, Tahon M, Estève Y, et al (2021) On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In: 2021 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 373–380
Makiuchi MR, Uto K, Shinoda K (2021) Multimodal emotion recognition with high-level speech and text features. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, pp 350–357
Mao S, Tao D, Zhang G et al (2019) Revisiting hidden markov models for speech emotion recognition. ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 6715–6719
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT press, Cambridge
Nimmagadda R, Arora K, Martin MV (2022) Emotion recognition models for companion robots. J Supercomput. https://doi.org/10.1007/s11227-022-04416-4
Park DS, Chan W, Zhang Y, et al (2019) Specaugment: a simple data augmentation method for automatic speech recognition. http://arxiv.org/abs/1904.08779
Pepino L, Riera P, Ferrer L (2021) Emotion recognition from speech using wav2vec 2.0 embeddings. http://arxiv.org/abs/2104.03502
Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. CoRR http://arxiv.org/1802.05365
Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI
Rajamani ST, Rajamani KT, Mallol-Ragolta A et al (2021) A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. ICASSP 2021–2021 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 6294–6298
Santoso J, Yamada T, Makino S, et al (2021) Speech emotion recognition based on attention weight correction using word-level confidence measure. In: Interspeech, pp 1947–1951
Sarma M, Ghahremani P, Povey D, et al (2018) Emotion identification from raw speech signals using dnns. In: Interspeech, pp 3097–3101
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093
Shafer G (1992) Dempster-shafer theory. Encycl Artif Intell 1:330–331
Siriwardhana S, Reis A, Weerasekera R, et al (2020) Jointly fine-tuning" bert-like" self supervised models to improve multimodal speech emotion recognition. http://arxiv.org/abs/2008.06682
Sun C, Qiu X, Xu Y, et al (2019) How to fine-tune bert for text classification? In: China National Conference on Chinese Computational Linguistics, Springer, pp 194–206
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:4
Wan CX, Li B (2022) Financial causal sentence recognition based on bert-cnn text classification. J Supercomput 78(5):6503–6527. https://doi.org/10.1007/s11227-021-04097-5
Wang H, Wei S, Fang B (2020) Facial expression recognition using iterative fusion of mo-hog and deep features. J Supercomput 76(5):3211–3221. https://doi.org/10.1007/s11227-018-2554-8
Wang Y, Boumadane A, Heba A (2021) A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. http://arxiv.org/abs/2111.02735
Yang Z, Dai Z, Yang Y et al (2019) Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Process Syst 32:5
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp 112–118
Yosinski J, Clune J, Bengio Y et al (2014) How transferable are features in deep neural networks? Adv Neural Inf Process Syst 27:8
Zadeh A, Zellers R, Pincus E, et al (2016) Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. http://arxiv.org/abs/1606.06259
Zadeh AB, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2236–2246
Zhao D, Qian Y, Liu J et al (2022) The facial expression recognition technology under image processing and neural network. J Supercomput 78(4):4681–4708. https://doi.org/10.1007/s11227-021-04058-y
Zheng L, Li Q, Ban H, et al (2018) Speech emotion recognition based on convolution neural network combined with random forest. In: 2018 Chinese Control and Decision Conference (CCDC), IEEE, pp 4143–4147
Funding
This study was supported by the National Key R &D Program of China under Grant 2020YFC0833102.
Author information
Authors and Affiliations
Contributions
YY, CH, YT, and YX contributed to the conception of the study. YY, CH, and YF performed the experiment. YY, YT and CH, YX contributed significantly to the analysis and manuscript preparation. YY and CH performed the data analyses and wrote the manuscript. YT, YX, and XH helped perform the analysis with constructive discussions. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This declaration is not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yi, Y., Tian, Y., He, C. et al. DBT: multimodal emotion recognition based on dual-branch transformer. J Supercomput 79, 8611–8633 (2023). https://doi.org/10.1007/s11227-022-05001-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-05001-5