Skip to main content
Log in

Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

The code during the study is available in https://github.com/1LuYaoLiu/Mutimodal-Speech-Emotion-Recognition-Based-on-Multi-view-Attention-Mechanism.

The datasets used during the study were provided by a third party. Direct requests for these materials may be made to the provider as follows:

IEMOCAP: https://sail.usc.edu/iemocap/index.html;

MSP-IMPROV: http://ecs.utdallas.edu/research/researchlabs/msp-lab/MSPImprov.html.

References

  1. Abdelwahab M, Busso C (2018) Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26 (12):2423–2435

    Article  Google Scholar 

  2. Bhosale S, Chakraborty R, Kopparapu SK (2020) Deep encoder linguistic and acoustic cues for attention based end to end speech emotion recognition. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7184–7188

  3. Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp 69–72, Sydney, Australia. Association for computational linguistics

  4. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

    Article  Google Scholar 

  5. Busso C, Parthasarathy S, Burmania A, Abdelwahab M, Sadoughi N, Provost EM (2017) MSP- IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans Affect Comput 8(1):67–80

    Article  Google Scholar 

  6. Cho J, Pappagari R, Kulkarni P, Villalba J, Carmiel Y, Dehak N (2018) Deep neural networks for emotion recognition combining audio and transcripts. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2018-September (September), pp 247–251

  7. Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1-2):1261–1289

    Article  Google Scholar 

  8. Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: International conference on spoken language processing, ICSLP, proceedings, vol 3(Icslp 96), pp 1970–1973

  9. Deng J, Xu X, Zhang Z, Fruhholz S, Schuller B (2018) Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26(1):31–43

    Article  Google Scholar 

  10. Dobrišek S, Gajšek R, Mihelič F, Pavešić N, Štruc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst, vol 10

  11. Georgiou E, Papaioannou C, Potamianos A (2019) Deep hierarchical fusion with application in sentiment analysis. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2019-September, pp 1646–1650

  12. Gobl C, Chasaide AN (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Comm 40(1-2):189–212

    Article  MATH  Google Scholar 

  13. Guizzo E, Weyde T, Leveson JB (2020) Multi-time-scale convolution for emotion recognition from speech audio signals. In: ICASSP 2020 – 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6489–6493

  14. Gupta S, fahad MS, Deepak A (2019) Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition. arXiv, pp 23347–23365

  15. Lausen A, Schacht A (2018) Gender differences in the recognition of vocal emotions. Front Psychol 9(JUN):1–22

    Google Scholar 

  16. Li P, Song Y, McLoughlin IV, Guo W, Dai L (2018) An attention pooling based representation learning method for speech emotion recognition. In: Interspeech 2018, 19th annual conference of the international speech communication association. ISCA, Hyderabad, India, 2-6 Sept 2018, pp 3087–3091

  17. Liu J, Liu Z, Wang L, Guo L, Dang J (2020) Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7169–7173

  18. Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP (2018) Efficient low-rank multimodal fusion with modality-specific factors. arXiv, pp 2247–2256

  19. Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput:184–198

  20. Miao H, Cheng G, Gao C, Zhang P, online YY (2020) Transformer-based ctc/attention end-to-end speech recognition architecture. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6084–6088

  21. Mihalcea R, Morency L-P, Science C (2013) Utterance-level multimodal sentiment analysis. Acl: 973–982

  22. Nediyanchath A, Paramasivam P, Yenigalla P (2020) Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP, IEEE international conference on acoustics, speech and signal processing - proceedings. IEEE, vol 2020 May, pp 7179–7183

  23. Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. Proc Annual Conf Int Speech Commun Assoc Interspeech 2:809–812

    Google Scholar 

  24. Neumann M, Vu NT (2019) Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7390–7394

  25. Pao T-L, Chen Y-T, Yeh J-H, Liao W-Y (2005) Combining acoustic features for improved emotion recognition in mandarin speech. In: Tao J, Tan T, Picard RW (eds) Affective computing and intelligent interaction, pp 279–285, Berlin, Heidelberg, Springer

  26. Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion . Inf Fusion 37:98–125

    Article  Google Scholar 

  27. Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Conference proceedings - EMNLP 2015: conference on empirical methods in natural language processing, (september), pp 2539–2544

  28. Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59

    Article  Google Scholar 

  29. Rozgić V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of SVM trees for multimodal emotion recognition. In: 2012 Conference handbook - asia-pacific signal and information processing association annual summit and conference, APSIPA ASC 2012, pp 7–10

  30. Schuller B (2011) Recognizing affect from linguistic information in 3D continuous space. IEEE Trans Affect Comput 2(4):192–205

    Article  Google Scholar 

  31. Shen J, Tang X, Dong X, Shao L (2020) Visual object tracking by hierarchical attention siamese network, vol 50(7), pp 3068–3080

  32. Shirian A, Guha T (2020) Compact graph architecture for speech emotion recognition

  33. Su B-H, Chang C-M, Lin Y-S, Lee C-C (2020) Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network, pp 506–510

  34. Tripathi S, Tripathi S, Beigi H (2018) Multi-modal emotion recognition on IEMOCAP dataset using deep learning

  35. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 30, pp 5998–6008

  36. Wang Y, Guan L (2004) An investigation of speech-based human emotion recognition. In: 2004 IEEE 6th workshop on multimedia signal processing, pp 15–18

  37. Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP (2018) Words can Shift: dynamically adjusting word representations using nonverbal behaviors. arXiv

  38. Wang J, Xue M, Culhane R, Diao E, Ding J, Tarokh V (2020) Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6469–6473

  39. Williams C, Stevens K (1981) Vocal correlates of emotional state. Vocal Correlates Emotional States, vol 01

  40. Wu CH, Liang WB (2015) Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels (Extended abstract). In: 2015 International conference on affective computing and intelligent interaction, ACII 2015, pp 477–483

  41. Xu H, Zhang H, Han K, Wang Y, Peng Y, Li X (2019) Learning alignment for multimodal emotion recognition from speech. arXiv, pp 3569–3573

  42. Yandong W, Kaipeng Z, Zhifeng L, Yu Q (2016) A Discriminative Feature Learning Approach for Deep Face Recognition. Computer Vision – ECCV pp 499–515

  43. Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018 IEEE Spoken Lang Technol Workshop (SLT):112–118

  44. Yoon S, Dey S, Lee H, Jung K (2020) Attention modality hopping mechanism for speech emotion recognition. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3362–3366

  45. Zhang Z, Wu B, Schuller B (2019) Attention-augmented end-to-end multi-task learning for emotion prediction from speech. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6705–6709

  46. Zhao Z, Bao Z, Zhang Z, Cummins N, Wang H, Schuller BW (2019) Attention-enhanced connectionist temporal classification for discrete speech emotion recognition. In: Interspeech 2019, 20th annual conference of the international speech communication association. ISCA, Graz, Austria, 15-19 Sept 2019, pp 206–210

  47. Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International conference on affective computing and intelligent interaction, ACII 2015, pp 827–831

Download references

Acknowledgements

This work was supported by Fundamental Research Funds for the Central Universities (Grants 2019RC29 and DUT19RC(3)012), by National Natural Science Foundation (NNSF) of China (Grant 61972064), by the Gansu Provincial First-Class Discipline Program of Northwest Minzu University (Grant 11080305), and by LiaoNing Revitalization Talents Program (Grant XLYC1806006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu-Yao Liu.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing interests regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Table 9 The settings for hyper-parameters in experiments
Table 10 The specific settings for convolution layers

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, L., Liu, LY., Liu, SL. et al. Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism. Multimed Tools Appl 82, 28917–28935 (2023). https://doi.org/10.1007/s11042-023-14600-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14600-0

Keywords

Navigation