Abstract
In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.
Similar content being viewed by others
Data Availability
The code during the study is available in https://github.com/1LuYaoLiu/Mutimodal-Speech-Emotion-Recognition-Based-on-Multi-view-Attention-Mechanism.
The datasets used during the study were provided by a third party. Direct requests for these materials may be made to the provider as follows:
IEMOCAP: https://sail.usc.edu/iemocap/index.html;
MSP-IMPROV: http://ecs.utdallas.edu/research/researchlabs/msp-lab/MSPImprov.html.
References
Abdelwahab M, Busso C (2018) Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26 (12):2423–2435
Bhosale S, Chakraborty R, Kopparapu SK (2020) Deep encoder linguistic and acoustic cues for attention based end to end speech emotion recognition. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7184–7188
Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp 69–72, Sydney, Australia. Association for computational linguistics
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Busso C, Parthasarathy S, Burmania A, Abdelwahab M, Sadoughi N, Provost EM (2017) MSP- IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans Affect Comput 8(1):67–80
Cho J, Pappagari R, Kulkarni P, Villalba J, Carmiel Y, Dehak N (2018) Deep neural networks for emotion recognition combining audio and transcripts. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2018-September (September), pp 247–251
Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1-2):1261–1289
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: International conference on spoken language processing, ICSLP, proceedings, vol 3(Icslp 96), pp 1970–1973
Deng J, Xu X, Zhang Z, Fruhholz S, Schuller B (2018) Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26(1):31–43
Dobrišek S, Gajšek R, Mihelič F, Pavešić N, Štruc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst, vol 10
Georgiou E, Papaioannou C, Potamianos A (2019) Deep hierarchical fusion with application in sentiment analysis. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2019-September, pp 1646–1650
Gobl C, Chasaide AN (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Comm 40(1-2):189–212
Guizzo E, Weyde T, Leveson JB (2020) Multi-time-scale convolution for emotion recognition from speech audio signals. In: ICASSP 2020 – 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6489–6493
Gupta S, fahad MS, Deepak A (2019) Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition. arXiv, pp 23347–23365
Lausen A, Schacht A (2018) Gender differences in the recognition of vocal emotions. Front Psychol 9(JUN):1–22
Li P, Song Y, McLoughlin IV, Guo W, Dai L (2018) An attention pooling based representation learning method for speech emotion recognition. In: Interspeech 2018, 19th annual conference of the international speech communication association. ISCA, Hyderabad, India, 2-6 Sept 2018, pp 3087–3091
Liu J, Liu Z, Wang L, Guo L, Dang J (2020) Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7169–7173
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP (2018) Efficient low-rank multimodal fusion with modality-specific factors. arXiv, pp 2247–2256
Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput:184–198
Miao H, Cheng G, Gao C, Zhang P, online YY (2020) Transformer-based ctc/attention end-to-end speech recognition architecture. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6084–6088
Mihalcea R, Morency L-P, Science C (2013) Utterance-level multimodal sentiment analysis. Acl: 973–982
Nediyanchath A, Paramasivam P, Yenigalla P (2020) Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP, IEEE international conference on acoustics, speech and signal processing - proceedings. IEEE, vol 2020 May, pp 7179–7183
Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. Proc Annual Conf Int Speech Commun Assoc Interspeech 2:809–812
Neumann M, Vu NT (2019) Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7390–7394
Pao T-L, Chen Y-T, Yeh J-H, Liao W-Y (2005) Combining acoustic features for improved emotion recognition in mandarin speech. In: Tao J, Tan T, Picard RW (eds) Affective computing and intelligent interaction, pp 279–285, Berlin, Heidelberg, Springer
Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion . Inf Fusion 37:98–125
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Conference proceedings - EMNLP 2015: conference on empirical methods in natural language processing, (september), pp 2539–2544
Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Rozgić V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of SVM trees for multimodal emotion recognition. In: 2012 Conference handbook - asia-pacific signal and information processing association annual summit and conference, APSIPA ASC 2012, pp 7–10
Schuller B (2011) Recognizing affect from linguistic information in 3D continuous space. IEEE Trans Affect Comput 2(4):192–205
Shen J, Tang X, Dong X, Shao L (2020) Visual object tracking by hierarchical attention siamese network, vol 50(7), pp 3068–3080
Shirian A, Guha T (2020) Compact graph architecture for speech emotion recognition
Su B-H, Chang C-M, Lin Y-S, Lee C-C (2020) Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network, pp 506–510
Tripathi S, Tripathi S, Beigi H (2018) Multi-modal emotion recognition on IEMOCAP dataset using deep learning
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 30, pp 5998–6008
Wang Y, Guan L (2004) An investigation of speech-based human emotion recognition. In: 2004 IEEE 6th workshop on multimedia signal processing, pp 15–18
Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP (2018) Words can Shift: dynamically adjusting word representations using nonverbal behaviors. arXiv
Wang J, Xue M, Culhane R, Diao E, Ding J, Tarokh V (2020) Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6469–6473
Williams C, Stevens K (1981) Vocal correlates of emotional state. Vocal Correlates Emotional States, vol 01
Wu CH, Liang WB (2015) Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels (Extended abstract). In: 2015 International conference on affective computing and intelligent interaction, ACII 2015, pp 477–483
Xu H, Zhang H, Han K, Wang Y, Peng Y, Li X (2019) Learning alignment for multimodal emotion recognition from speech. arXiv, pp 3569–3573
Yandong W, Kaipeng Z, Zhifeng L, Yu Q (2016) A Discriminative Feature Learning Approach for Deep Face Recognition. Computer Vision – ECCV pp 499–515
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018 IEEE Spoken Lang Technol Workshop (SLT):112–118
Yoon S, Dey S, Lee H, Jung K (2020) Attention modality hopping mechanism for speech emotion recognition. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3362–3366
Zhang Z, Wu B, Schuller B (2019) Attention-augmented end-to-end multi-task learning for emotion prediction from speech. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6705–6709
Zhao Z, Bao Z, Zhang Z, Cummins N, Wang H, Schuller BW (2019) Attention-enhanced connectionist temporal classification for discrete speech emotion recognition. In: Interspeech 2019, 20th annual conference of the international speech communication association. ISCA, Graz, Austria, 15-19 Sept 2019, pp 206–210
Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International conference on affective computing and intelligent interaction, ACII 2015, pp 827–831
Acknowledgements
This work was supported by Fundamental Research Funds for the Central Universities (Grants 2019RC29 and DUT19RC(3)012), by National Natural Science Foundation (NNSF) of China (Grant 61972064), by the Gansu Provincial First-Class Discipline Program of Northwest Minzu University (Grant 11080305), and by LiaoNing Revitalization Talents Program (Grant XLYC1806006).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no competing interests regarding the publication of this paper.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Feng, L., Liu, LY., Liu, SL. et al. Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism. Multimed Tools Appl 82, 28917–28935 (2023). https://doi.org/10.1007/s11042-023-14600-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14600-0