Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

Feng, Lin; Liu, Lu-Yao; Liu, Sheng-Lan; Zhou, Jian; Yang, Han-Qing; Yang, Jie

doi:10.1007/s11042-023-14600-0

Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

Published: 04 March 2023

Volume 82, pages 28917–28935, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lin Feng¹,
Lu-Yao Liu ORCID: orcid.org/0000-0002-1366-4151¹,
Sheng-Lan Liu¹,
Jian Zhou¹,
Han-Qing Yang² &
…
Jie Yang³

547 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In recent years, speech emotion recognition (SER) increasingly attracts attention since it is a key component of intelligent human-computer interaction and sophisticated dialog systems. To obtain more abundant emotional information, a great number of studies in SER pay attention to the multimodal systems which utilize other modalities such as text and facial expression to assist the speech emotion recognition. However, it is difficult to structure a fusion mechanism which can selectively extract abundant emotion-related features from different modalities. To tackle this issue, we develop a multimodal speech emotion recognition model based on multi-scale MFCCs and multi-view attention mechanism, which is able to extract abundant audio emotional features and comprehensively fuse emotion-related features from four aspects (i.e., audio self-attention, textual self-attention, audio attention based on textual content, and textual attention based on audio content). Under different audio input conditions and attention configurations, it can be observed that the best emotion recognition accuracy can be achieved by jointly utilizing four attention modules and three different scales of MFCCs. In addition, based on multi-task learning, we regard the gender recognition as an auxiliary task to learn gender information. To further improve the accuracy of emotion recognition, a joint loss function based on softmax cross-entropy loss and center loss is used. The experiments are conducted on two different datasets (IEMOCAP and MSP-IMPROV). The experimental results demonstrate that the proposed model outperforms the previous models on IEMOCAP dataset, while it obtains the competitive performance on MSP-IMPROV dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Article 01 June 2023

Data Availability

The code during the study is available in https://github.com/1LuYaoLiu/Mutimodal-Speech-Emotion-Recognition-Based-on-Multi-view-Attention-Mechanism.

The datasets used during the study were provided by a third party. Direct requests for these materials may be made to the provider as follows:

IEMOCAP: https://sail.usc.edu/iemocap/index.html;

MSP-IMPROV: http://ecs.utdallas.edu/research/researchlabs/msp-lab/MSPImprov.html.

References

Abdelwahab M, Busso C (2018) Domain adversarial for acoustic emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26 (12):2423–2435
Article Google Scholar
Bhosale S, Chakraborty R, Kopparapu SK (2020) Deep encoder linguistic and acoustic cues for attention based end to end speech emotion recognition. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7184–7188
Bird S (2006) NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp 69–72, Sydney, Australia. Association for computational linguistics
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) IEMOCAP interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Article Google Scholar
Busso C, Parthasarathy S, Burmania A, Abdelwahab M, Sadoughi N, Provost EM (2017) MSP- IMPROV: an acted corpus of dyadic interactions to study emotion perception. IEEE Trans Affect Comput 8(1):67–80
Article Google Scholar
Cho J, Pappagari R, Kulkarni P, Villalba J, Carmiel Y, Dehak N (2018) Deep neural networks for emotion recognition combining audio and transcripts. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2018-September (September), pp 247–251
Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum-behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1-2):1261–1289
Article Google Scholar
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: International conference on spoken language processing, ICSLP, proceedings, vol 3(Icslp 96), pp 1970–1973
Deng J, Xu X, Zhang Z, Fruhholz S, Schuller B (2018) Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans Audio Speech Lang Process 26(1):31–43
Article Google Scholar
Dobrišek S, Gajšek R, Mihelič F, Pavešić N, Štruc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst, vol 10
Georgiou E, Papaioannou C, Potamianos A (2019) Deep hierarchical fusion with application in sentiment analysis. In: Proceedings of the annual conference of the international speech communication association, INTERSPEECH, 2019-September, pp 1646–1650
Gobl C, Chasaide AN (2003) The role of voice quality in communicating emotion, mood and attitude. Speech Comm 40(1-2):189–212
Article MATH Google Scholar
Guizzo E, Weyde T, Leveson JB (2020) Multi-time-scale convolution for emotion recognition from speech audio signals. In: ICASSP 2020 – 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6489–6493
Gupta S, fahad MS, Deepak A (2019) Pitch-synchronous single frequency filtering spectrogram for speech emotion recognition. arXiv, pp 23347–23365
Lausen A, Schacht A (2018) Gender differences in the recognition of vocal emotions. Front Psychol 9(JUN):1–22
Google Scholar
Li P, Song Y, McLoughlin IV, Guo W, Dai L (2018) An attention pooling based representation learning method for speech emotion recognition. In: Interspeech 2018, 19th annual conference of the international speech communication association. ISCA, Hyderabad, India, 2-6 Sept 2018, pp 3087–3091
Liu J, Liu Z, Wang L, Guo L, Dang J (2020) Speech emotion recognition with local-global aware deep representation learning. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 7169–7173
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency LP (2018) Efficient low-rank multimodal fusion with modality-specific factors. arXiv, pp 2247–2256
Metallinou A, Wollmer M, Katsamanis A, Eyben F, Schuller B, Narayanan S (2012) Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans Affect Comput:184–198
Miao H, Cheng G, Gao C, Zhang P, online YY (2020) Transformer-based ctc/attention end-to-end speech recognition architecture. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6084–6088
Mihalcea R, Morency L-P, Science C (2013) Utterance-level multimodal sentiment analysis. Acl: 973–982
Nediyanchath A, Paramasivam P, Yenigalla P (2020) Multi-head attention for speech emotion recognition with auxiliary learning of gender recognition. In: ICASSP, IEEE international conference on acoustics, speech and signal processing - proceedings. IEEE, vol 2020 May, pp 7179–7183
Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. Proc Annual Conf Int Speech Commun Assoc Interspeech 2:809–812
Google Scholar
Neumann M, Vu NT (2019) Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7390–7394
Pao T-L, Chen Y-T, Yeh J-H, Liao W-Y (2005) Combining acoustic features for improved emotion recognition in mandarin speech. In: Tao J, Tan T, Picard RW (eds) Affective computing and intelligent interaction, pp 279–285, Berlin, Heidelberg, Springer
Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion . Inf Fusion 37:98–125
Article Google Scholar
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Conference proceedings - EMNLP 2015: conference on empirical methods in natural language processing, (september), pp 2539–2544
Poria S, Cambria E, Howard N, Huang GB, Hussain A (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59
Article Google Scholar
Rozgić V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Ensemble of SVM trees for multimodal emotion recognition. In: 2012 Conference handbook - asia-pacific signal and information processing association annual summit and conference, APSIPA ASC 2012, pp 7–10
Schuller B (2011) Recognizing affect from linguistic information in 3D continuous space. IEEE Trans Affect Comput 2(4):192–205
Article Google Scholar
Shen J, Tang X, Dong X, Shao L (2020) Visual object tracking by hierarchical attention siamese network, vol 50(7), pp 3068–3080
Shirian A, Guha T (2020) Compact graph architecture for speech emotion recognition
Su B-H, Chang C-M, Lin Y-S, Lee C-C (2020) Improving speech emotion recognition using graph attentive bi-directional gated recurrent unit network, pp 506–510
Tripathi S, Tripathi S, Beigi H (2018) Multi-modal emotion recognition on IEMOCAP dataset using deep learning
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems. Curran Associates, Inc, vol 30, pp 5998–6008
Wang Y, Guan L (2004) An investigation of speech-based human emotion recognition. In: 2004 IEEE 6th workshop on multimedia signal processing, pp 15–18
Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency LP (2018) Words can Shift: dynamically adjusting word representations using nonverbal behaviors. arXiv
Wang J, Xue M, Culhane R, Diao E, Ding J, Tarokh V (2020) Speech emotion recognition with dual-sequence LSTM architecture. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6469–6473
Williams C, Stevens K (1981) Vocal correlates of emotional state. Vocal Correlates Emotional States, vol 01
Wu CH, Liang WB (2015) Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels (Extended abstract). In: 2015 International conference on affective computing and intelligent interaction, ACII 2015, pp 477–483
Xu H, Zhang H, Han K, Wang Y, Peng Y, Li X (2019) Learning alignment for multimodal emotion recognition from speech. arXiv, pp 3569–3573
Yandong W, Kaipeng Z, Zhifeng L, Yu Q (2016) A Discriminative Feature Learning Approach for Deep Face Recognition. Computer Vision – ECCV pp 499–515
Yoon S, Byun S, Jung K (2018) Multimodal speech emotion recognition using audio and text. 2018 IEEE Spoken Lang Technol Workshop (SLT):112–118
Yoon S, Dey S, Lee H, Jung K (2020) Attention modality hopping mechanism for speech emotion recognition. In: ICASSP 2020 - 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3362–3366
Zhang Z, Wu B, Schuller B (2019) Attention-augmented end-to-end multi-task learning for emotion prediction from speech. In: ICASSP 2019 - 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6705–6709
Zhao Z, Bao Z, Zhang Z, Cummins N, Wang H, Schuller BW (2019) Attention-enhanced connectionist temporal classification for discrete speech emotion recognition. In: Interspeech 2019, 20th annual conference of the international speech communication association. ISCA, Graz, Austria, 15-19 Sept 2019, pp 206–210
Zheng WQ, Yu JS, Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks. In: 2015 International conference on affective computing and intelligent interaction, ACII 2015, pp 827–831

Download references

Acknowledgements

This work was supported by Fundamental Research Funds for the Central Universities (Grants 2019RC29 and DUT19RC(3)012), by National Natural Science Foundation (NNSF) of China (Grant 61972064), by the Gansu Provincial First-Class Discipline Program of Northwest Minzu University (Grant 11080305), and by LiaoNing Revitalization Talents Program (Grant XLYC1806006).

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, Dalian, China
Lin Feng, Lu-Yao Liu, Sheng-Lan Liu & Jian Zhou
Computer Science and Engineering, Washington University, St. Louis, MO, USA
Han-Qing Yang
Research Institute of Information Technology, Tsinghua University, Beijing, 100084, China
Jie Yang

Authors

Lin Feng
View author publications
You can also search for this author in PubMed Google Scholar
Lu-Yao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sheng-Lan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Han-Qing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jie Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu-Yao Liu.

Ethics declarations

Conflict of Interests

The authors declare that they have no competing interests regarding the publication of this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 9 The settings for hyper-parameters in experiments

Full size table

Table 10 The specific settings for convolution layers

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Feng, L., Liu, LY., Liu, SL. et al. Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism. Multimed Tools Appl 82, 28917–28935 (2023). https://doi.org/10.1007/s11042-023-14600-0

Download citation

Received: 07 September 2021
Revised: 18 June 2022
Accepted: 04 February 2023
Published: 04 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11042-023-14600-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

Abstract

Access this article

Similar content being viewed by others

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism

Abstract

Access this article

Similar content being viewed by others

Multi-modal Speech Emotion Recognition: Improving Accuracy Through Fusion of VGGish and BERT Features with Multi-head Attention

A Time-Frequency Attention Mechanism with Subsidiary Information for Effective Speech Emotion Recognition

Speech Emotion Recognition Using Cascaded Attention Network with Joint Loss for Discrimination of Confusions

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation