Speech emotion recognition based on multimodal and multiscale feature fusion

Hu, Huangshui; Wei, Jie; Sun, Hongyu; Wang, Chuhang; Tao, Shuo

doi:10.1007/s11760-024-03773-2

Speech emotion recognition based on multimodal and multiscale feature fusion

Original Paper
Published: 31 December 2024

Volume 19, article number 165, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Huangshui Hu¹,
Jie Wei¹,
Hongyu Sun¹,
Chuhang Wang² &
…
Shuo Tao¹

165 Accesses
Explore all metrics

Abstract

Conventional feature extraction methods for speech emotion recognition often suffer from unidimensionality and inadequacy in capturing the full range of emotional cues, limiting their effectiveness. To address these challenges, this paper introduces a novel network model named Multi-Modal Speech Emotion Recognition Network (MMSERNet). This model leverages the power of multimodal and multiscale feature fusion to significantly enhance the accuracy of speech emotion recognition. MMSERNet is composed of three specialized sub-networks, each dedicated to the extraction of distinct feature types: cepstral coefficients, spectrogram features, and textual features. It integrates audio features derived from Mel-frequency cepstral coefficients and Mel spectrograms with textual features obtained from word vectors, thereby creating a rich, comprehensive representation of emotional content. The fusion of these diverse feature sets facilitates a robust multimodal approach to emotion recognition. Extensive empirical evaluations of the MMSERNet model on benchmark datasets such as IEMOCAP and MELD demonstrate not only significant improvements in recognition accuracy but also an efficient use of model parameters, ensuring scalability and practical applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

Article 06 July 2023

Speech Emotion Recognition Based on Multi Acoustic Feature Fusion

Multimodal modelling of human emotion using sound, image and text fusion

Article 11 August 2023

Data availability

No datasets were generated or analysed during the current study.

References

Ramakrishnan, S., Emary, E.: Speech emotion recognition approaches in human computer interaction. Telecommunication Syst. 52, 1467–1478 (2013)
Article MATH Google Scholar
Wani, T.M., Gunawan, T.S., Qadri, S.A.A., et al.: A comprehensive review of speech emotion recognition systems. IEEE Access. 9, 47795–47814 (2021)
Article MATH Google Scholar
de Lope, J., Grana, M.: An ongoing review of speech emotion recognition. Neurocomputing. 528, 1–11 (2023)
Article MATH Google Scholar
Pepino, L., Riera, P., Ferrer, L.: Emotion recognition from speech using wav2vec 2.0 embeddings. (2021). arXiv preprint arXiv:2104.03502
Yang, L., Zhao, H., Yu, K.: End-to-end speech emotion recognition based on multi-head attention. J. Comput. Appl. 42(6), 1869 (2022)
MATH Google Scholar
Mishra, S.P., Warule, P., Deb, S.: Speech emotion recognition using MFCC-based entropy feature. Signal. Image Video Process. 18(1), 153–161 (2024)
Article Google Scholar
Yoon, S., Byun, S., Dey, S., et al.: Speech emotion recognition using multi-hop attention mechanism. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2822–2826. IEEE (2019)
Tripathi, S., Kumar, A., Ramesh, A., et al.: Deep learning based emotion recognition system using speech features and transcriptions. (2019). arXiv preprint arXiv:1906.05681
Santoso, J., Yamada, T., Ishizuka, K., et al.: Speech emotion recognition based on self-attention weight correction for acoustic and text features. IEEE Access. 10, 115732–115743 (2022)
Article MATH Google Scholar
Ye, J.X., Wen, X.C., Wang, X.Z., et al.: GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun. 145, 21–35 (2022)
Article Google Scholar
Li, X., Lu, G., Yan, J., et al.: A multi-scale multi-task learning model for continuous dimensional emotion recognition from audio. Electronics. 11(3), 417 (2022)
Article MATH Google Scholar
Chen, M., Zhao, X.: A multi-scale fusion framework for bimodal speech emotion recognition. In: Interspeech 2020, pp. 374–378 (2020)
Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Evaluation. 42, 335–359 (2008)
Article Google Scholar
Poria, S., Hazarika, D., Majumder, N., et al.: MELD: A multimodal multi-party dataset for emotion recognition in conversations. arXiv Preprint arXiv:181002508 (2018)
McFee, B., Raffel, C., Liang, D., et al.: In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)
Zhong, Y., Hu, Y., Huang, H., et al.: A lightweight model based on separable convolution for speech emotion recognition. In: Interspeech 2020, pp. 3331–3335 (2020)
Aftab, A., Morsali, A., Ghaemmaghami, S., et al.: Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6912–6916. IEEE (2022)
Ye, J., Wen, X.C., Wei, Y., et al.: Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
He, J., Wu, M., Li, M., Zhu, X., Ye, F.: Multilevel transformer for multimodal emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Wang, S., Ma, Y., Ding, Y.: Exploring complementary features in multi-modal speech emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Poria, S., Cambria, E., Hazarika, D., et al.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883 (2017)
Hu, J., Liu, Y., Zhao, J., Jin, Q.: MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5666–5675 (2021)
Lian, Z., Liu, B., Tao, J.: CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 985–1000 (2021)
Article MATH Google Scholar
Dou, H., Wei, L., Huai, X.: DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7042–7052 (2021)
Hu, D., Hou, X., Wei, L., et al.: MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7037–7041. IEEE (2022)

Download references

Acknowledgements

This work is supported by the science and technology development project of Jilin province [grant number (20210201051GX, 20210203161SF), and the education department project of Jilin province [grant number JJKH20220686KJ].

Author information

Authors and Affiliations

College of Computer Science and Engineering, Changchun University of Technology, Changchun, China
Huangshui Hu, Jie Wei, Hongyu Sun & Shuo Tao
College of Computer Science and Technology, Changchun Normal University, Changchun, China
Chuhang Wang

Authors

Huangshui Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wei
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Sun
View author publications
You can also search for this author in PubMed Google Scholar
Chuhang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shuo Tao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Conceptualization and Writing—original draft preparation: [Huangshui Hu]; Methodology: [Jie Wei]; Formal analysis and investigation [Chuhang Wang]; Writing—review and editing: [Hongyu Sun], [Shuo Tao].

Corresponding author

Correspondence to Jie Wei.

Ethics declarations

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors. No violation of Human and animal rights is involved.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, H., Wei, J., Sun, H. et al. Speech emotion recognition based on multimodal and multiscale feature fusion. SIViP 19, 165 (2025). https://doi.org/10.1007/s11760-024-03773-2

Download citation

Received: 24 April 2024
Revised: 23 September 2024
Accepted: 09 December 2024
Published: 31 December 2024
DOI: https://doi.org/10.1007/s11760-024-03773-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition based on multimodal and multiscale feature fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

Speech Emotion Recognition Based on Multi Acoustic Feature Fusion

Multimodal modelling of human emotion using sound, image and text fusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Human and animal rights

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Speech emotion recognition based on multimodal and multiscale feature fusion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An efficient speech emotion recognition based on a dual-stream CNN-transformer fusion network

Speech Emotion Recognition Based on Multi Acoustic Feature Fusion

Multimodal modelling of human emotion using sound, image and text fusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Human and animal rights

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation