Abstract
Conventional feature extraction methods for speech emotion recognition often suffer from unidimensionality and inadequacy in capturing the full range of emotional cues, limiting their effectiveness. To address these challenges, this paper introduces a novel network model named Multi-Modal Speech Emotion Recognition Network (MMSERNet). This model leverages the power of multimodal and multiscale feature fusion to significantly enhance the accuracy of speech emotion recognition. MMSERNet is composed of three specialized sub-networks, each dedicated to the extraction of distinct feature types: cepstral coefficients, spectrogram features, and textual features. It integrates audio features derived from Mel-frequency cepstral coefficients and Mel spectrograms with textual features obtained from word vectors, thereby creating a rich, comprehensive representation of emotional content. The fusion of these diverse feature sets facilitates a robust multimodal approach to emotion recognition. Extensive empirical evaluations of the MMSERNet model on benchmark datasets such as IEMOCAP and MELD demonstrate not only significant improvements in recognition accuracy but also an efficient use of model parameters, ensuring scalability and practical applicability.
Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Ramakrishnan, S., Emary, E.: Speech emotion recognition approaches in human computer interaction. Telecommunication Syst. 52, 1467–1478 (2013)
Wani, T.M., Gunawan, T.S., Qadri, S.A.A., et al.: A comprehensive review of speech emotion recognition systems. IEEE Access. 9, 47795–47814 (2021)
de Lope, J., Grana, M.: An ongoing review of speech emotion recognition. Neurocomputing. 528, 1–11 (2023)
Pepino, L., Riera, P., Ferrer, L.: Emotion recognition from speech using wav2vec 2.0 embeddings. (2021). arXiv preprint arXiv:2104.03502
Yang, L., Zhao, H., Yu, K.: End-to-end speech emotion recognition based on multi-head attention. J. Comput. Appl. 42(6), 1869 (2022)
Mishra, S.P., Warule, P., Deb, S.: Speech emotion recognition using MFCC-based entropy feature. Signal. Image Video Process. 18(1), 153–161 (2024)
Yoon, S., Byun, S., Dey, S., et al.: Speech emotion recognition using multi-hop attention mechanism. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2822–2826. IEEE (2019)
Tripathi, S., Kumar, A., Ramesh, A., et al.: Deep learning based emotion recognition system using speech features and transcriptions. (2019). arXiv preprint arXiv:1906.05681
Santoso, J., Yamada, T., Ishizuka, K., et al.: Speech emotion recognition based on self-attention weight correction for acoustic and text features. IEEE Access. 10, 115732–115743 (2022)
Ye, J.X., Wen, X.C., Wang, X.Z., et al.: GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun. 145, 21–35 (2022)
Li, X., Lu, G., Yan, J., et al.: A multi-scale multi-task learning model for continuous dimensional emotion recognition from audio. Electronics. 11(3), 417 (2022)
Chen, M., Zhao, X.: A multi-scale fusion framework for bimodal speech emotion recognition. In: Interspeech 2020, pp. 374–378 (2020)
Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Evaluation. 42, 335–359 (2008)
Poria, S., Hazarika, D., Majumder, N., et al.: MELD: A multimodal multi-party dataset for emotion recognition in conversations. arXiv Preprint arXiv:181002508 (2018)
McFee, B., Raffel, C., Liang, D., et al.: In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)
Zhong, Y., Hu, Y., Huang, H., et al.: A lightweight model based on separable convolution for speech emotion recognition. In: Interspeech 2020, pp. 3331–3335 (2020)
Aftab, A., Morsali, A., Ghaemmaghami, S., et al.: Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6912–6916. IEEE (2022)
Ye, J., Wen, X.C., Wei, Y., et al.: Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
He, J., Wu, M., Li, M., Zhu, X., Ye, F.: Multilevel transformer for multimodal emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Wang, S., Ma, Y., Ding, Y.: Exploring complementary features in multi-modal speech emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Poria, S., Cambria, E., Hazarika, D., et al.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883 (2017)
Hu, J., Liu, Y., Zhao, J., Jin, Q.: MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5666–5675 (2021)
Lian, Z., Liu, B., Tao, J.: CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 985–1000 (2021)
Dou, H., Wei, L., Huai, X.: DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7042–7052 (2021)
Hu, D., Hou, X., Wei, L., et al.: MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7037–7041. IEEE (2022)
Acknowledgements
This work is supported by the science and technology development project of Jilin province [grant number (20210201051GX, 20210203161SF), and the education department project of Jilin province [grant number JJKH20220686KJ].
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Conceptualization and Writing—original draft preparation: [Huangshui Hu]; Methodology: [Jie Wei]; Formal analysis and investigation [Chuhang Wang]; Writing—review and editing: [Hongyu Sun], [Shuo Tao].
Corresponding author
Ethics declarations
Human and animal rights
This article does not contain any studies with human participants or animals performed by any of the authors. No violation of Human and animal rights is involved.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, H., Wei, J., Sun, H. et al. Speech emotion recognition based on multimodal and multiscale feature fusion. SIViP 19, 165 (2025). https://doi.org/10.1007/s11760-024-03773-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03773-2