Skip to main content
Log in

Speech emotion recognition based on multimodal and multiscale feature fusion

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Conventional feature extraction methods for speech emotion recognition often suffer from unidimensionality and inadequacy in capturing the full range of emotional cues, limiting their effectiveness. To address these challenges, this paper introduces a novel network model named Multi-Modal Speech Emotion Recognition Network (MMSERNet). This model leverages the power of multimodal and multiscale feature fusion to significantly enhance the accuracy of speech emotion recognition. MMSERNet is composed of three specialized sub-networks, each dedicated to the extraction of distinct feature types: cepstral coefficients, spectrogram features, and textual features. It integrates audio features derived from Mel-frequency cepstral coefficients and Mel spectrograms with textual features obtained from word vectors, thereby creating a rich, comprehensive representation of emotional content. The fusion of these diverse feature sets facilitates a robust multimodal approach to emotion recognition. Extensive empirical evaluations of the MMSERNet model on benchmark datasets such as IEMOCAP and MELD demonstrate not only significant improvements in recognition accuracy but also an efficient use of model parameters, ensuring scalability and practical applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Ramakrishnan, S., Emary, E.: Speech emotion recognition approaches in human computer interaction. Telecommunication Syst. 52, 1467–1478 (2013)

    Article  MATH  Google Scholar 

  2. Wani, T.M., Gunawan, T.S., Qadri, S.A.A., et al.: A comprehensive review of speech emotion recognition systems. IEEE Access. 9, 47795–47814 (2021)

    Article  MATH  Google Scholar 

  3. de Lope, J., Grana, M.: An ongoing review of speech emotion recognition. Neurocomputing. 528, 1–11 (2023)

    Article  MATH  Google Scholar 

  4. Pepino, L., Riera, P., Ferrer, L.: Emotion recognition from speech using wav2vec 2.0 embeddings. (2021). arXiv preprint arXiv:2104.03502

  5. Yang, L., Zhao, H., Yu, K.: End-to-end speech emotion recognition based on multi-head attention. J. Comput. Appl. 42(6), 1869 (2022)

    MATH  Google Scholar 

  6. Mishra, S.P., Warule, P., Deb, S.: Speech emotion recognition using MFCC-based entropy feature. Signal. Image Video Process. 18(1), 153–161 (2024)

    Article  Google Scholar 

  7. Yoon, S., Byun, S., Dey, S., et al.: Speech emotion recognition using multi-hop attention mechanism. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2822–2826. IEEE (2019)

  8. Tripathi, S., Kumar, A., Ramesh, A., et al.: Deep learning based emotion recognition system using speech features and transcriptions. (2019). arXiv preprint arXiv:1906.05681

  9. Santoso, J., Yamada, T., Ishizuka, K., et al.: Speech emotion recognition based on self-attention weight correction for acoustic and text features. IEEE Access. 10, 115732–115743 (2022)

    Article  MATH  Google Scholar 

  10. Ye, J.X., Wen, X.C., Wang, X.Z., et al.: GM-TCNet: Gated multi-scale temporal convolutional network using emotion causality for speech emotion recognition. Speech Commun. 145, 21–35 (2022)

    Article  Google Scholar 

  11. Li, X., Lu, G., Yan, J., et al.: A multi-scale multi-task learning model for continuous dimensional emotion recognition from audio. Electronics. 11(3), 417 (2022)

    Article  MATH  Google Scholar 

  12. Chen, M., Zhao, X.: A multi-scale fusion framework for bimodal speech emotion recognition. In: Interspeech 2020, pp. 374–378 (2020)

  13. Busso, C., Bulut, M., Lee, C.C., et al.: IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Evaluation. 42, 335–359 (2008)

    Article  Google Scholar 

  14. Poria, S., Hazarika, D., Majumder, N., et al.: MELD: A multimodal multi-party dataset for emotion recognition in conversations. arXiv Preprint arXiv:181002508 (2018)

  15. McFee, B., Raffel, C., Liang, D., et al.: In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)

  16. Zhong, Y., Hu, Y., Huang, H., et al.: A lightweight model based on separable convolution for speech emotion recognition. In: Interspeech 2020, pp. 3331–3335 (2020)

  17. Aftab, A., Morsali, A., Ghaemmaghami, S., et al.: Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6912–6916. IEEE (2022)

  18. Ye, J., Wen, X.C., Wei, Y., et al.: Temporal modeling matters: A novel temporal emotional modeling approach for speech emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

  19. He, J., Wu, M., Li, M., Zhu, X., Ye, F.: Multilevel transformer for multimodal emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

  20. Wang, S., Ma, Y., Ding, Y.: Exploring complementary features in multi-modal speech emotion recognition. In: ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

  21. Poria, S., Cambria, E., Hazarika, D., et al.: Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 873–883 (2017)

  22. Hu, J., Liu, Y., Zhao, J., Jin, Q.: MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5666–5675 (2021)

  23. Lian, Z., Liu, B., Tao, J.: CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 985–1000 (2021)

    Article  MATH  Google Scholar 

  24. Dou, H., Wei, L., Huai, X.: DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7042–7052 (2021)

  25. Hu, D., Hou, X., Wei, L., et al.: MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7037–7041. IEEE (2022)

Download references

Acknowledgements

This work is supported by the science and technology development project of Jilin province [grant number (20210201051GX, 20210203161SF), and the education department project of Jilin province [grant number JJKH20220686KJ].

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Conceptualization and Writing—original draft preparation: [Huangshui Hu]; Methodology: [Jie Wei]; Formal analysis and investigation [Chuhang Wang]; Writing—review and editing: [Hongyu Sun], [Shuo Tao].

Corresponding author

Correspondence to Jie Wei.

Ethics declarations

Human and animal rights

This article does not contain any studies with human participants or animals performed by any of the authors. No violation of Human and animal rights is involved.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, H., Wei, J., Sun, H. et al. Speech emotion recognition based on multimodal and multiscale feature fusion. SIViP 19, 165 (2025). https://doi.org/10.1007/s11760-024-03773-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03773-2

Keywords

Navigation