Skip to main content
Log in

Speech emotion classification using feature-level and classifier-level fusion

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Emotion plays a vital role in every living being. Understanding emotion is a very complex task for everyone, but if possible, it will work like a miracle to solve thousands of problems and save many lives. Emotion is reflected not only in the gesture but also in work and in producing an efficient result. Hence, the recognition of emotion using speech has been a topic of interest for many researchers for the last three decades. In our study, we used three features, mel frequency cepstral coefficient (MFCC), spectrogram, and mel-spectrogram, as a one-dimensional input vector to the convolutional neural network (CNN) and deep neural network (DNN) for speech emotion classification. We evaluated the accuracy of SER using the features individually and in combination with the deep learning classifiers CNN and DNN. For both CNN and DNN classifiers, the combination of features performed better than the individual features. The combination of features using the DNN classifier achieved an accuracy of 76.60%, 87.10%, 79.79%, and 100%, and using the CNN classifier achieved classification accuracy of 75%, 84.11%, 78.13%, and 100% for the RAVDESS, EMO-DB, SAVEE, and TESS datasets,respectively. Then we applied a proposed feature and classifier-level fusion method using CNN and DNN to improve emotion classification performance and achieved classification accuracy of 80.42%, 87.48%, and 80.99% on the RAVDESS, EMO-DB, and SAVEE datasets, respectively. The performance of the proposed feature and classifier-level fusion method was compared with the other methods, and it was found that the proposed method performed better than the state-of-the-art methods.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The data that support the findings of this study are openly available.

References

  • Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Language Process 22(10):1533–1545

    Article  Google Scholar 

  • Abdelhamid AA, El-Kenawy E-SM, Alotaibi B, Amer GM, Abdelkader MY, Ibrahim A, Eid MM (2022) Robust speech emotion recognition using CNN+ lSTM based on stochastic fractal search optimization algorithm. IEEE Access 10:49265–49284

    Article  Google Scholar 

  • Ancilin J, Milton A (2021) Improved speech emotion recognition with Mel frequency magnitude coefficient. Appl Acoust 179:108046

    Article  Google Scholar 

  • Andayani F, Theng LB, Tsun MT, Chua C (2022) Hybrid lSTM-transformer model for emotion recognition from speech audio files. IEEE Access 10:36018–36027

    Article  Google Scholar 

  • Badshah A M, Ahmad J, Rahim v, Baik S W (2017) Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 international conference on platform technology and service (PlatCon), IEEE, pp 1–5

  • Bansal M, Yadav S, Vishwakarma D K (2021) A language-independent speech sentiment analysis using prosodic features. In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, pp 1210–1216

  • Chen M, He X, Yang J, Zhang H (2018) 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444

    Article  Google Scholar 

  • Choi G-H, Bak E-S, Pan S-B (2019) User identification system using 2d resized spectrogram features of ECG. IEEE Access 7:34862–34873

    Article  Google Scholar 

  • Deb S, Dandapat S (2016) Classification of speech under stress using harmonic peak to energy ratio. Comput Electric Eng 55:12–23

    Article  Google Scholar 

  • Deb S, Dandapat S (2016) Emotion classification using residual sinusoidal peak amplitude. In: 2016 International conference on signal processing and communications (SPCOM), IEEE, pp 1–5

  • Deb S, Dandapat S (2017) Exploration of phase information for speech emotion classification. In: 2017 Twenty-third National Conference on Communications (NCC), IEEE, pp 1–5

  • Dolka H, VM AX, Juliet S (2021) Speech emotion recognition using ann on mfcc features. In: 2021 3rd International Conference on Signal Processing and Communication (ICPSC), IEEE, pp 431–435

  • Ezzameli K, Mahersia H (2023) Emotion recognition from unimodal to multimodal analysis: a review. Inf Fusion 101847

  • Fahad MS, Ranjan A, Yadav J, Deepak A (2021) A survey of speech emotion recognition in natural environment. Digital Signal Process 110:102951

    Article  Google Scholar 

  • Fu W, Yang X, Wang Y (2010) Heart sound diagnosis based on DTW and MFCC. In: 2010 3rd International Congress on Image and Signal Processing, Vol. 6, IEEE, pp 2920–2923

  • Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 801–804

  • Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894

    Article  Google Scholar 

  • Ittichaichareon C, Suksri S, Yingthawornsuk T (2012) Speech recognition using mfcc. In: International conference on computer graphics, simulation and modeling, Vol. 9

  • Kwon S (2019) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183

    Article  Google Scholar 

  • Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303

    Article  Google Scholar 

  • Liu Z-T, Rehman A, Wu M, Cao W-H, Hao M (2021) Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf Sci 563:309–325

    Article  Google Scholar 

  • Lukose S, Upadhya SS (2017) Music player based on emotion recognition of voice signals. 2017 International Conference on Intelligent Computing. Instrumentation and Control Technologies (ICICICT), IEEE, pp 1751–1754

  • Mekruksavanich S, Jitpattanakul A, Hnoohom N (2020) Negative emotion recognition using deep learning for Thai language. In: 2020 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON), IEEE, pp 71–74

  • Milton A, Roy SS, Selvi ST (2013) Svm scheme for speech emotion recognition using MFCC feature. Int J Comput Appl 69(9)

  • Mishra S P, Warule P, Deb S (2023) Deep learning based emotion classification using Mel frequency magnitude coefficient. In: 2023 1st International Conference on Innovations in High Speed Communication and Signal Processing (IHCSP), IEEE, pp 93–98

  • Nassif AB, Shahin I, Hamsa S, Nemmour N, Hirose K (2021) Casa-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Appl Soft Comput 103:107141

    Article  Google Scholar 

  • Özseven T (2018) Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl Acoust 142:70–77

    Article  Google Scholar 

  • Pandey SK, Shekhawat HS, Prasanna SM (2019) Deep learning techniques for speech emotion recognition: a review. In: 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA), IEEE, pp 1–6

  • Sajjad M, Kwon S et al (2020) Clustering-based speech emotion recognition by incorporating learned features and deep Bilstm. IEEE Access 8:79861–79875

    Article  Google Scholar 

  • Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093

  • Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53(9–10):1062–1087

    Article  Google Scholar 

  • Sönmez YÜ, Varol A (2020) A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access 8:190784–190796

    Article  Google Scholar 

  • Sun L, Chen J, Xie K, Gu T (2018) Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition. Int J Speech Technol 21(4):931–940

    Article  Google Scholar 

  • Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun 115:29–37

    Article  Google Scholar 

  • Tiwari V (2010) Mfcc and its applications in speaker recognition. Int J Emerg Technol 1(1):19–22

    Google Scholar 

  • Valles D, Matin R (2021) An audio processing approach using ensemble learning for speech-emotion recognition for children with ASD. In: 2021 IEEE World AI IoT Congress (AIIoT), IEEE, pp 0055–0061

  • Ververidis D, Kotropoulos C (2003) A state of the art review on emotional speech databases. In: Proceedings of 1st Richmedia Conference, Citeseer, pp 109–119

  • Warule P, Mishra SP, Deb S, Krajewski J (2023) Sinusoidal model-based diagnosis of the common cold from the speech signal. Biomed Signal Process Control 83:104653

    Article  Google Scholar 

  • Warule P, Mishra S P, Deb S (2022) Classification of cold and non-cold speech using vowel-like region segments. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), IEEE, pp 1–5

  • Warule P, Mishra S P, Deb S (2023) Time-frequency analysis of speech signal using chirplet transform for automatic diagnosis of Parkinson’s disease. Biomed Eng Lett 1–11

  • Yildirim S, Kaya Y, Kılıç F (2021) A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl Acoust 173:107721

    Article  Google Scholar 

  • Zão L, Cavalcante D, Coelho R (2014) Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Process Lett 21(5):620–624

    Article  Google Scholar 

  • Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722

    Article  Google Scholar 

  • Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN lSTM networks. Biomed Signal Process Control 47:312–323

    Article  Google Scholar 

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siba Prasad Mishra.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mishra, S.P., Warule, P. & Deb, S. Speech emotion classification using feature-level and classifier-level fusion. Evolving Systems 15, 541–554 (2024). https://doi.org/10.1007/s12530-023-09550-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-023-09550-9

Keywords

Navigation