Abstract
Emotion plays a vital role in every living being. Understanding emotion is a very complex task for everyone, but if possible, it will work like a miracle to solve thousands of problems and save many lives. Emotion is reflected not only in the gesture but also in work and in producing an efficient result. Hence, the recognition of emotion using speech has been a topic of interest for many researchers for the last three decades. In our study, we used three features, mel frequency cepstral coefficient (MFCC), spectrogram, and mel-spectrogram, as a one-dimensional input vector to the convolutional neural network (CNN) and deep neural network (DNN) for speech emotion classification. We evaluated the accuracy of SER using the features individually and in combination with the deep learning classifiers CNN and DNN. For both CNN and DNN classifiers, the combination of features performed better than the individual features. The combination of features using the DNN classifier achieved an accuracy of 76.60%, 87.10%, 79.79%, and 100%, and using the CNN classifier achieved classification accuracy of 75%, 84.11%, 78.13%, and 100% for the RAVDESS, EMO-DB, SAVEE, and TESS datasets,respectively. Then we applied a proposed feature and classifier-level fusion method using CNN and DNN to improve emotion classification performance and achieved classification accuracy of 80.42%, 87.48%, and 80.99% on the RAVDESS, EMO-DB, and SAVEE datasets, respectively. The performance of the proposed feature and classifier-level fusion method was compared with the other methods, and it was found that the proposed method performed better than the state-of-the-art methods.
Graphical abstract
Similar content being viewed by others
Availability of data and materials
The data that support the findings of this study are openly available.
References
Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Language Process 22(10):1533–1545
Abdelhamid AA, El-Kenawy E-SM, Alotaibi B, Amer GM, Abdelkader MY, Ibrahim A, Eid MM (2022) Robust speech emotion recognition using CNN+ lSTM based on stochastic fractal search optimization algorithm. IEEE Access 10:49265–49284
Ancilin J, Milton A (2021) Improved speech emotion recognition with Mel frequency magnitude coefficient. Appl Acoust 179:108046
Andayani F, Theng LB, Tsun MT, Chua C (2022) Hybrid lSTM-transformer model for emotion recognition from speech audio files. IEEE Access 10:36018–36027
Badshah A M, Ahmad J, Rahim v, Baik S W (2017) Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 international conference on platform technology and service (PlatCon), IEEE, pp 1–5
Bansal M, Yadav S, Vishwakarma D K (2021) A language-independent speech sentiment analysis using prosodic features. In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, pp 1210–1216
Chen M, He X, Yang J, Zhang H (2018) 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444
Choi G-H, Bak E-S, Pan S-B (2019) User identification system using 2d resized spectrogram features of ECG. IEEE Access 7:34862–34873
Deb S, Dandapat S (2016) Classification of speech under stress using harmonic peak to energy ratio. Comput Electric Eng 55:12–23
Deb S, Dandapat S (2016) Emotion classification using residual sinusoidal peak amplitude. In: 2016 International conference on signal processing and communications (SPCOM), IEEE, pp 1–5
Deb S, Dandapat S (2017) Exploration of phase information for speech emotion classification. In: 2017 Twenty-third National Conference on Communications (NCC), IEEE, pp 1–5
Dolka H, VM AX, Juliet S (2021) Speech emotion recognition using ann on mfcc features. In: 2021 3rd International Conference on Signal Processing and Communication (ICPSC), IEEE, pp 431–435
Ezzameli K, Mahersia H (2023) Emotion recognition from unimodal to multimodal analysis: a review. Inf Fusion 101847
Fahad MS, Ranjan A, Yadav J, Deepak A (2021) A survey of speech emotion recognition in natural environment. Digital Signal Process 110:102951
Fu W, Yang X, Wang Y (2010) Heart sound diagnosis based on DTW and MFCC. In: 2010 3rd International Congress on Image and Signal Processing, Vol. 6, IEEE, pp 2920–2923
Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia, pp 801–804
Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Control 59:101894
Ittichaichareon C, Suksri S, Yingthawornsuk T (2012) Speech recognition using mfcc. In: International conference on computer graphics, simulation and modeling, Vol. 9
Kwon S (2019) A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183
Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13(2):293–303
Liu Z-T, Rehman A, Wu M, Cao W-H, Hao M (2021) Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf Sci 563:309–325
Lukose S, Upadhya SS (2017) Music player based on emotion recognition of voice signals. 2017 International Conference on Intelligent Computing. Instrumentation and Control Technologies (ICICICT), IEEE, pp 1751–1754
Mekruksavanich S, Jitpattanakul A, Hnoohom N (2020) Negative emotion recognition using deep learning for Thai language. In: 2020 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT & NCON), IEEE, pp 71–74
Milton A, Roy SS, Selvi ST (2013) Svm scheme for speech emotion recognition using MFCC feature. Int J Comput Appl 69(9)
Mishra S P, Warule P, Deb S (2023) Deep learning based emotion classification using Mel frequency magnitude coefficient. In: 2023 1st International Conference on Innovations in High Speed Communication and Signal Processing (IHCSP), IEEE, pp 93–98
Nassif AB, Shahin I, Hamsa S, Nemmour N, Hirose K (2021) Casa-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Appl Soft Comput 103:107141
Özseven T (2018) Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition. Appl Acoust 142:70–77
Pandey SK, Shekhawat HS, Prasanna SM (2019) Deep learning techniques for speech emotion recognition: a review. In: 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA), IEEE, pp 1–6
Sajjad M, Kwon S et al (2020) Clustering-based speech emotion recognition by incorporating learned features and deep Bilstm. IEEE Access 8:79861–79875
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Interspeech, pp 1089–1093
Schuller B, Batliner A, Steidl S, Seppi D (2011) Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun 53(9–10):1062–1087
Sönmez YÜ, Varol A (2020) A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access 8:190784–190796
Sun L, Chen J, Xie K, Gu T (2018) Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition. Int J Speech Technol 21(4):931–940
Sun L, Zou B, Fu S, Chen J, Wang F (2019) Speech emotion recognition based on DNN-decision tree SVM model. Speech Commun 115:29–37
Tiwari V (2010) Mfcc and its applications in speaker recognition. Int J Emerg Technol 1(1):19–22
Valles D, Matin R (2021) An audio processing approach using ensemble learning for speech-emotion recognition for children with ASD. In: 2021 IEEE World AI IoT Congress (AIIoT), IEEE, pp 0055–0061
Ververidis D, Kotropoulos C (2003) A state of the art review on emotional speech databases. In: Proceedings of 1st Richmedia Conference, Citeseer, pp 109–119
Warule P, Mishra SP, Deb S, Krajewski J (2023) Sinusoidal model-based diagnosis of the common cold from the speech signal. Biomed Signal Process Control 83:104653
Warule P, Mishra S P, Deb S (2022) Classification of cold and non-cold speech using vowel-like region segments. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), IEEE, pp 1–5
Warule P, Mishra S P, Deb S (2023) Time-frequency analysis of speech signal using chirplet transform for automatic diagnosis of Parkinson’s disease. Biomed Eng Lett 1–11
Yildirim S, Kaya Y, Kılıç F (2021) A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl Acoust 173:107721
Zão L, Cavalcante D, Coelho R (2014) Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Process Lett 21(5):620–624
Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. Multimed Tools Appl 78(3):3705–3722
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d & 2d CNN lSTM networks. Biomed Signal Process Control 47:312–323
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mishra, S.P., Warule, P. & Deb, S. Speech emotion classification using feature-level and classifier-level fusion. Evolving Systems 15, 541–554 (2024). https://doi.org/10.1007/s12530-023-09550-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12530-023-09550-9