Abstract
Emotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields of education, health, forensics, defense, robotics, and scientific purposes. However, SER has the limitations of data labeling, misinterpretation of speech, annotation of audio, and time complexity. This work presents the evaluation of SER based on the features extracted from Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) to study the emotions from different versions of audio signals. The sound signals are segmented by extracting and parametrizing each frequency calls using MFCC, GFCC, and combined features (M-GFCC) in the feature extraction stage. With the recent advances in Deep Learning techniques, this paper proposes a Deep Convolutional-Recurrent Neural Network (Deep C-RNN) approach to classify the effectiveness of learning emotion variations in the classification stage. We use a fusion of Mel–Gammatone filter in convolutional layers to first extract high-level spectral features then recurrent layers is adopted to learn the long-term temporal context from high-level features. Also, the proposed work differentiates the emotions from neutral speech with suitable binary tree diagrammatic illustrations. The methodology of the proposed work is applied on a large dataset covering Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Finally, the proposed results which obtained accuracy more than 80% and have less loss are compared with the state of the art approaches, and an experimental result provides evidence that fusion results outperform in recognizing emotions from speech signals.
Similar content being viewed by others
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., & Dean, J. et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265–283).
Aghajani, K., & Esmaili Paeen Afrakoti, I. (2020). Speech emotion recognition using scalogram based deep structure. International Journal of Engineering, 33(2), 285–292.
Bhavan, A., Sharma, M., Piplani, M., Chauhan, P., & Shah, R. R. (2020). Deep learning approaches for speech emotion recognition. In Deep learning-based approaches for sentiment analysis (pp. 259–289). Springer, Singapore.
Bourouba, H., & Djemili, R. (2020). Feature extraction algorithm using new cepstral techniques for robust speech recognition. Malaysian Journal of Computer Science, 33(2), 90–101.
Cheng, H., & Tang, X. (2020). Speech emotion recognition based on interactive convolutional neural network. 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP). https://doi.org/10.1109/icicsp50920.2020.9232071.
Delbrouck, J., Tits, N., Brousmiche, M., & Dupont, S. (2020). A transformer-based joint-encoding for emotion recognition and sentiment analysis. Second grand-challenge and workshop on multimodal language (Challenge-HML). https://doi.org/10.18653/v1/2020.challengehml-1.1.
Demircan, S., & Kahramanli, H. (2018). Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech. Neural Computing and Applications, 29(8), 59–66.
Deng, L., & Gao, Y. (2020). Gammachirp filter banks applied in roust speaker recognition based on GMM-UBM classifier. Int. Arab J. Inf. Technol., 17(2), 170–177.
Griol, D., Molina, J. M., & Callejas, Z. (2019). Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances. Neurocomputing, 326, 132–140.
Gu, Y., Li, X., Chen, S., Zhang, J., & Marsic, I. (2017). Speech intention classification with multimodal deep learning. In Canadian conference on artificial intelligence (pp. 260–271). Cham: Springer.
Guo, W., Zhang, Y., Cai, X., Meng, L., Yang, J., & Yuan, X. (2020). LD-MAN: Layout-driven multimodal attention network for online news sentiment recognition. IEEE Transactions on Multimedia. https://doi.org/10.1109/tmm.2020.3003648.
Gupta, S., & Mehra, A. (2015). Speech emotion recognition using svm with thresholding fusion. In 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 570–574). IEEE.
Hao, M., Cao, W., Liu, Z., Wu, M., & Xiao, P. (2020). Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing, 391, 42–51. https://doi.org/10.1016/j.neucom.2020.01.048.
Houjeij, A., Hamieh, L., Mehdi, N., & Hajj, H. (2012). A novel approach for emotion classification based on fusion of text and speech. In 2012 19th International Conference on Telecommunications (ICT) (pp. 1–6). IEEE.
Jiang, P., Fu, H., Tao, H., Lei, P., & Zhao, L. (2019). Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access, 7, 90368–90377.
Koo, H., Jeong, S., Yoon, S., & Kim, W. (2020). Development of speech emotion recognition algorithm using MFCC and prosody. In 2020 International Conference on Electronics, Information, and Communication (ICEIC) (pp. 1–4). IEEE.
Krishna, G., Tran, C., Carnahan, M., Hagood, M. M., & Tewfik, A. H. (2020). Speech recognition using EEG signals recorded using dry electrodes. arXiv preprint arXiv:abs/2008.07621.
Lee, C. W., Song, K. Y., Jeong, J., & Choi, W. Y. (2018). Convolutional attention networks for multimodal emotion recognition from speech and text data. ACL, 2018, 28.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020). M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI (pp. 1359–1367).
Murugan, N. S., & Devi, G. U. (2018a). Detecting streaming of Twitter spam using hybrid method. Wireless Personal Communications, 103(2), 1353–1374.
Murugan, N. S., & Devi, G. U. (2018b). Detecting spams in social networks using ML algorithms-a review. International Journal of Environment and Waste Management, 21(1), 22–36.
Murugan, N. S., & Devi, G. U. (2019). Feature extraction using LR-PCA hybridization on twitter data and classification accuracy using machine learning algorithms. Cluster Computing, 22(6), 13965–13974.
Nagarajan, S. M., & Gandhi, U. D. (2019). Classifying streaming of Twitter data based on sentiment analysis using hybridization. Neural Computing and Applications, 31(5), 1425–1433.
Sahay, S., Kumar, S. H., Xia, R., Huang, J., & Nachman, L. (2018). Multimodal relational tensor network for sentiment and emotion classification. arXiv preprint arXiv:abs/1806.02923.
Sailunaz, K., Dhaliwal, M., Rokne, J., & Alhajj, R. (2018). Emotion detection from text and speech: A survey. Social Network Analysis and Mining, 8(1), 28.
Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8, 79861–79875.
Shirazi, Z. A., de Souza, C. P., Kashef, R., & Rodrigues, F. F. (2020). Deep learning in the healthcare industry: theory and applications. In Computational intelligence and soft computing applications in healthcare management science (pp. 220–245). IGI Global.
Shu, L., Yu, Y., Chen, W., Hua, H., Li, Q., Jin, J., & Xu, X. (2020). Wearable emotion recognition using heart rate data from a smart bracelet. Sensors, 20(3), 718.
Treigys, P., Korvel, G., Tamulevičius, G., Bernatavičienė, J., & Kostek, B. (2020). Investigating feature spaces for isolated word recognition. In Data science: New issues, challenges and applications (pp. 165–181). Springer, Cham.
Trilla, A., & Alias, F. (2012). Sentence-based sentiment analysis for expressive text-to-speech. IEEE Transactions on Audio, Speech, and Language Processing, 21(2), 223–233.
Wei, C., Chen, L. L., Song, Z. Z., Lou, X. G., & Li, D. D. (2020). EEG-based emotion recognition using simple recurrent units network and ensemble learning. Biomedical Signal Processing and Control, 58, 101756.
Wu, M., Su, W., Chen, L., Pedrycz, W., & Hirota, K. (2020). Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing.
Zisad, S. N., Hossain, M. S., & Andersson, K. (2020). Speech emotion recognition in neurological disorders using convolutional neural network. Brain Informatics Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-59277-6_26.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kumaran, U., Radha Rammohan, S., Nagarajan, S.M. et al. Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. Int J Speech Technol 24, 303–314 (2021). https://doi.org/10.1007/s10772-020-09792-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-020-09792-x