Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN

Kumaran, U.; Radha Rammohan, S.; Nagarajan, Senthil Murugan; Prathik, A.

doi:10.1007/s10772-020-09792-x

Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN

Published: 13 January 2021

Volume 24, pages 303–314, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

U. Kumaran¹,
S. Radha Rammohan²,
Senthil Murugan Nagarajan³ &
…
A. Prathik⁴

1636 Accesses
53 Citations
Explore all metrics

Abstract

Emotions play a significant role in human life. Recognition of human emotions has numerous tasks in recognizing the emotional features of speech signals. In this regard, Speech Emotion Recognition (SER) has multiple applications in various fields of education, health, forensics, defense, robotics, and scientific purposes. However, SER has the limitations of data labeling, misinterpretation of speech, annotation of audio, and time complexity. This work presents the evaluation of SER based on the features extracted from Mel Frequency Cepstral Coefficients (MFCC) and Gammatone Frequency Cepstral Coefficients (GFCC) to study the emotions from different versions of audio signals. The sound signals are segmented by extracting and parametrizing each frequency calls using MFCC, GFCC, and combined features (M-GFCC) in the feature extraction stage. With the recent advances in Deep Learning techniques, this paper proposes a Deep Convolutional-Recurrent Neural Network (Deep C-RNN) approach to classify the effectiveness of learning emotion variations in the classification stage. We use a fusion of Mel–Gammatone filter in convolutional layers to first extract high-level spectral features then recurrent layers is adopted to learn the long-term temporal context from high-level features. Also, the proposed work differentiates the emotions from neutral speech with suitable binary tree diagrammatic illustrations. The methodology of the proposed work is applied on a large dataset covering Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. Finally, the proposed results which obtained accuracy more than 80% and have less loss are compared with the state of the art approaches, and an experimental result provides evidence that fusion results outperform in recognizing emotions from speech signals.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Article Open access 13 February 2024

Automatic speech recognition: a survey

Article 10 November 2020

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., & Dean, J. et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16) (pp. 265–283).
Aghajani, K., & Esmaili Paeen Afrakoti, I. (2020). Speech emotion recognition using scalogram based deep structure. International Journal of Engineering, 33(2), 285–292.
Google Scholar
Bhavan, A., Sharma, M., Piplani, M., Chauhan, P., & Shah, R. R. (2020). Deep learning approaches for speech emotion recognition. In Deep learning-based approaches for sentiment analysis (pp. 259–289). Springer, Singapore.
Bourouba, H., & Djemili, R. (2020). Feature extraction algorithm using new cepstral techniques for robust speech recognition. Malaysian Journal of Computer Science, 33(2), 90–101.
Google Scholar
Cheng, H., & Tang, X. (2020). Speech emotion recognition based on interactive convolutional neural network. 2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP). https://doi.org/10.1109/icicsp50920.2020.9232071.
Delbrouck, J., Tits, N., Brousmiche, M., & Dupont, S. (2020). A transformer-based joint-encoding for emotion recognition and sentiment analysis. Second grand-challenge and workshop on multimodal language (Challenge-HML). https://doi.org/10.18653/v1/2020.challengehml-1.1.
Demircan, S., & Kahramanli, H. (2018). Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech. Neural Computing and Applications, 29(8), 59–66.
Article Google Scholar
Deng, L., & Gao, Y. (2020). Gammachirp filter banks applied in roust speaker recognition based on GMM-UBM classifier. Int. Arab J. Inf. Technol., 17(2), 170–177.
Google Scholar
Griol, D., Molina, J. M., & Callejas, Z. (2019). Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances. Neurocomputing, 326, 132–140.
Article Google Scholar
Gu, Y., Li, X., Chen, S., Zhang, J., & Marsic, I. (2017). Speech intention classification with multimodal deep learning. In Canadian conference on artificial intelligence (pp. 260–271). Cham: Springer.
Guo, W., Zhang, Y., Cai, X., Meng, L., Yang, J., & Yuan, X. (2020). LD-MAN: Layout-driven multimodal attention network for online news sentiment recognition. IEEE Transactions on Multimedia. https://doi.org/10.1109/tmm.2020.3003648.
Article Google Scholar
Gupta, S., & Mehra, A. (2015). Speech emotion recognition using svm with thresholding fusion. In 2015 2nd International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 570–574). IEEE.
Hao, M., Cao, W., Liu, Z., Wu, M., & Xiao, P. (2020). Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features. Neurocomputing, 391, 42–51. https://doi.org/10.1016/j.neucom.2020.01.048.
Article Google Scholar
Houjeij, A., Hamieh, L., Mehdi, N., & Hajj, H. (2012). A novel approach for emotion classification based on fusion of text and speech. In 2012 19th International Conference on Telecommunications (ICT) (pp. 1–6). IEEE.
Jiang, P., Fu, H., Tao, H., Lei, P., & Zhao, L. (2019). Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE Access, 7, 90368–90377.
Article Google Scholar
Koo, H., Jeong, S., Yoon, S., & Kim, W. (2020). Development of speech emotion recognition algorithm using MFCC and prosody. In 2020 International Conference on Electronics, Information, and Communication (ICEIC) (pp. 1–4). IEEE.
Krishna, G., Tran, C., Carnahan, M., Hagood, M. M., & Tewfik, A. H. (2020). Speech recognition using EEG signals recorded using dry electrodes. arXiv preprint arXiv:abs/2008.07621.
Lee, C. W., Song, K. Y., Jeong, J., & Choi, W. Y. (2018). Convolutional attention networks for multimodal emotion recognition from speech and text data. ACL, 2018, 28.
Google Scholar
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Article Google Scholar
Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020). M3ER: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI (pp. 1359–1367).
Murugan, N. S., & Devi, G. U. (2018a). Detecting streaming of Twitter spam using hybrid method. Wireless Personal Communications, 103(2), 1353–1374.
Article Google Scholar
Murugan, N. S., & Devi, G. U. (2018b). Detecting spams in social networks using ML algorithms-a review. International Journal of Environment and Waste Management, 21(1), 22–36.
Article Google Scholar
Murugan, N. S., & Devi, G. U. (2019). Feature extraction using LR-PCA hybridization on twitter data and classification accuracy using machine learning algorithms. Cluster Computing, 22(6), 13965–13974.
Article Google Scholar
Nagarajan, S. M., & Gandhi, U. D. (2019). Classifying streaming of Twitter data based on sentiment analysis using hybridization. Neural Computing and Applications, 31(5), 1425–1433.
Article Google Scholar
Sahay, S., Kumar, S. H., Xia, R., Huang, J., & Nachman, L. (2018). Multimodal relational tensor network for sentiment and emotion classification. arXiv preprint arXiv:abs/1806.02923.
Sailunaz, K., Dhaliwal, M., Rokne, J., & Alhajj, R. (2018). Emotion detection from text and speech: A survey. Social Network Analysis and Mining, 8(1), 28.
Article Google Scholar
Sajjad, M., & Kwon, S. (2020). Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access, 8, 79861–79875.
Article Google Scholar
Shirazi, Z. A., de Souza, C. P., Kashef, R., & Rodrigues, F. F. (2020). Deep learning in the healthcare industry: theory and applications. In Computational intelligence and soft computing applications in healthcare management science (pp. 220–245). IGI Global.
Shu, L., Yu, Y., Chen, W., Hua, H., Li, Q., Jin, J., & Xu, X. (2020). Wearable emotion recognition using heart rate data from a smart bracelet. Sensors, 20(3), 718.
Article Google Scholar
Treigys, P., Korvel, G., Tamulevičius, G., Bernatavičienė, J., & Kostek, B. (2020). Investigating feature spaces for isolated word recognition. In Data science: New issues, challenges and applications (pp. 165–181). Springer, Cham.
Trilla, A., & Alias, F. (2012). Sentence-based sentiment analysis for expressive text-to-speech. IEEE Transactions on Audio, Speech, and Language Processing, 21(2), 223–233.
Article Google Scholar
Wei, C., Chen, L. L., Song, Z. Z., Lou, X. G., & Li, D. D. (2020). EEG-based emotion recognition using simple recurrent units network and ensemble learning. Biomedical Signal Processing and Control, 58, 101756.
Article Google Scholar
Wu, M., Su, W., Chen, L., Pedrycz, W., & Hirota, K. (2020). Two-stage fuzzy fusion based-convolution neural network for dynamic emotion recognition. IEEE Transactions on Affective Computing.
Zisad, S. N., Hossain, M. S., & Andersson, K. (2020). Speech emotion recognition in neurological disorders using convolutional neural network. Brain Informatics Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-030-59277-6_26.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering, Amrita Vishwa Vidyapeetham, Bengaluru, India
U. Kumaran
Department of Computer Applications, Dr MGR Educational and Research Institute, Maduravoyal, Chennai, India
S. Radha Rammohan
Department of Computer Science and Engineering, AIT, Chandigarh University, Chandigarh, Punjab, India
Senthil Murugan Nagarajan
Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, India
A. Prathik

Authors

U. Kumaran
View author publications
You can also search for this author in PubMed Google Scholar
S. Radha Rammohan
View author publications
You can also search for this author in PubMed Google Scholar
Senthil Murugan Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar
A. Prathik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to U. Kumaran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kumaran, U., Radha Rammohan, S., Nagarajan, S.M. et al. Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. Int J Speech Technol 24, 303–314 (2021). https://doi.org/10.1007/s10772-020-09792-x

Download citation

Received: 14 September 2020
Accepted: 30 December 2020
Published: 13 January 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10772-020-09792-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN

Abstract

Access this article

Similar content being viewed by others

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Role of machine learning and deep learning techniques in EEG-based BCI emotion recognition system: a review

Automatic speech recognition: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation