Speech emotion recognition with deep convolutional neural networks
Introduction
Speech emotion recognition is an important problem receiving increasing interest from researchers due to its numerous applications, such as audio surveillance, E-learning, clinical studies, detection of lies, entertainment, computer games, and call centers. Nevertheless, this problem still remains a significantly challenging task for advanced machine learning techniques. One of the reasons for such a moderate performance is the uncertainty of choosing the right features. In addition, the existence of background noise in audio recordings, such as real-world voices, could dramatically affect the effectiveness of a machine learning model [1]. Nevertheless, the advent of decent emotional speech recognition models could significantly improve the user experience in systems involving human-machine interactions, for example in the areas of Artificial Intelligence (AI) or Mobile Health (mHealth) [2]. Indeed, the ability to recognize emotions from audio samples and, therefore, the ability to imitate these emotions could have a considerable impact on the field of AI. Various virtual assistants in the field of mHealth after using such models can significantly improve their performance. In addition, emotional speech recognition systems are unpretentious in terms of the hardware requirements.
For now, deep learning models are utilized to solve recognition problems such as face recognition, voice recognition, image recognition, and speech emotion recognition [3], [4], [5], [6]. One of the main advantages of deep learning techniques lies in the automatic selection of features, which could, for example, be applied to important attributes inherent to sound files having a particular emotion in the task of recognition of speech emotions [7].
In recent years, various models based on deep neural networks for speech emotion recognition have been introduced. While one group of these models designs the neural network with the objective of detecting significant features directly from raw sound samples [8], the other group uses only one particular representation of a sound file and input to their models, e.g., [7], [1].
In this work, we extract five different features from a sound file and stack the resulting matrices in a one-dimensional array by taking mean values along the time axis. This array is then fed into the 1-D Convolutional Neural Network (CNN) model as input. We assert that the mixing of these features in the input data provides a more diverse representation of a sound file, which leads to a better generalization and a better classification during the process of recognizing emotions from speech. In addition, we utilize an incremental methodology for modifying our baseline model to improve its classification accuracy.
Although several speech-emotion recognition frameworks in the literature combine different feature types, the proposed feature combination using five different spectral representations of the same sound file has not been attempted by researchers so far. Specifically, the mix of features resulting in powerful identification and tracking of timbre fluctuations but poor distinguishable representations of pitch classes and harmony is further enriched with additional features to improve its representational power. As a result, our best performing model outperforms all existing frameworks, which use audio features and report their classification accuracies on the same emotion classes for both RAVDESS [9] and the IEMOCAP datasets [10], yielding the new state-of-the-art. For the EMO-DB dataset [11], our best performing model outperforms all previous work, with the exception of the study by Zhao et al. [12]. However, our model compares favorably with that one in terms of generality, simplicity, and applicability. In addition, we have noticed some inconsistencies related to [12] as we discuss them in Section 4.4.
In the next section, we present a brief review of the literature on previous work in the field of speech-related emotion recognition. After that, we present our methodology and the proposed baseline model in Section 3. The datasets, our improvements to the baseline model, and the experiments are described in the next section. After comparing our results with those of previous approaches, we draw conclusions and indicate possible future directions.
Section snippets
Literature review
The majority of speech-emotion recognition architectures that utilize neural networks are Convolutional Neural Networks (CNN), recurrent neural networks (RNN) with long-short term memory (LSTM), or their combination [8], [7], [2], [13]. The combination of CNN and RNN could detect an essential pattern in audio files when extracting features and classifying entities [8], [7]. One of the main goals in speech emotion recognition is the identification of significant features that could then be used
Datasets and methodology
We use three different audio datasets, RAVDESS [9], EMO-DB [11], IEMOCAP [10], which are widely employed by researchers in emotion recognition. After presenting the datasets, we describe the proposed framework, which starts with feature extraction followed by the baseline deep learning model. While the baseline model is ideal for RAVDESS, we present some additional deep learning models generated by different hyper parameter settings of the baseline and slight modifications to its architecture
Model variations, and experiments
For the classification of emotions, we have implemented several incremental models using three datasets mentioned before. We will discuss these models in detail below.
Discussion and conclusion
Speech emotion recognition is a complex task, which involves two essential problems: feature extraction and classification. In this paper, we propose a new framework for speech emotion recognition using one-dimensional deep CNN with the combination of five different audio features as input data. Our model outperforms state-of-the-art approaches for the RAVDESS and IEMOCAP datasets. For EMO-DB, we incrementally present a set of models based on our initial framework to improve the performance.
CRediT authorship contribution statement
Dias Issa: Conceptualization, Methodology, Software, Visualization, Investigation, Validation, Writing - original draft, Data curation, Writing - review & editing. M. Fatih Demirci: Methodology, Supervision, Project administration, Conceptualization, Writing - original draft, Writing - review & editing. Adnan Yazici: Methodology, Supervision, Project administration, Conceptualization, Writing - original draft, Writing - review & editing.
Declaration of Competing Interest
The authors declare no conflicts of interest.
References (40)
- et al.
Speech emotion recognition using deep 1d & 2d cnn lstm networks
Biomed. Signal Process. Control
(2019) - et al.
Automatic speech emotion recognition using modulation spectral features
Speech Commun.
(2011) - et al.
Speech emotion recognition using deep neural network and extreme learning machine
Interspeech
(2014) - et al.
Speech emotion recognition from spectrograms with deep convolutional neural network
- et al.
Real time multiple face recognition: a deep learning approach
- et al.
Voice recognition based on adaptive mfcc and deep learning
- et al.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds
- et al.
Speech emotion recognition using convolutional and recurrent neural networks
- et al.
Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network
The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English
PLOS ONE
Iemocap: interactive emotional dyadic motion capture database
Lang. Resour. Eval.
A database of german emotional speech
Ninth European Conference on Speech Communication and Technology
Improvement on speech emotion recognition based on deep convolutional neural networks
Proceedings of the 2018 International Conference on Computing and Artificial Intelligence
Self-attention for speech emotion recognition
Proc. Interspeech 2019
The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing
IEEE Trans. Affect. Comput.
Towards robust speech emotion recognition using deep residual networks for speech enhancement
Proc. Interspeech
The interspeech 2016 computational paralinguistics challenge: deception, sincerity & native language
Interspeech
Recognition of emotional speech with convolutional neural networks by means of spectral estimates
Data augmentation using gans for speech emotion recognition
Proc. Interspeech 2019
Cited by (338)
A novel concatenated 1D-CNN model for speech emotion recognition
2024, Biomedical Signal Processing and ControlDetection of the common cold from speech signals using transformer model and spectral features
2024, Biomedical Signal Processing and ControlAssessing the effectiveness of ensembles in Speech Emotion Recognition: Performance analysis under challenging scenarios[Formula presented]
2024, Expert Systems with ApplicationsImprovement of emotion classification performance using multi-resolution variational mode decomposition method
2024, Biomedical Signal Processing and Control