Speech emotion recognition with deep convolutional neural networks

https://doi.org/10.1016/j.bspc.2020.101894Get rights and content

Highlights

  • Sound files are represented effectively by combining various features.

  • The framework sets the new SOTA on two datasets for speech emotion recognition.

  • For the third dataset (EMO-DB), the framework obtains the second highest accuracy.

  • The advantages of the framework are its simplicity, applicability, and generality.

Abstract

The speech emotion recognition (or, classification) is one of the most challenging topics in data science. In this work, we introduce a new architecture, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (EMO-DB), and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets. We utilize an incremental method for modifying our initial model in order to improve classification accuracy. All of the proposed models work directly with raw sound data without the need for conversion to visual representations, unlike some previous approaches. Based on experimental results, our best-performing model outperforms existing frameworks for RAVDESS and IEMOCAP, thus setting the new state-of-the-art. For the EMO-DB dataset, it outperforms all previous works except one but compares favorably with that one in terms of generality, simplicity, and applicability. Specifically, the proposed framework obtains 71.61% for RAVDESS with 8 classes, 86.1% for EMO-DB with 535 samples in 7 classes, 95.71% for EMO-DB with 520 samples in 7 classes, and 64.3% for IEMOCAP with 4 classes in speaker-independent audio classification tasks.

Introduction

Speech emotion recognition is an important problem receiving increasing interest from researchers due to its numerous applications, such as audio surveillance, E-learning, clinical studies, detection of lies, entertainment, computer games, and call centers. Nevertheless, this problem still remains a significantly challenging task for advanced machine learning techniques. One of the reasons for such a moderate performance is the uncertainty of choosing the right features. In addition, the existence of background noise in audio recordings, such as real-world voices, could dramatically affect the effectiveness of a machine learning model [1]. Nevertheless, the advent of decent emotional speech recognition models could significantly improve the user experience in systems involving human-machine interactions, for example in the areas of Artificial Intelligence (AI) or Mobile Health (mHealth) [2]. Indeed, the ability to recognize emotions from audio samples and, therefore, the ability to imitate these emotions could have a considerable impact on the field of AI. Various virtual assistants in the field of mHealth after using such models can significantly improve their performance. In addition, emotional speech recognition systems are unpretentious in terms of the hardware requirements.

For now, deep learning models are utilized to solve recognition problems such as face recognition, voice recognition, image recognition, and speech emotion recognition [3], [4], [5], [6]. One of the main advantages of deep learning techniques lies in the automatic selection of features, which could, for example, be applied to important attributes inherent to sound files having a particular emotion in the task of recognition of speech emotions [7].

In recent years, various models based on deep neural networks for speech emotion recognition have been introduced. While one group of these models designs the neural network with the objective of detecting significant features directly from raw sound samples [8], the other group uses only one particular representation of a sound file and input to their models, e.g., [7], [1].

In this work, we extract five different features from a sound file and stack the resulting matrices in a one-dimensional array by taking mean values along the time axis. This array is then fed into the 1-D Convolutional Neural Network (CNN) model as input. We assert that the mixing of these features in the input data provides a more diverse representation of a sound file, which leads to a better generalization and a better classification during the process of recognizing emotions from speech. In addition, we utilize an incremental methodology for modifying our baseline model to improve its classification accuracy.

Although several speech-emotion recognition frameworks in the literature combine different feature types, the proposed feature combination using five different spectral representations of the same sound file has not been attempted by researchers so far. Specifically, the mix of features resulting in powerful identification and tracking of timbre fluctuations but poor distinguishable representations of pitch classes and harmony is further enriched with additional features to improve its representational power. As a result, our best performing model outperforms all existing frameworks, which use audio features and report their classification accuracies on the same emotion classes for both RAVDESS [9] and the IEMOCAP datasets [10], yielding the new state-of-the-art. For the EMO-DB dataset [11], our best performing model outperforms all previous work, with the exception of the study by Zhao et al. [12]. However, our model compares favorably with that one in terms of generality, simplicity, and applicability. In addition, we have noticed some inconsistencies related to [12] as we discuss them in Section 4.4.

In the next section, we present a brief review of the literature on previous work in the field of speech-related emotion recognition. After that, we present our methodology and the proposed baseline model in Section 3. The datasets, our improvements to the baseline model, and the experiments are described in the next section. After comparing our results with those of previous approaches, we draw conclusions and indicate possible future directions.

Section snippets

Literature review

The majority of speech-emotion recognition architectures that utilize neural networks are Convolutional Neural Networks (CNN), recurrent neural networks (RNN) with long-short term memory (LSTM), or their combination [8], [7], [2], [13]. The combination of CNN and RNN could detect an essential pattern in audio files when extracting features and classifying entities [8], [7]. One of the main goals in speech emotion recognition is the identification of significant features that could then be used

Datasets and methodology

We use three different audio datasets, RAVDESS [9], EMO-DB [11], IEMOCAP [10], which are widely employed by researchers in emotion recognition. After presenting the datasets, we describe the proposed framework, which starts with feature extraction followed by the baseline deep learning model. While the baseline model is ideal for RAVDESS, we present some additional deep learning models generated by different hyper parameter settings of the baseline and slight modifications to its architecture

Model variations, and experiments

For the classification of emotions, we have implemented several incremental models using three datasets mentioned before. We will discuss these models in detail below.

Discussion and conclusion

Speech emotion recognition is a complex task, which involves two essential problems: feature extraction and classification. In this paper, we propose a new framework for speech emotion recognition using one-dimensional deep CNN with the combination of five different audio features as input data. Our model outperforms state-of-the-art approaches for the RAVDESS and IEMOCAP datasets. For EMO-DB, we incrementally present a set of models based on our initial framework to improve the performance.

CRediT authorship contribution statement

Dias Issa: Conceptualization, Methodology, Software, Visualization, Investigation, Validation, Writing - original draft, Data curation, Writing - review & editing. M. Fatih Demirci: Methodology, Supervision, Project administration, Conceptualization, Writing - original draft, Writing - review & editing. Adnan Yazici: Methodology, Supervision, Project administration, Conceptualization, Writing - original draft, Writing - review & editing.

Declaration of Competing Interest

The authors declare no conflicts of interest.

References (40)

  • J. Zhao et al.

    Speech emotion recognition using deep 1d & 2d cnn lstm networks

    Biomed. Signal Process. Control

    (2019)
  • S. Wu et al.

    Automatic speech emotion recognition using modulation spectral features

    Speech Commun.

    (2011)
  • I.T. Kun Han et al.

    Speech emotion recognition using deep neural network and extreme learning machine

    Interspeech

    (2014)
  • A.M. Badshah et al.

    Speech emotion recognition from spectrograms with deep convolutional neural network

  • S. Mittal et al.

    Real time multiple face recognition: a deep learning approach

  • H.-S. Bae et al.

    Voice recognition based on adaptive mfcc and deep learning

  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • K.-Y. Huang et al.

    Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds

  • W. Lim et al.

    Speech emotion recognition using convolutional and recurrent neural networks

  • G. Trigeorgis et al.

    Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network

  • S.R. Livingstone et al.

    The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in North American English

    PLOS ONE

    (2018)
  • C. Busso et al.

    Iemocap: interactive emotional dyadic motion capture database

    Lang. Resour. Eval.

    (2008)
  • F. Burkhardt et al.

    A database of german emotional speech

    Ninth European Conference on Speech Communication and Technology

    (2005)
  • Y. Niu et al.

    Improvement on speech emotion recognition based on deep convolutional neural networks

    Proceedings of the 2018 International Conference on Computing and Artificial Intelligence

    (2018)
  • L. Tarantino et al.

    Self-attention for speech emotion recognition

    Proc. Interspeech 2019

    (2019)
  • F. Eyben et al.

    The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing

    IEEE Trans. Affect. Comput.

    (2015)
  • A. Triantafyllopoulos et al.

    Towards robust speech emotion recognition using deep residual networks for speech enhancement

    Proc. Interspeech

    (2019)
  • B.W. Schuller et al.

    The interspeech 2016 computational paralinguistics challenge: deception, sincerity & native language

    Interspeech

    (2016)
  • N. Weißkirchen et al.

    Recognition of emotional speech with convolutional neural networks by means of spectral estimates

  • A. Chatziagapi et al.

    Data augmentation using gans for speech emotion recognition

    Proc. Interspeech 2019

    (2019)
  • Cited by (338)

    • A novel concatenated 1D-CNN model for speech emotion recognition

      2024, Biomedical Signal Processing and Control
    View all citing articles on Scopus
    View full text