Elsevier

Knowledge-Based Systems

Volume 184, 15 November 2019, 104886
Knowledge-Based Systems

Bagged support vector machines for emotion recognition from speech

https://doi.org/10.1016/j.knosys.2019.104886Get rights and content

Abstract

Speech emotion recognition, a highly promising and exciting problem in the field of Human Computer Interaction, has been studied and analyzed over several decades. It concerns the task of recognizing a speaker’s emotions from their speech recordings. Recognizing emotions from speech can go a long way in determining a person’s physical and psychological state of well-being. In this work we performed emotion classification on three corpora — the Berlin EmoDB, the Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC), and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). A combination of spectral features was extracted from them which was further processed and reduced to the required feature set. Ensemble learning has been proven to give superior performance compared to single estimators. We propose a bagged ensemble comprising of support vector machines with a Gaussian kernel as a viable algorithm for the problem at hand. We report the results obtained on the three datasets mentioned above.

Introduction

Speech is one of the primary means of communication among human beings. One can convey their emotions, state of mind etc. through speech, and speech related applications have sprung up in numerous areas such as personal digital assistants, text-to-speech models, sensors and others. Thus, the natural next step is to teach a computer to interact just like humans, in that it could learn to understand the emotions underlying spoken language and respond appropriately. This is why it becomes important to train a machine to recognize the emotions of people from their speech.

The task of recognizing emotions in speech (both speaker-dependent and independent) has been a subject of considerable interest for quite some time. This is a problem that is highly challenging and multi-dimensional, because various emotions can be conveyed differently in different forms of speech. Also, the task of determining what all features to extract from speech to analyze its inherent emotions is a different problem in itself.

The existing approaches to this problem mostly make use of support vector machines (SVMs), hidden Markov models (HMMs) or neural networks. While SVMs provide reasonably good estimates with lesser effort, neural networks and hidden Markov models are difficult to build and train, and require high computational power and time. There thus needs to be a method to enhance the performance of support vector machines on the problem. This is where ensemble learning comes into the picture.

Ensemble learning [1] comprises of training multiple estimators, and aggregating their outcomes using particular rules. Some of the prominent ways of building ensembles include bagging (bootstrap aggregating) and boosting. Both these methods usually comprise of ensembling similar learners. Bagging, however, is a parallel mechanism, while boosting is an iterative procedure.

Our approach comprises of examining the performance of these ensemble methods on the problem of emotion recognition from speech. Particularly, we wish to assess the performance of ensembles of support vector machines. We compare the bagged and boosted ensembles prepared from the same, and observe that the bagging estimator demonstrates a better performance as compared to boosting.

Section 2 summarizes the previous research done in speech emotion recognition and ensemble learning methods. Section 3 gives a general overview of the system, including the model description and the process of feature extraction. Section 4 gives a thorough description of the datasets used, the experimental setup and procedure. Section 5 subsequently reports observations and compares results with some state-of-the-art systems, and Section 6 then proceeds to derive conclusions from the observations.

Section snippets

Prior research

This section is divided into two parts: one covering research on emotion recognition from speech, and the other covering advances in ensemble learning methods.

System overview

This section will cover the system overview — that is, the nature and quantity of features extracted and the structure and design of the model used.

Datasets description

We present results on three emotional speech corpora, the details of which are described below.

Results and discussion

We first extracted 455 features from the datasets for emotion recognition and then reduced their dimensionality using Boruta. As per 4.3 the recognition rate remained approximately same, while the number of features required is reduced. In case of the EmoDB dataset the reduction is almost 68%, whereas the recognition rate improved by 7% compared to results obtained by using all the 455 features.

The proposed method was also evaluated using only MFCC features, but their recognition rate was as

Conclusion and future work

In this work, we proposed a bagged ensemble comprising of support vector machines with a Gaussian kernel for SER. We firstly extracted MFCCs along with spectral centroids to represent emotional speech followed by a wrapper-based feature selection method to retrieve the best feature set. Experiments on the EmoDB, RAVDESS and IITKGP-SEHSC databases show the superiority of our proposed approach compared with the state-of-the-art in terms of overall accuracy.

Many questions and potential avenues for

Acknowledgments

This article has received funding from Infosys Center for AI, IIIT-Delhi and ECRA Grant (ECR/2018/002449) by SERB , Government of India.

References (45)

  • ParkJeong-Sik et al.

    Feature vector classification based speech emotion recognition for service robots

    IEEE Trans. Consum. Electron.

    (2009)
  • KimEun Ho et al.

    Improved emotion recognition with a novel speaker-independent feature

    IEEE/ASME Trans. Mechatronics

    (2009)
  • HasanMd Rashidul et al.

    Speaker identification using mel frequency cepstral coefficients

    Variations

    (2004)
  • DaveNamrata

    Feature extraction methods LPC, PLP and MFCC in speech recognition

    Int. J. Adv. Res. Eng. Technol.

    (2013)
  • Bou-GhazaleSahar E. et al.

    A comparative study of traditional and newly proposed features for recognition of speech under stress

    IEEE Trans. Speech Audio Process.

    (2000)
  • LiuGabrielle K.

    Evaluating Gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech

    (2018)
  • LiXi et al.

    Newman stress and emotion classification using jitter and shimmer features

  • PanYixiong et al.

    Speech emotion recognition using support vector machine

    Int. J. Smart Home

    (2012)
  • SchullerBjörn et al.

    Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture

  • FahadMd et al.

    DNN-HMM based speaker adaptive emotion recognition using proposed epoch and MFCC features

    (2018)
  • Xuejing Sun, Pitch accent prediction using ensemble machine learning, in: Seventh International Conference on Spoken...
  • Cited by (157)

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.104886.

    View full text