Bagged support vector machines for emotion recognition from speech

doi:10.1016/j.knosys.2019.104886

Knowledge-Based Systems

Volume 184, 15 November 2019, 104886

https://doi.org/10.1016/j.knosys.2019.104886 Get rights and content

Abstract

Speech emotion recognition, a highly promising and exciting problem in the field of Human Computer Interaction, has been studied and analyzed over several decades. It concerns the task of recognizing a speaker’s emotions from their speech recordings. Recognizing emotions from speech can go a long way in determining a person’s physical and psychological state of well-being. In this work we performed emotion classification on three corpora — the Berlin EmoDB, the Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC), and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). A combination of spectral features was extracted from them which was further processed and reduced to the required feature set. Ensemble learning has been proven to give superior performance compared to single estimators. We propose a bagged ensemble comprising of support vector machines with a Gaussian kernel as a viable algorithm for the problem at hand. We report the results obtained on the three datasets mentioned above.

Introduction

Speech is one of the primary means of communication among human beings. One can convey their emotions, state of mind etc. through speech, and speech related applications have sprung up in numerous areas such as personal digital assistants, text-to-speech models, sensors and others. Thus, the natural next step is to teach a computer to interact just like humans, in that it could learn to understand the emotions underlying spoken language and respond appropriately. This is why it becomes important to train a machine to recognize the emotions of people from their speech.

The task of recognizing emotions in speech (both speaker-dependent and independent) has been a subject of considerable interest for quite some time. This is a problem that is highly challenging and multi-dimensional, because various emotions can be conveyed differently in different forms of speech. Also, the task of determining what all features to extract from speech to analyze its inherent emotions is a different problem in itself.

The existing approaches to this problem mostly make use of support vector machines (SVMs), hidden Markov models (HMMs) or neural networks. While SVMs provide reasonably good estimates with lesser effort, neural networks and hidden Markov models are difficult to build and train, and require high computational power and time. There thus needs to be a method to enhance the performance of support vector machines on the problem. This is where ensemble learning comes into the picture.

Ensemble learning [1] comprises of training multiple estimators, and aggregating their outcomes using particular rules. Some of the prominent ways of building ensembles include bagging (bootstrap aggregating) and boosting. Both these methods usually comprise of ensembling similar learners. Bagging, however, is a parallel mechanism, while boosting is an iterative procedure.

Our approach comprises of examining the performance of these ensemble methods on the problem of emotion recognition from speech. Particularly, we wish to assess the performance of ensembles of support vector machines. We compare the bagged and boosted ensembles prepared from the same, and observe that the bagging estimator demonstrates a better performance as compared to boosting.

Section 2 summarizes the previous research done in speech emotion recognition and ensemble learning methods. Section 3 gives a general overview of the system, including the model description and the process of feature extraction. Section 4 gives a thorough description of the datasets used, the experimental setup and procedure. Section 5 subsequently reports observations and compares results with some state-of-the-art systems, and Section 6 then proceeds to derive conclusions from the observations.

Section snippets

Prior research

This section is divided into two parts: one covering research on emotion recognition from speech, and the other covering advances in ensemble learning methods.

System overview

This section will cover the system overview — that is, the nature and quantity of features extracted and the structure and design of the model used.

Datasets description

We present results on three emotional speech corpora, the details of which are described below.

Results and discussion

We first extracted 455 features from the datasets for emotion recognition and then reduced their dimensionality using Boruta. As per 4.3 the recognition rate remained approximately same, while the number of features required is reduced. In case of the EmoDB dataset the reduction is almost 68%, whereas the recognition rate improved by 7% compared to results obtained by using all the 455 features.

The proposed method was also evaluated using only MFCC features, but their recognition rate was as

Conclusion and future work

In this work, we proposed a bagged ensemble comprising of support vector machines with a Gaussian kernel for SER. We firstly extracted MFCCs along with spectral centroids to represent emotional speech followed by a wrapper-based feature selection method to retrieve the best feature set. Experiments on the EmoDB, RAVDESS and IITKGP-SEHSC databases show the superiority of our proposed approach compared with the state-of-the-art in terms of overall accuracy.

Many questions and potential avenues for

Acknowledgments

This article has received funding from Infosys Center for AI, IIIT-Delhi and ECRA Grant (ECR/2018/002449) by SERB , Government of India.

References (45)

GoblChrister et al.
The role of voice quality in communicating emotion, mood and attitude
Speech Commun.
(2003)
YangBin et al.
Emotion recognition from speech signals using new harmony features
Signal Process.
(2010)
ChenLijiang et al.
Speech emotion recognition: features and classification models
Digit. Signal Process.
(2012)
LiuZhen-Tao et al.
Speech emotion recognition based on feature selection and extreme learning machine decision tree
Neurocomputing
(2018)
WangGang et al.
A comparative assessment of ensemble learning for credit scoring
Expert Syst. Appl.
(2011)
OzciftAkin et al.
Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms
Comput. Methods Programs Biomed.
(2011)
MorettiFabio et al.
Urban traffic flow forecasting through statistical and neural network bagging ensemble hybrid modeling
Neurocomputing
(2015)
ZareapoorMasoumeh et al.
Application of credit card fraud detection: Based on bagging ensemble classifier
Procedia Comput. Sci.
(2015)
MorrisonDonn et al.
Ensemble methods for spoken emotion recognition in call-centres
Speech Commun.
(2007)
HuQiao et al.
Fault diagnosis of rotating machinery based on improved wavelet package transform and SVMs ensemble
Mech. Syst. Signal Process.
(2007)

ParkJeong-Sik et al.

Feature vector classification based speech emotion recognition for service robots

IEEE Trans. Consum. Electron.

(2009)

KimEun Ho et al.

Improved emotion recognition with a novel speaker-independent feature

IEEE/ASME Trans. Mechatronics

(2009)

HasanMd Rashidul et al.

Speaker identification using mel frequency cepstral coefficients

Variations

(2004)

DaveNamrata

Feature extraction methods LPC, PLP and MFCC in speech recognition

Int. J. Adv. Res. Eng. Technol.

(2013)

Bou-GhazaleSahar E. et al.

A comparative study of traditional and newly proposed features for recognition of speech under stress

IEEE Trans. Speech Audio Process.

(2000)

LiuGabrielle K.

Evaluating Gammatone frequency cepstral coefficients with neural networks for emotion recognition from speech

(2018)

LiXi et al.

Newman stress and emotion classification using jitter and shimmer features

PanYixiong et al.

Speech emotion recognition using support vector machine

Int. J. Smart Home

(2012)

SchullerBjörn et al.

Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture

FahadMd et al.

DNN-HMM based speaker adaptive emotion recognition using proposed epoch and MFCC features

(2018)

Xuejing Sun, Pitch accent prediction using ensemble machine learning, in: Seventh International Conference on Spoken...

Cited by (157)

Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments
2024, Computer Standards and Interfaces
In human–computer interaction, emotion recognition provides a deeper understanding of the user’s emotions, enabling empathetic and effective responses based on the user’s emotional state. While deep learning models have improved emotion recognition solutions, it is still an active area of research. One important limitation is that most emotion recognition systems use only text as input, ignoring features such as voice intonation. Another limitation is the limited number of datasets available for multimodal emotion recognition. In addition, most published datasets contain emotions that are simulated by professionals and produce limited results in real-world scenarios. In other languages, such as Spanish, hardly any datasets are available. Therefore, our contributions to emotion recognition are as follows. First, we compile and annotate a new corpus for multimodal emotion recognition in Spanish (Spanish MEACorpus 2023), which contains 13.16 h of speech divided into 5129 segments labeled by considering Ekman’s six basic emotions. The dataset is extracted from YouTube videos in natural environments. Second, we explore several deep learning models for emotion recognition using text- and audio-based features. Third, we evaluate different multimodal techniques to build a multimodal recognition system that improves the results of unimodal models, achieving a Macro F1-score of 87.745%, using late fusion with concatenation strategy approach.
Enhancing speech emotion recognition with the Improved Weighted Average Support Vector method
2024, Biomedical Signal Processing and Control
Emotions have a vital role in human communication in today's time, they help individuals express their thoughts and understand the emotions of others better. Speech Emotion Recognition (SER) is a part of machine learning algorithms that aims to develop automated systems capable of detecting and classifying emotions expressed through speech signals. Although various approaches for SER have been established, the success rates vary depending on the language, emotions, and databases. This paper introduces a novel method called Improved Weighted Average Support Vector (IWASV) to enhance speech emotion recognition. The proposed model utilizes an Improved Weighted Average Support Vector that combines a Weighted Average Ensemble with an Improved Support Vector Machine. To enhance accuracy, feature selection employs a local search-based Reptile Search Optimization technique. Experiments were conducted on the TESS, IITKGP-SEHSC, and RAVDESS datasets, and performance evaluation measures such as AUC-ROC, specificity, recall, F1-score, accuracy, and precision were used to assess the proposed and existing methods. The results demonstrated the effectiveness of the IWASV method, achieving a specificity of 97.84%, recall of 97.88%, F1-score of 97.86%, accuracy of 98.56%, precision of 97.93%, and AUC-ROC values of 98.12%, 97.89%, and 97.92% for the TESS, IITKGP-SEHSC, and RAVDESS datasets, respectively. These findings highlight the superior efficiency of the proposed method in recognizing and understanding emotions conveyed through speech signals.
In-depth investigation of speech emotion recognition studies from past to present –The importance of emotion recognition from speech signal for AI–
2024, Intelligent Systems with Applications
In the super smart society (Society 5.0), new and rapid methods are needed for speech recognition, emotion recognition, and speech emotion recognition areas to maximize human-machine or human-computer interaction and collaboration. Speech signal contains much information about the speaker, such as age, sex, ethnicity, health condition, emotion, and thoughts. The field of study which analyzes the mood of the person from the speech is called speech emotion recognition (SER). Classifying the emotions from the speech data is a complicated problem for artificial intelligence, and its sub-discipline, machine learning. Because it is hard to analyze the speech signal which contains various frequencies and characteristics. Speech data are digitized with signal processing methods and speech features are obtained. These features vary depending on the emotions such as sadness, fear, anger, happiness, boredom, confusion, etc. Even though different methods have been developed for determining the audio properties and emotion recognition, the success rate varies depending on the languages, cultures, emotions, and data sets. In speech emotion recognition, there is a need for new methods which can be applied in data sets with different sizes, which will increase classification success, in which best properties can be obtained, and which are affordable. The success rates are affected by many factors such as the methods used, lack of speech emotion datasets, the homogeneity of the database, the difficulty of the language (linguistic differences), the noise in audio data and the length of the audio data. Within the scope of this study, studies on emotion recognition from speech signals from past to present have been analyzed in detail. In this study, classification studies based on a discrete emotion model using speech data belonging to the Berlin emotional database (EMO-DB), Italian emotional speech database (EMOVO), The Surrey audio-visual expressed emotion database (SAVEE), Ryerson Audio-Visual Database of Emotional Speech and Song Database (RAVDESS), which are mostly independent of the speaker and content, are examined. The results of both classical classifiers and deep learning methods are compared. Deep learning results are more successful, but classical classification is more important in determining the defining features of speech, song or voice. So It develops feature extraction stage. This study will be able to contribute to the literature and help the researchers in the SER field.
Bi-stream graph learning based multimodal fusion for emotion recognition in conversation
2024, Information Fusion
Emotion Recognition in Conversation (ERC) is the process of automatically detecting and understanding emotions expressed in a conversation, which plays an important role in human–computer interaction. A conversation generates different modality data including words, tone of voice and facial expression. Multimodal ERC can fuse the information from multiple views to comprehensively model emotion dynamics in a conversation. Graph Neural Networks (GNNs) are employed by multimodal ERC to learn intra-modal long-range contextual information and inter-modal interaction. However, fusing different modalities within a graph may generate the conflict of multimodal information and suffer from data heterogeneity issue. In the paper, we propose a novel Bi-stream Graph Learning based Multimodal Fusion (BiGMF) approach for ERC. It consists of a unimodal stream graph learning for modeling the intra-modal long-range context information and a cross-modal stream graph learning for modeling the inter-modal interactions, which uses GNNs to learn the intra- and inter-modal information in parallel. The separation learning scheme can successfully alleviate the conflict and heterogeneity in multimodal data fusion, and promote the explicitly modeling of cross-modal relations. The experimental results on two public datasets further verify that the superiority of the proposed approach compared to the SOTA approaches.
MRSLN: A Multimodal Residual Speaker-LSTM Network to alleviate the over-smoothing issue for Emotion Recognition in Conversation
2024, Neurocomputing
Multimodal Emotion Recognition in Conversation (ERC) plays a significant role in the field of human–computer intelligent interaction since it enables computers to perceive and infer the emotions expressed by the individuals, thereby intelligently responding to them. Most of current ERC methods pay more attention to modeling the complex interaction between different modalities. However, the features extracted by their unimodal networks are over-smoothed and may contain insufficient intra-speaker contextual information, which results in suboptimal results. In this paper, we focus on the unimodal learning and propose a simple late fusion framework named Multimodal Residual Speaker-LSTM Network (MRSLN), which uses speaker information to directly model inter-speaker and intra-speaker dependency, rather than fuse it into the learned features. MRSLN uses the speaker-LSTM consisting of the inter-speaker LSTM, intra-speaker LSTM, and the residual network between the input and output of the inter-speaker LSTM. Our proposed method can alleviate the issue of over-smoothing in deep Long Short Term Memory (LSTM) network and also incorporate additional intra-speaker contextual information. Extensive experiments conducted on IEMOCAP and MELD datasets demonstrate that MRSLN effectively captures inter-speaker and intra-speaker information and outperforms currently complex state-of-the-art (SOTA) models in efficiency and classification performance.
Deep-CNN based knowledge learning with Beluga Whale optimization using chaogram transformation using intelligent sensors for speech emotion recognition
2024, Measurement: Sensors
In recent times, there has been a lot of focus placed on speech emotion recognition, often known as SER. SER is a fundamental approach to emotional human-machine interaction. Utilising DCNNs has resulted in significant advancements being made, notably with regard to the learning of high-level characteristics for SER. However, one of the primary challenges that arises when applying deep neural networks to SER is the problem of overfitting, which occurs commonly when too many parameters are chosen using insufficiently large datasets. The majority of the research that has been done to attempt to solve this problem has focused on translating the audio input into a picture and employing transfer learning methods. According to research, the patterns generated in this region include significant emotional aspects of the speaker. A new speech signal format called Chaogram was designed to project these patterns, which would result in three channels like RGB images, in order to give an input that works with VGG-based Deep-CNNs. The original phase space is then used to reconstruct the voice samples in a three-dimensional space. In the subsequent stage, the Chaogram photos' finer details were accentuated by image enhancing techniques. To learn Chaogram's high-level features and emotion classifications, the Visual Geometry Group (VGG) deep convolutional neural network (DCNN) is employed after being pre-trained on the massive ImageNet dataset using intelligent sensors. We next apply transfer learning to our datasets to further improve the provided model, which we combine with the proposed model. To enhance the hyper-parameter layout of architecture-determined CNNs, a new Deep-CNN with BWO (Beluga Whale Optimization) is introduced. In order to apply Deep–CNN–KL and to the field of SER, thispropose a new method of representing speech signals, which we call Chaograms, by mapping the signal onto a 2D image. On two publicly available emotion datasets, these findings demonstrate the ability of the proposed approach, which has the potential to significantly improve SER applications: EMO-DB and eNTERFACE05. Image enhancement methods that place more emphasis on such features may lead to more precise classification in the performance analysis. Classification accuracy can be considerably improved by adapting these images to work with the pre-trained Deep-CNN inputs.

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.104886.

View full text

Bagged support vector machines for emotion recognition from speech☆