1 Introduction

Affective computing [1], a currently active research topic within the engineering application area, aims at the automatic recognition, interpretation, and synthesis of human emotions. Speech is a main vehicle of human emotion expression, since speech is one of the most powerful, natural, and immediate means for human beings to communicate their emotions. Automatic recognition of human emotions from affective speech, that is, spoken emotion recognition, plays an important role in affective computing area and is increasingly attracting attention in many fields such as speech processing, pattern recognition, artificial intelligence, and engineering application. A major motivation comes from the desire to improve the naturalness and efficiency of human–machine interaction [2, 3].

Generally, a basic spoken emotion recognition system consists of three steps: feature extraction, feature data preprocessing, and emotion classification. Feature extraction is concerned with extracting suitable features efficiently characterizing different emotions from speech signals. A large number of paralinguistic and linguistic feature information related to emotion expression are conveyed in speech signals. Even though there is no agreement on the best features to use, prosody features are the most widely used for spoken emotion recognition in many previous work [2, 410], since prosody features are considered to show important emotional cues of the speaker. The used actual number of prosodic parameters when quantifying emotion varies greatly. Some of the most prominent prosody features used in spoken emotion recognition contain pitch-related, intensity-related, and duration-related attributes, such as the maximum, minimum, median, range. Another typical type of acoustic features indicating human emotions is voice quality features, which characterize the phonation process. It has been shown that the emotional content of an utterance is strongly related to its voice quality [1113]. In [1113], the typical voice quality parameters used for spoken emotion recognition contain the first three formants (F1, F2, and F3), spectral energy distribution, harmonics-to-noise ratio (HNR), pitch irregularity (jitter), and amplitude irregularity (shimmer). In addition to the aforementioned prosody features and voice quality features, the third typical type of acoustic features used for spoken emotion recognition is spectral features, which are based on the short-term power spectrum of speech signals [1418]. The well-known spectral features contain linear prediction coefficients (LPC), linear prediction cepstral coefficients (LPCC), log frequency power coefficients (LFPC), Mel-frequency cepstral coefficients (MFCC), and so on. In recent years, the combination of different types of emotional features, such as acoustic, lexical, contextual, discourse information, has also been explored in an effort to improve emotion detection or prediction by including nonacoustic events [3, 1922].

Feature data preprocessing aims to reduce the size of feature sets and extract few suitable features for classification. To achieve this goal, there are usually two feasible ways: feature selection and dimensionality reduction. Feature selection is used to select the most relevant feature sets among a large number of extracted features and remove the irrelevant ones. In recent years, various feature selection techniques, such as forward selection (FS) [3], sequential forward selection (SFS) [23], ensemble random forest to trees (ERFTrees) [24], fast correlation-based filter (FCBF) [10], have been applied for spoken emotion recognition. Dimensionality reduction is used to generate few new features containing most of the valuable speech information. So far, various linear and nonlinear dimensionality reduction techniques have been used for spoken emotion recognition. The traditional linear dimensionality reduction methods, such as principal component analysis (PCA) [25] and Fisher’s linear discriminant analysis (LDA) [26], have been successfully used for reducing the dimensionality of emotional speech features [3, 27]. The recently emerged manifold learning (also called nonlinear dimensionality reduction) methods, such as locally linear embedding (LLE) [28] and isometric feature mapping (Isomap) [29], have also been used to perform nonlinear dimensionality reduction on spoken emotion recognition tasks [30, 31].

After feature data preprocessing, the next step of a spoken emotion recognition system is emotion classification, which aims to identify the underlying emotions of speech utterances. Most current researches on spoken emotion recognition have focused on this classification step since it represents the interface between the problem domain and the classification techniques. Besides, the traditional classifiers can easily be used in almost all proposed spoken emotion recognition systems. So far, various types of classifiers have been employed for spoken emotion classification. In 1990s, the representative emotion classification methods were linear discriminant classifiers (LDC) and K-nearest neighbor (KNN) [32, 33]. Another common classifier, which became popular around 2,000 for spoken emotion recognition applications, was artificial neural network (ANN) [34, 35]. After 2002, more attention was paid to several statistical pattern recognition techniques, such as support vector machines (SVMs) [3639], Gaussian mixture model (GMM) [40, 41], and hidden Markov models (HMM) [14, 42]. It seems that each classifier has its own advantages and limitations. In order to combine the merits of several different classifiers, multiple classifier fusion has also been recently employed for spoken emotion recognition due to its superior performance to a single classifier [23, 43, 44].

Most previous studies [3, 45, 46] in the spoken emotion recognition area focus on detecting emotional states in clean speech data easily recorded in quiet environment, but human beings are capable of perceiving emotions even in noisy environment. In recent years, robust emotion recognition in noisy speech has become an important issue in the spoken emotion recognition area since the emotional speech signals in real-world scenarios are usually disturbed with different levels of noise. So far, some efforts have been made for robust spoken emotion recognition in noise on this step of feature extraction in a spoken emotion recognition system. To cope with noise in speech, Schulle et al. [47] extracted a large 4 k acoustic feature set and then used a fast information gain ratio–based feature selection technique to find the suitable feature subsets according to the noise situation. In [48, 49], to reduce the influence of noise, a feature dimensionality reduction method called enhanced Lipschitz embedding was used to embed the extracted 64 acoustic features into a low-dimensional nonlinear manifold. Yeh and Chi extracted the joint spectro-temporal features from an auditory model and then applied them to detect the emotion status of noisy speech [50].

Among the aforementioned three steps in a spoken emotion recognition system, emotion classification is one of the most critical aspects for any successful spoken emotion recognition system. Therefore, designing a good classifier is a crucial step on the robust spoken emotion recognition task. So far, very little work has been done for robust spoken emotion recognition in noise on this step of emotion classification.

In recent years, the newly emerged compressive sensing (CS) (also called compressive sampling) theory [5153], originally aiming to address signal sensing and coding problems, has shown tremendous potential in pattern recognition area. Especially, sparse representation in the CS theory has recently been used as a nonparametric classifier for pattern recognition and shows promising performance on face recognition [5456] and speech recognition [57, 58]. This nonparametric classifier based on sparse representation is the so-called sparse representation classifier (SRC). In SRC, the test sample is represented as a sparse linear combination of the training samples, and the coding fidelity is measured by the l 1-norm of coding residual. Such a sparse representation model in SRC actually assumes that the coding residual follows the Gaussian distribution. However, in practice, this assumption may not hold well in noisy environment since the coding residual may not conform to the Gaussian distribution. Therefore, SRC may not exhibit well its robustness and effectiveness in noisy environment.

To improve the robustness and effectiveness of sparse representation in SRC, in this paper, an enhanced sparse representation classifier (abbreviated as enhanced-SRC) is proposed for robust spoken emotion recognition in noise. We firstly employ the maximum likelihood estimation (MLE) to formulate a weighted sparse representation model and then construct a new classification technique, that is, enhanced-SRC. The proposed enhanced-SRC is used to perform spoken emotion recognition, and its performance is investigated on both clean and noisy emotional speech.

The main contributions of this paper are as follows: (1) An enhanced sparse representation classifier is developed for robust emotion recognition in noisy speech. (2) Sparse representation is a recently emerged technique based on the CS theory. To the best of our knowledge, this work represents one of the first attempts that develop a new classification method of spoken emotion recognition based on sparse representation. (3) There is very little work done for robust spoken emotion recognition in noise on this step of emotion classification. In this paper, we present a new emotion classification method via spare representation for robust spoken emotion recognition. (4) The influence of feature selection on spoken emotion recognition tasks is also investigated. The performance of the proposed method integrating a modern feature selection method, that is, fast correlation-based filter (FCBF), was also reported on clean and noisy emotional speech.

The rest of this paper is organized as follows: Sect. 2 reviews compressive sensing (CS) and sparse representation classifier (SRC). Section 3 provides the proposed enhanced-SRC method in detail. Section 4 introduces speech emotional corpus, and Sect. 5 gives acoustic feature extraction in detail. Experimental results and analysis are given in Sect. 6. Discussion is provided in Sect. 7. Section 8 offers the concluding remarks.

2 Review of CS and SRC

In this section, we briefly review the compressive sensing (CS) theory and then present the details of the recently emerged sparse representation classifier (SRC) based on the CS theory.

2.1 Compressive sensing (CS)

Compressed sensing (CS) [5153] aims to recover a sparse signal by using a number of random linear measurements. Usually, the number of measurements is much lower than the number of samples needed if the signal is sampled at the Nyquist–Shannon sampling rate. The recovery procedure is to minimize the l 1-norm of the sparse signal by solving a convex optimization problem.

Given a system of under-determined equation:

$$ y_{m \times 1} = {\mathbf{A}}_{m \times n} x_{n \times 1} , \, m < n $$
(1)

It is known that the aforementioned Eq. (1) has no unique solution, since the number of variables is larger than the number of equations. In signal processing terms, the length of the signal (n) is larger than the number of samples (m). However, according to the CS theory, if the signal is sparse, it is necessarily unique and can be reconstructed by practical algorithms.

Suppose that the signal is k-sparse if it is a linear combination of only k basis vectors; that is, there are only k nonzero values in x, and the remainder are all zeroes. In this case, it is possible to find the solution to Eq. (1) by a brute force enumeration of all the possible k-sparse vectors of length n. Mathematically speaking, this problem can be expressed as

$$ \hbox{min} \left\| x \right\|_{0} , {\text{ subject to }}y = {\mathbf{A}}x $$
(2)

where \( \left\| {} \right\|_{0} \) is the l 0-norm and denotes the number of nonzero elements in the vector. Equation (2) is known to be a NP (nondeterministic polynomial) hard problem and is thus not a practical solution to Eq. (1). The CS literature [5153] indicates that under a certain condition on the projection matrix A, that is, restricted isometry property (RIP), the sparsest solution to Eq. (1) can be obtained by replacing the l 0-norm in Eq. (2) by its closest convex surrogate, the l 1-norm \( \left( {\left\| {} \right\|_{1} } \right) \). Therefore, the solution to Eq. (2) is equivalent to the following l 1-norm minimization problem

$$ \hbox{min} \left\| x \right\|_{1} , {\text{ subject to }}y = {\mathbf{A}}x $$
(3)

where the l 1-norm, \( \left\| {} \right\|_{1} \), denotes the minimization of the sum of absolute values of elements in the vector and serves as an approximation of the l 0-norm.

In practice, the equality \( y = {\mathbf{A}}x \) is often relaxed to take into account the existence of measurement error in the sensing process due to a small amount of noise. Suppose the measurements are inaccurate and consider the noisy model

$$ y = {\mathbf{A}}x + e $$
(4)

where e is a stochastic or deterministic error term. Particularly, if the error term e is assumed to be white noise such that \( \left\| e \right\|_{2} < \varepsilon \), where \( \varepsilon \) is a small constant. A noise robust version of Eq. (3) is defined as follows

$$ \hbox{min} \left\| x \right\|_{1} , {\text{ subject to }}\left\| {y - {\mathbf{A}}x} \right\|_{2} < \varepsilon $$
(5)

where \( \left\| {y - {\mathbf{A}}x} \right\|_{2} \) is the coding fidelity, which is used to ensure that the given signal y can be faithfully represented by the coding dictionary \( {\mathbf{A}} \). To solve the l 1-minimization of (3) and (5), various efficient algorithms have been developed. Two typical algorithms based on the interior-point idea are l 1-magic [59] and l 1–ls [60]. The l 1-magic algorithm [59] recasts the l 1-minimization problem as a second-order cone program and then applies the primal log-barrier approach. The l 1l s algorithm [60] is a specialized interior-point method for solving the large-scale 11-regularized least-squares programs that uses the preconditioned conjugate gradients algorithm to compute the search direction.

2.2 Sparse representation classifier (SRC)

Based on the CS theory, the so-called sparse representation classifier (SRC) has recently been proposed [54]. SRC is based on the assumption that the whole set of training samples form a dictionary, and then the recognition problem is cast as one of discriminatively finding a sparse representation of the test image as a linear combination of training images by solving the optimization problem in Eq. (3) or (5). Formally, for the training samples of a single class, this assumption can be expressed as

$$ y_{k,test} = \alpha_{k,1} y_{k,1} + \alpha_{k,2} y_{k,2} + \cdots + \alpha_{{k,n_{k} }} y_{{k,n_{k} }} + \varepsilon_{k} = \sum\limits_{i = 1}^{{n_{k} }} {\alpha_{k,i} y_{k,i} } + \varepsilon_{k} $$
(6)

where y k,test is the test sample of the kth class, y k,i is the ith training sample of the kth class, α k,i is the corresponding weight vector, and ɛ k is the approximation error.

For the training samples from all c object classes, the aforementioned Eq. (6) can be expressed as

$$ \begin{gathered} y_{{k,{\text{test}}}} = \alpha_{1,1} y_{1,1} + \cdots + \alpha_{k,1} y_{k,1} + \cdots + \alpha_{{k,n_{k} }} y_{{k,n_{k} }} + \cdots + \alpha_{{c,n_{c} }} y_{{c,n_{c} }} + \varepsilon \hfill \\ \, = \sum\limits_{i = 1}^{{n_{1} }} {\alpha_{1,i} y_{1,i} } + \cdots + \sum\limits_{i = k}^{{n_{k} }} {\alpha_{k,i} y_{k,i} } + \cdots + \sum\limits_{i = 1}^{{n_{c} }} {\alpha_{c,i} y_{c,i} } + \varepsilon \hfill \\ \end{gathered} $$
(7)

In matrix–vector notation, Eq. (7) can be rewritten as

$$ y_{{k,{\text{test}}}} = {\mathbf{A\alpha }} + \varepsilon $$
(8)

where \( \left\{ \begin{gathered} {\mathbf{A}} = [y_{1,1} \left| \cdots \right|y_{{1,n_{1} }} \left| \cdots \right|y_{k,1} \left| \cdots \right|y_{{k,n_{k} }} \left| \cdots \right|y_{c,1} \left| \cdots \right|y_{{c,n_{c} }} ] \hfill \\ {\varvec{\alpha}} = [\alpha_{1,1} \cdots \alpha_{{1,n_{1} }} \cdots \alpha_{k,1} \cdots \alpha_{{k,n_{k} }} \cdots \alpha_{c,1} \cdots \alpha_{{c,n_{c} }} ]^{'} \hfill \\ \end{gathered} \right. \). Therefore, in Eq. (8), \( {\mathbf{y}} \)is the feature data of testing samples, \( {\mathbf{A}} \)is the feature data of training samples, and \( {\varvec{\alpha}} \) is the weight vector.

The linearity assumption in the SRC algorithm coupled with Eq. (8) implies that the weight vector \( {\varvec{\alpha}} \) should be zero except those associated with the correct class of the test sample. To obtain the weight vector \( {\varvec{\alpha}} \), the following l 0-norm minimization problem should be solved.

$$ \mathop {\hbox{min} }\limits_{\alpha } \left\| {\varvec{\alpha}} \right\|_{0} ,{\text{ subject to }}\left\| {y_{k,test} - {\mathbf{A\alpha }}} \right\|_{2} \le \varepsilon $$
(9)

It is known that Eq. (9) is a NP-hard problem. The NP-hard l 0-norm can be replaced by its closest convex surrogate, the l 1-norm. Therefore, the solution of Eq. (9) is equivalent to the following l 1-norm minimization problem.

$$ \mathop {\hbox{min} }\limits_{\alpha } \left\| {\varvec{\alpha}} \right\|_{1} ,{\text{ subject to }}\left\| {y_{{k,{\text{test}}}} - {\mathbf{A\alpha }}} \right\|_{2} \le \varepsilon $$
(10)

This is a convex optimization problem and can be solved by quadratic programming. Once a sparse solution of \( {\varvec{\alpha}} \) is obtained, the classification procedure of SRC [54] is given as follows:

  • Step 1: Solve the l 1-norm minimization problem in Eq. (10).

  • Step 2: For each class i, compute the residuals between the reconstructed sample \( y_{\text{recons}} (i) = \sum\nolimits_{j = 1}^{{n_{i} }} {\alpha_{i,j} y_{i,j} } \) and the given test sample by \( r(y_{\text{test}} ,i) = \left\| {y_{{k,{\text{test}}}} - y_{\text{recons}} (i)} \right\|_{2} \).

  • Step 3: The class of the given test sample is determined by identify \( (y_{\text{test}} ) = \arg \min_{i} r(y_{\text{test}} ,i) \).

3 The proposed enhanced-SRC method

From the view of the maximum likelihood estimation (MLE), in SRC, the coding fidelity with l 1-norm defined in Eq. (10) actually assumes that the coding residual \( e = y - {\mathbf{A\alpha }} \) follows the Gaussian distribution. However, in practice, this assumption in SRC may not hold well in noisy environment since the coding residual may not conform to the Gaussian distribution. Therefore, in SRC, the conventional sparse representation model defined in Eq. (10) with l 1-norm may not be robust and effective enough for signal representation. This is the main drawback of SRC, resulting in the fact that SRC may not exhibit well its robustness and effectiveness in noisy environment. To make a more robust and effective sparse representation model, in this paper, the maximum likelihood estimation (MLE) is employed to find the MLE solution of coding coefficient vector so as to formulate a weighted sparse representation model, which gives rise to the proposed enhanced-SRC.

The conventional sparse representation problem in Eq. (5) can be rewritten as the following LASSO [61] problem

$$ \mathop {\hbox{min} }\limits_{\alpha } \left\| {y - {\mathbf{A\alpha }}} \right\|_{2} ,{\text{ subject to }}\left\| {\varvec{\alpha}} \right\|_{1} \le \varepsilon $$
(11)

It can be seen that the sparse representation problem in Eq. (11) is essentially a sparsity-constrained least-square estimation problem. It is known that only when the coding residual \( e = y - {\mathbf{A\alpha }} \) follows the Gaussian distribution, the least-square solution is the MLE solution.

Suppose that the coding dictionary \( {\mathbf{A}} \) can be rewritten as \( {\mathbf{A}} = [\lambda_{1} ;\lambda_{2} ; \cdots ;\lambda_{n} ] \in {\text{R}}^{m \times n} \), where \( \lambda_{i} \)(\( i = 1,2, \cdots ,n \)) is the ith row vector of \( {\mathbf{A}} \), the coding residual \( e = y - {\mathbf{A\alpha }} \) can be expressed as

$$ e_{i} = y_{i} - \lambda_{i} {\varvec{\alpha}}, \, i = 1,2, \cdots ,n $$
(12)

If e 1, e 2,…, e n have independent and identical distribution with some probability density function (PDF) \( f(e_{i} \left| \theta \right.) \), where \( \theta \in \vartheta \) is a parameter that characterizes the distribution, MLE aims to maximize the likelihood function

$$ \mathop {\arg \hbox{max} }\limits_{\theta \in \vartheta } L(\theta \left| {e_{1} ,e_{2} , \ldots ,e_{n} } \right.) = \mathop {\arg \hbox{max} }\limits_{\theta \in \vartheta } \prod\limits_{i = 1}^{n} {f(e_{i} \left| \theta \right.)} $$
(13)

In practice, it is often more convenient to work with the logarithm of the likelihood function (called log-likelihood).

$$ \mathop {\arg \hbox{max} }\limits_{\theta \in \vartheta } \ln L(\theta \left| {e_{1} ,e_{2} , \ldots ,e_{n} } \right.) = \mathop {\arg \hbox{max} }\limits_{\theta \in \vartheta } \sum\limits_{i = 1}^{n} {\ln f(e_{i} \left| \theta \right.)} $$
(14)

Let \( p_{\theta } (e_{i} ) = - \ln f(e_{i} \left| \theta \right.) \), MLE equivalently aims to minimize the following log-likelihood

$$ \mathop {\arg \hbox{min} }\limits_{\theta \in \vartheta } ( - \ln L(\theta \left| {e_{1} ,e_{2} , \ldots ,e_{n} } \right.)) = \mathop {\arg \hbox{min} }\limits_{\theta \in \vartheta } ( - \sum\limits_{i = 1}^{n} {\ln f(e_{i} \left| \theta \right.)} ) = \mathop {\arg \hbox{min} }\limits_{\theta \in \vartheta } \sum\limits_{i = 1}^{n} {p_{\theta } (e_{i} )} $$
(15)

To find the MLE solution of coding coefficient vector \( {\varvec{\alpha}} \), the sparse representation problem in Eq. (11) can be formulated as the following minimization

$$ \mathop {\hbox{min} }\limits_{\alpha } \sum\limits_{i = 1}^{n} {p_{\theta } (y_{i} - \lambda_{i} {\varvec{\alpha}})} ,{\text{ subject to }}\left\| {\varvec{\alpha}} \right\|_{1} \le \varepsilon $$
(16)

As shown in Eq. (16), the MLE-based sparse representation model is essentially sparsity-constrained MLE problem. When the coding residual \( e = y - {\mathbf{A\alpha }} \) follows the Gaussian distribution, the sparse representation problem in Eq. (16) is equivalent to the sparse representation problem in Eq. (11). Therefore, the MLE-based sparse representation model is a more general sparse representation problem. Since \( e = y - {\mathbf{A\alpha }} \), the sparse representation problem in Eq. (16) can be approximately rewritten as the following weighted sparse representation problem

$$ \mathop {\hbox{min} }\limits_{\alpha } \left\| {W(y - {\mathbf{A\alpha }})} \right\|_{2} ,{\text{ subject to }}\left\| {\varvec{\alpha}} \right\|_{1} \le \varepsilon $$
(17)

where W is a weighted diagonal matrix and its element is the weight assigned to each feature point of samples. Intuitively, the outlier feature points of samples with high coding residuals should have low weights. Therefore, the following Gaussian weighted function can be adopted.

$$ W_{i} = \text{e}^{{ - \frac{{\left\| {y - y_{recons} (i)} \right\|_{2} }}{{2\sigma^{2} }}}} $$
(18)

where \( y_{recons} (i) = A\alpha_{i} \) denotes the reconstructed sample based on the coding coefficient α i , and σ is a constant. In our experiments, σ is set to 1 for its satisfying performance. \( W_{i} \in [0,1] \) is a nonnegative scalar and needs to be estimated with Eq. (18). Essentially, the weighted sparse representation problem in Eq. (17) is a weighted LASSO problem. Compared with the models in Eqs. (5) and (11), the proposed weighted LASSO model in Eq. (17) has a good property; that is, the outlier feature points with big residuals will be adaptively assigned with low weights to reduce their affects so that the sensitiveness to outliers can be greatly reduced. To find the solution of the proposed weighted LASSO model, the l 1-regularized least-squares method [60] is employed to solve Eq. (17).

Based on the weighted sparse representation model in Eq. (17), a new classifier called enhanced-SRC can be developed. As done in SRC, the classification procedure using enhanced-SRC is presented as follows:

  1. 1.

    Solve the l 1-norm minimization problem in Eq. (17).

  2. 2.

    For each class i, repeat the following two steps:

    1. (a)

      Reconstruct a sample for each class by a linear combination of the training samples belonging to that class by \( y_{\text{recons}} (i) = \sum\nolimits_{j = 1}^{{n_{i} }} {\alpha_{i,j} y_{i,j} } \).

    2. (b)

      Compute the residuals between the reconstructed sample and the given test sample by \( r(y_{\text{test}} ,i) = \left\| {y_{\text{test}} - y_{\text{recons}} (i)} \right\|_{2} \).

  3. 3.

    Choose the class with the minimizing residual as the class of the given test sample, when the residuals for each class are obtained.

4 Speech emotional corpus

To evaluate the performance of the proposed method on the spoken emotion recognition tasks, in our work, we used two publicly available emotional speech databases: the Berlin database of German emotional speech [62] and the Polish database of Polish emotional Speech [63].

4.1 Berlin database

The Berlin database of German emotional speech [62], also known as Emo-DB, has been developed by Professor Sendlmeier and his fellows in the Department of Communication Science, Institute for Speech and Communication, Berlin Technical University. The speech corpus comprises 535 emotional utterances spoken in seven different emotions: anger, joy, sadness, neutral, boredom, disgust, and fear. The numbers of utterances for the seven emotion categories in the Berlin database are as follows: anger (127), boredom (81), disgust (46), fear (69), joy (71), neutral (79), and sadness (62). Ten professional native German-speaking actors (5 female and 5 male) simulated the emotions, producing 10 German utterances (5 short and 5 longer sentences) which could be used in everyday communication and are interpretable in all applied emotions. The actors were advised to read these predefined sentences in the targeted seven emotions. The length of the speech samples varies from 3 to 8 s. The recordings were taken in an anechoic chamber with high-quality recording equipment and available at a sampling rate of 16 kHz with a 16-bit resolution and mono channel. A human perception test with twenty subjects, different from the speakers, was performed to benchmark the quality of the recorded data. Reported human test accuracy on this database is 84.1 %.

4.2 Polish database

The Polish database of Polish emotional speech [63] is consisted of 240 WAVE files with six emotional states: anger, joy, sadness, neutral, boredom, and fear. The number of utterances for each emotion is 40. The wave files have been acquired with a sampling frequency of 44.1 kHz with 16 bits per sample. The 240 sentences in this database were uttered by four actors and four actresses. Each person was uttering the same five sentences, attempting to assign them with differing emotional load, thus producing six different sets of recordings. To assess a quality of the prepared database material, the recordings have been evaluated by thirty subjects, through a procedure of classification of randomly generated samples. An average rate of correct recognition for this evaluation experiment was 72 %.

5 Acoustic feature extraction

Although there is no general agreement regarding the best features for spoken emotion recognition, the most widely used acoustic features are prosody features, voice quality features as well as the spectral features. In our work, the extracted prosody features contain pitch, intensity, and duration, whereas the extracted voice quality features include the first three formants (F1, F2, and F3), spectral energy distribution, harmonics-to-noise ratio (HNR), pitch irregularity (jitter), and amplitude irregularity (shimmer). For the spectral features extraction, the well-known MFCC features are used. The software employed to deal with all the acoustic features extraction is PRAAT, a shareware program developed by created by Paul Boersma and David Weenink of the Institute of Phonetics Sciences of the University of Amsterdam and publicly available online at http://www.praat.org. Not only typical statistical parameters like mean and standard derivations are used but also some other like median and quartiles are taken into account.

5.1 Prosody features

Prosody refers to the stress and intonation patterns of spoken language [64]. Its importance in conveying emotional expression is intuitive, and hence, it has always been the first acoustic parameter considered when dealing with spoken emotional expression. The prosody parameters measured in this work are related to pitch, intensity, and duration, as described below:

5.1.1 Pitch-related parameters

Pitch, often referred as fundamental frequency (F0), is an estimation of the rate of vocal fold vibration and is considered as one of the most important attributes in emotion expression and detection [35, 65].

The pitch contour has been shown to vary depending on the emotional states being expressed. Regardless of the scale difference due to the sex condition, joy and anger present a higher pitch average, while boredom and sadness means are slightly slower with reference in the neutral emotion [66]. A robust version of the autocorrelation algorithm—the default in the PRAAT system—is used to calculate the pitch contour automatically for all utterances [67]. From the pitch contour of each utterance, we extracted 10 statistics: maximum, minimum, range, mean, standard deviation, first quartile, median, third quartile, inter-quartile range, and the mean absolute slope.

5.1.2 Intensity-related parameters

Intensity, often referred to as the volume or energy of the speech, is correlated with loudness and is one of the most intuitive indicators in the relation voice–emotion [68, 69]. Even if we are not experts in this matter, we could easier imagine someone angry shouting than gently whispering. The intensity contour provides information that can be used to differentiate sets of emotions. Higher intensity levels are found in those with high arousal levels such as anger, surprise, and joy, while sadness and boredom with low arousal levels yield lower intensity values [14, 66]. The algorithm used to calculate the intensity convolutes a Kaiser-20 window over the speech signal—the default procedure in PRAAT. From global statistics directly derived from the intensity contour, we selected 9 statistics: maximum, minimum, range, mean, standard deviation, first quartile, median, third quartile, and inter-quartile range.

5.1.3 Duration-related parameters

Prosody involves also duration-related measurements. One of the most important durational measurements in the aim to discriminate among speaker’s emotional states is the speaking rate. An acoustic correlation of the speaking rate can be defined as the inverse of the average of the voiced region length within a certain time interval. It has been noted that fear, disgust, anger, and joy often have an increased speaking rate, while sadness has a reduced articulation rate with irregular pauses [14]. Since we do not know the onset time and duration of the individual phonemes, measures of the speaking rate are obtained with respect to voiced and unvoiced regions. Statistics are calculated from individual durations of voiced and unvoiced regions, which are extracted from the pitch contour. All measures are calculated with respect to the length of the utterance. The duration of each frame is 20 ms. For the duration-related parameters, we selected 6 statistics: total-frames, voiced-frames, unvoiced-frames, ratio of voiced- versus unvoiced-frames, ratio of voiced-frames versus total-frames, and ratio of unvoiced-frames versus total-frames.

5.2 Voice quality features

Voice quality is referred to as the characteristic auditory coloring of an individual’s voice, derived from a variety of laryngeal and supralaryngeal features and running continuously through the individual’s speech [70]. A wide range of phonetic variables contribute to the subjective impression of voice quality. Voice quality is usually changed to strengthen the impression of emotions. Voice quality measures, which have been directly related to emotions, include the first three formants, spectral energy distribution, harmonics-to-noise ratio (HNR), pitch irregularity (jitter), and amplitude irregularity (shimmer) [7173].

5.2.1 Formant-related parameters

The resonant frequencies produced in the vocal tract are referred to as formant frequencies or formants [74]. Each formant is characterized by its center frequency and its bandwidth. It has been found that the first three formants (F1, F2, F3) are affected by the emotional states of speech more than the other formants [75]. It was also noticed that the amplitudes of F2 and F3 were higher with respect to that of F1 for anger and fear compared with neutral speech [76]. To estimate the formants, PRAAT applies a Gaussian-like window for each analysis window and computes the linear predictive coding (LPC) coefficients with the Burg algorithm [77], which is a recursive estimator for auto-regressive models, where each step is estimated using the results from the previous step. The following statistics are measured for the extracted formant parameters: mean of F1, std of F1, median of F1, bandwidth of median of F1, mean of F2, std of F2, median of F2, bandwidth of median of F2, mean of F3, std of F3, median of F3, bandwidth of median of F3.

5.2.2 Spectral energy distribution–related parameters

The spectral energy distribution is calculated within four different frequency bands in order to decide whether the band mainly contains harmonics of the fundamental frequency or turbulent noise [73]. There are many contradictions in identifying the best frequency band of the power spectrum in order to classify emotions.

Many investigators put high significance on the low frequency bands, such as the 0–1.5 kHz band [75, 78], whereas others suggest the opposite [14]. Here, spectral energy distribution in four different frequency bands including low and high frequency (0–5 kHz) bands is calculated directly by PRAAT. The following features are measured: band energy from 0 to 500 Hz, band energy from 500 to 1,000 Hz, band energy from 2,500 to 4,000 Hz, and band energy from 4,000 to 5,000 Hz.

5.2.3 Harmonics-to-noise ratio–related parameters

The harmonic-to-noise ratio (HNR) is defined as the ratio of the energy of the harmonic part to the energy of the remaining part of the signal and represents the degree of acoustic periodicity. HNR estimation can be considered as an acoustic correlation with breathiness and roughness [79]. The values of HNR in the sentences expressed with anger are significantly higher than the neutral expression [79].

The algorithm performs acoustic periodicity detection on the basis of an accurate autocorrelation method [67]. The following features are measured: maximum, minimum, range, mean, and standard deviation.

5.2.4 Jitter and Shimmer

Jitter/shimmer measures have been considered in voice quality assessment to describe the kinds of irregularities associated with vocal pathology [80].

Jitter: It measures the cycle-to-cycle variations of the fundamental period averaging the magnitude difference of consecutive fundamental periods, divided by the mean period.

Jitter is defined as the relative mean absolute third-order difference of the point process, which is exceptionally calculated using PRAAT [81]. Jitter is calculated with the following Eq. (19), in which T i is the ith peak-to-peak interval and N is the number of intervals:

$$ {\text{Jitter}}\,(\% ) = {{\sum\limits_{i = 2}^{N - 1} {(2T_{i} - T_{i - 1} - T_{i + 1} )} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 2}^{N - 1} {(2T_{i} - T_{i - 1} - T_{i + 1} )} } {\sum\limits_{i = 2}^{N - 1} {T_{i} } }}} \right. \kern-0pt} {\sum\limits_{i = 2}^{N - 1} {T_{i} } }} $$
(19)

Shimmer: It measures the cycle-to-cycle variations of amplitude by averaging the magnitude difference of the amplitudes of consecutive periods, divided by the mean amplitude [80].

Shimmer is calculated similarly to jitter as shown in Eq. (20), in which E i is the ith peak-to-peak energy values and N is the number of intervals:

$$ {\text{Shimmer}}\,(\% ) = {{\sum\limits_{i = 2}^{N - 1} {(2E_{i} - E_{i - 1} - E_{i + 1} )} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 2}^{N - 1} {(2E_{i} - E_{i - 1} - E_{i + 1} )} } {\sum\limits_{i = 2}^{N - 1} {E_{i} } }}} \right. \kern-0pt} {\sum\limits_{i = 2}^{N - 1} {E_{i} } }} $$
(20)

5.3 MFCC features

As the representative spectral features, for each utterance, the first 13 Mel-frequency cepstral coefficients (MFCC) (including log-energy) with their first delta and second delta components are extracted by using a 25-ms Hamming window at intervals of 10 ms. The mean and std of MFCC as well as its first delta and second delta components are computed for each utterance, giving 156 MFCC features.

To summarize, for each utterance from the emotional speech corpus, 25 prosody features, 23 voice quality features as well as 156 MFCC features are extracted. These extracted 204 features in total are statistical in Table 1.

Table 1 Acoustic feature extraction

6 Experiments and results

To verify the robustness and effectiveness of the proposed enhanced-SRC method on the spoken emotion recognition tasks, six representative classifiers, including linear discriminant classifier (LDC), K-nearest neighbor (KNN), C4.5 decision tree, artificial neural networks (ANNs), support vector machines (SVMs) as well as the recently emerged SRC, are used to compare with the proposed enhanced-SRC method. When using KNN classifier, the best values of K are found by an exhaust search within the range [1, 20] with a step of 1. As one of the representative ANN, the radial basis function neural networks (RBFNNs) classifier is used for its computation simplicity and comparable performance to other types of ANN such as multi-layer perceptron (MLP). The LIBSVM package [82] is used to implement the SVM algorithm with the simple linear kernel function, one-versus-one strategy for multi-class classification problem. The parameter σ in the Gaussian weighted function of enhanced-SRC is set to 1 for its satisfying performance. Our experiment configuration is Intel CPU 2.10 GHz, 1G RAM memory, MATLAB 7.0.1 (R14).

As in [3, 44, 83], a fivefold cross-validation scheme is employed in the emotion classification experiments, and the average results are reported. In other words, each classification model is trained on nine-tenths of the total data and tested on the remaining tenth. This process is repeated ten times, each with a different partitioning seed, in order to account for variance between the partitions.

The experimental results and analysis are presented in two aspects. For one thing, spoken emotion recognition experiments are performed on clean speech utterances from the original speech emotional databases. For another, spoken emotion recognition experiments are conducted on the noisy speech when the Gaussian white noise with zero mean and unit variance is added to each utterance from the original speech emotional databases at various signal-to-noise ratios (SNRs) levels. We investigate the effect of noise addition in 5 dB steps starting from the original clean speech, moving on to slightly noise overlaid 30-dB SNR and terminating at heavily overlaid −10 dB SNR.

In addition, we also investigate the performance of the proposed enhanced-SRC method after performing feature selection scheme. As done in [10], a modern feature selection method, that is, fast correlation-based filter (FCBF) [84], was used to select the most valuable feature sets from the extracted high-dimensional acoustic features. And then by using the selected feature sets, we reported the performance of the proposed enhanced-SRC method on clean and noisy emotional speech.

6.1 Experiments on the Berlin database

The recognition results of different classification methods on the Berlin database, including LDC, KNN, C4.5, RBFNN, SVM, SRC as well as the proposed enhanced-SRC, are given in Table 2. It can be seen in Table 2 that the proposed enhanced-SRC obtains the highest accuracy of 81.68 %, outperforming the other used methods. More important, enhanced-SRC gives a clear improvement of 6.29 % over SRC. This shows that enhanced-SRC is suitable for spoken emotion recognition due to its good classification performance. The recognition performance of the other used methods is 78.75 % for SVM, 74.39 % for SRC, 71.22 % for RBFNN, 70.84 % for KNN, 69.74 % for LDC, and 55.14 % for C4.5.

Table 2 Comparison of recognition results (%) on the Berlin database

To further explore the recognition accuracy per emotion for enhanced-SRC, the confusion matrix of 7-class emotion recognition results obtained with enhanced-SRC is presented Table 3, where the bold number represents for the recognition accuracy per emotion. The confusion matrix in Table 3 indicates that anger and sadness could be discriminated well with a satisfactory accuracy of more than 96 %, while other five emotions are classified with relatively low recognition accuracy (less than 90 %). In particular, joy is recognized with the lowest accuracy of 57.73 % since joy is highly confused with anger.

Table 3 Confusion matrix of recognition results obtained with enhanced-SRC on the Berlin database

To evaluate the performance of enhanced-SRC on robust spoken emotion recognition in noise, the Gaussian white noise is added to each utterance from the Berlin speech corpus at various SNR levels. Table 4 presents the recognition results at different SNR levels. From Table 4, we can see that at each selected SNR level, enhanced-SRC performs best among all used methods. More crucially, at SNR levels ranging from 30 to 10 dB, enhanced-SRC achieves a relatively stable performance of about 80 %. At SNR levels ranging from 5 to −10 dB, enhanced-SRC outperforms significantly the other used methods. Even at SNR = −10 dB, enhanced-SRC still obtains an accuracy of 62.98 %. This demonstrates the robustness and effectiveness of enhanced-SRC for robust spoken emotion recognition in noise. This is attributed to enhanced-SRC having two prominent characteristics. Firstly, enhanced-SRC uses MLE to find the MLE solution of coding coefficient vector. Secondly, enhanced-SRC can adaptively assign the outlier feature points of samples with low weights to reduce their influence of noise.

Table 4 Comparison of recognition results (%) at different SNR levels on the Berlin database

Although it is difficult to perform direct comparisons with previously published results on the Berlin database due to different experimental conditions such as the used acoustic features, the used classifiers, in our work, the recognition accuracy of 81.68 % obtained by enhanced-SRC is still highly comparable with the currently reported work [44, 46, 83, 85], in which the experimental settings are similar to ours. In [44], the authors employed the mean of the log-spectrum (MLS), MFCC, prosody features, and presented the best performance of 71.75 % with the multiple feature hierarchical classifier. In [46], the authors extracted pitch, intensity, MFCC to form the feature set of the segment-based approach (SBA) and reported the best accuracy of 75.5 %. In [83], based on pitch, formants (F1, F2, and F3), energy, and MFCC, the authors obtained the best accuracy of 64.78 % with SVM. Scherer et al. [85] used the relative spectral transform perceptual linear prediction (RASTA-PLP) coefficients and spectral energy modulation features and achieved an accuracy of 70 % with KNN.

6.2 Experiments on the Polish database

Table 5 presents the recognition performance of different classification methods on the Polish database. The results in Table 5 indicate that the proposed enhanced-SRC still obtains the recognition performance superior to the other used methods, that is, SRC, SVM, LDC, KNN, RBFNN, and C4.5. Note that, among all the used methods, enhanced-SRC achieves the highest accuracy of 71.67 %, making an obvious improvement of 10.84 % over SRC. Again, this demonstrates the promising performance of enhanced-SRC.

Table 5 Comparison of recognition results (%) on the Polish database

Table 6 gives the confusion matrix of 6-class emotion recognition results obtained by enhanced-SRC. As shown in Table 6, we can see that only two emotions, that is, sadness and joy, are identified relatively well with an accuracy of 82.5 % (sadness) and 77.5 % (joy). Table 7 gives the recognition results of different classification methods on the Polish database at different SNR levels. The results in Table 7 show that enhanced-SRC still obviously performs better than the other used methods at various SNR levels. This demonstrates the robustness and effectiveness of the proposed enhanced-SRC method on the robust spoken emotion recognition tasks, again.

Table 6 Confusion matrix of recognition results obtained with enhanced-SRC on the Polish database
Table 7 Comparison of recognition results (%) at different SNR levels on the Polish database

Now, we compare our reported recognition accuracy of 71.67 % with the previously published work [83, 86] on the Polish database. Fersini et al. [83] used pitch, formants (F1, F2, and F3), energy, and MFCC and yielded an accuracy of 68.7 % with SVM. In [86], based on pitch and temporal characteristics, the authors used linear discriminant analysis (LDA) to produce four-dimensional feature spaces that provide the highest recognition accuracy of 68 % with the nearest-neighbor classifier.

6.3 Influence of feature selection on emotion recognition accuracy

Since the dimensionality of the extracted acoustic features is still high, it is usually necessary to perform feature selection to reduce the feature dimensionality. To achieve this goal, fast correlation-based filter (FCBF) [84], as recently used for spoken emotion recognition in [10], was adopted in this work.

The basic idea of FCBF [84] is that it finds the predominant features in terms of their mutual information with the class to predict and remove ones in which mutual information is lesser than a threshold ξ. In detail, FCBF is consisted of two steps: (1) selecting a subset of relevant features and (2) selecting predominant features from relevant ones. In this work, a feature selection Matlab toolbox, that is, the FEAST software, available at http://www.cs.man.ac.uk/~gbrown/fstoolbox/, is used to implement the FCBF algorithm. The threshold ξ is set to 0.001 for FCBF.

Due to the different number of emotion categories and speech samples on these two databases, that is, the Berlin database and the Polish database, the number of selected features via FCBF is also different. Based on the threshold \( \xi { = 0} . 0 0 1 \), the FCBF algorithm is able to select at most 90 predominant features from the extracted 204 features on the Berlin database and 70 predominant features on the Polish database. By using FCBF and the proposed enhanced-SRC, Figs. 1 and 2 give the recognition accuracy with different number of selected features on the Berlin database and the Polish database, respectively. As can be seen from Figs. 1 and 2, the size of the best selected feature set, corresponding to the highest recognition accuracy, is 87 on the Berlin database and 64 on the Polish database, respectively.

Fig. 1
figure 1

Recognition accuracy versus number of selected features via FCBF and enhanced-SRC on the Berlin database

Fig. 2
figure 2

Recognition accuracy versus number of selected features via FCBF and enhanced-SRC on the Polish database

With the best selected 87 features on the Berlin database, the recognition results of different classification methods, that is, LDC, KNN, C4.5, RBFNN, SVM, SRC, and enhanced-SRC, are provided in Table 8. Table 9 gives the recognition results of all used classification methods at different SNR levels. It can be seen in Tables 8 and 9 that enhanced-SRC performs best among all used methods, again. In addition, compared with the results in Tables 2 and 4 without feature selection, the results in Tables 8 and 9 indicate that all used classification methods obtain better recognition performance after feature selection via FCBF on the Berlin database. More specially, under clean condition, enhanced-SRC obtains an accuracy of 83.19 %, giving an improvement of 1.51 %. At SNR = −5 dB, enhanced-SRC makes an improvement of 2.63 % at best.

Table 8 Comparison of recognition results (%) using 87 selected features on the Berlin database
Table 9 Comparison of recognition results (%) using 87 selected features at different SNR levels on the Berlin database

With the best selected 64 features on the Polish database, Table 10 presents the recognition results of all used classification methods, that is, LDC, KNN, C4.5, RBFNN, SVM, SRC, and enhanced-SRC. At different SNR levels, Table 11 gives the recognition performance of these classification methods. The results in Tables 10 and 11 show that the proposed enhanced-SRC still outperforms the other used methods. Moreover, after feature selection via FCBF, all used classification methods make an improvement to a certain degree. Note that enhanced-SRC achieves an accuracy of 73.64 % under clean condition, giving an improvement of 1.97 %.

Table 10 Comparison of recognition results (%) using 64 selected features on the Polish database
Table 11 Comparison of recognition results (%) using 64 selected features at different SNR levels on the Polish database

In summary, by using the feature selection method of FCBF, all used classification methods give more or less an improvement on clean and noisy emotional speech. Note that the proposed enhanced-SRC can obtain an improvement of 2.63 % at best. This demonstrates that performing feature selection via FCBF is effective to improve emotion recognition performance.

7 Discussion

From the aforementioned experimental results, some interesting points can be found as follows:

  1. 1.

    Regardless of the level of the added noise, the proposed enhanced-SRC method always performs best on these two emotional speech databases, that is, the Berlin database and the Polish database, outperforming the other used six typical classifiers, including LDC, KNN, C4.5, RBFNN, SVM as well as SRC. This demonstrates the robustness of the proposed enhanced-SRC. This can be attributed to be the fact that the proposed enhanced-SRC is developed by a weighted sparse representation model based on MLE.

  2. 2.

    Among the six typical classifiers, that is, LDC, KNN, C4.5, RBFNN, SVM and SRC, SVM obtains the best performance on the clean and noisy emotional speech. This can be explained by the good property of SVM; that is, SVM has good generalization ability since it is based on the statistical learning theory of structural risk management [87], which aims to limit the empirical risk on the training data and on the capacity of the decision function.

  3. 3.

    Due to the difference of the used emotional speech databases, such as the quality of audio recording, the amount of speech samples, the number of emotional states, it is difficult to directly compare the obtained different emotion recognition results on the Berlin and Polish databases. Nevertheless, in our work without feature selection, the reported recognition accuracy obtained by the proposed enhanced-SRC on the Berlin database is 81.68 % and is about 10 % higher than the reported recognition accuracy on the Polish database. This shows the quality of audio recording on the Berlin database is far higher than that on the Polish database. Additionally, the different reported human test accuracies on these two databases (84.1 % for the Berlin database and 72 % for the Polish database) also demonstrate that the quality of audio recordings on the Polish database is still needed to be improved.

8 Conclusions and future work

This work presents a novel method of robust emotion recognition in noisy speech via sparse representation. The recently emerged sparse representation in the CS theory can be used to form SRC for pattern recognition. However, SRC could not obtain promising performance on spoken emotion recognition tasks, since the conventional sparse representation model in SRC is not robust and effective in noisy environment. To overcome the drawback of SRC, in this work, a weighted sparse representation model based on MLE is developed to construct a new classification algorithm called enhanced-SRC. Experimental results on the Berlin and Polish databases demonstrate the promising performance of enhanced-SRC on the task of robust spoken emotion recognition in noise.

Recently, the CS technique has been successfully applied for missing data imputation in noise robust speech recognition [88]. Therefore, to further investigate the performance of the CS technique on the robust spoken emotion recognition tasks, in our future work, it is an interesting task to study on how to use the CS technique to reconstruct the clean speech from the noisy emotional speech so as to promote the robustness of the latter extracted acoustic features.