Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition

Heracleous, Panikos; Mohammad, Yasser; Yoneyama, Akio

doi:10.1007/978-3-030-22643-5_9

Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition

Panikos Heracleous¹⁵,
Yasser Mohammad¹⁶ &
Akio Yoneyama¹⁵

Conference paper
First Online: 27 June 2019

2003 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11567))

Abstract

Speech emotion recognition is a task designed to automatically identify human emotions in spoken utterances. The current study focuses on speech emotion recognition based on deep convolutional neural networks (DCNNs) and extremely randomized trees. Specifically, we propose a method based on DCNN, which extracts informative features from the speech signal, and those features are then used by an extremely randomized trees classifier for emotion recognition. The CNNs are a special variant of conventional feed-forward deep neural networks (DNNs), and have been used in many speech applications. Another method is also proposed which integrates DCNN with i-vectors for emotion recognition. The proposed methods were evaluated using the state-of-the-art English IEMOCAP and FAU Aibo German emotional corpora for the recognition of four and five emotions, respectively. When using the IEMOCAP English corpus and DCNN with extremely randomized trees, a 63.9% unweighted average recall (UAR) was obtained. In the case of using the German children’s Aibo corpus, a 61.8% UAR was achieved. These results are very promising showing the effectiveness of the proposed methods in speech emotion recognition. The proposed methods were compared with a baseline approach based on support vector machines (SVM), and they showed superior performance.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Emotion recognition plays an important role in human-computer interaction and is attracting a high level of attention because of its real world applications [1]. Emotion recognition can be applied in human-robot interaction to detect the user’s emotions, or in call-centers to identify the caller’s emotional state. In particular, in cases of emergency, emotion recognition can provide feedback to the operator so that he or she can respond in an appropriate way. Furthermore, the emotional state of the caller may be very informative concerning the level of customer satisfaction.

The current study focuses on emotion recognition based on the speech modality. A method is proposed which uses DCNN to extract informative features from each layer of the network, and the extracted features are then flattened and used by extremely randomized trees [2] for emotion recognition. Extremely randomized trees are similar to random forest [3], but with random tree splitting. The motivation for using extremely randomized trees is due to the lower computational cost, and additionally the method shows a high level of performance in the case of a small number of features.

A CNN [4, 5] is a special variant of conventional neural networks consisting of convolution and pooling layers. Many studies have reported results for speech emotion recognition [6], image classification [7], and sentence classification [8] based on CNNs. In particular, CNNs are very popular in image classification and most of the recent related studies are based on CNNs. In the current study, CNNs are used because of their simplicity compared to a conventional feed-forward DNN. Due to parameter sharing, computational and memory costs are lower.

In addition to DCNN with extremely randomized trees, another method based on conventional CNNs is also experimentally investigated. In this case, instead of using extremely randomized trees for classification, a fully connected layer is added on the top of convolutional layers of the DCNN, and emotion recognition is performed using the features of the last layer. When using the two methods, the neural networks are fed with frame-level spectral features. Furthermore, for more comprehensive investigations, DCNNs fed with i-vectors [9] are also applied in speech emotion recognition. In the i-vector paradigm, the spoken utterance is represented by a small number of factors, which comprise the variability of speaker, channel, emotion, or language. Although i-vectors have been successfully used in speech emotion recognition [10,11,12], the integration of deep learning (DL) and i-vectors has not been investigated comprehensively so far, and only very few studies addressed this issue [13]. As a result, DL and i-vectors for speech emotion recognition are still an open research area and further investigations are necessary.

In a previous study [14], the authors demonstrated experimental results on far-field speech emotion recognition using a DCNN for feature extraction and extremely randomized trees for classification. In the current study, the DCNN architecture is simplified by excluding network pre-training, and, also, by using the features of all convolutional layers to select the learned features used in classification. The motivation for using the features from all layers lies in the fact that lower-level features may be also very informative resulting in higher classification rates when included. Furthermore, in the proposed methods are evaluated using also the English IEMOCAP corpus [15] for classification of four emotions.

Regarding the emotional data used, the proposed methods are evaluated using the state-of-the-art English IEMOCAP and German FAU Aibo [16] corpora for the classification of four and five emotions, respectively. For comparison purposes, a baseline speech emotion recognition experiment using the popular SVM classifier [17] with i-vectors was also conducted.

2 Related Work

Previously, several studies addressed the problem of emotion recognition using different modalities, classifiers, and feature extraction methods. Emotion recognition can be performed using speech signal [18], visual/facial information [19], electroencephalography (EEG) signals [20], and also using physiological signals such as, blood volume pulse (BVP), electromyography (EMG), skin conductance (SC), skin temperature (SKT) and respiration (RESP) [21].

Speech emotion recognition using Gaussian mixture models (GMMs) was reported in [22, 23]. In [24], hidden Markov model- (HMM) based speech emotion recognition was presented. SVM is among the most popular classifier used in speech emotion recognition [25, 26]. More recent studies are based on neural networks (NN) [27, 28]. Currently, speech emotion recognition using deep neural networks is being investigated [29, 30].

Mel-frequency cepstral coefficients (MFCC) [31] are very commonly and widely used features in speech emotion recognition. In addition to MFCC features, shifted delta-cepstral (SDC) coefficients [32, 33] can also be applied. Originally, SDC coefficients were used in spoken language identification showing superior performance compared with the sole use of MFCC features. In the current study, SDC coefficients are concatenated with MFCC features to form the basic feature vectors. In many recent studies, low-level descriptors (LLD) and functionals [34] are used as features. Considering the success of i-vectors in speaker recognition and spoken language identification [35], however, studies on speech emotion recognition using i-vectors and neural networks have also been presented in a small number of studies. In the current study, the i-vectors used by conventional CNN architecture are extracted from concatenated MFCC features and SDC coefficients.

3 Methods

3.1 Emotional Corpora

In the current study, the FAU Aibo and the IEMOCAP data are used. The FAU Aibo German corpus consists of 9 h of speech uttered by 51 children while interacting with Sony’s Aibo robot. The spontaneous Aibo speech was recorded using a close-talking microphone, and was annotated into 11 categories by five human annotators. However, in the current study, the 5-class task is considered, and data for the emotions angry, emphatic, joyful, neutral, and rest were used for the classification. For training, 590 utterances for each emotion were used, and for testing, 299 utterances for each emotion were used. The data were randomly selected from the entire data set.

The IEMOCAP database was collected at the SAIL lab of the University of Southern California. It contains 12 h of audiovisual data produced by 10 actors. The data were annotated into categorical labels as well as dimensional labels. In the current study, categorical labels were used to classify the emotional states of neutral, happy, angry, and sad. To avoid unbalanced data, 250 utterances for training and 70 utterances for testing randomly selected from each emotion were used.

3.2 Feature Extraction

Cepstral Features. MFCCs are the basic features used in the current study. The MFCC features are extracted every 10 ms using a window-length of 20 ms.

In addition to MFCC features, SDC coefficients are also used. The SDC feature vectors are obtained by concatenating delta cepstra across multiple, and they are described by the N number of cepstral coefficients, d time advance and delay, k number of blocks concatenated for the feature vector, and P time shift between consecutive blocks. For each SDC final feature vector, kN parameters are used. In contrast, in the case of conventional cepstra and delta cepstra feature vectors, 2N parameters are used. The SDC is calculated as follows:

$$\begin{aligned} \Delta c(t+iP) = c(t+iP+d) - c(t+iP-d) \end{aligned}$$

(1)

The final vector at time t is given by the concatenation of all $\Delta c(t+iP)$ for all $0 \le i< k$, where c(t) is the original feature value at time t. Figure 1 shows the computation procedure for the SDC coefficients. Therefore, in modeling the emotions being classified, this study also used MFCC features, concatenated with SDC coefficients to form feature vectors of length 112. In the case of using CNN and i-vectors, the concatenated MFCC/SDC features were used to extract the i-vectors used by the classifier. In the other two cases, neural networks fed by blocks of MFCC/SDC (center frame ± 10 frames) features were applied.

I-Vector Features. A widely used classification approach in speaker recognition is based on GMMs with universal background models (UBM). In this approach, each speaker model is created by adapting the UBM using maximum a posteriori (MAP) adaptation. A GMM supervector is constructed by concatenating the means of the adapted models. As in speaker recognition, GMM supervectors can also be used for emotion classification.

To overcome the limitations of the high dimensionality of GMM supervectors, the i-vectors model the variability contained in the supervectors with a small set of factors. In this case, an input utterance can be modeled as:

$$\begin{aligned} \mathbf M = \mathbf m + \mathbf Tw \end{aligned}$$

(2)

where $\mathbf M $ is the emotion-dependent supervector, $\mathbf m $ is the emotion-independent supervector, $\mathbf T $ is the total variability matrix, and $\mathbf w $ is the i-vector. Both the total variability matrix and emotion-independent supervector are estimated from the complete set of training data.

Proposed Feature Extraction and Selection Approach. In this paper, DCNN for learning informative features from the speech signal that is then used for emotion classification is investigated. The MFCC and SDC features are calculated using overlapping windows with a length of 20 ms. This generates a multidimensional time-series that represent the data for each session. The proposed method is a simplified version of the method recently proposed in [36] for activity recognition using mobile sensors.

The proposed classifier consists of a DCNN followed by extremely randomized trees instead of the standard fully connected classifier. The motivation for using extremely randomized trees lies in previous observations showing their effectiveness in the case of a small number of features. The network architecture is shown in Fig. 2, and consists of a series of five blocks, each of which consists of two convolutional layers (64 $5\times 5$) followed by a max-pooling layer ($2\times 2$). Outputs from each block are then combined and flattened to represent the learned features.

The main idea behind the proposed approach is to use deep networks as feature learners only, but not classifiers, and then utilize feature selection to determine a small set of neurons that provide maximal information for an efficient activity recognizer. This approach combines elements from standard deep learning methodology. However, the approach treats the problem of generating an efficient feature extractor given an accurate, yet inefficient deep neural network, as a feature selection problem instead of using standard neural network compression techniques like optimal brain damage [37].

The training process is shown in Fig. 3, and consists of two main stages: feature learning and feature selection. In the feature learning stage, labeled data are used to train a deep neural architecture as a feature extractor. The goal of this step is to produce a large set of features that are as informative about the emotion recognition problem as possible. To achieve that, no attempt is made at this stage to optimize the computational cost, resulting thus in a slow feature extractor. Feature selection is then used to keep a small fraction of the learned features in the fast feature extractor upon which a classifier can be trained to solve emotion recognition problem. During recognition, only the fast feature extractor and final classifier are kept.

During training, a classifier consisting of a fully connected network (two layers with 16 and 16 ReLU neurons) followed by a sigmoid layer with the number of target emotions takes the output of all layers in the architecture through bypass connections. These bypass connections allow the fully connected network to utilize low level features extracted in early convolution filters instead of having to rely on the higher level features learned by the capping CNN. The obvious problem with this design is that the number of inputs to the classifier increases dramatically. For this reason, we employ $L_1$ regularization during the training process to generate a sparse representation in the classifier by setting most of the weights in its earlier layers to zero. This reduces the effective input size. Other forms of pruning can be used at this stage to reduce the computational cost of the classifier [38].

Furthermore, during training, neurons in early stages are subjected to multiple updates at every gradient-based weight update due to the use of bypassing connections. For this reason, the learning rate used for a neuron in layer i ($\eta _i$) is calculated from the base learning rate ($\eta $) as:

$$\begin{aligned} \eta _i = \frac{\eta }{n-i}, \end{aligned}$$

(3)

where n is the total number of layers.

Although, it is possible to just use the described classifier for emotion recognition (i.e., conventional DCNN), in the proposed approach, the final shallow classifier, after this training is completed, is removed and replaced with an extremely randomized trees classifier trained on a subset of the neurons that is selected.

The feature extractor learned through the proposed method will be impractical for a real applications (e.g., applications on a mobile devices) due to its large size resulting from using bypassing connections from all neurons to the output. This problem can be alleviated while improving the generalization capacity of the system using feature selection.

Selecting appropriate bypass connections from the slow feature selector can be thought of as a standard feature selection problem which is solved in this paper using a multi-criteria wrapper method. Each feature (neuronal output i) is assigned a total quality ($Q\left( i\right) $) according to Eq. 4 where $\bar{I_j}\left( i\right) $ is z-score normalized feature importance (${I_j}\left( i\right) $) according to a base feature selection method.

$$\begin{aligned} Q\left( i\right) = \sum _{j=0}^{n_f} w_j \bar{I_j}\left( i\right) , \end{aligned}$$

(4)

The raw importance measure is calculated as a weighted summation of multiple base feature selector importance measures after z-score normalization. In this work, we utilize two base selectors: randomized logistic regression [39], and extremely randomized trees. Random linear regression (RLR) estimates feature importance by randomly selecting subsets of training samples and fitting them using a $L_1$ sparsity inducing penalty that is scaled for a random set of coefficients. The features that appear repeatedly in such selections (i.e., features with high coefficients) are assumed to be more important and are given higher scores. The second feature selector employs extremely randomized trees. During fitting decision trees, features that appear at lower depths are generally more important. By fitting several such trees, feature importances can be estimated as the average depth of each feature in the trees.

Feature selection uses n-fold cross validation to select an appropriate number of neurons to keep in the final (i.e., fast) feature extractor. For each fold, the quality of each neuron is calculated using Eq. 4 employing its training set and then an extremely randomized tree classifier is fitted to the training set and evaluated on the validation set. The process is repeated recursively on the top-half of the neurons until a single neuron is kept in the feature set. The number of features/neurons that maximizes the $F_1$-measure on the validations sets is finally kept.

Table 1. Recalls for individual emotions when using MFCC features with/without SDC coefficients [%] (FAU Aibo).

Full size table

Table 2. EERs for individual emotions when using MFCC features with/without SDC coefficients [%] (FAU Aibo).

Full size table

4 Results

This section presents the results obtained using the FAU Aibo and IEMOCAP corpora. The proposed method based on DCNN and extremely randomized trees is compared with three other classifiers namely, DCNN with a fully-connected layer on top fed with MFCC/SDC features, DCNN fed with i-vectors, and SVM fed also with i-vectors. In this section, the improvements when using SDC coefficients along with MFCC features compared to the sole use of MFCC are also described.

For evaluation, the equal error rate (EER) and the UAR are used. The UAR is defined as the mean of the recalls of the individual classes.

4.1 Emotion Recognition Using the German FAU Aibo Corpus

Table 1 shows the recalls and the UAR when using DCNN with/without SDC coefficients in the i-vector extraction. The results show that when using MFCC only in the i-vector extraction, the UAR was as low as 39.5%. When MFCC features were concatenated with SDC coefficients, the UAR improved to 59.8% showing a 20.3% absolute improvement. As show, the emotion joyful shows superior performance, and the emotion rest shows the lowest recall. A possible reason might be the fact that the class rest consists of several emotions not belonging to other classes. The results obtained when using also SDC are very promising and superior to the results obtained in similar studies [40]. The results also show the effectiveness of integrating i-vectors and CNN for speech emotion recognition using only 590 training i-vectors for each emotion.

Table 3. Recalls for individual emotions when using three different classifiers [%] (FAU Aibo).

Full size table

Table 2 shows the EERs of the five emotions when using the FAU Aibo corpus. When using MFCC features only in the i-vectors extraction, the average EER was 34.4%. When SDC coefficients were also concatenated, the EER improved to 21.8% showing an absolute reduction of 13.4%. The lowest EER was obtained in the case of the emotion joyful, and the highest EER was achieved in the case of the emotion rest. Tables 1 and 2 show the effectiveness of using SDC coefficients in speech emotion recognition. Therefore, in the following experiments, MFCC features concatenated with SDC coefficients will be used.

Table 3 shows the recalls of the individual classes, and also the UAR obtained. As shown, using DCNN for feature extraction and extremely randomized trees for classification, a 61.8% UAR was obtained. This is the highest UAR among the four classifiers. In the case of using conventional DCNN with a fully connected layer on top, the UAR was 51.4%. Finally, when using SVM, a 48.8% UAR was achieved. The results show, that in the two cases of using DCNN with extremely randomized trees and with a fully-connected layer, similar recalls were obtained across the five emotions. In the case of using SVM with i-vector features, the emotion joyful was classified with the highest recall, and the emotions neutral and rest showed the lowest recalls. This is similar to the case when DCNN with i-vectors was used. Previous studies reported that when short utterances were used for speaker recognition, the extracted i-vectors become unreliable [41]. Also, in the case of using i-vectors, the optimal case is when long training and long test utterances are used. It may happen, therefore, that in the current study training and test utterances of different lengths were randomly selected resulting in a higher recall variability. Note, however, that when using DCNN with i-vectors, the second highest UAR was obtained, and i-vectors can still be considered to be a very effective feature extraction method in speech emotion recognition.

Table 4. Confusion matrix [%] of five emotions recognition when using DCNN with i-vectors (FAU Aibo).

Full size table

Table 5. Confusion matrix [%] of five emotions recognition when using DCNN and extremely randomized trees (FAU Aibo).

Full size table

Table 6. Confusion matrix [%] of five emotions recognition when using DCNN with a fully connected layer (FAU Aibo).

Full size table

Table 7. Confusion matrix [%] of five emotions recognition when using SVM with i-vectors (FAU Aibo).

Full size table

Tables 4, 5, 6, and 7 show the confusion matrices when using the four classifiers. As shown, a higher variability in misclassification is obtained when i-vectors were used.

4.2 Emotion Recognition Using the English IEMOCAP Corpus

Table 8 shows the recalls of the four emotions in the case of using the IEMOCAP corpus. In this case, DCNN fed with i-vectors were used. For i-vector extraction MFCC features alone and also MFCC features concatenated with SDC coefficients were used. As shown, when using MFCC features only, an UAR of 55.5% was obtained. When SDC coefficients were also concatenated, the UAR improved to 62.0%. The results also show that in most of cases (three out of four) the SDC coefficients resulted in higher recalls. The highest rates of recognition were for the angry and sad emotions. In contrast, the lowest recall was achieved in the case of the emotion happy.

Table 8. Recalls for individual emotions when using MFCC features with/without SDC coefficients [%] (IEMOCAP).

Full size table

Table 9 shows the EERs when using DCNN fed with i-vectors. In the case of using MFCC features only, the average EER was 26.5%. When SDC coefficients were also used, a 22.2% EER was obtained. The results show that when also using SDC coefficients, significant improvements were obtained. Therefore, in the following experiments, MFCC features concatenated with SDC coefficients will be considered.

Table 9. EER for individual emotions when using MFCC features with/without SDC coefficients [%] (IEMOCAP).

Full size table

Table 10. Recalls for individual emotions when using three different classifiers [%] (IEMOCAP).

Full size table

Table 10 shows the recalls and the UARs in the case of the IEMOCAP corpus and when using three different classifiers. As shown, when using DCNN for feature extraction and extremely randomized trees for classification, a 63.9% UAR was obtained, which is the highest among the four classifiers. This result is very promising and superior to the results obtained in similar studies [42, 43]. The results also show the effectiveness of the proposed method when DCNN is used for informative feature extraction. When using DCNN with a fully connected layer on top, a 59.3% UAR was achieved. Finally, the UAR in the case of SVM with i-vectors was as low as 36.8%. Tables 11, 12, 13, and 14 show the confusion matrices in the case of the IEMOCAP corpus and when using the four classifiers described previously. As shown, the emotion neutral is classified with the lowest recall in all cases.

Table 11. Confusion matrix [%] of four emotions recognition when using DCNN with i-vectors (IEMOCAP).

Full size table

Table 12. Confusion matrix [%] of four emotions recognition when using DCNN with extremely randomized trees (IEMOCAP).

Full size table

Table 13. Confusion matrix [%] of four emotions recognition when using DCNN with a fully-connected layer (IEMOCAP).

Full size table

Table 14. Confusion matrix [%] of four emotions recognition when using SVM with i-vectors (IEMOCAP).

Full size table

5 Discussion

A limitation of the current study is the small volume of training data used in the classification experiments. Specifically, in the case of using the FAU Aibo corpus, 590 training utterances for each emotion were used, and in the case of using the IEMOCAP corpus, 250 training utterances were used, respectively. Considering that DL based methods require a large amount of training data for accurate parameter estimation, further improvements may be possible by increasing the amount of data. The features used in the current study were based on MFCC and SDC coefficients, and also on the i-vectors. Although, several alternatives were considered (e.g., bottleneck features, LLD, etc.), well-known and very effective features were selected. Also, in particular the authors were interested in investigating the use of i-vectors with CNN due to the very small number of studies that have addressed this issue.

6 Conclusions

The current study focused on speech emotion recognition based on deep learning. We proposed a method based on DCNN, which extracts informative features used by extremely randomized trees for emotion recognition. When using the German FAU Aibo corpus for the recognition of five emotions, the proposed method achieved a 61.8% UAR. In the case of the IEMOCAP corpus, a 63.9% UAR was obtained. These results are very promising and show the effectiveness of the proposed method in speech emotion recognition. Additionally, several other classification and features extraction methods were experimentally investigated. The proposed method, however, showed superior performance. Currently, speech emotion recognition in adverse environments is being investigated.

References

Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, pp. 110–127. Oxford University Press, New York, November 2013
Chapter Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006)
Article Google Scholar
Ho, T.K.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, pp. 278–282 (1995)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Google Scholar
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio, Speech Lang. Process. 22, 1533–1545 (2014)
Article Google Scholar
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Proceedings of Signal and Information Processing Association Annual Summit and Conference (APSIPA) (2016)
Google Scholar
Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Commun. 29, 2352–2449 (2017)
Article MathSciNet Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Gomes, J., El-Sharkawy, M.: i-vector algorithm with gaussian mixture model for efficient speech emotion recognition. In: Proceedings of International Conference on Computational Science and Computational Intelligence (CSCI), pp. 476–480 (2015)
Google Scholar
Liu, R.X.Y.: Using i-vector space model for emotion recognition. In: Proceedings of Interspeech, pp. 2227–2230 (2012)
Google Scholar
Gamage, K.W., Sethu, V., Le, P.N., Ambikairajah, E.: An i-vector GPLDA system for speech based emotion recognition. In: Proceedings of APSIPA Annual Summit and Conference, pp. 289–292 (2015)
Google Scholar
Zhang, T., Wu, J.: Speech emotion recognition with i-vector feature and RNN model. In: Proceedings of ChinaSIP, pp. 524–528 (2015)
Google Scholar
Heracleous, P., Mohammad, Y., Takai, K., Yasuda, K., Yoneyama, A., Sugaya, F.: A study on far-field emotion recognition based on deep convolutional neural networks. In: Proceedings of International Conference on Computational Linguistics and Intelligent Text Processing (2018)
Google Scholar
Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. J. Lang. Resour. Eval. 42, 335–359 (2008)
Article Google Scholar
Steidl, S.: Automatic Classification of Emotion-Related User States in Spontaneous Children’s Speech. Logos Verlag, Berlin (2009)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: Support Vector Machines. Cambridge University Press, Cambridge (2000)
MATH Google Scholar
Basu, S., Chakraborty, J., Bag, A., Aftabuddin, M.: A review on emotion recognition using speech. In: International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 109–114 (2017)
Google Scholar
Metallinou, A., Busso, C., Lee, S., Narayanan, S.: Visual emotion recognition using compact facial representations and viseme information. In: Proceedings of ICASSP, pp. 2474–2477 (2010)
Google Scholar
Alarcao, S.M., Fonseca, M.J.: Emotions recognition using EEG signals: a survey. IEEE Trans. Affect. Comput. (2017)
Google Scholar
Maaoui, C., Pruski, A.: A comparative study of SVM kernel applied to emotion recognition from physiological signals. In: Proceedings of 5th International Multi-Conference on Systems, Signals and Devices (2008)
Google Scholar
Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009)
Google Scholar
Xu, S., Liu, Y., Liu, X.: Speaker recognition and speech emotion recognition based on GMM. In: 3rd International Conference on Electric and Electronics (EEIC 2013), pp. 434–436 (2013)
Google Scholar
Schuller, B., Rigoll, G., Lang, M.: Hidden markov model-based speech emotion recognition. In: Proceedings of the IEEE ICASSP, vol. I, pp. 401–404 (2003)
Google Scholar
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Google Scholar
Chavhan, Y., Dhore, M.L., Yesaware, P.: Speech emotion recognition using support vector machine. Int. J. Comput. Appl. 1(20), 6–9 (2010). (0975–8887)
Article Google Scholar
Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000)
Article Google Scholar
Shaw, A., Vardhan, R.K., Saxena, S.: Emotion recognition and classification in speech using artificial neural networks. Int. J. Comput. Appl. 145(8), 5–9 (2016). (0975–8887)
Article Google Scholar
Stuhlsatz, A., Meyer, C., Eyben, F., Zielke1, T., Meier, G., Schuller, B.: Deep neural networks for acoustic emotion recognition: raising the benchmarks. In: Proceedings of ICASSP, pp. 5688–5691 (2011)
Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 2023–2027 (2014)
Google Scholar
Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012)
Article Google Scholar
Bielefeld, B.: Language identification using shifted delta cepstrum. In: Fourteenth Annual Speech Research Symposium (1994)
Google Scholar
Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., Deller, Jr., J.R.: Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In: Proceedings of ICSLP2002-INTERSPEECH 2002, pp. 16–20 (2002)
Google Scholar
Schuller, B., et al.: The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. In: Proceedings of Interspeech, pp. 2253–2256 (2007)
Google Scholar
Ranjan, S., Yu, C., Zhang, C., Kelly, F., Hansen, J.H.L.: Language recognition using deep neural networks with very limited training data. In: Proceedings of ICASSP, pp. 5830–5834 (2016)
Google Scholar
Mohammad, Y., Matsumoto, K., Hoashi, K.: Deep feature learning and selection for activity recognition. In: Proceedings of the 33rd ACM/SIGAPP Symposium on Applied Computing, pp. 926–935. ACM SAC (2018)
Google Scholar
LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, vol. 2, pp. 598–605 (1990)
Google Scholar
Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: customizing DNN pruning to the underlying hardware parallelism. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 548–560. ACM (2017)
Article Google Scholar
Friedman, J., Hastie, T., et al.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337–407 (2000)
Article MathSciNet Google Scholar
Attabi, Y., Alam, J., Dumouchel, P., Kenny, P., Shaughnessy, D.O.: Multiple windowed spectral features for emotion recognition. In: Proceedings of ICASSP, pp. 7527–7531 (2013)
Google Scholar
Zhang, J., Inoue, N., Shinoda, K.: I-vector transformation using conditional generative adversarial networks for short utterance speaker verification. In: Proceedings of Interspeech, pp. 3613–3617 (2018)
Google Scholar
Tzinis, E., Potamianos, A.: Segment-based emotion recognition using recurrent neural networks. In: Proceedings of ACI, vol. I, pp. 190–195 (2017)
Google Scholar
Huang, C.W., Narayanan, S.: Attention assisted discover of sub-utterance in speech emotion recognition. In: Proceedings of Interspeech, pp. 1387–1391 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
Panikos Heracleous & Akio Yoneyama
Artificial Intelligence Research Center, AIST, 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
Yasser Mohammad

Authors

Panikos Heracleous
View author publications
You can also search for this author in PubMed Google Scholar
Yasser Mohammad
View author publications
You can also search for this author in PubMed Google Scholar
Akio Yoneyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panikos Heracleous .

Editor information

Editors and Affiliations

The Open University of Japan, Chiba, Japan
Masaaki Kurosu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heracleous, P., Mohammad, Y., Yoneyama, A. (2019). Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition. In: Kurosu, M. (eds) Human-Computer Interaction. Recognition and Interaction Technologies. HCII 2019. Lecture Notes in Computer Science(), vol 11567. Springer, Cham. https://doi.org/10.1007/978-3-030-22643-5_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-22643-5_9
Published: 27 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22642-8
Online ISBN: 978-3-030-22643-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Work

3 Methods

3.1 Emotional Corpora

3.2 Feature Extraction

4 Results

4.1 Emotion Recognition Using the German FAU Aibo Corpus

4.2 Emotion Recognition Using the English IEMOCAP Corpus

5 Discussion

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation