Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS

Yalamanchili, Bhanusree; Anne, Koteswara Rao; Samayamantula, Srinivas Kumar

doi:10.1007/s11042-022-12112-x

Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS

Published: 04 March 2022

Volume 81, pages 16945–16966, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Bhanusree Yalamanchili ORCID: orcid.org/0000-0003-2056-9379¹,
Koteswara Rao Anne² &
Srinivas Kumar Samayamantula³

693 Accesses
1 Altmetric
Explore all metrics

A Correction to this article was published on 07 December 2022

This article has been updated

Abstract

Speech Emotion Recognition (SER) determines human emotions using linguistic and nonlinguistic features of the uttered speech. The nonlinguistic process is more suitable for applications where language is not a concern. In this paper, Capsule Network (CapsuleNets) with a combination of Time Distributed 2D-Convolution layers is used for classifying emotions using speech signals. CapsuleNets are specially designed to capture the spatial cues of the data but fail in considering temporal cues in time series data like speech. In order to capture the temporal cues, along with spatial cues, Time distributed 2D- convolution neural layers are introduced before the CapsuleNets. Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) speech data sets are used for experimenting with the proposed network architecture. The log-mel spectrogram of the speech samples is extracted and used for training and testing of the proposed model. The combination of CapsuleNets with Time Distributed 2D-Convolution layers has achieved a classification accuracy of 92.6% on the RAVDESS dataset and 93.2% on the IEMOCAP dataset. These results are compared with the plain CapsuleNets model, and remarkable improvement is observed. Also, the proposed system has outperformed the existing models on the mentioned benchmarked datasets. The confusion matrix shows consistent improvement in the accuracy of every emotion, including sad and disgust in RAVDESS and angry in IEMOCAP, which are poorly classified by classifiers such as variants in CNN, RNN, LSTM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Capsule Network Based Speech Emotion Recognition for Efficient Capturing of Spatial Features

Emotion Recognition in Speech Using Convolutional Neural Networks

Multi-featured Speech Emotion Recognition Using Extended Convolutional Neural Network

Change history

07 December 2022
A Correction to this paper has been published: https://doi.org/10.1007/s11042-022-14281-1

References

Akçay MB, Oğuz K (2020) Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication 116:56–76
Article Google Scholar
Anagnostopoulos CN, Iliou T, Giannoukos I (2015) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43(2):155–177
Article Google Scholar
Atmaja BT, Akagi M (2019) Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model. In: Proceedings - 2019 IEEE International Conference on Signals and Systems, ICSigSys 2019, pp 40–44
Google Scholar
Busso C et al (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42:335–359
Article Google Scholar
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion Recognition. IEEE Signal Processing Letters 25(10):1440–1444
Article Google Scholar
Cummins N, Amiriparian S, Hagerer G, Batliner A, Steidl S, Schuller BW (2017) An image-based deep spectrum feature representation for the recognition of emotional speech. In: MM 2017 - Proceedings of the 2017 ACM Multimedia Conference, pp 478–484
Chapter Google Scholar
Dias Issa M, Demirci F, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks, biomedical signal processing and control. Volume 59:101894 ISSN 1746-8094
Google Scholar
Dzedzickis A, Kaklauskas A, Bucinskas V (2020) Human emotion recognition: Review of sensors and methods. Sensors (Basel, Switzerland) 20(3) [Online]. Available: https://europepmc.org/articles/PMC7037130
Fayek HM, Lech M, Cavedon L. Evaluating deep learning architectures for speech emotion recognition. Neural Netw 201792 60–68. https://doi.org/10.1016/j.neunet.2017.02.013.
Hinton GE, Krizhevsky A, Wang SD (2011) Transforming auto- encoders. In: Honkela T, Duch W, Girolami M, Kaski S (eds) Artificial Neural Networks and Machine Learning – ICANN 2011. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 44–51
Chapter Google Scholar
Hinton G, Sabour S, Frosst N (2018) Matrix capsules with EM routing. In: 6th International Conference on Learning Representations, ICLR 2018 - Conference Track Proceedings, pp 1–15
Google Scholar
Huang CW, Narayanan SS (2016) Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 08–12-Sept, pp 1387–1391
Google Scholar
Jain R (2019) Improving Performance and Inference on Audio Classifica- tion Tasks Using Capsule Networks. arXiv
Google Scholar
Jing S, Mao X, Chen L (2018) Prominence features: Effective emotional features for speech emotion recognition. Digital Signal Processing: A Review Journal 72:216–231 [Online]. Available: 10.1016/j.dsp.2017.10.016
Article Google Scholar
Kuchibhotla S, Vankayalapati HD, Vaddi RS, Anne KR (2014) A comparative analysis of classifiers in emotion recognition through acoustic features. International Journal of Speech Technology 17(4):401–408
Article Google Scholar
Kuchibhotla S, Vankayalapati HD, Anne KR (2016) An optimal two stage feature selection for speech emotion recognition using acoustic features. International Journal of Speech Technology 19(4):657–667
Article Google Scholar
Kwabena Patrick M, Felix Adekoya A, Abra Mighty A, Edward BY (2019) Capsule networks – a survey. Journal of King Saud University - Computer and Information Sciences [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1319157819309322
Lalitha S, Tripathi S, Gupta D (2019) Enhanced speech emotion detection using deep neural networks. International Journal of Speech Technology 22(3):497–510 [Online]. Available: 10.1007/s10772-018-09572-8
Article Google Scholar
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), Jeju, Korea (South), pp 1–4. https://doi.org/10.1109/APSIPA.2016.7820699
Chapter Google Scholar
Liu ZT, Wu M, Cao WH, Mao JW, Xu JP, Tan GZ (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280 [Online]. Available: 10.1016/j.neucom. 2017.07.050
Article Google Scholar
Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north American english
Google Scholar
Madhu G, Govardhan A, Srinivas BS, Sahoo KS, Jhanjhi NZ, Vardhan KS, Rohit B (2021) Imperative dynamic routing between capsules network for malaria classification. CMC-Computers Materials & Continua 68(1):903–919
Article Google Scholar
Meng H, Yan T, Yuan F, Wei H (2019) Speech emotion recognition from 3d log-mel spectrograms with deep learning network. IEEE Access 7:125 868–125 881
Article Google Scholar
Mustaqeem KS (2020) CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network. Mathematics 8(12):2133. https://doi.org/10.3390/math8122133
Article Google Scholar
Mustaqeem MS, Kwon S (2020) Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access 8:79861–79875. https://doi.org/10.1109/ACCESS.2020.2990405
Article Google Scholar
Palaz D et al (2015) Analysis of CNN-based speech recognition system using raw speech as input. INTERSPEECH
Peer D, Stabinger S, Rodr'ıguez-Sa'nchez A (2021) Limitation of capsule networks. Pattern Recognition Letters 144:68–74 [Online]. Available: 10.1016/j.patrec.2021.01.017
Article Google Scholar
Qiao H, Wang T, Wang P, Qiao S, Zhang L (2018) A time-distributed spatiotemporal feature learning method for machine health monitoring with multi-sensor time series. Sensors 18:2932. https://doi.org/10.3390/s18092932
Article Google Scholar
Russell JA, Mehrabian A (1977) Evidence for a three-factor theory of emotions. Journal of Research in Personality 11(3):273–294
Article Google Scholar
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. Curran Associates Inc, Red Hook, NY, USA, pp 3859–3869
Google Scholar
Satapathy SC, Cruz M, Namburu A, Chakkaravarthy S, Pittendreigh M (2020) Skin Cancer classification using convolutional capsule network (CapsNet). Journal of Scientific and Industrial Research (JSIR) 79(11):994–1001
Google Scholar
Satt A, Rozenberg S, Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-Augus, pp 1089–1093
Google Scholar
Wu X, Liu S, Cao Y, Li X, Yu J, Dai D, Ma X, Hu S, Wu Z, Liu X, Meng H (2019) Speech emotion recognition using capsule networks. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 6695–6699
Chapter Google Scholar
Xie Y, Liang R, Liang Z, Huang C, Zou C, Schuller B (2019) Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio Speech and Language Processing 27(11):1675–1685
Article Google Scholar
Zhao Z, Bao Z, Zhao Y, Zhang Z, Cummins N, Ren Z, Schuller B (2019) Exploring deep spectrum representations via attention- based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 7:97 515–97 525
Article Google Scholar
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D 2D CNN LSTM networks. Biomedical Signal Processing and Control 47:312–323 [Online]. Available: 10.1016/j.bspc.2018.08.035
Article Google Scholar

Download references

Acknowledgments

The first author acknowledges the suggestions and cooperation of Anupama Namburu, Vellore Institute of Technology, Amaravathi, Andhra Pradesh.

Author information

Authors and Affiliations

Department of CSE, Jawaharlal Nehru Technological University, Kakinada, India
Bhanusree Yalamanchili
Department of CSE, Kalasalingam Academy of Research and Education, Tamilnadu, India
Koteswara Rao Anne
Department of ECE, Jawaharlal Nehru Technological University, Kakinada, India
Srinivas Kumar Samayamantula

Authors

Bhanusree Yalamanchili
View author publications
You can also search for this author in PubMed Google Scholar
Koteswara Rao Anne
View author publications
You can also search for this author in PubMed Google Scholar
Srinivas Kumar Samayamantula
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bhanusree Yalamanchili.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

The model is verified on Berlin Database of Emotional Speech (EMO-DB) [8, 29]. EMO-DB is chosen for the following reasons:

1.
IEMOCAP and RAVDESS are both English data bases and the model is trained on these two data sets separately. EMO-DB data base is a different language (Berlin) and testing is done on these speech signals.
2.
EMO-DB is also a simulated dataset as RAVDESS, whereas IEMOCAP is a seminatural data set [1, 29, 35] So, testing of EMO-DB is verified on model trained on IEMOCAP.
3.
Many significant developments in the field have been tested on EMO-DB dataset [3].
4.
The strength of the EMODB is it offers a good representation of gender and emotional classes [3].

EMO-DB is a widely used datasets for SER. It has 10 German sentences (five short sentences, and five long sentences), simulated by five females, and five males. Every speaker expresses all the ten sentences with different emotions. The dataset contains ten sentences acted with seven emotions by all the actors. The emotions in this dataset are, happy, neutral, anger, sadness, disgust, fear, and boredom. The details are shown in Table 14. Only four emotions - Anger, Happy, Sad, Neutral are considered from the EMO-DB dataset so as to map with the trained model of IEMOCAP. (IEMOCAP data set has only these emotion samples and the model is also trained on that). IEMOCAP trained model is selected over RAVDESS, as the training accuracy is 98% which is best over 94% of RAVDESS training accuracy.

Table 14 Details of EMODB data set

Full size table

The EMO_DB voice samples are pre-processed and log-mel spectrograms are extracted and overlapping hop is performed and are split in to 5 frames. No data augmentation is performed as there are sufficient samples for testing. The time distributed spectrogram frames are applied to Time distributed 2D-convolution layers and then to CapsuleNets for classification where the model is trained on IEMOCAP dataset. The train test and accuracies of the IEMOCAP and EMODB respectively are given in Table 15. The confusion matrix of the testing are provided in Table 16.

Table 15 Train and Test accuracies

Full size table

Table 16 Confusion matrix

Full size table

The confusion matrix in Table 16 indicates how the proposed model has shown reasonably better performance on anger and happy over CapsuleNet without Time distributed 2D-convolution layers as in the case of IEMOCAP. CapsuleNet without Time distributed 2D-convolution layers model has recorded a very poor performance in the order of less than 50% for IEMOCAP

Justifying with real data is not performed as collecting real data includes recording setup with appropriate hardware or collecting information in noise less environment which has many constraints in practicality. M.Lech et.al. also explained that “Real-time processing of speech needs a continually streaming input signal, rapid processing, and steady output of data within a constrained time, which differs by milliseconds from the time when the analysed data samples were generated.” [3]

References:

[1]F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A Database of German Emotional Speech.” [Online]. Available: http://www.expressive-speech.net/emodb/.

[2]B. J. Abbaschian, D. Sierra-Sosa, and A. Elmaghraby, “Deep learning techniques for speech emotion recognition, from databases to models,” Sensors (Switzerland), vol. 21, no. 4, pp. 1–27, 2021, doi: 10.3390/s21041249.

[3]S. R. Livingstone and F. A. Russo, “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English,” 2018, doi: 10.5281/zenodo.1188976.

[4]C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” 2007.

[5]M. Lech, M. Stolar, C. Best, and R. Bolia, “Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding,” Front. Comput. Sci., vol. 2, no. May, pp. 1–14, 2020, doi: 10.3389/fcomp.2020.00014.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yalamanchili, B., Anne, K.R. & Samayamantula, S.K. Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS. Multimed Tools Appl 81, 16945–16966 (2022). https://doi.org/10.1007/s11042-022-12112-x

Download citation

Received: 16 April 2021
Revised: 30 August 2021
Accepted: 03 January 2022
Published: 04 March 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s11042-022-12112-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Emotion Recognition using Time Distributed 2D-Convolution layers for CAPSULENETS

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Capsule Network Based Speech Emotion Recognition for Efficient Capturing of Spatial Features

Emotion Recognition in Speech Using Convolutional Neural Networks

Multi-featured Speech Emotion Recognition Using Extended Convolutional Neural Network

Change history

07 December 2022

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix 1

Appendix 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now