Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning

Nicolás, José Antonio; de Lope, Javier; Graña, Manuel

doi:10.1007/978-3-031-06527-9_27

José Antonio Nicolás¹¹,
Javier de Lope¹¹ &
Manuel Graña¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13259))

Included in the following conference series:

International Work-Conference on the Interplay Between Natural and Artificial Computation

1264 Accesses

Abstract

This paper introduces innovations both in data augmentation and deep neural network architecture for speech emotion recognition (SER). The novel architecture combines a series of convolutional layers with a final layer of long short-term memory cells to determine emotions in audio signals. The audio signals are conveniently processed to generate mel spectrograms, which are used as inputs to the deep neural network architecture. This paper proposes a selected set of data augmentation techniques that allow to reduce the network overfitting. We achieve an average recognition accuracy of 86.44% on publicly distributed databases, outperforming state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Plutchik, R.: The nature of emotions: human emotions have deep evolutionary roots. Am. Sci. 89(4), 344–350 (2001)
Article Google Scholar
Lieskovská, E., Jakubec, M., Jarina, R., Chmulik, M.: A review on speech emotion recognition using deep learning and attention mechanism. Electronics 10, 1163 (2021)
Article Google Scholar
Anagnostopoulos, C.-N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Springer Sci. 43, 155–177 (2015)
Google Scholar
Gangamohan, P., Kadiri, S.R., Yegnanarayana, B.: Analysis of emotional speech—a review. In: Esposito, A., Jain, L.C. (eds.) Toward Robotic Socially Believable Behaving Systems - Volume I. ISRL, vol. 105, pp. 205–238. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31056-5_11
Chapter Google Scholar
Singh, Y.B., Goel, S.: Survey on human emotion recognition: speech database, features and classification. In: Proceedings of IEEE International Conference on Advances in Computing, Communication Control and Networking, pp. 298–301 (2018)
Google Scholar
Khalil, R.A., et al.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019)
Article Google Scholar
Abbaschian, B.J., Sierra-Sosa, D., Elmaghraby, A.: Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(1249), 1–27 (2021)
Google Scholar
Rabiner, L.R., Schafer, R.W.: Theory and Applications of Digital Speech Processing. Pearson, Upper Saddle River (2010)
Google Scholar
Wani, T.M., et al.: A comprehensive review of speech emotion recognition systems. IEEE Access (in press, online ready)
Google Scholar
Gerczuk, M. et al.: EmoNet: a transfer learning framework for multi-corpus speech emotion recognition. arXiv preprint arXiv:2103.08310 (2021)
Verhelst, W., Roelands, M.: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 554–557 (1993)
Google Scholar
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Google Scholar
Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language, pp. 21–25 (2013)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. Proc. Interspeech 2015, 3586–3589 (2015)
Google Scholar
Andén, J., Mallat, S.: Deep scattering spectrum. IEEE Trans. Signal Process. 62(16), 4114–4128 (2014)
Article MathSciNet Google Scholar
Bovik, A.C.: Handbook of Image and Video Processing: Communications. Networking and Multimedia. Academic Press Inc., Orlando (2005)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
Berg, A., Deng, J., Fei-Fei, L.: Large scale visual recognition challenge (2010). http://www.image-net.org/challenges
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
Google Scholar
McFee, B. et al.: librosa: Audio and music signal analysis in Python. In: Proceedings of 14th Python in Science Conference, pp. 18–25, Austin, TX (2015)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830, e0196391 (2011)
Google Scholar
Abadi, M. et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612 (2018)
Slimi, A., Hamroun, M., Zrigui, M., Nicolas, H.: Emotion recognition from speech using spectrograms and shallow neural networks. In: ACM International Conference on Advances in Mobile Computing & Multimedia, Chiang Mai, Thailand, pp. 298–301 (2020)
Google Scholar

Download references

Acknowledgments

This work has been partially supported by FEDER funds through MINECO project PID2020-116346GB-I00.

Author information

Authors and Affiliations

Computational Cognitive Robotics Group, Department of Artificial Intelligence, Universidad Politécnica de Madrid (UPM), Madrid, Spain
José Antonio Nicolás & Javier de Lope
Computational Intelligence Group, University of the Basque Country (UPV/EHU), Leioa, Spain
Manuel Graña

Authors

José Antonio Nicolás
View author publications
You can also search for this author in PubMed Google Scholar
Javier de Lope
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Graña
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Javier de Lope .

Editor information

Editors and Affiliations

Universidad Politécnica de Cartagena, Cartagena, Spain
José Manuel Ferrández Vicente
Universidad Nacional de Educación a Distancia, Madrid, Spain
José Ramón Álvarez-Sánchez
Universidad Nacional de Educación a Distancia, Madrid, Spain
Félix de la Paz López
Ohio State University, Columbus, OH, USA
Hojjat Adeli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nicolás, J.A., de Lope, J., Graña, M. (2022). Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning. In: Ferrández Vicente, J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Adeli, H. (eds) Bio-inspired Systems and Applications: from Robotics to Ambient Intelligence. IWINAC 2022. Lecture Notes in Computer Science, vol 13259. Springer, Cham. https://doi.org/10.1007/978-3-031-06527-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-06527-9_27
Published: 24 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06526-2
Online ISBN: 978-3-031-06527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning