Abstract
This paper introduces innovations both in data augmentation and deep neural network architecture for speech emotion recognition (SER). The novel architecture combines a series of convolutional layers with a final layer of long short-term memory cells to determine emotions in audio signals. The audio signals are conveniently processed to generate mel spectrograms, which are used as inputs to the deep neural network architecture. This paper proposes a selected set of data augmentation techniques that allow to reduce the network overfitting. We achieve an average recognition accuracy of 86.44% on publicly distributed databases, outperforming state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Plutchik, R.: The nature of emotions: human emotions have deep evolutionary roots. Am. Sci. 89(4), 344–350 (2001)
Lieskovská, E., Jakubec, M., Jarina, R., Chmulik, M.: A review on speech emotion recognition using deep learning and attention mechanism. Electronics 10, 1163 (2021)
Anagnostopoulos, C.-N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Springer Sci. 43, 155–177 (2015)
Gangamohan, P., Kadiri, S.R., Yegnanarayana, B.: Analysis of emotional speech—a review. In: Esposito, A., Jain, L.C. (eds.) Toward Robotic Socially Believable Behaving Systems - Volume I. ISRL, vol. 105, pp. 205–238. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31056-5_11
Singh, Y.B., Goel, S.: Survey on human emotion recognition: speech database, features and classification. In: Proceedings of IEEE International Conference on Advances in Computing, Communication Control and Networking, pp. 298–301 (2018)
Khalil, R.A., et al.: Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019)
Abbaschian, B.J., Sierra-Sosa, D., Elmaghraby, A.: Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(1249), 1–27 (2021)
Rabiner, L.R., Schafer, R.W.: Theory and Applications of Digital Speech Processing. Pearson, Upper Saddle River (2010)
Wani, T.M., et al.: A comprehensive review of speech emotion recognition systems. IEEE Access (in press, online ready)
Gerczuk, M. et al.: EmoNet: a transfer learning framework for multi-corpus speech emotion recognition. arXiv preprint arXiv:2103.08310 (2021)
Verhelst, W., Roelands, M.: An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech. In: 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 554–557 (1993)
Lee, L., Rose, R.: A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Jaitly, N., Hinton, G.E.: Vocal tract length perturbation (VTLP) improves speech recognition. In: Proceedings of ICML Workshop on Deep Learning for Audio, Speech and Language, pp. 21–25 (2013)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. Proc. Interspeech 2015, 3586–3589 (2015)
Andén, J., Mallat, S.: Deep scattering spectrum. IEEE Trans. Signal Process. 62(16), 4114–4128 (2014)
Bovik, A.C.: Handbook of Image and Video Processing: Communications. Networking and Multimedia. Academic Press Inc., Orlando (2005)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Berg, A., Deng, J., Fei-Fei, L.: Large scale visual recognition challenge (2010). http://www.image-net.org/challenges
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5), e0196391 (2018)
McFee, B. et al.: librosa: Audio and music signal analysis in Python. In: Proceedings of 14th Python in Science Conference, pp. 18–25, Austin, TX (2015)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830, e0196391 (2011)
Abadi, M. et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612 (2018)
Slimi, A., Hamroun, M., Zrigui, M., Nicolas, H.: Emotion recognition from speech using spectrograms and shallow neural networks. In: ACM International Conference on Advances in Mobile Computing & Multimedia, Chiang Mai, Thailand, pp. 298–301 (2020)
Acknowledgments
This work has been partially supported by FEDER funds through MINECO project PID2020-116346GB-I00.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Nicolás, J.A., de Lope, J., Graña, M. (2022). Data Augmentation Techniques for Speech Emotion Recognition and Deep Learning. In: Ferrández Vicente, J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Adeli, H. (eds) Bio-inspired Systems and Applications: from Robotics to Ambient Intelligence. IWINAC 2022. Lecture Notes in Computer Science, vol 13259. Springer, Cham. https://doi.org/10.1007/978-3-031-06527-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-06527-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06526-2
Online ISBN: 978-3-031-06527-9
eBook Packages: Computer ScienceComputer Science (R0)