Speech emotion recognition with transfer learning and multi-condition training for noisy environments

Haque, Arijul; Rao, Krothapalli Sreenivasa

doi:10.1007/s10772-024-10109-5

Speech emotion recognition with transfer learning and multi-condition training for noisy environments

Published: 02 June 2024

Volume 27, pages 353–365, (2024)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Arijul Haque¹ &
Krothapalli Sreenivasa Rao¹

143 Accesses
Explore all metrics

Abstract

This paper explores the use of transfer learning techniques to develop robust speech emotion recognition (SER) models capable of handling noise in real-world environments. Two SER frameworks have been proposed in this work: Framework-1 is a two-stage framework that involves retraining pretrained networks on clean data in the first stage followed by fine-tuning the network further with noisy data in the second stage, while Framework-2 directly retrains pretrained networks on multi-conditioned noisy data. To create multi-conditioned data, we have used both natural noise recordings and trance music under a single augmentation framework. Three pre-trained models (AlexNet, GoogleNet, VGG19) are evaluated on two datasets (IEMOCAP and IITKGP-SEHSC) using bottleneck features and quantized bottleneck features (only in the test phase) for noise mitigation. The experiments involve retraining the last one or two layers or using an SVM classifier on the bottleneck features. The results reveal that GoogleNet and VGG19 outperform AlexNet, and fine-tuning the final two layers of these models achieves the highest accuracy. Additionally, quantized bottleneck features further improve performance. Most importantly, Framework-2 consistently outperforms Framework-1 in most cases. While comparisons with existing work are challenging due to widely varying experimental settings in related works, the findings demonstrate competitive performance. A major novelty in this work lies in the variety of SNR conditions explored and the use of trance music for creating multi-conditioned noisy data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transfer Learning for Audio-Based Speech Emotion Recognition in Chinese: Leveraging Pretrained Models for Improved Performance

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion

Article 28 March 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Two speech datasets have been used in this work. One of these is the English IEMOCAP dataset, which is freely available at https://sail.usc.edu/iemocap/. The other dataset is the Hindi IITKGP-SEHSC dataset, which may be requested by following the instructions available at https://cse.iitkgp.ac.in/~ksrao/pdf/declaration.pdf. For natural noise recordings, we have used the ESC-50 dataset, available at https://github.com/karolpiczak/ESC-50.

References

Al-Dujaili, M. J., & Ebrahimi-Moghadam, A. (2023). Speech emotion recognition: A comprehensive survey. Wireless Personal Communications, 129(4), 2525–2561.
Article Google Scholar
Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Pugachevskiy, S., & Schuller, B. (2018). Bag-of-deep-features: Noise-robust deep feature representations for audio analysis. In 2018 international joint conference on neural networks (IJCNN2018) (pp. 1–7). IEEE.
Azarang, A., & Kehtarnavaz, N. (2020). A review of multi-objective deep learning speech denoising methods. Speech Communication, 122, 1–10.
Article Google Scholar
Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In 2017 international conference on platform technology and service (PlatCon 2017) (pp. 1–5). IEEE.
Bentler, R., & Chiou, L.-K. (2006). Digital noise reduction: An overview. Trends in Amplification, 10(2), 67–82.
Article Google Scholar
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., & Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42, 335–359.
Article Google Scholar
Cabanac, M. (2002). What is emotion? Behavioural Processes, 60(2), 69–83.
Article Google Scholar
Fayek, H. M., Lech, M., & Cavedon, L. (2015). Towards real-time speech emotion recognition using deep neural networks. In 2015 9th international conference on signal processing and communication systems (ICSPCS 2015) (pp. 1–5). IEEE.
Gerczuk, M., Amiriparian, S., Ottl, S., & Schuller, B. W. (2021). EmoNet: A transfer learning framework for multi-corpus speech emotion recognition. IEEE Transactions on Affective Computing, 14(2), 1472–1487.
Article Google Scholar
Koolagudi, S. G., Reddy, R., Yadav, J., & Rao, K. S. (2011). IITKGP-SEHSC: Hindi speech corpus for emotion analysis. In 2011 international conference on devices and communications (ICDeCom 2011)(pp. 1–5). IEEE.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Michael I. Jordan, Yann LeCun and Sara A. Solla (Eds.), Advances in neural information processing systems. Proceedings of the first 12 conferences, (Vol. 25). The MIT Press.
Lech, M., Stolar, M., Best, C., & Bolia, R. (2020). Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Frontiers in Computer Science, 2, 14.
Article Google Scholar
Liu, S., Zhang, M., Fang, M., Zhao, J., Hou, K., & Hung, C.-C. (2021). Speech emotion recognition based on transfer learning from the FaceNet framework. The Journal of the Acoustical Society of America, 149(2), 1338–1345.
Article Google Scholar
Luna-Jiménez, C., Griol, D., Callejas, Z., Kleinlein, R., Montero, J. M., & Fernández-Martínez, F. (2021). Multimodal emotion recognition on RAVDESS dataset using transfer learning. Sensors, 21(22), 7665.
Article Google Scholar
Parra-Gallego, L. F., & Orozco-Arroyave, J. R. (2022). Classification of emotions and evaluation of customer satisfaction from speech in real-world acoustic environments. Digital Signal Processing, 120, 103–286.
Article Google Scholar
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, 2015 (pp. 1015–1018).
Seltzer, M. L., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, (ICASSP 2013) (pp. 7398–7402). IEEE.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR 2015) (pp. 1–9).
Tiwari, U., Soni, M., Chakraborty, R., Panda, A., & Kopparapu, S. K. (2020). Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In ICASSP 2020—2020 IEEE international conference on acoustics, speech and signal processing (ICASSP 2020) (pp. 7194–7198). IEEE.
Uma Maheswari, S., Shahina, A., & Nayeemulla Khan, A. (2021). Understanding Lombard speech: A review of compensation techniques towards improving speech based recognition systems. Artificial Intelligence Review, 54, 2495–2523.
Article Google Scholar
Yuan, L., Wang, T., Ferraro, G., Suominen, H., & Rizoiu, M.-A. (2023). Transfer learning for hate speech detection in social media. Journal of Computational Social Science, 6, 1–21.
Article Google Scholar
Zaman, K., Sun, Z., Shah, S. M., Shoaib, M., Pei, L., & Hussain, A. (2022). Driver emotions recognition based on improved faster R-CNN and neural architectural search network. Symmetry, 14(4), 687.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, IIT Kharagpur, Kharagpur, West Bengal, India
Arijul Haque & Krothapalli Sreenivasa Rao

Authors

Arijul Haque
View author publications
You can also search for this author inPubMed Google Scholar
Krothapalli Sreenivasa Rao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Arijul Haque.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Haque, A., Rao, K.S. Speech emotion recognition with transfer learning and multi-condition training for noisy environments. Int J Speech Technol 27, 353–365 (2024). https://doi.org/10.1007/s10772-024-10109-5

Download citation

Received: 09 February 2024
Accepted: 09 May 2024
Published: 02 June 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s10772-024-10109-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition with transfer learning and multi-condition training for noisy environments

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Transfer Learning for Audio-Based Speech Emotion Recognition in Chinese: Leveraging Pretrained Models for Improved Performance

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now