Abstract
In this paper, we discuss the possibility to improve the accuracy of speech emotion recognition in multi-user systems. We assume that a small corpus of speech emotional data is available for each speaker of interest. It is proposed to train a speaker-independent emotion classifier of arbitrary audio features including deep embeddings and fine-tune it on the utterances of each speaker. As a result, every user is associated with his or her own emotion recognition model. The concrete fine-tuned classifier may be chosen using a speaker recognition algorithm or even fixed if the identity of a user is known. It is experimentally shown that the proposed approach makes it possible to significantly improve the quality of conventional speaker-independent emotion classifier.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020)
Campr, P., Pražák, A., Psutka, J.V., Psutka, J.: Online speaker adaptation of an acoustic model using face recognition. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 378–385. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_48
Cramer, J., Wu, H.H., Salamon, J., Bello, J.P.: Look, listen, and learn more: Design choices for deep audio embeddings. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3852–3856. IEEE (2019)
Dahake, P.P., Shaw, K., Malathi, P.: Speaker dependent speech emotion recognition using MFCC and support vector machine. In: Proceedings of International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), pp. 1080–1084. IEEE (2016)
Eyben, F., Wöllmer, M., Schuller, B.: OpenSmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462 (2010)
Ezz-Eldin, M., Khalaf, A.A., Hamed, H.F., Hussein, A.I.: Efficient feature-aware hybrid model of deep learning architectures for speech emotion recognition. IEEE Access 9, 19999–20011 (2021)
Farooq, M., Hussain, F., Baloch, N.K., Raja, F.R., Yu, H., Zikria, Y.B.: Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors 20(21), 6008 (2020)
Fayek, H.M.: Speech processing for machine learning: Filter banks, Mel-frequency cepstral coefficients (MFCCs) and what’s in-between (2016). https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
Guliani, D., Beaufays, F., Motta, G.: Training speech recognition models with federated learning: a quality/cost framework. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3080–3084. IEEE (2021)
Haq, S., Jackson, P.J., Edge, J.: Speaker-dependent audio-visual emotion recognition. In: Proceedings of International Conference on Audio-Visual Speech Processing (AVSP), pp. 53–58 (2009)
Jalal, M.A., Loweimi, E., Moore, R.K., Hain, T.: Learning temporal clusters using capsule routing for speech emotion recognition. In: Proceedings of Interspeech, pp. 1701–1705. ISCA (2019)
Koluguri, N.R., Li, J., Lavrukhin, V., Ginsburg, B.: SpeakerNet: 1D depth-wise separable convolutional network for text-independent speaker recognition and verification. arXiv preprint arXiv:2010.12653 (2020)
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS One 13(5), e0196391 (2018)
Savchenko, A.V.: Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output convnet. Peer J. Comput. Sci. 5, e197 (2019)
Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. arXiv preprint arXiv:2103.17107 (2021)
Savchenko, A.V., Savchenko, V.V.: A method for measuring the pitch frequency of speech signals for the systems of acoustic speech analysis. Measurement Tech. 62(3), 282–288 (2019)
Sokolov, A., Savchenko, A.V.: Gender domain adaptation for automatic speech recognition. In: Proceedings of 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000413–000418. IEEE (2021)
dos SP Soares, A., et al.: Energy-based voice activity detection algorithm using gaussian and cauchy kernels. In: Proceedings of the 9th Latin American Symposium on Circuits & Systems (LASCAS), pp. 1–4. IEEE (2018)
Zhang, Y., et al.: Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504 (2020)
Zhao, Y., Li, J., Zhang, S., Chen, L., Gong, Y.: Domain and speaker adaptation for Cortana speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5984–5988. IEEE (2018)
Acknowledgments
The work is supported by RSF (Russian Science Foundation) grant 20-71-10010.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Savchenko, L., V. Savchenko, A. (2021). Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-87802-3_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87801-6
Online ISBN: 978-3-030-87802-3
eBook Packages: Computer ScienceComputer Science (R0)