Abstract
In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeechMOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best model speaker verification, TitaNet, achieved a linear correlation of 0.6933. Although the results of the speaker verification models are slightly lower, the SpeakerNet model has only 5M parameters, making it suitable for real-time applications, and the TitaNet model produces an embedding of size 192, the smallest among all the evaluated models. The experiment results are reproducible with publicly available source-code.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Babu, A., et al.: XLS-R: self-supervised cross-lingual speech representation learning at scale. CoRR abs/2111.09296 (2021). https://arxiv.org/abs/2111.09296
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312. PMLR (2022)
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 2020, Curran Associates Inc., Red Hook, NY, USA (2020)
Chen, S., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Select. Top. Signal Process. 16, 1505–1518 (2021)
Chung, Y.A., Hsu, W.N., Tang, H., Glass, J.: An unsupervised autoregressive model for speech representation learning. In: Proceedings of the Interspeech 2019, pp. 146–150 (2019). https://doi.org/10.21437/Interspeech.2019-1473
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition, pp. 2426–2430 (2021). https://doi.org/10.21437/Interspeech.2021-329
Cooper, E., Huang, W.C., Toda, T., Yamagishi, J.: Generalization ability of MOS prediction networks. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446. IEEE (2022)
Das, R., et al.: Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions, pp. 99–120 (2020). https://doi.org/10.21437/VCC_BC.2020-15
Fu, S.W., Tsao, Y., Hwang, H.T., Wang, H.M.: Quality-net: an end-to-end non-intrusive speech quality assessment model based on BLSTM (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova baseline system for the voxceleb speaker recognition challenge 2020. arXiv preprint arXiv:2009.14153 (2020)
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech Lang. Proc. 29, 3451–3460 (2021). https://doi.org/10.1109/TASLP.2021.3122291
King, S., Karaiskos, V.: The blizzard challenge 2016 (2016)
Koluguri, N.R., Li, J., Lavrukhin, V., Ginsburg, B.: Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification (2020). https://doi.org/10.48550/ARXIV.2010.12653,https://arxiv.org/abs/2010.12653
Koluguri, N.R., Park, T., Ginsburg, B.: Titanet: neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8102–8106. IEEE (2022)
Kriman, S., et al.: Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053889
Liu, A.T., Li, S.W., Lee, H.Y.: Tera: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2020)
Liu, A.T., Li, S.W., Lee, H.Y.: Tera: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2021)
Lo, C.C., et al.: MOSNet: deep learning-based objective assessment for voice conversion. In: Interspeech 2019. ISCA (2019). https://doi.org/10.21437/interspeech.2019-2003, https://doi.org/10.21437%2Finterspeech.2019-2003
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods (2018)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). https://doi.org/10.48550/ARXIV.1807.03748, https://arxiv.org/abs/1807.03748
Patton, B., Agiomyrgiannakis, Y., Terry, M., Wilson, K.W., Saurous, R.A., Sculley, D.: AutoMOS: learning a non-intrusive assessor of naturalness-of-speech. CoRR abs/1611.09207 (2016). https://arxiv.org/abs/1611.09207
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022). https://doi.org/10.48550/ARXIV.2212.04356,https://arxiv.org/abs/2212.04356
Ragano, A., et al.: A comparison of deep learning MOS predictors for speech synthesis quality (2022)
Rix, A., Beerends, J., Hollier, M., Hekstra, A.: Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In: Proceedings 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749–752. (Cat. No.01CH37221) (2001). https://doi.org/10.1109/ICASSP.2001.941023
Todisco, M., et al.: ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)
Tseng, W.C., Huang, C.Y., Kao, W.T., Lin, Y.Y., Lee, H.Y.: Utilizing self-supervised representations for MOS prediction. In: Interspeech (2021)
Tseng, W.C., Kao, W.T., Lee, H.Y.: DDOS: a MOS prediction framework utilizing domain adaptive pre-training and distribution of opinion scores. In: Interspeech (2022)
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Wang, S., Qian, Y., Yu, K.: What does the speaker embedding encode? In: Interspeech, pp. 1497–1501 (2017)
Wu, Z., Xie, Z., King, S.: The blizzard challenge 2019 (2019)
Yang, Z., et al.: Fusion of self-supervised learned models for MOS prediction. In: Proceedings of the Interspeech 2022, pp. 5443–5447 (2022). https://doi.org/10.21437/Interspeech.2022-10262
Zezario, R.E., Fu, S.W., Chen, F., Fuh, C.S., Wang, H.M., Tsao, Y.: Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 54–70 (2022)
Acknowledgements
The authors are grateful to the Center of Excellence in Artificial Intelligence (https://ceia.ufg.br/) (CEIA) at the Federal University of Goias (UFG) for their support and to CyberLabs (https://cyberlabs.ai) and Coqui (https://coqui.ai/) for their valuable assistance.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
S. Oliveira, F., Casanova, E., Junior, A.C., R. S. Gris, L., S. Soares, A., R. Galvão Filho, A. (2023). Evaluation of Speech Representations for MOS Prediction. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-40498-6_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)