Evaluation of Speech Representations for MOS Prediction

S. Oliveira, Frederico; Casanova, Edresson; Junior, Arnaldo Candido; R. S. Gris, Lucas; S. Soares, Anderson; R. Galvão Filho, Arlindo

doi:10.1007/978-3-031-40498-6_24

Frederico S. Oliveira¹⁰,
Edresson Casanova¹⁰,
Arnaldo Candido Junior¹⁰,
Lucas R. S. Gris¹⁰,
Anderson S. Soares¹⁰ &
…
Arlindo R. Galvão Filho¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14102))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

604 Accesses

Abstract

In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeechMOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best model speaker verification, TitaNet, achieved a linear correlation of 0.6933. Although the results of the speaker verification models are slightly lower, the SpeakerNet model has only 5M parameters, making it suitable for real-time applications, and the TitaNet model produces an embedding of size 192, the smallest among all the evaluated models. The experiment results are reproducible with publicly available source-code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the use of the i-vector speech representation for instrumental quality measurement

Article 20 June 2020

Relative Significance of Speech Sounds in Speaker Verification Systems

Article 11 April 2023

Neural Embedding Extractors for Text-Independent Speaker Verification

Notes

1.
https://github.com/freds0/BSpeech-MOS-Prediction.

References

Babu, A., et al.: XLS-R: self-supervised cross-lingual speech representation learning at scale. CoRR abs/2111.09296 (2021). https://arxiv.org/abs/2111.09296
Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312. PMLR (2022)
Google Scholar
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 2020, Curran Associates Inc., Red Hook, NY, USA (2020)
Google Scholar
Chen, S., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Select. Top. Signal Process. 16, 1505–1518 (2021)
Article Google Scholar
Chung, Y.A., Hsu, W.N., Tang, H., Glass, J.: An unsupervised autoregressive model for speech representation learning. In: Proceedings of the Interspeech 2019, pp. 146–150 (2019). https://doi.org/10.21437/Interspeech.2019-1473
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition, pp. 2426–2430 (2021). https://doi.org/10.21437/Interspeech.2021-329
Cooper, E., Huang, W.C., Toda, T., Yamagishi, J.: Generalization ability of MOS prediction networks. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446. IEEE (2022)
Google Scholar
Das, R., et al.: Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions, pp. 99–120 (2020). https://doi.org/10.21437/VCC_BC.2020-15
Fu, S.W., Tsao, Y., Hwang, H.T., Wang, H.M.: Quality-net: an end-to-end non-intrusive speech quality assessment model based on BLSTM (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova baseline system for the voxceleb speaker recognition challenge 2020. arXiv preprint arXiv:2009.14153 (2020)
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech Lang. Proc. 29, 3451–3460 (2021). https://doi.org/10.1109/TASLP.2021.3122291
King, S., Karaiskos, V.: The blizzard challenge 2016 (2016)
Google Scholar
Koluguri, N.R., Li, J., Lavrukhin, V., Ginsburg, B.: Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification (2020). https://doi.org/10.48550/ARXIV.2010.12653,https://arxiv.org/abs/2010.12653
Koluguri, N.R., Park, T., Ginsburg, B.: Titanet: neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8102–8106. IEEE (2022)
Google Scholar
Kriman, S., et al.: Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053889
Liu, A.T., Li, S.W., Lee, H.Y.: Tera: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2020)
Article Google Scholar
Liu, A.T., Li, S.W., Lee, H.Y.: Tera: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2021)
Article Google Scholar
Lo, C.C., et al.: MOSNet: deep learning-based objective assessment for voice conversion. In: Interspeech 2019. ISCA (2019). https://doi.org/10.21437/interspeech.2019-2003, https://doi.org/10.21437%2Finterspeech.2019-2003
Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods (2018)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). https://doi.org/10.48550/ARXIV.1807.03748, https://arxiv.org/abs/1807.03748
Patton, B., Agiomyrgiannakis, Y., Terry, M., Wilson, K.W., Saurous, R.A., Sculley, D.: AutoMOS: learning a non-intrusive assessor of naturalness-of-speech. CoRR abs/1611.09207 (2016). https://arxiv.org/abs/1611.09207
Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022). https://doi.org/10.48550/ARXIV.2212.04356,https://arxiv.org/abs/2212.04356
Ragano, A., et al.: A comparison of deep learning MOS predictors for speech synthesis quality (2022)
Google Scholar
Rix, A., Beerends, J., Hollier, M., Hekstra, A.: Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In: Proceedings 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749–752. (Cat. No.01CH37221) (2001). https://doi.org/10.1109/ICASSP.2001.941023
Todisco, M., et al.: ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)
Tseng, W.C., Huang, C.Y., Kao, W.T., Lin, Y.Y., Lee, H.Y.: Utilizing self-supervised representations for MOS prediction. In: Interspeech (2021)
Google Scholar
Tseng, W.C., Kao, W.T., Lee, H.Y.: DDOS: a MOS prediction framework utilizing domain adaptive pre-training and distribution of opinion scores. In: Interspeech (2022)
Google Scholar
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Google Scholar
Wang, S., Qian, Y., Yu, K.: What does the speaker embedding encode? In: Interspeech, pp. 1497–1501 (2017)
Google Scholar
Wu, Z., Xie, Z., King, S.: The blizzard challenge 2019 (2019)
Google Scholar
Yang, Z., et al.: Fusion of self-supervised learned models for MOS prediction. In: Proceedings of the Interspeech 2022, pp. 5443–5447 (2022). https://doi.org/10.21437/Interspeech.2022-10262
Zezario, R.E., Fu, S.W., Chen, F., Fuh, C.S., Wang, H.M., Tsao, Y.: Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 54–70 (2022)
Article Google Scholar

Download references

Acknowledgements

The authors are grateful to the Center of Excellence in Artificial Intelligence (https://ceia.ufg.br/) (CEIA) at the Federal University of Goias (UFG) for their support and to CyberLabs (https://cyberlabs.ai) and Coqui (https://coqui.ai/) for their valuable assistance.

Author information

Authors and Affiliations

UFG, Goiás, GO, Brazil
Frederico S. Oliveira, Edresson Casanova, Arnaldo Candido Junior, Lucas R. S. Gris, Anderson S. Soares & Arlindo R. Galvão Filho

Authors

Frederico S. Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Edresson Casanova
View author publications
You can also search for this author in PubMed Google Scholar
Arnaldo Candido Junior
View author publications
You can also search for this author in PubMed Google Scholar
Lucas R. S. Gris
View author publications
You can also search for this author in PubMed Google Scholar
Anderson S. Soares
View author publications
You can also search for this author in PubMed Google Scholar
Arlindo R. Galvão Filho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Frederico S. Oliveira , Edresson Casanova , Arnaldo Candido Junior , Lucas R. S. Gris , Anderson S. Soares or Arlindo R. Galvão Filho .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

S. Oliveira, F., Casanova, E., Junior, A.C., R. S. Gris, L., S. Soares, A., R. Galvão Filho, A. (2023). Evaluation of Speech Representations for MOS Prediction. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-40498-6_24
Published: 23 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40497-9
Online ISBN: 978-3-031-40498-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Evaluation of Speech Representations for MOS Prediction