Skip to main content

Evaluation of Speech Representations for MOS Prediction

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2023)

Abstract

In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeechMOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best model speaker verification, TitaNet, achieved a linear correlation of 0.6933. Although the results of the speaker verification models are slightly lower, the SpeakerNet model has only 5M parameters, making it suitable for real-time applications, and the TitaNet model produces an embedding of size 192, the smallest among all the evaluated models. The experiment results are reproducible with publicly available source-code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/freds0/BSpeech-MOS-Prediction.

References

  1. Babu, A., et al.: XLS-R: self-supervised cross-lingual speech representation learning at scale. CoRR abs/2111.09296 (2021). https://arxiv.org/abs/2111.09296

  2. Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: a general framework for self-supervised learning in speech, vision and language. In: International Conference on Machine Learning, pp. 1298–1312. PMLR (2022)

    Google Scholar 

  3. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 2020, Curran Associates Inc., Red Hook, NY, USA (2020)

    Google Scholar 

  4. Chen, S., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Select. Top. Signal Process. 16, 1505–1518 (2021)

    Article  Google Scholar 

  5. Chung, Y.A., Hsu, W.N., Tang, H., Glass, J.: An unsupervised autoregressive model for speech representation learning. In: Proceedings of the Interspeech 2019, pp. 146–150 (2019). https://doi.org/10.21437/Interspeech.2019-1473

  6. Conneau, A., Baevski, A., Collobert, R., Mohamed, A., Auli, M.: Unsupervised cross-lingual representation learning for speech recognition, pp. 2426–2430 (2021). https://doi.org/10.21437/Interspeech.2021-329

  7. Cooper, E., Huang, W.C., Toda, T., Yamagishi, J.: Generalization ability of MOS prediction networks. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8442–8446. IEEE (2022)

    Google Scholar 

  8. Das, R., et al.: Predictions of subjective ratings and spoofing assessments of voice conversion challenge 2020 submissions, pp. 99–120 (2020). https://doi.org/10.21437/VCC_BC.2020-15

  9. Fu, S.W., Tsao, Y., Hwang, H.T., Wang, H.M.: Quality-net: an end-to-end non-intrusive speech quality assessment model based on BLSTM (2018)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  11. Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova baseline system for the voxceleb speaker recognition challenge 2020. arXiv preprint arXiv:2009.14153 (2020)

  12. Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech Lang. Proc. 29, 3451–3460 (2021). https://doi.org/10.1109/TASLP.2021.3122291

  13. King, S., Karaiskos, V.: The blizzard challenge 2016 (2016)

    Google Scholar 

  14. Koluguri, N.R., Li, J., Lavrukhin, V., Ginsburg, B.: Speakernet: 1d depth-wise separable convolutional network for text-independent speaker recognition and verification (2020). https://doi.org/10.48550/ARXIV.2010.12653,https://arxiv.org/abs/2010.12653

  15. Koluguri, N.R., Park, T., Ginsburg, B.: Titanet: neural model for speaker representation with 1d depth-wise separable convolutions and global context. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8102–8106. IEEE (2022)

    Google Scholar 

  16. Kriman, S., et al.: Quartznet: deep automatic speech recognition with 1d time-channel separable convolutions. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6124–6128 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053889

  17. Liu, A.T., Li, S.W., Lee, H.Y.: Tera: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2020)

    Article  Google Scholar 

  18. Liu, A.T., Li, S.W., Lee, H.Y.: Tera: self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 2351–2366 (2021)

    Article  Google Scholar 

  19. Lo, C.C., et al.: MOSNet: deep learning-based objective assessment for voice conversion. In: Interspeech 2019. ISCA (2019). https://doi.org/10.21437/interspeech.2019-2003, https://doi.org/10.21437%2Finterspeech.2019-2003

  20. Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting development of parallel and nonparallel methods (2018)

    Google Scholar 

  21. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  22. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). https://doi.org/10.48550/ARXIV.1807.03748, https://arxiv.org/abs/1807.03748

  23. Patton, B., Agiomyrgiannakis, Y., Terry, M., Wilson, K.W., Saurous, R.A., Sculley, D.: AutoMOS: learning a non-intrusive assessor of naturalness-of-speech. CoRR abs/1611.09207 (2016). https://arxiv.org/abs/1611.09207

  24. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision (2022). https://doi.org/10.48550/ARXIV.2212.04356,https://arxiv.org/abs/2212.04356

  25. Ragano, A., et al.: A comparison of deep learning MOS predictors for speech synthesis quality (2022)

    Google Scholar 

  26. Rix, A., Beerends, J., Hollier, M., Hekstra, A.: Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. In: Proceedings 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 749–752. (Cat. No.01CH37221) (2001). https://doi.org/10.1109/ICASSP.2001.941023

  27. Todisco, M., et al.: ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441 (2019)

  28. Tseng, W.C., Huang, C.Y., Kao, W.T., Lin, Y.Y., Lee, H.Y.: Utilizing self-supervised representations for MOS prediction. In: Interspeech (2021)

    Google Scholar 

  29. Tseng, W.C., Kao, W.T., Lee, H.Y.: DDOS: a MOS prediction framework utilizing domain adaptive pre-training and distribution of opinion scores. In: Interspeech (2022)

    Google Scholar 

  30. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)

    Google Scholar 

  31. Wang, S., Qian, Y., Yu, K.: What does the speaker embedding encode? In: Interspeech, pp. 1497–1501 (2017)

    Google Scholar 

  32. Wu, Z., Xie, Z., King, S.: The blizzard challenge 2019 (2019)

    Google Scholar 

  33. Yang, Z., et al.: Fusion of self-supervised learned models for MOS prediction. In: Proceedings of the Interspeech 2022, pp. 5443–5447 (2022). https://doi.org/10.21437/Interspeech.2022-10262

  34. Zezario, R.E., Fu, S.W., Chen, F., Fuh, C.S., Wang, H.M., Tsao, Y.: Deep learning-based non-intrusive multi-objective speech assessment model with cross-domain features. IEEE/ACM Trans. Audio Speech Lang. Process. 31, 54–70 (2022)

    Article  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the Center of Excellence in Artificial Intelligence (https://ceia.ufg.br/) (CEIA) at the Federal University of Goias (UFG) for their support and to CyberLabs (https://cyberlabs.ai) and Coqui (https://coqui.ai/) for their valuable assistance.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Frederico S. Oliveira , Edresson Casanova , Arnaldo Candido Junior , Lucas R. S. Gris , Anderson S. Soares or Arlindo R. Galvão Filho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

S. Oliveira, F., Casanova, E., Junior, A.C., R. S. Gris, L., S. Soares, A., R. Galvão Filho, A. (2023). Evaluation of Speech Representations for MOS Prediction. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-40498-6_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-40497-9

  • Online ISBN: 978-3-031-40498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics