Skip to main content

Quality Assurance for Speech Synthesis with ASR

  • Conference paper
  • First Online:
Book cover Intelligent Systems and Applications (IntelliSys 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 543))

Included in the following conference series:

Abstract

Autoregressive TTS models are still widely used. Due to their stochastic nature, the output may vary from very good to completely unusable from one inference to another. In this publication, we propose to use the percentage of completely correct transcribed sentences (PCTS) of an ASR system as a new objective quality measure for TTS inferences. PCTS is easy to measure and represents the intelligibility dimensions of a typical subjective evaluation with mean opinion score (MOS). We show that PCTS leads to similar results as subjective MOS evaluation. A more detailed, semi-automatic error analysis of the differences between ASR transcripts of TTS speech and the text used for generating the TTS speech can help identifying problems in the TTS training data that are harder to find with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_de_citrinet_1024.

  2. 2.

    https://commonvoice.mozilla.org/en/datasets.

  3. 3.

    https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german.

  4. 4.

    https://de.wiktionary.org, accessed on 03.03.2022.

References

  1. Murthy, S., Sitaram, D., Sitaram, S.: Effect of TTS generated audio on OOV Detection and word error rate in ASR for low-resource languages. In: Interspeech, pp. 1026–1030 (2018)

    Google Scholar 

  2. Gokay, R., Yalcin, H.: Improving low resource Turkish speech recognition with data augmentation and TTS. In: 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD), pp. 357–360 (2019)

    Google Scholar 

  3. Baskar, M.K., Burget, L., Watanabe, S., Astudillo, R.F.: Eat: enhanced ASR-TTS for self-supervised speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6753–6757 (2021)

    Google Scholar 

  4. Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., Nakatani, T.: Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6166–6170 (2019)

    Google Scholar 

  5. Chen, Z.: Semi-supervision in ASR: Sequential mixmatch and factorized TTS-based augmentation (2021)

    Google Scholar 

  6. Baby, A.: An ASR Guided Speech Intelligibility Measure for TTS Model Selection. arXiv preprint arXiv:2006.01463 (2020)

  7. Taylor, J., Richmond, K.: Confidence Intervals for ASR-based TTS Evaluation (2021)

    Google Scholar 

  8. Protasio Ribeiro, F., Florencio, D., Zhang, C., Seltzer, M.: CROWDMOS: an approach for crowdsourcing mean opinion score studies. Mai 2011, ICASSP. Verfügbar unter: https://www.microsoft.com/en-us/research/publication/crowdmos-an-approach-for-crowdsourcing-mean-opinion-score-studies/

  9. Shen, J.: Natural TTS synthesis by conditioning waveNet on MEL spectrogram predictions. In: 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)

    Google Scholar 

  10. Viswanathan, M., Viswanathan, M.: Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Comput. Speech Lang. 19(1), 55–83 (2005). https://doi.org/10.1016/j.csl.2003.12.001

    Article  Google Scholar 

  11. Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, pp. 125–128 (1993)

    Google Scholar 

  12. Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213 (2020)

    Google Scholar 

  13. Badlani, R., \Lancucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro, B.: One TTS alignment to rule them all. arXiv preprint arXiv:2108.10447 (2021)

  14. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), pp. 749–752 (2001)

    Google Scholar 

  15. Beerends, J.G.: Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I—temporal alignment. J. Audio Eng. Soc. 61(6), 366–384 (2013)

    Google Scholar 

  16. Chinen, M., Lim, F.S., Skoglund, J., Gureev, N., O’Gorman, F., Hines, A.: VISQOL v3: an open source production ready objective speech and audio metric. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2020)

    Google Scholar 

  17. Wang, W., Zhou, Z., Lu, Y, Wang, H., Du, C., Qian, Y.: Towards data selection on TTS data for children’s speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6888–6892 (2021)

    Google Scholar 

  18. Wu, Z., Xie, Z., King, S.: The blizzard challenge 2019. In: Proceedings of Blizzard Challenge Workshop 2019 (2019)

    Google Scholar 

  19. Morris, A., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition Okt. (2004). https://doi.org/10.21437/Interspeech.2004-668

  20. Hayashi, T.: Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Mai 2020, pp. 7654–7658 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053512

  21. Wirth, J., Puchtler, P., Peinl, R.: Neural Speech Synthesis in German. Okt. pp. 26–34. Zugegriffen: 12. Januar 2022 (2021). Verfügbar unter: https://www.thinkmind.org/index.php?view=article&articleid=centric_2021_2_30_30009

  22. Majumdar, S., Balam, J., Hrinchuk, O., Lavrukhin, V., Noroozi, V., Ginsburg, B.: Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition. arXiv preprint arXiv:2104.01721 (2021)

  23. Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020, pp. 2757–2761. Okt (2020). https://doi.org/10.21437/Interspeech.2020-2826

  24. Wang, C.: VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August 2021, pp. 993–1003 (2021). https://aclanthology.org/2021.acl-long.80

  25. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. [cs, eess]. Okt, Zugegriffen: 14. Januar 2022 (2020). http://arxiv.org/abs/2006.11477

  26. Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409 (2021)

  27. Kong, J., Kim, J., Bae, J.: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems. 33 (2020)

    Google Scholar 

  28. Kim, J., Kong, J., Son, J.: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. arXiv preprint arXiv:2106.06103 (2021)

  29. Han, W.: ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191 (2020)

  30. Gulati, A.: Conformer: Convolution-augmented transformer for speech recognition“, arXiv preprint arXiv:2005.08100 (2020)

  31. Burchi, M., Vielzeuf, V.: Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition. arXiv preprint arXiv:2109.01163 (2021)

  32. Hinterleitner, F., Norrenbrock, C., Möller, S.: Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to René Peinl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Peinl, R., Wirth, J. (2023). Quality Assurance for Speech Synthesis with ASR. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 543. Springer, Cham. https://doi.org/10.1007/978-3-031-16078-3_51

Download citation

Publish with us

Policies and ethics