Quality Assurance for Speech Synthesis with ASR

Peinl, René; Wirth, Johannes

doi:10.1007/978-3-031-16078-3_51

René Peinl¹⁰ &
Johannes Wirth¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 543))

Included in the following conference series:

Proceedings of SAI Intelligent Systems Conference

685 Accesses
1 Citations

Abstract

Autoregressive TTS models are still widely used. Due to their stochastic nature, the output may vary from very good to completely unusable from one inference to another. In this publication, we propose to use the percentage of completely correct transcribed sentences (PCTS) of an ASR system as a new objective quality measure for TTS inferences. PCTS is easy to measure and represents the intelligibility dimensions of a typical subjective evaluation with mean opinion score (MOS). We show that PCTS leads to similar results as subjective MOS evaluation. A more detailed, semi-automatic error analysis of the differences between ASR transcripts of TTS speech and the text used for generating the TTS speech can help identifying problems in the TTS training data that are harder to find with other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_de_citrinet_1024.
2.
https://commonvoice.mozilla.org/en/datasets.
3.
https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german.
4.
https://de.wiktionary.org, accessed on 03.03.2022.

References

Murthy, S., Sitaram, D., Sitaram, S.: Effect of TTS generated audio on OOV Detection and word error rate in ASR for low-resource languages. In: Interspeech, pp. 1026–1030 (2018)
Google Scholar
Gokay, R., Yalcin, H.: Improving low resource Turkish speech recognition with data augmentation and TTS. In: 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD), pp. 357–360 (2019)
Google Scholar
Baskar, M.K., Burget, L., Watanabe, S., Astudillo, R.F.: Eat: enhanced ASR-TTS for self-supervised speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6753–6757 (2021)
Google Scholar
Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., Nakatani, T.: Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6166–6170 (2019)
Google Scholar
Chen, Z.: Semi-supervision in ASR: Sequential mixmatch and factorized TTS-based augmentation (2021)
Google Scholar
Baby, A.: An ASR Guided Speech Intelligibility Measure for TTS Model Selection. arXiv preprint arXiv:2006.01463 (2020)
Taylor, J., Richmond, K.: Confidence Intervals for ASR-based TTS Evaluation (2021)
Google Scholar
Protasio Ribeiro, F., Florencio, D., Zhang, C., Seltzer, M.: CROWDMOS: an approach for crowdsourcing mean opinion score studies. Mai 2011, ICASSP. Verfügbar unter: https://www.microsoft.com/en-us/research/publication/crowdmos-an-approach-for-crowdsourcing-mean-opinion-score-studies/
Shen, J.: Natural TTS synthesis by conditioning waveNet on MEL spectrogram predictions. In: 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)
Google Scholar
Viswanathan, M., Viswanathan, M.: Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Comput. Speech Lang. 19(1), 55–83 (2005). https://doi.org/10.1016/j.csl.2003.12.001
Article Google Scholar
Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, pp. 125–128 (1993)
Google Scholar
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213 (2020)
Google Scholar
Badlani, R., \Lancucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro, B.: One TTS alignment to rule them all. arXiv preprint arXiv:2108.10447 (2021)
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), pp. 749–752 (2001)
Google Scholar
Beerends, J.G.: Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I—temporal alignment. J. Audio Eng. Soc. 61(6), 366–384 (2013)
Google Scholar
Chinen, M., Lim, F.S., Skoglund, J., Gureev, N., O’Gorman, F., Hines, A.: VISQOL v3: an open source production ready objective speech and audio metric. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2020)
Google Scholar
Wang, W., Zhou, Z., Lu, Y, Wang, H., Du, C., Qian, Y.: Towards data selection on TTS data for children’s speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6888–6892 (2021)
Google Scholar
Wu, Z., Xie, Z., King, S.: The blizzard challenge 2019. In: Proceedings of Blizzard Challenge Workshop 2019 (2019)
Google Scholar
Morris, A., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition Okt. (2004). https://doi.org/10.21437/Interspeech.2004-668
Hayashi, T.: Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Mai 2020, pp. 7654–7658 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053512
Wirth, J., Puchtler, P., Peinl, R.: Neural Speech Synthesis in German. Okt. pp. 26–34. Zugegriffen: 12. Januar 2022 (2021). Verfügbar unter: https://www.thinkmind.org/index.php?view=article&articleid=centric_2021_2_30_30009
Majumdar, S., Balam, J., Hrinchuk, O., Lavrukhin, V., Noroozi, V., Ginsburg, B.: Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition. arXiv preprint arXiv:2104.01721 (2021)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020, pp. 2757–2761. Okt (2020). https://doi.org/10.21437/Interspeech.2020-2826
Wang, C.: VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August 2021, pp. 993–1003 (2021). https://aclanthology.org/2021.acl-long.80
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. [cs, eess]. Okt, Zugegriffen: 14. Januar 2022 (2020). http://arxiv.org/abs/2006.11477
Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409 (2021)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems. 33 (2020)
Google Scholar
Kim, J., Kong, J., Son, J.: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. arXiv preprint arXiv:2106.06103 (2021)
Han, W.: ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191 (2020)
Gulati, A.: Conformer: Convolution-augmented transformer for speech recognition“, arXiv preprint arXiv:2005.08100 (2020)
Burchi, M., Vielzeuf, V.: Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition. arXiv preprint arXiv:2109.01163 (2021)
Hinterleitner, F., Norrenbrock, C., Möller, S.: Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Information Systems, University of Applied Sciences Hof, Alfons-Goppel-Platz 1, 95028, Hof, Germany
René Peinl & Johannes Wirth

Authors

René Peinl
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Wirth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to René Peinl .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peinl, R., Wirth, J. (2023). Quality Assurance for Speech Synthesis with ASR. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 543. Springer, Cham. https://doi.org/10.1007/978-3-031-16078-3_51

Download citation

DOI: https://doi.org/10.1007/978-3-031-16078-3_51
Published: 01 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16077-6
Online ISBN: 978-3-031-16078-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics