Abstract
Autoregressive TTS models are still widely used. Due to their stochastic nature, the output may vary from very good to completely unusable from one inference to another. In this publication, we propose to use the percentage of completely correct transcribed sentences (PCTS) of an ASR system as a new objective quality measure for TTS inferences. PCTS is easy to measure and represents the intelligibility dimensions of a typical subjective evaluation with mean opinion score (MOS). We show that PCTS leads to similar results as subjective MOS evaluation. A more detailed, semi-automatic error analysis of the differences between ASR transcripts of TTS speech and the text used for generating the TTS speech can help identifying problems in the TTS training data that are harder to find with other methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Murthy, S., Sitaram, D., Sitaram, S.: Effect of TTS generated audio on OOV Detection and word error rate in ASR for low-resource languages. In: Interspeech, pp. 1026–1030 (2018)
Gokay, R., Yalcin, H.: Improving low resource Turkish speech recognition with data augmentation and TTS. In: 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD), pp. 357–360 (2019)
Baskar, M.K., Burget, L., Watanabe, S., Astudillo, R.F.: Eat: enhanced ASR-TTS for self-supervised speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6753–6757 (2021)
Karita, S., Watanabe, S., Iwata, T., Delcroix, M., Ogawa, A., Nakatani, T.: Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6166–6170 (2019)
Chen, Z.: Semi-supervision in ASR: Sequential mixmatch and factorized TTS-based augmentation (2021)
Baby, A.: An ASR Guided Speech Intelligibility Measure for TTS Model Selection. arXiv preprint arXiv:2006.01463 (2020)
Taylor, J., Richmond, K.: Confidence Intervals for ASR-based TTS Evaluation (2021)
Protasio Ribeiro, F., Florencio, D., Zhang, C., Seltzer, M.: CROWDMOS: an approach for crowdsourcing mean opinion score studies. Mai 2011, ICASSP. Verfügbar unter: https://www.microsoft.com/en-us/research/publication/crowdmos-an-approach-for-crowdsourcing-mean-opinion-score-studies/
Shen, J.: Natural TTS synthesis by conditioning waveNet on MEL spectrogram predictions. In: 2018 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783 (2018)
Viswanathan, M., Viswanathan, M.: Measuring speech quality for text-to-speech systems: development and assessment of a modified mean opinion score (MOS) scale. Comput. Speech Lang. 19(1), 55–83 (2005). https://doi.org/10.1016/j.csl.2003.12.001
Kubichek, R.: Mel-cepstral distance measure for objective speech quality assessment. In: Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, pp. 125–128 (1993)
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., Xiao, J.: Flow-TTS: a non-autoregressive network for text to speech based on flow. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7209–7213 (2020)
Badlani, R., \Lancucki, A., Shih, K.J., Valle, R., Ping, W., Catanzaro, B.: One TTS alignment to rule them all. arXiv preprint arXiv:2108.10447 (2021)
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), pp. 749–752 (2001)
Beerends, J.G.: Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I—temporal alignment. J. Audio Eng. Soc. 61(6), 366–384 (2013)
Chinen, M., Lim, F.S., Skoglund, J., Gureev, N., O’Gorman, F., Hines, A.: VISQOL v3: an open source production ready objective speech and audio metric. In: 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6 (2020)
Wang, W., Zhou, Z., Lu, Y, Wang, H., Du, C., Qian, Y.: Towards data selection on TTS data for children’s speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6888–6892 (2021)
Wu, Z., Xie, Z., King, S.: The blizzard challenge 2019. In: Proceedings of Blizzard Challenge Workshop 2019 (2019)
Morris, A., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition Okt. (2004). https://doi.org/10.21437/Interspeech.2004-668
Hayashi, T.: Espnet-TTS: unified, reproducible, and integratable open source end-to-end text-to-speech toolkit. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, Mai 2020, pp. 7654–7658 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053512
Wirth, J., Puchtler, P., Peinl, R.: Neural Speech Synthesis in German. Okt. pp. 26–34. Zugegriffen: 12. Januar 2022 (2021). Verfügbar unter: https://www.thinkmind.org/index.php?view=article&articleid=centric_2021_2_30_30009
Majumdar, S., Balam, J., Hrinchuk, O., Lavrukhin, V., Noroozi, V., Ginsburg, B.: Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition. arXiv preprint arXiv:2104.01721 (2021)
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., Collobert, R.: MLS: a large-scale multilingual dataset for speech research. In: Interspeech 2020, pp. 2757–2761. Okt (2020). https://doi.org/10.21437/Interspeech.2020-2826
Wang, C.: VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August 2021, pp. 993–1003 (2021). https://aclanthology.org/2021.acl-long.80
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. [cs, eess]. Okt, Zugegriffen: 14. Januar 2022 (2020). http://arxiv.org/abs/2006.11477
Jeong, M., Kim, H., Cheon, S.J., Choi, B.J., Kim, N.S.: Diff-TTS: a denoising diffusion model for text-to-speech. arXiv preprint arXiv:2104.01409 (2021)
Kong, J., Kim, J., Bae, J.: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. Advances in Neural Information Processing Systems. 33 (2020)
Kim, J., Kong, J., Son, J.: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. arXiv preprint arXiv:2106.06103 (2021)
Han, W.: ContextNet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191 (2020)
Gulati, A.: Conformer: Convolution-augmented transformer for speech recognition“, arXiv preprint arXiv:2005.08100 (2020)
Burchi, M., Vielzeuf, V.: Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition. arXiv preprint arXiv:2109.01163 (2021)
Hinterleitner, F., Norrenbrock, C., Möller, S.: Is intelligibility still the main problem? A review of perceptual quality dimensions of synthetic speech (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Peinl, R., Wirth, J. (2023). Quality Assurance for Speech Synthesis with ASR. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2022. Lecture Notes in Networks and Systems, vol 543. Springer, Cham. https://doi.org/10.1007/978-3-031-16078-3_51
Download citation
DOI: https://doi.org/10.1007/978-3-031-16078-3_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16077-6
Online ISBN: 978-3-031-16078-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)