A framework for quality assessment of synthesised speech using learning-based objective evaluation

Malviya, Shrikant; Mishra, Rohit; Barnwal, Santosh Kumar; Tiwary, Uma Shanker

doi:10.1007/s10772-023-10021-4

A framework for quality assessment of synthesised speech using learning-based objective evaluation

Published: 02 February 2023

Volume 26, pages 221–243, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Shrikant Malviya ORCID: orcid.org/0000-0002-7539-3721^1,2^na1,
Rohit Mishra¹^na1,
Santosh Kumar Barnwal³ &
…
Uma Shanker Tiwary¹

200 Accesses
Explore all metrics

Abstract

In this paper, we propose a novel method of evaluating text-to-speech systems named “Learning-Based Objective Evaluation” (LBOE), which utilises a set of selected low-level-descriptors (LLD) based features to assess the speech-quality of a TTS model. We have considered Unit selection speech synthesis (USS), Hidden Markov Model speech synthesis (HMM), Clustergen speech synthesis (CLU) and Deep Neural Network-based speech synthesis (DNN) methods to construct speech synthesis models on two Hindi speech datasets. The models are first evaluated on the basis of conventional objective and subjective evaluation measures, which have some inherent limitations. Subjective evaluation requires a substantial manual effort as well as financial support, while objective measures are inconsistent in delivering the expected results. As a novel method of evaluation, a LBOE is proposed, which utilises a set of selected LLD based features to analyse the speech-quality of TTS models. The LBOE method is explored in a twofold manner. First, LLD features in understanding the perceptual quality space of a synthesised speech are discussed. Second, quality-prediction models are constructed for prediction on various subjective dimensions, e.g. comprehensibility, naturalness and prosody, using those features. To show its efficacy on various sound units, the proposed method is analysed on the features extracted at various levels, i.e. phoneme, syllable and word. We observed that our learning-based objective evaluation method is capable of comparing the TTS models based on the LLD features extracted at each level. It provides a reliable and accurate evaluation method comparable to subjective evaluation measures and is more consistent than the objective evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Perceptual Quality Assessment of TTS-Synthesized Speech

Performance Evaluation of Speech Synthesis Techniques for English Language

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Article Open access 12 February 2024

Notes

ITU: International Telecommunication Union.
MFCCs: Mel-Frequency Cepstral Coefficients.
PESQ: Perceptual Evaluation of Speech Quality.
MCD: Mel-Cepstral Distortion.
LPCC: Linear Predictive Coding Coefficients.
HMM: Hidden Markov Model.
NiQA: Non-intrusive Quality Assessment.
CMU phone-set for indic languages Parlikar et al. (2016).
C:Consonant, V:Vowel.
The MARY Text-to-Speech System (MaryTTS) http://mary.dfki.de/.
CART: Classification And Regression Tree.
Edinburgh Speech Tools Library http://www.cstr.ed.ac.uk/projects/speech_tools/.
Tacotron-2: https://github.com/Rayhane-mamah/Tacotron-2.
WaveNet vocoder: https://github.com/r9y9/wavenet_vocoder.
A transliteration tool for Indic-languages: https://github.com/libindic/indic-trans.
Computed through the statistical functionals, i.e. mean and variances on LLDs as well as its derivatives.
Building Indic Voices: http://festvox.org/bsv/x3528.html.
Here ORIG denotes the original human speech files.
Unified Parser: https://www.iitm.ac.in/donlab/tts/unified.php.
TextGridTools https://textgridtools.readthedocs.io/en/stable/.
openSmile: audio feature extraction tool (https://audeering.github.io/opensmile/about.html).
sklearn.svm.NuSVR: Uses LIBSVM library at the backend Chang and Lin (2011).

References

Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y. & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th international conference on machine learning, Vol. 70, (pp. 195–204).
Baby, A., Nishanthi, N., Thomas, A. L. & Murthy, H. A. (2016a). Resources for Indian languages. In International conference on text, speech, and dialogue (pp. 514–521).
Baby, A., Nishanthi, N., Thomas, A. L. & Murthy, H. A. (2016b). A unified parser for developing Indian language text to speech synthesizers. In International conference on text, speech, and dialogue (pp. 514–521).
Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y. & Syrdal, A. (1999). The at &t next-gen tts system. In Joint meeting of ASA, EAA, and DAGA (pp. 18–24).
Black, A. W. (n.d.). CMU INDIC speech synthesis databases. Retrieved December 15, 2021, from http://festvox.org/cmu_indic/index.html
Black, A. W. (2002). Perfect synthesis for all of the people all of the time. In Proceedings of 2002 IEEE workshop on speech synthesis, 2002. (pp. 167–170).
Black, A. W. (2006). Clustergen: A statistical parametric synthesizer using trajectory modeling. In Proceedings of Interspeech-2006, ninth international conference on spoken language processing (pp. 1762–1765).
Black, A. W. & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Eurospeech97 (pp. 601–604).
Cernak, M. & Rusko, M. (2005). An evaluation of synthetic speech using the PESQ measure. In Proceedings of the European congress on acoustics (pp. 2725–2728).
Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.
Article Google Scholar
Chang, Y. Y. (2011). Evaluation of TTS systems in intelligibility and comprehension tasks. In Proceedings of the 23rd conference on computational linguistics and speech processing (pp. 64–78).
Chen, J. D. & Campbell, N. (1999). Objective distance measures for assessing concatenative speech synthesis. In Sixth European conference on speech communication and technology.
Choi, Y., Jung, Y. & Kim, H. (2020). Deep MOS predictor for synthetic speech using cluster-based modeling. In Proceedings of Interspeech 2020 (pp. 1743–1747).
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., et al. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.
Article Google Scholar
Eyben, F., Weninger, F., Gross, F. & Schuller, B. (2013). Recent developments in opensmile, the Munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on multimedia (pp. 835–838).
Falk, T. & Chan, W. (2004). Single ended method for objective speech quality assessment in narrowband telephony applications. In ITU-T (p. 563).
Falk, T. H., & Moller, S. (2008). Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letters, 15, 781–784.
Article Google Scholar
Fu, S. W., Tsao, Y., Hwang, H. T. & Wang, H. M. (2018). Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. In Proceedings of Interspeech 2018 (pp. 1873–1877).
Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W. & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. In I. Guyon et al. (Eds.), Advances in neural information processing systems, Vol. 30, (pp. 2962–2970). Curran Associates Inc.
Grancharov, V., Zhao, D. Y., Lindblom, J., & Kleijn, W. B. (2006). Low-complexity, nonintrusive speech quality assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1948–1956.
Article Google Scholar
Grice, M., Vagges, K. & Hirst, D. (1992). Prosodic form tests and “prosodic function tests”. SAM final report.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.
Book MATH Google Scholar
Heute, U. (2008). Speech-transmission quality: Aspects and assessment for wideband vs. narrowband signals. Advances in Digital Speech Transmission, 572.
Hinterleitner, F., Norrenbrock, C., Möller, S. & Heute, U. (2013). Predicting the quality of text-to-speech systems from a large-scale feature set. In Interspeech (pp. 383–387).
Huang, D. Y. (2011). Prediction of perceived sound quality of synthetic speech. In Proceedings of APSIPA.
Hunt, A.J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, Vol. 1, (pp. 373–376).
Jekosch, U. (2006). Voice and speech quality perception: Assessment and evaluation. Springer.
Google Scholar
Kim, D. S. (2005). Anique: An auditory model for single-ended speech quality estimation. IEEE Transactions on Speech and Audio Processing, 13(5), 821–831.
Article Google Scholar
Klabbers, E., Van Santen, J. P., & Kain, A. (2007). The contribution of various sources of spectral mismatch to audible discontinuities in a diphone database. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 949–956.
Article Google Scholar
Lewis, J. R. (2004). Effect of speaker and sampling rate on mos-x ratings of concatenative TTS voices. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 48, (pp. 759–763).
Lo, C. C., Fu, S. W., Huang, W. C., Wang, X., Yamagishi, J., Tsao, Y. & Wang, H. M. (2019). MOSNet: Deep learning-based objective assessment for voice conversion. In Proceedings of Interspeech 2019 (pp. 1541–1545).
Loizou, P. C. (2011). Speech quality assessment. In Multimedia analysis, processing and communications (pp. 623–654). Springer.
Loizou, P. C. (2013). Speech enhancement: Theory and practice speech enhancement: Theory and practice. CRC Press.
Book Google Scholar
Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T. & Ling, Z. (2018) The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In Proceedings of Odyssey 2018 the speaker and language recognition workshop (pp. 195–202).
Malviya, S., Mishra, R., Barnwal, S. K., & Tiwary, U. S. (2021). HDRS: Hindi dialogue restaurant search corpus for dialogue state tracking in task-oriented environment. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2517–2528. https://doi.org/10.1109/TASLP.2021.3065833
Article Google Scholar
Malviya, S., Mishra, R. & Tiwary, U. S. (2016). Structural analysis of Hindi phonetics and a method for extraction of phonetically rich sentences from a very large Hindi text corpus. In 2016 conference of O-COCOSDA (pp. 188–193).
Mariniak, A. (1993). A global framework for the assessment of synthetic speech without subjects. In Third European conference on speech communication and technology.
Mayo, C., Clark, R. A., & King, S. (2011). Listeners’ weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3), 311–326.
Article Google Scholar
Mishra, R., Barnwal, S. K., Malviya, S., Mishra, P. & Tiwary, U. S. (2018). Prosodic feature selection of personality traits for job interview performance. In International conference on intelligent systems design and applications (pp. 673–682).
Möller, S. (2017). Quality engineering: Qualität kommunikationstechnischer Systeme. Springer.
Book Google Scholar
Möller, S., Hinterleitner, F., Falk, T. H. & Polzehl, T. (2010). Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Eleventh annual conference of the international speech communication association.
Monzo, C., Iriondo, I. & Socoró, J. C. (2014). Voice quality modelling for expressive speech synthesis. The Scientific World Journal.
Moore, B. C. (2012). An introduction to the psychology of hearing. Brill.
Google Scholar
Müller, S., Chan, W., Côté, N., Falk, T. H., Raake, A., & Wältermann, M. (2011). Speech quality estimation: Models and trends. IEEE Signal Processing Magazine, 28(6), 18–28.
Article Google Scholar
Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Moller, S. (2012). Instrumental assessment of prosodic quality for text-to-speech signals. IEEE Signal Processing Letters, 19(5), 255–258.
Article Google Scholar
Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Möller, S. (2015). Quality prediction of synthesized speech based on perceptual quality dimensions. Speech Communication, 66, 17–35.
Article Google Scholar
Novorita, B. (1999). Incorporation of temporal masking effects into bark spectral distortion measure. In Proceedings of ICASSP, Vol. 2, (pp. 665–668).
Pammi, S., Charfuelan, M. & Schröder, M. (2010). Multilingual voice creation toolkit for the mary TTS platform. In LREC.
Papadopoulos, P., Travadi, R. & Narayanan, S. (2017). Global SNR estimation of speech signals for unknown noise conditions using noise adapted non-linear regression. In Proceedings of Interspeech 2017 (pp. 3842–3846).
Parlikar, A., Sitaram, S., Wilkinson, A. & Black, A. W. (2016). The festvox indic frontend for grapheme to phoneme conversion. In WILDRE: Workshop on indian language data-resources and evaluation.
Parrish, W. M. (1951). The concept of “naturalness’’. Quarterly Journal of Speech, 37(4), 448–454.
Article Google Scholar
Ping, W., Peng, K., Gibiansky, A., Arik, S.Ö., Kannan, A. , Narang, S. & Miller, J. (2018). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In ICLR-2018. OpenReview.net.
Prakash, A., Prakash, J. J. & Murthy, H. A. (2016). Acoustic analysis of syllables across Indian languages. In INTERSPEECH (pp. 327–331).
Quackenbush, S. R., Barnwell, T. P., & Clements, M. A. (1988). Objective measures of speech quality. Prentice Hall.
Google Scholar
Rec, I. (1994). P. 85. A method for subjective performance assessment of the quality of speech voice output devices. International Telecommunication Union.
Google Scholar
Recommendation, I. T. (2001). Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. In Rec. ITU-T P. 862.
Rosipal, R. & Krämer, N. (2005). Overview and recent advances in partial least squares. In International statistical and optimization perspectives workshop "subspace, latent structure and feature selection” (pp. 34–51).
Schröder, M. & Hunecke, A. (2007). Creating German unit selection voices for the Mary TTSs platform from the Bits corpora. In Proceedings of SSW6.
Schuller, B. (2006). Automatische emotionserkennung aus sprachlicher und manueller Interaktion (Unpublished doctoral dissertation). Technische Universität München.
Schuller et al., B. (2009). The Interspeech 2009 emotion challenge. In Proceedings 10th ISCA.
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C. & Narayanan, S. S. (2010). The Interspeech 2010 paralinguistic challenge. In Proceedings 11th ISCA.
Schuller, B., Steidl, S., Batliner, A., Nöth, E., Vinciarelli, A., Burkhardt, F., & Weiss, B (2012). The Interspeech 2012 speaker trait challenge. In Proceedings 13th ISCA.
Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., et al. (2011). The Interspeech 2011 speaker state challenge. In Proceedings 12th ISCA.
Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F. & Kim, S. (2013). The Interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings 14th ISCA.
Schuller, B.W., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J.K., Baird, A. & Evanini, K. (2016). The Interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Interspeech, Vol. 2016, (pp. 2001–2005).
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z. & Wu, Y. (2018). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of ICASSP (pp. 4779–4783). IEEE.
Stylianou, Y. & Syrdal, A. K. (2001). Perceptual and objective detection of discontinuities in concatenative speech synthesis. In 2001 IEEE international conference on acoustics, speech, and signal processing. proceedings (Cat. No. 01CH37221), Vol. 2, (pp. 837–840).
Sydeserff, H., Caley, R., Isard, S. D., Jack, M. A., Monaghan, A. I., & Verhoeven, J. (1992). Evaluation of speech synthesis techniques in a comprehension task. Speech Communication, 11(2–3), 189–194.
Article Google Scholar
Taylor, P. (2009). Text-to-speech synthesis. Cambridge University Press.
Book Google Scholar
Thangarajan, R., & Natarajan, A. (2008). Syllable based continuous speech recognition for Tamil. South Asian Language Review, 18(1), 72–85.
Google Scholar
Tokuda, K., Kobayashi, T., Masuko, T. & Imai, S. (1994). Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In Third international conference on spoken language processing.
Uriel, E. (2013). Hypothesis testing in the multiple regression model. Universidad de Valencia, Department of Economics.
Google Scholar
Valentini-Botinhao, C., Yamagishi, J. & King, S. (2011). Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In Twelfth annual conference of the international speech communication association.
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S. & Pantic, M. (2013). Avec 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on audio/visual emotion challenge (pp. 3–10).
van Bezooijen, R., van Heuven, V., Gibbon, D., Moore, R. & Winski, R. (1997). Assessment of synthesis systems. In D. Gibbon, R. Moore, & R. Winski (Eds.) Handbook of standards and resources for spoken language systems (pp. 481–563).
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A. & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. In 9th ISCA speech synthesis workshop (pp. 125–125).
van Heuven, V. J. & van Bezooijen, R. (1995). Quality evaluation of synthesized speech. In Speech coding and synthesis (p. 707738). Citeseer.
Vepa, J., & King, S. (2006). Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1763–1771.
Article Google Scholar
Viswanathan, M., & Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech & Language, 19(1), 55–83.
Article Google Scholar
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N. & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech 2017 (pp. 4006–4010).
Wei, B. & Gibson, J. D. (2001). Comparison of distance measures in discrete spectral modeling. S. M. U.
Yi, Z., Huang, W. C., Tian, X., Yamagishi, J., Das, R.K., Kinnunen, T. & Toda, T. (2020). Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. In Proceedings of the joint workshop for the blizzard challenge and voice conversion challenge 2020 (pp. 80–98).
Young, S.J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V. & Woodland, P. (2006). In The HTK Book Version 3.4. Cambridge University Press.
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Article Google Scholar

Download references

Author information

Shrikant Malviya and Rohit Mishra have contributed equally to this work.

Authors and Affiliations

Department of Information Technology, Indian Institute of Information Technology Allahabad, Prayagraj, India
Shrikant Malviya, Rohit Mishra & Uma Shanker Tiwary
Department of Computer Science & Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, India
Shrikant Malviya
CogniStar Technologies, New Delhi, India
Santosh Kumar Barnwal

Authors

Shrikant Malviya
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Santosh Kumar Barnwal
View author publications
You can also search for this author in PubMed Google Scholar
Uma Shanker Tiwary
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shrikant Malviya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Malviya, S., Mishra, R., Barnwal, S.K. et al. A framework for quality assessment of synthesised speech using learning-based objective evaluation. Int J Speech Technol 26, 221–243 (2023). https://doi.org/10.1007/s10772-023-10021-4

Download citation

Received: 20 February 2022
Accepted: 08 January 2023
Published: 02 February 2023
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10772-023-10021-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A framework for quality assessment of synthesised speech using learning-based objective evaluation

Abstract

Access this article

Similar content being viewed by others

Perceptual Quality Assessment of TTS-Synthesized Speech

Performance Evaluation of Speech Synthesis Techniques for English Language

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A framework for quality assessment of synthesised speech using learning-based objective evaluation

Abstract

Access this article

Similar content being viewed by others

Perceptual Quality Assessment of TTS-Synthesized Speech

Performance Evaluation of Speech Synthesis Techniques for English Language

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation