Skip to main content
Log in

A framework for quality assessment of synthesised speech using learning-based objective evaluation

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel method of evaluating text-to-speech systems named “Learning-Based Objective Evaluation” (LBOE), which utilises a set of selected low-level-descriptors (LLD) based features to assess the speech-quality of a TTS model. We have considered Unit selection speech synthesis (USS), Hidden Markov Model speech synthesis (HMM), Clustergen speech synthesis (CLU) and Deep Neural Network-based speech synthesis (DNN) methods to construct speech synthesis models on two Hindi speech datasets. The models are first evaluated on the basis of conventional objective and subjective evaluation measures, which have some inherent limitations. Subjective evaluation requires a substantial manual effort as well as financial support, while objective measures are inconsistent in delivering the expected results. As a novel method of evaluation, a LBOE is proposed, which utilises a set of selected LLD based features to analyse the speech-quality of TTS models. The LBOE method is explored in a twofold manner. First, LLD features in understanding the perceptual quality space of a synthesised speech are discussed. Second, quality-prediction models are constructed for prediction on various subjective dimensions, e.g. comprehensibility, naturalness and prosody, using those features. To show its efficacy on various sound units, the proposed method is analysed on the features extracted at various levels, i.e. phoneme, syllable and word. We observed that our learning-based objective evaluation method is capable of comparing the TTS models based on the LLD features extracted at each level. It provides a reliable and accurate evaluation method comparable to subjective evaluation measures and is more consistent than the objective evaluation criteria.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. ITU: International Telecommunication Union.

  2. MFCCs: Mel-Frequency Cepstral Coefficients.

  3. PESQ: Perceptual Evaluation of Speech Quality.

  4. MCD: Mel-Cepstral Distortion.

  5. LPCC: Linear Predictive Coding Coefficients.

  6. HMM: Hidden Markov Model.

  7. NiQA: Non-intrusive Quality Assessment.

  8. CMU phone-set for indic languages Parlikar et al. (2016).

  9. C:Consonant, V:Vowel.

  10. The MARY Text-to-Speech System (MaryTTS) http://mary.dfki.de/.

  11. CART: Classification And Regression Tree.

  12. Edinburgh Speech Tools Library http://www.cstr.ed.ac.uk/projects/speech_tools/.

  13. Tacotron-2: https://github.com/Rayhane-mamah/Tacotron-2.

  14. WaveNet vocoder: https://github.com/r9y9/wavenet_vocoder.

  15. A transliteration tool for Indic-languages: https://github.com/libindic/indic-trans.

  16. Computed through the statistical functionals, i.e. mean and variances on LLDs as well as its derivatives.

  17. Building Indic Voices: http://festvox.org/bsv/x3528.html.

  18. Here ORIG denotes the original human speech files.

  19. Unified Parser: https://www.iitm.ac.in/donlab/tts/unified.php.

  20. TextGridTools https://textgridtools.readthedocs.io/en/stable/.

  21. openSmile: audio feature extraction tool (https://audeering.github.io/opensmile/about.html).

  22. sklearn.svm.NuSVR: Uses LIBSVM library at the backend Chang and Lin (2011).

References

  • Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y. & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th international conference on machine learning, Vol. 70, (pp. 195–204).

  • Baby, A., Nishanthi, N., Thomas, A. L. & Murthy, H. A. (2016a). Resources for Indian languages. In International conference on text, speech, and dialogue (pp. 514–521).

  • Baby, A., Nishanthi, N., Thomas, A. L. & Murthy, H. A. (2016b). A unified parser for developing Indian language text to speech synthesizers. In International conference on text, speech, and dialogue (pp. 514–521).

  • Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y. & Syrdal, A. (1999). The at &t next-gen tts system. In Joint meeting of ASA, EAA, and DAGA (pp. 18–24).

  • Black, A. W. (n.d.). CMU INDIC speech synthesis databases. Retrieved December 15, 2021, from http://festvox.org/cmu_indic/index.html

  • Black, A. W. (2002). Perfect synthesis for all of the people all of the time. In Proceedings of 2002 IEEE workshop on speech synthesis, 2002. (pp. 167–170).

  • Black, A. W. (2006). Clustergen: A statistical parametric synthesizer using trajectory modeling. In Proceedings of Interspeech-2006, ninth international conference on spoken language processing (pp. 1762–1765).

  • Black, A. W. & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Eurospeech97 (pp. 601–604).

  • Cernak, M. & Rusko, M. (2005). An evaluation of synthetic speech using the PESQ measure. In Proceedings of the European congress on acoustics (pp. 2725–2728).

  • Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.

    Article  Google Scholar 

  • Chang, Y. Y. (2011). Evaluation of TTS systems in intelligibility and comprehension tasks. In Proceedings of the 23rd conference on computational linguistics and speech processing (pp. 64–78).

  • Chen, J. D. & Campbell, N. (1999). Objective distance measures for assessing concatenative speech synthesis. In Sixth European conference on speech communication and technology.

  • Choi, Y., Jung, Y. & Kim, H. (2020). Deep MOS predictor for synthetic speech using cluster-based modeling. In Proceedings of Interspeech 2020 (pp. 1743–1747).

  • Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., et al. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.

    Article  Google Scholar 

  • Eyben, F., Weninger, F., Gross, F. & Schuller, B. (2013). Recent developments in opensmile, the Munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on multimedia (pp. 835–838).

  • Falk, T. & Chan, W. (2004). Single ended method for objective speech quality assessment in narrowband telephony applications. In ITU-T (p. 563).

  • Falk, T. H., & Moller, S. (2008). Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letters, 15, 781–784.

    Article  Google Scholar 

  • Fu, S. W., Tsao, Y., Hwang, H. T. & Wang, H. M. (2018). Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. In Proceedings of Interspeech 2018 (pp. 1873–1877).

  • Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W. & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. In I. Guyon et al. (Eds.), Advances in neural information processing systems, Vol. 30, (pp. 2962–2970). Curran Associates Inc.

  • Grancharov, V., Zhao, D. Y., Lindblom, J., & Kleijn, W. B. (2006). Low-complexity, nonintrusive speech quality assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1948–1956.

    Article  Google Scholar 

  • Grice, M., Vagges, K. & Hirst, D. (1992). Prosodic form tests and “prosodic function tests”. SAM final report.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.

    Book  MATH  Google Scholar 

  • Heute, U. (2008). Speech-transmission quality: Aspects and assessment for wideband vs. narrowband signals. Advances in Digital Speech Transmission, 572.

  • Hinterleitner, F., Norrenbrock, C., Möller, S. & Heute, U. (2013). Predicting the quality of text-to-speech systems from a large-scale feature set. In Interspeech (pp. 383–387).

  • Huang, D. Y. (2011). Prediction of perceived sound quality of synthetic speech. In Proceedings of APSIPA.

  • Hunt, A.J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, Vol. 1, (pp. 373–376).

  • Jekosch, U. (2006). Voice and speech quality perception: Assessment and evaluation. Springer.

    Google Scholar 

  • Kim, D. S. (2005). Anique: An auditory model for single-ended speech quality estimation. IEEE Transactions on Speech and Audio Processing, 13(5), 821–831.

    Article  Google Scholar 

  • Klabbers, E., Van Santen, J. P., & Kain, A. (2007). The contribution of various sources of spectral mismatch to audible discontinuities in a diphone database. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 949–956.

    Article  Google Scholar 

  • Lewis, J. R. (2004). Effect of speaker and sampling rate on mos-x ratings of concatenative TTS voices. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 48, (pp. 759–763).

  • Lo, C. C., Fu, S. W., Huang, W. C., Wang, X., Yamagishi, J., Tsao, Y. & Wang, H. M. (2019). MOSNet: Deep learning-based objective assessment for voice conversion. In Proceedings of Interspeech 2019 (pp. 1541–1545).

  • Loizou, P. C. (2011). Speech quality assessment. In Multimedia analysis, processing and communications (pp. 623–654). Springer.

  • Loizou, P. C. (2013). Speech enhancement: Theory and practice speech enhancement: Theory and practice. CRC Press.

    Book  Google Scholar 

  • Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T. & Ling, Z. (2018) The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In Proceedings of Odyssey 2018 the speaker and language recognition workshop (pp. 195–202).

  • Malviya, S., Mishra, R., Barnwal, S. K., & Tiwary, U. S. (2021). HDRS: Hindi dialogue restaurant search corpus for dialogue state tracking in task-oriented environment. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2517–2528. https://doi.org/10.1109/TASLP.2021.3065833

    Article  Google Scholar 

  • Malviya, S., Mishra, R. & Tiwary, U. S. (2016). Structural analysis of Hindi phonetics and a method for extraction of phonetically rich sentences from a very large Hindi text corpus. In 2016 conference of O-COCOSDA (pp. 188–193).

  • Mariniak, A. (1993). A global framework for the assessment of synthetic speech without subjects. In Third European conference on speech communication and technology.

  • Mayo, C., Clark, R. A., & King, S. (2011). Listeners’ weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3), 311–326.

    Article  Google Scholar 

  • Mishra, R., Barnwal, S. K., Malviya, S., Mishra, P. & Tiwary, U. S. (2018). Prosodic feature selection of personality traits for job interview performance. In International conference on intelligent systems design and applications (pp. 673–682).

  • Möller, S. (2017). Quality engineering: Qualität kommunikationstechnischer Systeme. Springer.

    Book  Google Scholar 

  • Möller, S., Hinterleitner, F., Falk, T. H. & Polzehl, T. (2010). Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Eleventh annual conference of the international speech communication association.

  • Monzo, C., Iriondo, I. & Socoró, J. C. (2014). Voice quality modelling for expressive speech synthesis. The Scientific World Journal.

  • Moore, B. C. (2012). An introduction to the psychology of hearing. Brill.

    Google Scholar 

  • Müller, S., Chan, W., Côté, N., Falk, T. H., Raake, A., & Wältermann, M. (2011). Speech quality estimation: Models and trends. IEEE Signal Processing Magazine, 28(6), 18–28.

    Article  Google Scholar 

  • Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Moller, S. (2012). Instrumental assessment of prosodic quality for text-to-speech signals. IEEE Signal Processing Letters, 19(5), 255–258.

    Article  Google Scholar 

  • Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Möller, S. (2015). Quality prediction of synthesized speech based on perceptual quality dimensions. Speech Communication, 66, 17–35.

    Article  Google Scholar 

  • Novorita, B. (1999). Incorporation of temporal masking effects into bark spectral distortion measure. In Proceedings of ICASSP, Vol. 2, (pp. 665–668).

  • Pammi, S., Charfuelan, M. & Schröder, M. (2010). Multilingual voice creation toolkit for the mary TTS platform. In LREC.

  • Papadopoulos, P., Travadi, R. & Narayanan, S. (2017). Global SNR estimation of speech signals for unknown noise conditions using noise adapted non-linear regression. In Proceedings of Interspeech 2017 (pp. 3842–3846).

  • Parlikar, A., Sitaram, S., Wilkinson, A. & Black, A. W. (2016). The festvox indic frontend for grapheme to phoneme conversion. In WILDRE: Workshop on indian language data-resources and evaluation.

  • Parrish, W. M. (1951). The concept of “naturalness’’. Quarterly Journal of Speech, 37(4), 448–454.

    Article  Google Scholar 

  • Ping, W., Peng, K., Gibiansky, A., Arik, S.Ö., Kannan, A. , Narang, S. & Miller, J. (2018). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In ICLR-2018. OpenReview.net.

  • Prakash, A., Prakash, J. J. & Murthy, H. A. (2016). Acoustic analysis of syllables across Indian languages. In INTERSPEECH (pp. 327–331).

  • Quackenbush, S. R., Barnwell, T. P., & Clements, M. A. (1988). Objective measures of speech quality. Prentice Hall.

    Google Scholar 

  • Rec, I. (1994). P. 85. A method for subjective performance assessment of the quality of speech voice output devices. International Telecommunication Union.

    Google Scholar 

  • Recommendation, I. T. (2001). Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. In Rec. ITU-T P. 862.

  • Rosipal, R. & Krämer, N. (2005). Overview and recent advances in partial least squares. In International statistical and optimization perspectives workshop "subspace, latent structure and feature selection” (pp. 34–51).

  • Schröder, M. & Hunecke, A. (2007). Creating German unit selection voices for the Mary TTSs platform from the Bits corpora. In Proceedings of SSW6.

  • Schuller, B. (2006). Automatische emotionserkennung aus sprachlicher und manueller Interaktion (Unpublished doctoral dissertation). Technische Universität München.

  • Schuller et al., B. (2009). The Interspeech 2009 emotion challenge. In Proceedings 10th ISCA.

  • Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C. & Narayanan, S. S. (2010). The Interspeech 2010 paralinguistic challenge. In Proceedings 11th ISCA.

  • Schuller, B., Steidl, S., Batliner, A., Nöth, E., Vinciarelli, A., Burkhardt, F., & Weiss, B (2012). The Interspeech 2012 speaker trait challenge. In Proceedings 13th ISCA.

  • Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., et al. (2011). The Interspeech 2011 speaker state challenge. In Proceedings 12th ISCA.

  • Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F. & Kim, S. (2013). The Interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings 14th ISCA.

  • Schuller, B.W., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J.K., Baird, A. & Evanini, K. (2016). The Interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Interspeech, Vol. 2016, (pp. 2001–2005).

  • Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z. & Wu, Y. (2018). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of ICASSP (pp. 4779–4783). IEEE.

  • Stylianou, Y. & Syrdal, A. K. (2001). Perceptual and objective detection of discontinuities in concatenative speech synthesis. In 2001 IEEE international conference on acoustics, speech, and signal processing. proceedings (Cat. No. 01CH37221), Vol. 2, (pp. 837–840).

  • Sydeserff, H., Caley, R., Isard, S. D., Jack, M. A., Monaghan, A. I., & Verhoeven, J. (1992). Evaluation of speech synthesis techniques in a comprehension task. Speech Communication, 11(2–3), 189–194.

    Article  Google Scholar 

  • Taylor, P. (2009). Text-to-speech synthesis. Cambridge University Press.

    Book  Google Scholar 

  • Thangarajan, R., & Natarajan, A. (2008). Syllable based continuous speech recognition for Tamil. South Asian Language Review, 18(1), 72–85.

    Google Scholar 

  • Tokuda, K., Kobayashi, T., Masuko, T. & Imai, S. (1994). Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In Third international conference on spoken language processing.

  • Uriel, E. (2013). Hypothesis testing in the multiple regression model. Universidad de Valencia, Department of Economics.

    Google Scholar 

  • Valentini-Botinhao, C., Yamagishi, J. & King, S. (2011). Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In Twelfth annual conference of the international speech communication association.

  • Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S. & Pantic, M. (2013). Avec 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on audio/visual emotion challenge (pp. 3–10).

  • van Bezooijen, R., van Heuven, V., Gibbon, D., Moore, R. & Winski, R. (1997). Assessment of synthesis systems. In D. Gibbon, R. Moore, & R. Winski (Eds.) Handbook of standards and resources for spoken language systems (pp. 481–563).

  • van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A. & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. In 9th ISCA speech synthesis workshop (pp. 125–125).

  • van Heuven, V. J. & van Bezooijen, R. (1995). Quality evaluation of synthesized speech. In Speech coding and synthesis (p. 707738). Citeseer.

  • Vepa, J., & King, S. (2006). Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1763–1771.

    Article  Google Scholar 

  • Viswanathan, M., & Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech & Language, 19(1), 55–83.

    Article  Google Scholar 

  • Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N. & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech 2017 (pp. 4006–4010).

  • Wei, B. & Gibson, J. D. (2001). Comparison of distance measures in discrete spectral modeling. S. M. U.

  • Yi, Z., Huang, W. C., Tian, X., Yamagishi, J., Das, R.K., Kinnunen, T. & Toda, T. (2020). Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. In Proceedings of the joint workshop for the blizzard challenge and voice conversion challenge 2020 (pp. 80–98).

  • Young, S.J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V. & Woodland, P. (2006). In The HTK Book Version 3.4. Cambridge University Press.

  • Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shrikant Malviya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliation.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malviya, S., Mishra, R., Barnwal, S.K. et al. A framework for quality assessment of synthesised speech using learning-based objective evaluation. Int J Speech Technol 26, 221–243 (2023). https://doi.org/10.1007/s10772-023-10021-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10021-4

Keywords

Navigation