Abstract
In this paper, we propose a novel method of evaluating text-to-speech systems named “Learning-Based Objective Evaluation” (LBOE), which utilises a set of selected low-level-descriptors (LLD) based features to assess the speech-quality of a TTS model. We have considered Unit selection speech synthesis (USS), Hidden Markov Model speech synthesis (HMM), Clustergen speech synthesis (CLU) and Deep Neural Network-based speech synthesis (DNN) methods to construct speech synthesis models on two Hindi speech datasets. The models are first evaluated on the basis of conventional objective and subjective evaluation measures, which have some inherent limitations. Subjective evaluation requires a substantial manual effort as well as financial support, while objective measures are inconsistent in delivering the expected results. As a novel method of evaluation, a LBOE is proposed, which utilises a set of selected LLD based features to analyse the speech-quality of TTS models. The LBOE method is explored in a twofold manner. First, LLD features in understanding the perceptual quality space of a synthesised speech are discussed. Second, quality-prediction models are constructed for prediction on various subjective dimensions, e.g. comprehensibility, naturalness and prosody, using those features. To show its efficacy on various sound units, the proposed method is analysed on the features extracted at various levels, i.e. phoneme, syllable and word. We observed that our learning-based objective evaluation method is capable of comparing the TTS models based on the LLD features extracted at each level. It provides a reliable and accurate evaluation method comparable to subjective evaluation measures and is more consistent than the objective evaluation criteria.
Similar content being viewed by others
Notes
ITU: International Telecommunication Union.
MFCCs: Mel-Frequency Cepstral Coefficients.
PESQ: Perceptual Evaluation of Speech Quality.
MCD: Mel-Cepstral Distortion.
LPCC: Linear Predictive Coding Coefficients.
HMM: Hidden Markov Model.
NiQA: Non-intrusive Quality Assessment.
CMU phone-set for indic languages Parlikar et al. (2016).
C:Consonant, V:Vowel.
The MARY Text-to-Speech System (MaryTTS) http://mary.dfki.de/.
CART: Classification And Regression Tree.
Edinburgh Speech Tools Library http://www.cstr.ed.ac.uk/projects/speech_tools/.
Tacotron-2: https://github.com/Rayhane-mamah/Tacotron-2.
WaveNet vocoder: https://github.com/r9y9/wavenet_vocoder.
A transliteration tool for Indic-languages: https://github.com/libindic/indic-trans.
Computed through the statistical functionals, i.e. mean and variances on LLDs as well as its derivatives.
Building Indic Voices: http://festvox.org/bsv/x3528.html.
Here ORIG denotes the original human speech files.
Unified Parser: https://www.iitm.ac.in/donlab/tts/unified.php.
TextGridTools https://textgridtools.readthedocs.io/en/stable/.
openSmile: audio feature extraction tool (https://audeering.github.io/opensmile/about.html).
sklearn.svm.NuSVR: Uses LIBSVM library at the backend Chang and Lin (2011).
References
Arık, S.Ö., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y. & Shoeybi, M. (2017). Deep voice: Real-time neural text-to-speech. In Proceedings of the 34th international conference on machine learning, Vol. 70, (pp. 195–204).
Baby, A., Nishanthi, N., Thomas, A. L. & Murthy, H. A. (2016a). Resources for Indian languages. In International conference on text, speech, and dialogue (pp. 514–521).
Baby, A., Nishanthi, N., Thomas, A. L. & Murthy, H. A. (2016b). A unified parser for developing Indian language text to speech synthesizers. In International conference on text, speech, and dialogue (pp. 514–521).
Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y. & Syrdal, A. (1999). The at &t next-gen tts system. In Joint meeting of ASA, EAA, and DAGA (pp. 18–24).
Black, A. W. (n.d.). CMU INDIC speech synthesis databases. Retrieved December 15, 2021, from http://festvox.org/cmu_indic/index.html
Black, A. W. (2002). Perfect synthesis for all of the people all of the time. In Proceedings of 2002 IEEE workshop on speech synthesis, 2002. (pp. 167–170).
Black, A. W. (2006). Clustergen: A statistical parametric synthesizer using trajectory modeling. In Proceedings of Interspeech-2006, ninth international conference on spoken language processing (pp. 1762–1765).
Black, A. W. & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Eurospeech97 (pp. 601–604).
Cernak, M. & Rusko, M. (2005). An evaluation of synthetic speech using the PESQ measure. In Proceedings of the European congress on acoustics (pp. 2725–2728).
Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 1–27.
Chang, Y. Y. (2011). Evaluation of TTS systems in intelligibility and comprehension tasks. In Proceedings of the 23rd conference on computational linguistics and speech processing (pp. 64–78).
Chen, J. D. & Campbell, N. (1999). Objective distance measures for assessing concatenative speech synthesis. In Sixth European conference on speech communication and technology.
Choi, Y., Jung, Y. & Kim, H. (2020). Deep MOS predictor for synthetic speech using cluster-based modeling. In Proceedings of Interspeech 2020 (pp. 1743–1747).
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., et al. (2016). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.
Eyben, F., Weninger, F., Gross, F. & Schuller, B. (2013). Recent developments in opensmile, the Munich open-source multimedia feature extractor. In Proceedings of the 21st ACM international conference on multimedia (pp. 835–838).
Falk, T. & Chan, W. (2004). Single ended method for objective speech quality assessment in narrowband telephony applications. In ITU-T (p. 563).
Falk, T. H., & Moller, S. (2008). Towards signal-based instrumental quality diagnosis for text-to-speech systems. IEEE Signal Processing Letters, 15, 781–784.
Fu, S. W., Tsao, Y., Hwang, H. T. & Wang, H. M. (2018). Quality-net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. In Proceedings of Interspeech 2018 (pp. 1873–1877).
Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K., Ping, W. & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. In I. Guyon et al. (Eds.), Advances in neural information processing systems, Vol. 30, (pp. 2962–2970). Curran Associates Inc.
Grancharov, V., Zhao, D. Y., Lindblom, J., & Kleijn, W. B. (2006). Low-complexity, nonintrusive speech quality assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1948–1956.
Grice, M., Vagges, K. & Hirst, D. (1992). Prosodic form tests and “prosodic function tests”. SAM final report.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Springer.
Heute, U. (2008). Speech-transmission quality: Aspects and assessment for wideband vs. narrowband signals. Advances in Digital Speech Transmission, 572.
Hinterleitner, F., Norrenbrock, C., Möller, S. & Heute, U. (2013). Predicting the quality of text-to-speech systems from a large-scale feature set. In Interspeech (pp. 383–387).
Huang, D. Y. (2011). Prediction of perceived sound quality of synthetic speech. In Proceedings of APSIPA.
Hunt, A.J. & Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE international conference on acoustics, speech, and signal processing conference proceedings, Vol. 1, (pp. 373–376).
Jekosch, U. (2006). Voice and speech quality perception: Assessment and evaluation. Springer.
Kim, D. S. (2005). Anique: An auditory model for single-ended speech quality estimation. IEEE Transactions on Speech and Audio Processing, 13(5), 821–831.
Klabbers, E., Van Santen, J. P., & Kain, A. (2007). The contribution of various sources of spectral mismatch to audible discontinuities in a diphone database. IEEE Transactions on Audio, Speech, and Language Processing, 15(3), 949–956.
Lewis, J. R. (2004). Effect of speaker and sampling rate on mos-x ratings of concatenative TTS voices. In Proceedings of the human factors and ergonomics society annual meeting, Vol. 48, (pp. 759–763).
Lo, C. C., Fu, S. W., Huang, W. C., Wang, X., Yamagishi, J., Tsao, Y. & Wang, H. M. (2019). MOSNet: Deep learning-based objective assessment for voice conversion. In Proceedings of Interspeech 2019 (pp. 1541–1545).
Loizou, P. C. (2011). Speech quality assessment. In Multimedia analysis, processing and communications (pp. 623–654). Springer.
Loizou, P. C. (2013). Speech enhancement: Theory and practice speech enhancement: Theory and practice. CRC Press.
Lorenzo-Trueba, J., Yamagishi, J., Toda, T., Saito, D., Villavicencio, F., Kinnunen, T. & Ling, Z. (2018) The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In Proceedings of Odyssey 2018 the speaker and language recognition workshop (pp. 195–202).
Malviya, S., Mishra, R., Barnwal, S. K., & Tiwary, U. S. (2021). HDRS: Hindi dialogue restaurant search corpus for dialogue state tracking in task-oriented environment. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 2517–2528. https://doi.org/10.1109/TASLP.2021.3065833
Malviya, S., Mishra, R. & Tiwary, U. S. (2016). Structural analysis of Hindi phonetics and a method for extraction of phonetically rich sentences from a very large Hindi text corpus. In 2016 conference of O-COCOSDA (pp. 188–193).
Mariniak, A. (1993). A global framework for the assessment of synthetic speech without subjects. In Third European conference on speech communication and technology.
Mayo, C., Clark, R. A., & King, S. (2011). Listeners’ weighting of acoustic cues to synthetic speech naturalness: A multidimensional scaling analysis. Speech Communication, 53(3), 311–326.
Mishra, R., Barnwal, S. K., Malviya, S., Mishra, P. & Tiwary, U. S. (2018). Prosodic feature selection of personality traits for job interview performance. In International conference on intelligent systems design and applications (pp. 673–682).
Möller, S. (2017). Quality engineering: Qualität kommunikationstechnischer Systeme. Springer.
Möller, S., Hinterleitner, F., Falk, T. H. & Polzehl, T. (2010). Comparison of approaches for instrumentally predicting the quality of text-to-speech systems. In Eleventh annual conference of the international speech communication association.
Monzo, C., Iriondo, I. & Socoró, J. C. (2014). Voice quality modelling for expressive speech synthesis. The Scientific World Journal.
Moore, B. C. (2012). An introduction to the psychology of hearing. Brill.
Müller, S., Chan, W., Côté, N., Falk, T. H., Raake, A., & Wältermann, M. (2011). Speech quality estimation: Models and trends. IEEE Signal Processing Magazine, 28(6), 18–28.
Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Moller, S. (2012). Instrumental assessment of prosodic quality for text-to-speech signals. IEEE Signal Processing Letters, 19(5), 255–258.
Norrenbrock, C. R., Hinterleitner, F., Heute, U., & Möller, S. (2015). Quality prediction of synthesized speech based on perceptual quality dimensions. Speech Communication, 66, 17–35.
Novorita, B. (1999). Incorporation of temporal masking effects into bark spectral distortion measure. In Proceedings of ICASSP, Vol. 2, (pp. 665–668).
Pammi, S., Charfuelan, M. & Schröder, M. (2010). Multilingual voice creation toolkit for the mary TTS platform. In LREC.
Papadopoulos, P., Travadi, R. & Narayanan, S. (2017). Global SNR estimation of speech signals for unknown noise conditions using noise adapted non-linear regression. In Proceedings of Interspeech 2017 (pp. 3842–3846).
Parlikar, A., Sitaram, S., Wilkinson, A. & Black, A. W. (2016). The festvox indic frontend for grapheme to phoneme conversion. In WILDRE: Workshop on indian language data-resources and evaluation.
Parrish, W. M. (1951). The concept of “naturalness’’. Quarterly Journal of Speech, 37(4), 448–454.
Ping, W., Peng, K., Gibiansky, A., Arik, S.Ö., Kannan, A. , Narang, S. & Miller, J. (2018). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In ICLR-2018. OpenReview.net.
Prakash, A., Prakash, J. J. & Murthy, H. A. (2016). Acoustic analysis of syllables across Indian languages. In INTERSPEECH (pp. 327–331).
Quackenbush, S. R., Barnwell, T. P., & Clements, M. A. (1988). Objective measures of speech quality. Prentice Hall.
Rec, I. (1994). P. 85. A method for subjective performance assessment of the quality of speech voice output devices. International Telecommunication Union.
Recommendation, I. T. (2001). Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. In Rec. ITU-T P. 862.
Rosipal, R. & Krämer, N. (2005). Overview and recent advances in partial least squares. In International statistical and optimization perspectives workshop "subspace, latent structure and feature selection” (pp. 34–51).
Schröder, M. & Hunecke, A. (2007). Creating German unit selection voices for the Mary TTSs platform from the Bits corpora. In Proceedings of SSW6.
Schuller, B. (2006). Automatische emotionserkennung aus sprachlicher und manueller Interaktion (Unpublished doctoral dissertation). Technische Universität München.
Schuller et al., B. (2009). The Interspeech 2009 emotion challenge. In Proceedings 10th ISCA.
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C. & Narayanan, S. S. (2010). The Interspeech 2010 paralinguistic challenge. In Proceedings 11th ISCA.
Schuller, B., Steidl, S., Batliner, A., Nöth, E., Vinciarelli, A., Burkhardt, F., & Weiss, B (2012). The Interspeech 2012 speaker trait challenge. In Proceedings 13th ISCA.
Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J., et al. (2011). The Interspeech 2011 speaker state challenge. In Proceedings 12th ISCA.
Schuller, B., Steidl, S., Batliner, A., Vinciarelli, A., Scherer, K., Ringeval, F. & Kim, S. (2013). The Interspeech 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. In Proceedings 14th ISCA.
Schuller, B.W., Steidl, S., Batliner, A., Hirschberg, J., Burgoon, J.K., Baird, A. & Evanini, K. (2016). The Interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Interspeech, Vol. 2016, (pp. 2001–2005).
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z. & Wu, Y. (2018). Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of ICASSP (pp. 4779–4783). IEEE.
Stylianou, Y. & Syrdal, A. K. (2001). Perceptual and objective detection of discontinuities in concatenative speech synthesis. In 2001 IEEE international conference on acoustics, speech, and signal processing. proceedings (Cat. No. 01CH37221), Vol. 2, (pp. 837–840).
Sydeserff, H., Caley, R., Isard, S. D., Jack, M. A., Monaghan, A. I., & Verhoeven, J. (1992). Evaluation of speech synthesis techniques in a comprehension task. Speech Communication, 11(2–3), 189–194.
Taylor, P. (2009). Text-to-speech synthesis. Cambridge University Press.
Thangarajan, R., & Natarajan, A. (2008). Syllable based continuous speech recognition for Tamil. South Asian Language Review, 18(1), 72–85.
Tokuda, K., Kobayashi, T., Masuko, T. & Imai, S. (1994). Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In Third international conference on spoken language processing.
Uriel, E. (2013). Hypothesis testing in the multiple regression model. Universidad de Valencia, Department of Economics.
Valentini-Botinhao, C., Yamagishi, J. & King, S. (2011). Can objective measures predict the intelligibility of modified hmm-based synthetic speech in noise? In Twelfth annual conference of the international speech communication association.
Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S. & Pantic, M. (2013). Avec 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the 3rd ACM international workshop on audio/visual emotion challenge (pp. 3–10).
van Bezooijen, R., van Heuven, V., Gibbon, D., Moore, R. & Winski, R. (1997). Assessment of synthesis systems. In D. Gibbon, R. Moore, & R. Winski (Eds.) Handbook of standards and resources for spoken language systems (pp. 481–563).
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A. & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. In 9th ISCA speech synthesis workshop (pp. 125–125).
van Heuven, V. J. & van Bezooijen, R. (1995). Quality evaluation of synthesized speech. In Speech coding and synthesis (p. 707738). Citeseer.
Vepa, J., & King, S. (2006). Subjective evaluation of join cost and smoothing methods for unit selection speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1763–1771.
Viswanathan, M., & Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech & Language, 19(1), 55–83.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N. & Saurous, R. A. (2017). Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech 2017 (pp. 4006–4010).
Wei, B. & Gibson, J. D. (2001). Comparison of distance measures in discrete spectral modeling. S. M. U.
Yi, Z., Huang, W. C., Tian, X., Yamagishi, J., Das, R.K., Kinnunen, T. & Toda, T. (2020). Voice conversion challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. In Proceedings of the joint workshop for the blizzard challenge and voice conversion challenge 2020 (pp. 80–98).
Young, S.J., Kershaw, D., Odell, J., Ollason, D., Valtchev, V. & Woodland, P. (2006). In The HTK Book Version 3.4. Cambridge University Press.
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Malviya, S., Mishra, R., Barnwal, S.K. et al. A framework for quality assessment of synthesised speech using learning-based objective evaluation. Int J Speech Technol 26, 221–243 (2023). https://doi.org/10.1007/s10772-023-10021-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-023-10021-4