Abstract
Hidden Markov model (HMM) based text-to-speech (TTS) has become one of the most promising approaches, as it has proven to be a particularly flexible and robust framework to generate synthetic speech. However, several factors such as mel-cepstral vocoder and over-smoothing are responsible for causing quality degradation of synthetic speech. This paper presents an HMM speech synthesis technique based on the modified discrete cosine transform (MDCT) representation to cope with these two issues. To this end, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame from feature vectors and allows for a 50% overlap between frames without increasing the data vector, in contrast to the conventional mel-cepstral spectral parameters that do not ensure direct speech waveform reconstruction. Experimental results show that a sound of good quality, conveniently evaluated using both objective and subjective tests, is obtained.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Allen, J., Hunnicutt, M. S., Klatt, D. H., Armstrong, R. C., & Pisoni, D. B. (1987). From text to speech: The MITalk system. Cambridge: Cambridge University Press.
Benoît, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication, 18(4), 381–392.
Bertinetto, P. M. (1981). Strutture prosodiche dell’italiano: accento, quantità, sillaba, giuntura, fondamenti metrici. Firenze: Presso l’Accademia della Crusca.
Biagetti, G., Crippa, P., Falaschetti, L., Orcioni, S., & Turchetti, C. (2016). Learning HMM state sequences from phonemes for speech synthesis. Procedia Computer Science, 96, 1589–1596.
Black, A. W. (2006). CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling. In Proceeding of 9th International conference on spoken language processing (INTERSPEECH 2006 - ICSLP), Pittsburgh.
Black, A. W., & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Eurospeech, 1, 581–584.
Black, A. W., & Tokuda, K. (2005). The Blizzard Challenge–2005: Evaluating corpus-based speech synthesis on common datasets. In Proceeding of 9th European conference on speech communication and technology (INTERSPEECH), Lisbon (pp. 77–80).
Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In IFA Proceedings 17 (Vol. 17, pp. 97–110).
Bosi, M., & Goldberg, R. E. (2003). Introduction to digital audio coding and standards. New York: Springer.
Cabral, J. P., Renals, S., Richmond, K., & Yamagishi, J. (2008). Glottal spectral separation for parametric speech synthesis. In Proceeding of interspeech, Brisbane (pp. 1829–1832)
Cabral, J. P., Richmond, K., Yamagishi, J., & Renals, S. (2014). Glottal spectral separation for speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 195–208.
Canepari, L. (1979). Introduzione alla fonetica. Torino: Einaudi.
Chen, G., Koh, S. N., & Soon, I. Y. (2003). Enhanced Itakura measure incorporating masking properties of human auditory system. Signal Processing, 83(7), 1445–1456.
Chen, L. H., Raitio, T., Valentini-Botinhao, C., Ling, Z. H., & Yamagishi, J. (2015). A deep generative architecture for postfiltering in statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11), 2003–2014.
CMUSphinx. (2007). CMUSphinx—Open source speech recognition toolkit. https://cmusphinx.github.io.
Csapó, T. G., & Németh, G. (2014). Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE Journal of Selected Topics in Signal Processing, 8(2), 209–220.
Dobrowolski, A. P., & Majda, E. (2011). Cepstral analysis in the speakers recognition systems. In Proceeding of Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference (SPA), Poznan (pp. 1–6).
Donovan, R. E., & Woodland, P. C. (1999). A Hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.
Dutoit, T. (2008). Corpus-based speech synthesis (pp. 437–456). Berlin: Springer.
Erro, D., Sainz, I., Navas, E., & Hernaez, I. (2014). Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 184–194.
Helander, E., Silen, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceeding of IEEE international conference on acoustics, speech, and signal processing conference proceedings (ICASSP), Atlanta (Vol. 1, pp. 373–376).
Itakura, F., & Saito, S. (1968). Analysis synthesis telephony based on the maximum likelihood method. In Proceeding of 6th international congress on acoustics, Tokyo (pp. C17–C20).
Iwahashi, N., Kaiki, N., & Sagisaka, Y. (1992). Concatenative speech synthesis by minimum distortion criteria. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), San Francisco (Vol. 2, pp. 65–68).
Kameoka, H., Yoshizato, K., Ishihara, T., Kadowaki, K., Ohishi, Y., & Kashino, K. (2015). Generative modeling of voice fundamental frequency contours. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6), 1042–1053.
Karhila, R., Remes, U., & Kurimo, M. (2014). Noise in HMM-based speech synthesis adaptation: Analysis, evaluation methods and experiments. IEEE Journal of Selected Topics in Signal Processing, 8(2), 285–295.
Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3), 187–207.
Klatt, D. H. (1973). Interaction between two factors that influence vowel duration. The Journal of the Acoustical Society of America, 54(4), 1102–1104.
Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82(3), 737–793.
Koriyama, T., Nose, T., & Kobayashi, T. (2014). Statistical parametric speech synthesis based on gaussian process regression. IEEE Journal of Selected Topics in Signal Processing, 8(2), 173–183.
Ling, Z.-H., Deng, L., & Yu, D. (2013). Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2129–2139.
Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., et al. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.
Maia, R., Toda, T., Tokuda, K., Sakai, S., & Nakamura, S. (2009). A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM-based speech synthesis. In Proceeding of 10th annual conference of the international speech communication association (INTERSPEECH), Brighton (pp. 1783–1786).
Maia, R., Toda, T., Zen, H., Nankaku, Y., & Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In Proceeding of international speech communication association speech synthesis, workshop 6 (ISCA SSW6), Bonn (pp. 131–136).
Malvar, H. S. (1992). Signal processing with lapped transforms. Norwood: Artech House Inc.
MaryTTS. (2012). MaryTTS—An open-source, multilingual text-to-speech synthesis platform written in java. http://mary.dfki.de.
Nakashika, T., Takashima, R., Takiguchi, T., & Ariki, Y. (2013). Voice conversion in high-order eigen space using deep belief nets. In Proceeding of interspeech, Lyon (pp. 369–372).
Nespor, M., & Bafile, L. (2008). I suoni del linguaggio. Bologna: Il mulino.
Nose, T. (2016). Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1694–1704.
Nose, T., Chunwijitra, V., & Kobayashi, T. (2014). A parameter generation algorithm using local variance for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 221–228.
Picart, B., Drugman, T., & Dutoit, T. (2014). Automatic variation of the degree of articulation in new HMM-based voices. IEEE Journal of Selected Topics in Signal Processing, 8(2), 307–322.
Princen, J., & Bradley, A. (1986). Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5), 1153–1161.
Princen, J., Johnson, A., & Bradley, A. (1987). Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In Proceeding of IEEE international conference on acoustics, speech, and signal processing (ICASSP), Dallas (Vol. 12, pp. 2161–2164).
Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., et al. (2011). HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 153–165.
Rosales, H. G., Jokisch, O., & Hoffmann, R. (2008). Bayes optimal classification for corpus-based unit selection in TTS synthesis. In Proceeding of V Jornadas en Tecnologıa del Habla (VJTH) (pp. 141–144).
Schröder, M., & Trouvain, J. (2003). The german text-to-speech synthesis system MARY: A tool for research, development and teaching. International Journal of Speech Technology, 6(4), 365–377.
Schröder, M., Charfuelan, M., Pammi, S., & Steiner, I. (2011). Open source voice creation toolkit for the MARY TTS platform. In 12th annual conference of the international speech communication association (INTERSPEECH), Florence (pp. 3253–3256).
Schröder, M., Charfuelan, M., Pammi, S., & Türk, O. (2008). The MARY TTS entry in the Blizzard Challenge 2008. In Proceeding of Blizzard challenge.
Schwarz, D. (2007). Corpus-based concatenative synthesis. IEEE Signal Processing Magazine, 24(2), 92–104.
Sung, J. S., Hong, D. H., & Kim, N. S. (2014). Factored maximum penalized likelihood kernel regression for HMM-based style-adaptive speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 251–261.
Takaki, S., Nankaku, Y., & Tokuda, K. (2014). Contextual additive structure for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 229–238.
Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., & Nakamura, S. (2014). Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 239–250.
Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.
Tesser, F., Paci, G., Sommavilla, G., & Cosi, P. (2013). A new language and a new voice for MARY-TTS. In 9th national congress, AISV (Associazione Italiana di Scienze della Voce), Venice (pp. 435–443).
Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, 90(5), 816–824.
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech, and signal processing, (ICASSP), Istanbul (Vol. 3, pp. 1315–1318).
Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on Hidden Markov models. Proceedings of the IEEE, 101(5), 1234–1252.
Tsiaras, V., Maia, R., Diakoloukas, V., Stylianou, Y., & Digalakis, V. (2016). Global variance in speech synthesis with linear dynamical models. IEEE Signal Processing Letters, 23(8), 1057–1061.
Urbain, J., Çakmak, H., Charlier, A., Denti, M., Dutoit, T., & Dupont, S. (2014). Arousal-driven synthesis of laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273–284.
Wan, V., Latorre, J., Yanagisawa, K., Braunschweiler, N., Chen, L., Gales, M. J. F., et al. (2014). Building HMM-TTS voices on diverse data. IEEE Journal of Selected Topics in Signal Processing, 8(2), 296–306.
Wu, Y.-J., Zen, H., Nankaku, Y., Tokuda, K. (2008). Minimum generation error criterion considering global/local variance for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Las Vegas (pp. 4621–4624).
Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., et al. (2009). Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1208–1230.
Yoshimura, T. (2002). Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD thesis, Nagoya Institute of Technology.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1997). Speaker interpolation in HMM-based speech synthesis system. In Proceeding of 5th European conference on speech communication and technology (EUROSPEECH), Rhodes (pp. 2523–2526).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceeding of 6th European Conference on Speech Communication and Technology (EUROSPEECH), Budapest (Vol. 5, pp. 2347–2350).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2001). Mixed excitation for HMM-based speech synthesis. In Proceeding of 7th European conference on speech communication and technology (EUROSPEECH), Scandinavia (pp. 2263–2266).
Yu, K., & Young, S. (2011). Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1071–1079.
Zen, H., & Senior, A. (2014). Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence (pp. 3844–3848).
Zen, H., Toda, T., & Tokuda, K. (2008). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Transactions on Information and Systems, 91(6), 1764–1773.
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Biagetti, G., Crippa, P., Falaschetti, L. et al. HMM speech synthesis based on MDCT representation. Int J Speech Technol 21, 1045–1055 (2018). https://doi.org/10.1007/s10772-018-09571-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-018-09571-9