HMM speech synthesis based on MDCT representation

Biagetti, Giorgio; Crippa, Paolo; Falaschetti, Laura; Turchetti, Claudio

doi:10.1007/s10772-018-09571-9

HMM speech synthesis based on MDCT representation

Published: 27 October 2018

Volume 21, pages 1045–1055, (2018)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Giorgio Biagetti¹,
Paolo Crippa¹,
Laura Falaschetti ORCID: orcid.org/0000-0003-3183-7682¹ &
…
Claudio Turchetti¹

199 Accesses
6 Citations
Explore all metrics

Abstract

Hidden Markov model (HMM) based text-to-speech (TTS) has become one of the most promising approaches, as it has proven to be a particularly flexible and robust framework to generate synthetic speech. However, several factors such as mel-cepstral vocoder and over-smoothing are responsible for causing quality degradation of synthetic speech. This paper presents an HMM speech synthesis technique based on the modified discrete cosine transform (MDCT) representation to cope with these two issues. To this end, we use an analysis/synthesis technique based on MDCT that guarantees a perfect reconstruction of the signal frame from feature vectors and allows for a 50% overlap between frames without increasing the data vector, in contrast to the conventional mel-cepstral spectral parameters that do not ensure direct speech waveform reconstruction. Experimental results show that a sound of good quality, conveniently evaluated using both objective and subjective tests, is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Allen, J., Hunnicutt, M. S., Klatt, D. H., Armstrong, R. C., & Pisoni, D. B. (1987). From text to speech: The MITalk system. Cambridge: Cambridge University Press.
Google Scholar
Benoît, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication, 18(4), 381–392.
Article Google Scholar
Bertinetto, P. M. (1981). Strutture prosodiche dell’italiano: accento, quantità, sillaba, giuntura, fondamenti metrici. Firenze: Presso l’Accademia della Crusca.
Google Scholar
Biagetti, G., Crippa, P., Falaschetti, L., Orcioni, S., & Turchetti, C. (2016). Learning HMM state sequences from phonemes for speech synthesis. Procedia Computer Science, 96, 1589–1596.
Article Google Scholar
Black, A. W. (2006). CLUSTERGEN: A statistical parametric synthesizer using trajectory modeling. In Proceeding of 9th International conference on spoken language processing (INTERSPEECH 2006 - ICSLP), Pittsburgh.
Black, A. W., & Campbell, N. (1995). Optimising selection of units from speech databases for concatenative synthesis. Eurospeech, 1, 581–584.
Google Scholar
Black, A. W., & Tokuda, K. (2005). The Blizzard Challenge–2005: Evaluating corpus-based speech synthesis on common datasets. In Proceeding of 9th European conference on speech communication and technology (INTERSPEECH), Lisbon (pp. 77–80).
Boersma, P. (1993). Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In IFA Proceedings 17 (Vol. 17, pp. 97–110).
Bosi, M., & Goldberg, R. E. (2003). Introduction to digital audio coding and standards. New York: Springer.
Book Google Scholar
Cabral, J. P., Renals, S., Richmond, K., & Yamagishi, J. (2008). Glottal spectral separation for parametric speech synthesis. In Proceeding of interspeech, Brisbane (pp. 1829–1832)
Cabral, J. P., Richmond, K., Yamagishi, J., & Renals, S. (2014). Glottal spectral separation for speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 195–208.
Article Google Scholar
Canepari, L. (1979). Introduzione alla fonetica. Torino: Einaudi.
Google Scholar
Chen, G., Koh, S. N., & Soon, I. Y. (2003). Enhanced Itakura measure incorporating masking properties of human auditory system. Signal Processing, 83(7), 1445–1456.
Article MATH Google Scholar
Chen, L. H., Raitio, T., Valentini-Botinhao, C., Ling, Z. H., & Yamagishi, J. (2015). A deep generative architecture for postfiltering in statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11), 2003–2014.
Article Google Scholar
CMUSphinx. (2007). CMUSphinx—Open source speech recognition toolkit. https://cmusphinx.github.io.
Csapó, T. G., & Németh, G. (2014). Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation. IEEE Journal of Selected Topics in Signal Processing, 8(2), 209–220.
Article Google Scholar
Dobrowolski, A. P., & Majda, E. (2011). Cepstral analysis in the speakers recognition systems. In Proceeding of Signal Processing Algorithms, Architectures, Arrangements, and Applications Conference (SPA), Poznan (pp. 1–6).
Donovan, R. E., & Woodland, P. C. (1999). A Hidden Markov-model-based trainable speech synthesizer. Computer Speech & Language, 13(3), 223–241.
Article Google Scholar
Dutoit, T. (2008). Corpus-based speech synthesis (pp. 437–456). Berlin: Springer.
Google Scholar
Erro, D., Sainz, I., Navas, E., & Hernaez, I. (2014). Harmonics plus noise model based vocoder for statistical parametric speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 184–194.
Article Google Scholar
Helander, E., Silen, H., Virtanen, T., & Gabbouj, M. (2012). Voice conversion using dynamic kernel partial least squares regression. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 806–817.
Article Google Scholar
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceeding of IEEE international conference on acoustics, speech, and signal processing conference proceedings (ICASSP), Atlanta (Vol. 1, pp. 373–376).
Itakura, F., & Saito, S. (1968). Analysis synthesis telephony based on the maximum likelihood method. In Proceeding of 6th international congress on acoustics, Tokyo (pp. C17–C20).
Iwahashi, N., Kaiki, N., & Sagisaka, Y. (1992). Concatenative speech synthesis by minimum distortion criteria. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), San Francisco (Vol. 2, pp. 65–68).
Kameoka, H., Yoshizato, K., Ishihara, T., Kadowaki, K., Ohishi, Y., & Kashino, K. (2015). Generative modeling of voice fundamental frequency contours. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6), 1042–1053.
Article Google Scholar
Karhila, R., Remes, U., & Kurimo, M. (2014). Noise in HMM-based speech synthesis adaptation: Analysis, evaluation methods and experiments. IEEE Journal of Selected Topics in Signal Processing, 8(2), 285–295.
Article Google Scholar
Kawahara, H., Masuda-Katsuse, I., & De Cheveigne, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds. Speech Communication, 27(3), 187–207.
Article Google Scholar
Klatt, D. H. (1973). Interaction between two factors that influence vowel duration. The Journal of the Acoustical Society of America, 54(4), 1102–1104.
Article Google Scholar
Klatt, D. H. (1987). Review of text-to-speech conversion for English. The Journal of the Acoustical Society of America, 82(3), 737–793.
Article Google Scholar
Koriyama, T., Nose, T., & Kobayashi, T. (2014). Statistical parametric speech synthesis based on gaussian process regression. IEEE Journal of Selected Topics in Signal Processing, 8(2), 173–183.
Article Google Scholar
Ling, Z.-H., Deng, L., & Yu, D. (2013). Modeling spectral envelopes using restricted Boltzmann machines and deep belief networks for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2129–2139.
Article Google Scholar
Ling, Z. H., Kang, S. Y., Zen, H., Senior, A., Schuster, M., Qian, X. J., et al. (2015). Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Processing Magazine, 32(3), 35–52.
Article Google Scholar
Maia, R., Toda, T., Tokuda, K., Sakai, S., & Nakamura, S. (2009). A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM-based speech synthesis. In Proceeding of 10th annual conference of the international speech communication association (INTERSPEECH), Brighton (pp. 1783–1786).
Maia, R., Toda, T., Zen, H., Nankaku, Y., & Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In Proceeding of international speech communication association speech synthesis, workshop 6 (ISCA SSW6), Bonn (pp. 131–136).
Malvar, H. S. (1992). Signal processing with lapped transforms. Norwood: Artech House Inc.
MATH Google Scholar
MaryTTS. (2012). MaryTTS—An open-source, multilingual text-to-speech synthesis platform written in java. http://mary.dfki.de.
Nakashika, T., Takashima, R., Takiguchi, T., & Ariki, Y. (2013). Voice conversion in high-order eigen space using deep belief nets. In Proceeding of interspeech, Lyon (pp. 369–372).
Nespor, M., & Bafile, L. (2008). I suoni del linguaggio. Bologna: Il mulino.
Google Scholar
Nose, T. (2016). Efficient implementation of global variance compensation for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1694–1704.
Article Google Scholar
Nose, T., Chunwijitra, V., & Kobayashi, T. (2014). A parameter generation algorithm using local variance for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 221–228.
Article Google Scholar
Picart, B., Drugman, T., & Dutoit, T. (2014). Automatic variation of the degree of articulation in new HMM-based voices. IEEE Journal of Selected Topics in Signal Processing, 8(2), 307–322.
Article Google Scholar
Princen, J., & Bradley, A. (1986). Analysis/synthesis filter bank design based on time domain aliasing cancellation. IEEE Transactions on Acoustics, Speech and Signal Processing, 34(5), 1153–1161.
Article Google Scholar
Princen, J., Johnson, A., & Bradley, A. (1987). Subband/transform coding using filter bank designs based on time domain aliasing cancellation. In Proceeding of IEEE international conference on acoustics, speech, and signal processing (ICASSP), Dallas (Vol. 12, pp. 2161–2164).
Raitio, T., Suni, A., Yamagishi, J., Pulakka, H., Nurminen, J., Vainio, M., et al. (2011). HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Audio, Speech, and Language Processing, 19(1), 153–165.
Article Google Scholar
Rosales, H. G., Jokisch, O., & Hoffmann, R. (2008). Bayes optimal classification for corpus-based unit selection in TTS synthesis. In Proceeding of V Jornadas en Tecnologıa del Habla (VJTH) (pp. 141–144).
Schröder, M., & Trouvain, J. (2003). The german text-to-speech synthesis system MARY: A tool for research, development and teaching. International Journal of Speech Technology, 6(4), 365–377.
Article Google Scholar
Schröder, M., Charfuelan, M., Pammi, S., & Steiner, I. (2011). Open source voice creation toolkit for the MARY TTS platform. In 12th annual conference of the international speech communication association (INTERSPEECH), Florence (pp. 3253–3256).
Schröder, M., Charfuelan, M., Pammi, S., & Türk, O. (2008). The MARY TTS entry in the Blizzard Challenge 2008. In Proceeding of Blizzard challenge.
Schwarz, D. (2007). Corpus-based concatenative synthesis. IEEE Signal Processing Magazine, 24(2), 92–104.
Article Google Scholar
Sung, J. S., Hong, D. H., & Kim, N. S. (2014). Factored maximum penalized likelihood kernel regression for HMM-based style-adaptive speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 251–261.
Article Google Scholar
Takaki, S., Nankaku, Y., & Tokuda, K. (2014). Contextual additive structure for HMM-based speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 229–238.
Article Google Scholar
Takamichi, S., Toda, T., Shiga, Y., Sakti, S., Neubig, G., & Nakamura, S. (2014). Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis. IEEE Journal of Selected Topics in Signal Processing, 8(2), 239–250.
Article Google Scholar
Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.
Book Google Scholar
Tesser, F., Paci, G., Sommavilla, G., & Cosi, P. (2013). A new language and a new voice for MARY-TTS. In 9th national congress, AISV (Associazione Italiana di Scienze della Voce), Venice (pp. 435–443).
Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems, 90(5), 816–824.
Article Google Scholar
Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing, 15(8), 2222–2235.
Article Google Scholar
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech, and signal processing, (ICASSP), Istanbul (Vol. 3, pp. 1315–1318).
Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., & Oura, K. (2013). Speech synthesis based on Hidden Markov models. Proceedings of the IEEE, 101(5), 1234–1252.
Article Google Scholar
Tsiaras, V., Maia, R., Diakoloukas, V., Stylianou, Y., & Digalakis, V. (2016). Global variance in speech synthesis with linear dynamical models. IEEE Signal Processing Letters, 23(8), 1057–1061.
Article Google Scholar
Urbain, J., Çakmak, H., Charlier, A., Denti, M., Dutoit, T., & Dupont, S. (2014). Arousal-driven synthesis of laughter. IEEE Journal of Selected Topics in Signal Processing, 8(2), 273–284.
Article Google Scholar
Wan, V., Latorre, J., Yanagisawa, K., Braunschweiler, N., Chen, L., Gales, M. J. F., et al. (2014). Building HMM-TTS voices on diverse data. IEEE Journal of Selected Topics in Signal Processing, 8(2), 296–306.
Article Google Scholar
Wu, Y.-J., Zen, H., Nankaku, Y., Tokuda, K. (2008). Minimum generation error criterion considering global/local variance for HMM-based speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Las Vegas (pp. 4621–4624).
Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., et al. (2009). Robust speaker-adaptive HMM-based text-to-speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 17(6), 1208–1230.
Article Google Scholar
Yoshimura, T. (2002). Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems. PhD thesis, Nagoya Institute of Technology.
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1997). Speaker interpolation in HMM-based speech synthesis system. In Proceeding of 5th European conference on speech communication and technology (EUROSPEECH), Rhodes (pp. 2523–2526).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceeding of 6th European Conference on Speech Communication and Technology (EUROSPEECH), Budapest (Vol. 5, pp. 2347–2350).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2001). Mixed excitation for HMM-based speech synthesis. In Proceeding of 7th European conference on speech communication and technology (EUROSPEECH), Scandinavia (pp. 2263–2266).
Yu, K., & Young, S. (2011). Continuous F0 modeling for HMM based statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), 1071–1079.
Article Google Scholar
Zen, H., & Senior, A. (2014). Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Florence (pp. 3844–3848).
Zen, H., Toda, T., & Tokuda, K. (2008). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard Challenge 2006. IEICE Transactions on Information and Systems, 91(6), 1764–1773.
Article Google Scholar
Zen, H., Tokuda, K., & Black, A. W. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Article Google Scholar

Download references

Author information

Authors and Affiliations

DII – Dipartimento di Ingegneria dell’Informazione, Università Politecnica delle Marche, Via Brecce Bianche 12, 60131, Ancona, Italy
Giorgio Biagetti, Paolo Crippa, Laura Falaschetti & Claudio Turchetti

Authors

Giorgio Biagetti
View author publications
You can also search for this author inPubMed Google Scholar
Paolo Crippa
View author publications
You can also search for this author inPubMed Google Scholar
Laura Falaschetti
View author publications
You can also search for this author inPubMed Google Scholar
Claudio Turchetti
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Laura Falaschetti.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biagetti, G., Crippa, P., Falaschetti, L. et al. HMM speech synthesis based on MDCT representation. Int J Speech Technol 21, 1045–1055 (2018). https://doi.org/10.1007/s10772-018-09571-9

Download citation

Received: 05 July 2018
Accepted: 19 October 2018
Published: 27 October 2018
Issue Date: 15 December 2018
DOI: https://doi.org/10.1007/s10772-018-09571-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HMM speech synthesis based on MDCT representation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Excitation Modeling Method Based on Inverse Filtering for HMM-Based Speech Synthesis

Reducing over-smoothness in HMM-based speech synthesis using exemplar-based voice conversion

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

HMM speech synthesis based on MDCT representation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Excitation Modeling Method Based on Inverse Filtering for HMM-Based Speech Synthesis

Reducing over-smoothness in HMM-based speech synthesis using exemplar-based voice conversion

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now