Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Wen, Zhengqi; Tao, Jianhua; Pan, Shifeng; Wang, Yang

doi:10.1007/s11265-013-0862-z

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Published: 19 December 2013

Volume 74, pages 423–435, (2014)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Zhengqi Wen¹,
Jianhua Tao¹,
Shifeng Pan¹ &
…
Yang Wang¹

370 Accesses
5 Citations
Explore all metrics

Abstract

The speech generated by hidden Markov model (HMM)-based speech synthesis systems (HTS) suffers from a ‘buzzing’ sound, which is due to an over-simplified vocoding technique. This paper proposes a new excitation model that uses a pitch-scaled spectrum for the parametric representation of speech in HTS. A residual signal produced using inverse filtering retains the detailed harmonic structure of speech that is not part of the linear prediction (LP) spectrum. By using pitch-scaled spectrums, we can compensate the LP spectrum using the detailed harmonic structure of the residual signal. This spectrum can be compressed using a periodic excitation parameter so that it can used to train HTS. We define an aperiodic measure as the harmonics-to-noise ratio, and calculate a voicing-cut off frequency to fit the aperiodic measure to a sigmoid function. We combine the LP coefficient, pitch-scaled spectrum, and sigmoid function to create a new parametric representation of speech. Listening tests were carried out to evaluate the effectiveness of the proposed technique. This vocoder received a mean opinion score of 4.0 in analysis-synthesis experiments, before dimensionality reduction. By integrating this vocoder into HTS, we improved the sound of the synthesized speech compared with the pulse train excitation model, and demonstrated an even better result than STRAIGHT-HTS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A uniform phase representation for the harmonic model in speech synthesis applications

Article Open access 16 October 2014

Excitation Modeling Method Based on Inverse Filtering for HMM-Based Speech Synthesis

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis

Notes

Simple of compound vowel of Chinese syllable.

References

Zen, H., Tokuda, K., & Black, A. (2009). Statistical parametric speech synthesis. Speech Communication, 51(11), 1039–1064.
Article Google Scholar
[online] HMM-based Speech Synthesis System (HTS). http://hts.sp.nitech.ac.jp/.
Stylianou, Y. (1996). Harmonic plus Noise Model for Speech, combined with Statistical Methods, for Speech and Speaker Modification. P.h.D. thesis, Ecole Nationale Supèrieure des Télécommunications. Paris, France.
Hermus, K., Van Hamme, H., & Irhimeh, S. (2007). Estimation of the voicing cut-off frequency contour based on a cumulative harmonicity score. IEEE Signal Processing Letters, 14(11), 820–823.
Article Google Scholar
Kawahara, H., Masuda-Katsuse, I., & de Cheveigné, A. (1999). Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Communication, 27(5), 187–207.
Article Google Scholar
Stylianou, Y. (2001). Applying the harmonic plus noise model in concatenative speech synthesis. IEEE Transactions on Speech Audio Processing, 9(1), 21–29.
Article Google Scholar
Hemptinne, C. (2006). Integration of the harmonic plus noise model (HNM) into the hidden markov model-based speech synthesis, system (HTS). Master thesis. IDIAP Research Institute, IDIAP-RR 69, Switzerland.
Zen, H., Toda, T., Nakamura, M., & Tokuda, K. (2007). Details of the Nitech HMM-based speech synthesis for Blizzard Challenge 2005. IEICE Transactions on Information and Systems, E90(D), 325–333.
Google Scholar
Cabral, J. P., Renals, S., Yamagishi, J., & Richmond, K. (2011). HMM-based Speech Synthesizer Using the LF-model of the Glottal Source. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4704–4707.
Fant, G., Liljencrants, J., & Lin, Q. (1985). A four-parameter model of glottal flow. Stockholm: STL-QPSR, KTH.
Google Scholar
Raitio, T., Suni, J., Yamagishi, H., Pulakka, A., Nurminen, J., Vainio, M., & Alku, P. (2010). HMM-based speech synthesis utilizing glottal inverse filtering. IEEE Transactions on Speech Audio Processing, 19(1), 153–165.
Article Google Scholar
Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech Audio Processing, 7(5), 569–585.
Article Google Scholar
Yoshimura, T., Tokuda, K., Masuko, T., & Kitamura, T. (2001). Mixed excitation for HMM-based speech synthesis. 9th European Conference on Speech Communication and Technology, 2263–2266.
Macree, A. V., Truong, K., George, E. B., Barnwell, T. P., & Viswanathan, V. (1996). A 2.4 kbitsfs MELP Coder Candidate for the New US. Federal Standard. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 200–203.
Drugman, T., Wilfart, G., & Dutoit, T. (2009). A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. Proceedings of Interspeech, 2009, 1779–1782.
Google Scholar
Maia, R., Toda, T., Zen, H., Nankaku, Y., & Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. 6the ISCA Workshop on Speech Synthesis, 131–136.
Skoglund, J., & Bastiaan, W. K. (2000). On time-frequency masking in voiced speech. IEEE Transactions on Speech Audio Processing, 8(4), 361–369.
Article Google Scholar
Robert, J. M., & Thomas, F. Q. (1986). Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Speech Audio Processing, 4(34), 744–754.
Google Scholar
Jackson, P. J. B., & Shadle, C. H. (2001). Pitch-scaled estimation of simultaneous voiced and trubulence-noise components in speech. IEEE Transactions on Speech Audio Processing, 9(7), 713–726.
Article Google Scholar
Wen, Z. Q., & Tao, J. H. (2011). Inverse filtering based harmonic plus noise excitation model for HMM-based speech synthesis. Proceedings of Interspeech, 2011, 1805–1808.
Google Scholar
Kawahara, H., Morise, M., Takahashi, T., Banno, H., Nisimura, R., & Irino, T. (2010). Simplification and extension of non-periodic excitation source representations for high-quality speech manipulation systems. Proceedings of Interspeech, 2010, 38–41.
Google Scholar
Fant, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton.
Google Scholar
Yegnanarayana, B., d’Alessandro, C., & Darsinos, V. (1998). An iterative algorithm for decomposition of speech signals into periodic and aperiodic components. IEEE Transactions on Speech Audio Processing, 6(1), 1–11.
Article Google Scholar
Naylor, P., Kounoudes, A., Gudnason, J., & Brookes, M. (2007). Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions on Speech Audio Processing, 15(1), 34–43.
Article Google Scholar
Nuttal, A. H. (1981). Some windows with very good sidelobe behavior. IEEE Transactions on Acoustics, Speech and Audio Processing, 29(1), 84–91.
Article Google Scholar
Drugman, T., Moinet, A., Dutoit, T., & Wilfart, G. (2009). Using a pitch-synchronous redisual codebook for hybrid HMM/frame selection speech synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3793–3796.
Raitio, T., Suni, A., Pulakka, H., Vainio, M., & Alku, P. (2011). Utilizing glottal source pulse library for generation improved excitation signal for HMM-based speech synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4564–4567.
Fodor, I. K. (2002). A survey of dimension reduction techniques. Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory. Center for Applied Scientific Computing, USA.
Jackson, J. E. (1991). A User’s Guide to Principal components. New York: John Wiley and Sons.
Book MATH Google Scholar
Wen, Z. Q., & Tao, J. H. (2011). An excitation model based on inverse filtering for speech analysis and synthesis. IEEE International Workshop on Machine Learning for Signal Processing.
Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transaction on Communications, 28(1), 84–85.
Google Scholar
Wen, Z. Q., Tao, J. H., & Hain, H. U. (2012). Pitch-scaled spectrum based excitation model for HMM-based speech synthesis. IEEE 11th International Conference on Signal Processing.
John, M. (1975). Linear prediction: a tutorial review. Proceedings of the IEEE, 63(4), 561–580.
Article Google Scholar
Tokuda, K., Masuko, T., Miyazaki, N., & Kobayashi, T. (2002). Multi-space probability distribution HMM. IEICE Transactions on Information and Systems, E85-D(3), 455–464.
Google Scholar
Shinoda, K., & Watanabe, T. (2000). MDL-based context-dependent subword modeling for speech recognition. The Journal of the Acoustical Society of Japan (e), 21(2), 79–86.
Article Google Scholar
Tokuda, K., Kobayashi, T., & Imai, S. (1995). Speech parameter generation from HMM using dynamic features. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 660–663
Kawahra, H., Morise, M., Takahashi, T., Nisimura, R., & Irino, T. (2006). Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. Proceedings of ICASSP, 3933–3936.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5), 453–467.
Article Google Scholar
Itakura, F. (1975). Line spectrum representation of linear predictor coefficients of speech signals. Journal of the Acoustical Society of America, 57(S1), S35–S35.
Article Google Scholar

Download references

Acknowledgments

The work was supported by NSFC-JSPS joint project (No.61011140075), China-Singapore Institute of Digital Media (CSIDM) and National Science Foundation of China (No.61273288, No.61233009, No.60873160, No.90820303 and No.61203258). The authors would like to thank Hideki Kawahara. I learned a lot during my three month stay in Prof. Kawahara’s lab and his direction has significantly helped me in preparing this research.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Science, Beijing, China
Zhengqi Wen, Jianhua Tao, Shifeng Pan & Yang Wang

Authors

Zhengqi Wen
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Shifeng Pan
View author publications
You can also search for this author in PubMed Google Scholar
Yang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianhua Tao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wen, Z., Tao, J., Pan, S. et al. Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis. J Sign Process Syst 74, 423–435 (2014). https://doi.org/10.1007/s11265-013-0862-z

Download citation

Received: 09 October 2012
Revised: 11 October 2013
Accepted: 25 October 2013
Published: 19 December 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s11265-013-0862-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

A uniform phase representation for the harmonic model in speech synthesis applications

Excitation Modeling Method Based on Inverse Filtering for HMM-Based Speech Synthesis

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Abstract

Access this article

Similar content being viewed by others

A uniform phase representation for the harmonic model in speech synthesis applications

Excitation Modeling Method Based on Inverse Filtering for HMM-Based Speech Synthesis

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation