$$\hbox {F}_{0}$$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

Mukherjee, Sankar; Mandal, Shyamal Kumar Das

doi:10.1007/s10772-014-9247-3

$\hbox {F}_{0}$ contour generation and synthesis using Bengali Hmm-based speech synthesis system

Published: 17 August 2014

Volume 18, pages 25–36, (2015)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Sankar Mukherjee¹ &
Shyamal Kumar Das Mandal¹

187 Accesses
Explore all metrics

Abstract

HMM based Bengali speech synthesis system (Bengali-HTS) generates highly intelligible synthesized speech but its naturalness is not adequate even though it is trained with a very good amount of speech corpus. In case of interrogative, imperative and exclamatory sentences, naturalness of the synthesized speech falls drastically. This paper proposes a method to overcome this problem by modifying the $\hbox {F}_{0}$ contour of synthetic speech based on Fujisaki model. The Fujisaki model features for different types of Bengali sentences are analyzed for the generation of $\hbox {F}_{0}$ contour. These features depend on prosodic word/phrase boundary of the sentence. So a two layer supervised classification and regression tree is trained to predict the prosodic word/phrase boundary. Fujisaki model then generates $\hbox {F}_{0}$ contour from input text using the prosodic word/phrase boundary and segmental duration information from HMM-based speech synthesis system. Moreover, for HMM training purpose, prosodic structure of sentence has been employed rather than lexical structure. From MOS and preference test it is found that proposed method significantly improved the overall quality of synthesized speech than that of Bengali-HTS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

Article 05 December 2018

An efficient adaptive artificial neural network based text to speech synthesizer for Hindi language

Article 09 April 2021

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Article 19 May 2015

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Anastasakos, T., McDonough, J., Schwartz, R., & Makhoul, J. (1996). A compact model for speaker-adaptive training. Proceedings of Fourth International Conference on Spoken Language Processing, 2, 1137–1140.
Article Google Scholar
Black, A. (2002). Perfect synthesis for all of the people all of the time. In Proceedings of IEEE Speech Synthesis Workshop.
Black, A., & Taylor, P. (1997). Automatically clustering similar units for unit selection in speech synthesis. In Proceedings of Eurospeech (pp. 601–604).
Bulyko, I., Ostendorf, M., & Bilmes, J. (2002). Robust splicing costs and efficient search with BMM models for concatenative speech synthesis. In Proceedings of ICASSP (pp. 461–464).
Campbell, N., & Black, A. (1996). Prosody and the selection of source units for concatenative synthesis. Progress in Speech Synthesis, 3, 279–292.
Google Scholar
Chen, C.-P., Huang, Y.-C., Wu, C.-H. & Lee, K.-D. (2012). Cross-lingual frame selection method for polyglot speech synthesis. In Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4521–4524).
Dines, J., & Sridharan, S. (2001). Trainable speech synthesis with trended hidden Markov models. In Proceedings of ICASSP (pp. 833–837).
Donovan, R., & Woodland, P. (1995). Improvements in an HMM-based speech synthesiser. In Proceedings of Eurospeech (pp. 573–576).
Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of japanese. Journal of the Acoustical Society of Japan (E), 5(4), 233–242.
Article Google Scholar
Fukada, T., Tokuda, K., Kobayashi, T., & Imai, S. (1992). An adaptive algorithm for mel-cepstral analysis of speech. Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing, 1, 137–140.
Google Scholar
Gao, B.-H., Qian, Y., Wu, Z.-Z., & Soong, F.-K. (2008). Duration refinement by jointly optimizing state and longer unit likelihood. In Proceedings of Interspeech (pp. 2266–2269).
Hsia, C.-C., Wu, C.-H., & Wu, J.-Q. (2007). Conversion function clustering and selection using linguistic and spectral information for emotional voice conversion. IEEE Transactions on Computers, 56(9), 1245–1254.
Article MathSciNet Google Scholar
Hunt, A., & Black, A. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In Proceedings of ICASSP (pp. 373–376).
Kawai, H., & Tsuzaki, M. (2002). A study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis. In Proceedings of IEEE Speech Synthesis Workshop.
Ling, Z.-H., Wu, Y.-J., Wang, Y.-P., Qin, L., & Wang, R.-H. (2006). USTC system for Blizzard challenge 2006—an improved HMM-based speech synthesis method. In Proceedings of Blizzard Challenge Workshop.
Mandal, S. D., Saha, A., & Datta, A. (2005). Annotated speech corpora development in Indian languages. Vishwa Bharat, 6, 49–64.
Google Scholar
Mandal, S. D., Warsi, A. H., Basu, T., Hirose, K., & Fujisaki, H. (2010). Analysis and synthesis of $\text{ F }_{0}$ contours for bangla readout speech. In Proceedings of Oriental COCOSDA, Kathmandu, Nepal.
Mukherjee, S. & Mandal, S. D. (2012). A Bengali HMM based speech synthesis system. In Proceeding of International Conference on Speech Database and Assessments (Oriental COCOSDA) (pp. 225–259).
Mukherjee, S. & Mandal, S. D. (2013). Bengali parts-of-speech tagging using global linear model. In Proceeding of IEEE INDICON -2013.
Oura, K., Zen, H., Nankaku, Y., Lee, A., & Tokuda, K. (2007). Postfiltering for HMM-based speech synthesis using mel-LSPs. In Proceedings of Autumn Meeting of ASJ (pp. 367–368) (in Japanese).
Picard, R. W. (1997). Affective computing. Cambridge: MIT Press.
Google Scholar
Qian, Y., Xu, J., & Soong, F. K. (2011). A frame mapping based HMM approach to cross-lingual voice transformation. In Proceeding of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5120–5123).
Shi, Y., Chang, E., Peng, H., & Chu, M. (2002). Power spectral density based channel equalization of large speech database for concatenative TTS system. In Proceedings of ICSLP (pp. 2369–2372).
Stylianou, Y. (1999). Assessment and correction of voice quality variabilities in large speech databases for concatenative speech synthesis. In Proceedings of ICASSP (pp. 377–380).
Sun, J.-W., Ding, F., & Wu, Y.-H. (2009). Polynomial segment model based statistical parametric speech synthesis system. In Proceedings of ICASSP (pp. 4021–4024).
Toda, T., & Tokuda, K. (2007). A speech parameter generation algorithm considering global variance for HMM-based speech synthesis. IEICE Transactions on Information and Systems E Series D, 90(5), 816–824.
Article Google Scholar
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. Proceeding of IEEE International Conference on Acoustics, Speech, and Signal Processing, 3, 1315–1318.
Google Scholar
Tokuda, K., Zen, H., & Black, A.W. (2002). An HMM-based speech synthesis system applied to english. In Proceedings of IEEE Workshop on Speech Synthesis (pp. 227–230).
Tseng, C.-Y. (2006). Higher level organization and discourse prosody. The Second International Symposium on Tonal Aspects of Languages (pp. 23–34).
Warsi, A. H., Basu, T., Hirose, K., & Fujisaki, H. (2012). Analysis and synthesis of $\text{ F }_{0}$ contours of declarative, interrogative, and imperative utterances of bangla. In Proceeding of International Conference on Speech Database and Assessments (Oriental COCOSDA) (pp. 56–61).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1998). Duration modeling for HMM-based speech synthesis. In Proceedings of ICSLP (pp. 29–32).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of Eurospeech (pp. 2347–2350).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2001). Mixed excitation for HMM-based speech synthesis. In Proceedings of Eurospeech (pp. 2263–2266).
Zen, H., Toda, T., & Tokuda, K. (2006). The Nitech-NAIST HMM-based speech synthesis system for the Blizzard challenge 2006. In Proceedings of Blizzard Challenge Workshop.
Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (2007a). A hidden semi-Markov model-based speech synthesis system. IEICE Transactions on Information and Systems E Series D, 90(5), 825–834.
Article Google Scholar
Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A. W., & Tokuda, K. (2007b). The HMM-based speech synthesis system (hts) version 2.0. In Proceedings of Sixth ISCA Workshop on Speech Synthesis (pp. 294–299).

Download references

Author information

Authors and Affiliations

Centre for Educational Technology, Indian Institute of Technology Kharagpur, Kharagpur , 721302, West Bengal, India
Sankar Mukherjee & Shyamal Kumar Das Mandal

Authors

Sankar Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Shyamal Kumar Das Mandal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sankar Mukherjee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mukherjee, S., Mandal, S.K.D. $\hbox {F}_{0}$ contour generation and synthesis using Bengali Hmm-based speech synthesis system. Int J Speech Technol 18, 25–36 (2015). https://doi.org/10.1007/s10772-014-9247-3

Download citation

Received: 23 October 2013
Accepted: 31 July 2014
Published: 17 August 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s10772-014-9247-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

\(\hbox {F}_{0}\) contour generation and synthesis using Bengali Hmm-based speech synthesis system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

An efficient adaptive artificial neural network based text to speech synthesizer for Hindi language

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

\(\hbox {F}_{0}\) contour generation and synthesis using Bengali Hmm-based speech synthesis system

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

An efficient adaptive artificial neural network based text to speech synthesizer for Hindi language

Superpositional HMM-Based Intonation Synthesis Using a Functional F0 Model

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation