Abstract
In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates.
In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bailly, G., Benoit, C., & Sawallis, T. (Eds.) (1992). Talking machines: theories, models and designs. Amsterdam: North Holland, Elsevier.
Chen, S. H., & Wang, Y. R. (1990). Vector quantization of pitch information in Mandarin speech. IEEE Transactions on Communications, 38(9), 1317–1320.
Chen, S. H., Hwang, S. H., & Wang, Y. R. (1998). An RNN-based prosodic information synthesizer for Mandarin text-to-speech. IEEE Transactions on Speech and Audio Processing, 6(3), 226–239.
Chiang, C. Y., Chen, S. H., & Wang, Y. R. (2005). On the inter-syllable coarticulation effect of pitch modeling for Mandarin speech. In Proceeding of interspeech (pp. 3269–3272).
Childers, D. G., & Hu, H. T. (1994). Speech synthesis by glottal excited linear prediction. Journal of the Acoustical Society of America, 96(4), 2026–2036.
Choi, J., Hon, H. W., Lebrun, J. L., Lee, S. P., Loudon, G., Phan, V. H., & Yogananthan, S. (1994). Yanhui, a software based high performance Mandarin text-to-speech system. In Proc. ROCLING XII (pp. 35–50).
Chou, F. C., Tseng, C. Y., & Lee, L. S. (2002). A set of corpus-based text-to-speech synthesis technologies for Mandarin Chinese. IEEE Transactions on Speech and Audio Processing, 10(7), 481–494.
Chu, M., Tang, D., Si, H., Tian, X., & Lu, S. (1998). Research on perception of juncture between syllables in Chinese. Chinese Journal of Acoustics, 17(2), 143–152.
Cohen, G., & Malah, D. (1995). Speech analysis and synthesis using a glottal excited AR model with DTW-based glottal determination. In 18th Convention of electrical and Electronics Engineers, 3.2.3 (pp. 1–5).
Fujisaki, H., & Hirose, K. (1984). Analysis of voice fundamental frequency contours for declarative sentences of Japanese. Journal of the Acoustical Society of Japan (E), 5(4), 233–241.
Hu, H. T., Kuo, F. J., & Wang, H. J. (2000). A pseudo glottal excitation model for the linear prediction vocoder with speech signals coded at 1.6 kbps. IEICE Transactions on Information and Systems, E83-D(8), 1654–1661.
Hund, A. (1993). Software dreams and talking machines. Available at http://us.geocities.com/tim_hobbs.geo/sw2.htm.
Hwang, S. H., & Chen, S. H. (1992). Neural network synthesizer of pause duration for Mandarin text-to-speech. Electronics Letters, 28(8), 720–721.
Hwang, S. H., Chen, S. H., & Wang, Y. R. (1996). A Mandarin text-to-speech system. In Proc. 4th int. conf. spoken language (Vol. 3, pp. 1421–1424).
Klatt, D. H. (1982). The Klattalk text-to-speech system. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 7, pp. 1589–1592).
Laroche, J., Stylianou, Y., & Moulines, E. (1993). HNS: Speech modification based on a harmonic + noise model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 550–553).
Lee, L. S., Tseng, C. Y., & Ouh-Young, M. (1989). The synthesis rules in a Chinese text-to-speech system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(9), 1309–1320.
Lee, L. S., Tseng, C. Y., & Hsieh, C. J. (1993). Improved tone concatenation rules in a formant-based Chinese text-to-speech system. IEEE Transactions on Speech and Audio Processing, 1(3), 287–294.
Lin, Y. J., & Yu, M. S. (1998). An efficient Mandarin text-to-speech system on time domain. IEICE Transactions on Information and Systems, E81-D(6), 545–555.
Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communications, COM-208, 84–95.
Liu, C. S., Ju, G. H., Wang, W. J., Wang, H. C., & Lai, W. H. (1991). A new speech synthesizer for text-to-speech system using multipulse excitation with pitch predictor. In Proc. IEEE int. conf. computer process. Chinese and oriental languages (pp. 205–209).
McCree, A. V., & Barnwell III, T. P. (1995). A mixed excitation LPC vocoder model for low bit rate speech coding. IEEE Transactions on Speech and Audio Processing, 3(4), 242–250.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5/6), 453–467.
Moulines, E., & Laroche, J. (1995). Non-parametric techniques for pitch-scale and time-scale modification of speech. Speech Communication, 16, 175–205.
Paliwal, K. K., & Atal, B. S. (1993). Efficient vector quantization of LPC parameters at 24 bits/frame. IEEE Transactions on Speech and Audio Processing, 1(1), 3–14.
Silva, S. S., & Netto, S. L. (2004). Closed-form estimation of the amplitude commands in the automatic extraction of the Fujisaki’s model. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 1, pp. 621–624).
Soong, F. K., & Juang, B. H. (1993). Optimal quantization of LSP parameters. IEEE Transactions on Speech and Audio Processing, 1(1), 15–24.
Supplee, L. M., Cohn, R. P., & Collura, J. S. (1997). MELP: the new federal standard at 2400 bps. In Proc. IEEE int. conf. acoust. speech signal process (Vol. 2, pp. 1591–1594).
Taylor, P., Black, A. W., & Caley, R. (1998). The architecture of the festival speech synthesis system. In Proceedings of the third ESCA workshop in speech synthesis (pp. 147–151). Available at http://www.cstr.ed.ac.uk/projects/festival/.
Tseng, C. Y., Pin, S. H., Lee, Y., Wang, H. M., & Chen, Y. C. (2005). Fluent speech prosody: Framework and modeling. Speech Communications, 46, 284–309.
Varga, A., & Fallside, F. (1987). A technique for using multipulse linear predictive speech synthesis in text-to-speech type systems. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(4), 586–587.
Wu, C. H., Chen, C. H., & Juang, S. C. (1995). An CELP-based prosodic information modification and generation of Mandarin text-to-speech. In Proc. ROCLING XIII (pp. 233–251).
Yu, C., & Hu, H. T. (2003). Design and implementation of an ASIC architecture for 1.6 kbps speech synthesis. IEEE Transactions on Consumer Electronics, 49(3), 731–736.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hu, HT., Wang, HM. Integrating coding techniques into LP-based Mandarin text-to-speech synthesis. Int J Speech Technol 10, 31–44 (2007). https://doi.org/10.1007/s10772-008-9015-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-008-9015-3