Skip to main content
Log in

A waveform concatenation technique for text-to-speech synthesis

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Designing text-to-speech systems capable of producing natural sounding speech segments in different Indian languages is a challenging and ongoing problem. Due to the large number of possible pronunciations in different Indian languages, a number of speech segments are needed to be stored in the speech database while a concatenative speech synthesis technique is used to achieve highly natural speech segments. However, the large speech database size makes it unusable for small hand held devices or human computer interactive systems with limited storage resources. In this paper, we proposed a fraction-based waveform concatenation technique to produce intelligible speech segments from a small footprint speech database. The results of all the experiments performed shows the effectiveness of the proposed technique in producing intelligible speech segments in different Indian languages even with very less storage and computation overhead compared to the existing syllable-based technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

References

  • Adell, J., Escudero, D., & Bonafonte, A. (2012). Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Communication, 54(3), 459–476.

    Article  Google Scholar 

  • Alías, F., Formiga, L., & Llora, X. (2011). Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept. Speech Communication, 53(5), 786–800.

    Article  Google Scholar 

  • Bellur, A., Narayan, K. B., Krishnan, K. R., Murthy, H. (2011). Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil. In IEEE National conference on communications (NCC) (pp. 1–5).

  • Benoı̂t, C., & Le Goff, B. (1998). Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP. Speech Communication, 26(1), 117–129.

    Article  Google Scholar 

  • Black, A., & Tokuda, K. (2005). The blizzard challenge 2005: Evaluating corpus-based speech synthesis on common databases. In Proceedings of interspeech.

  • Black, A. W., & Taylor, P. A. (1997). Automatically clustering similar units for unit selection in speech synthesis.

  • Cai, M. Q., Ling, Z. H., & Dai, L. R. (2015). Statistical parametric speech synthesis using a hidden trajectory model. Speech Communication, 72, 149–159.

    Article  Google Scholar 

  • Christiansen, C., Pedersen, M. S., & Dau, T. (2010). Prediction of speech intelligibility based on an auditory preprocessing model. Speech Communication, 52(7–8), 678–692.

    Article  Google Scholar 

  • Handley, Z. (2009). Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Communication, 51(10), 906–919.

    Article  Google Scholar 

  • Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International conference on acoustics, speech, and signal processing (pp. 373–376).

  • Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40(1), 161–187.

    Article  MATH  Google Scholar 

  • Kishore, S. P., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In INTERSPEECH.

  • Kishore, S. P., Black, A. W., Kumar, R., & Sangal, R. (2003). Experiments with unit selection speech databases for Indian languages. In National seminar on language technology tools, Hyderabad, India.

  • Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data driven synthesis approach for Indian languages using syllable as basic unit. In Proceedings of international conference on NLP (ICON) (pp. 311–316).

  • Li, Y., Tao, J., Hirose, K., Xu, X., & Lai, W. (2015). Hierarchical stress modeling and generation in mandarin for expressive text-to-speech. Speech Communication, 72, 59–73.

    Article  Google Scholar 

  • Morton, H., Gunson, N., Marshall, D., McInnes, F., Ayres, A., & Jack, M. (2011). Usability assessment of text-to-speech synthesis for additional detail in an automated telephone banking system. Computer Speech & Language, 25(2), 341–362.

    Article  Google Scholar 

  • Murthy, H. A., Bellur, A., Viswanath, V., Narayanan, B., Susan, A., Kasthuri, G., …, Prahallad, K. (2010). Building unit selection speech synthesis in Indian languages: An initiative by an Indian consortium. In Proceedings of COCOSDA, Kathmandu, Nepal.

  • Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text to speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–181.

    Article  Google Scholar 

  • Panda, S. P., & Nayak, A. K. (2014). Integration of fuzzy if-then rule with waveform concatenation technique for text-to-speech synthesis in Odia. In IEEE International conference on information technology (ICIT) (pp. 88–93).

  • Panda, S. P., & Nayak, A. K. (2014). A rule-based concatenative approach to speech synthesis in Indian language text-to-speech systems. In Intelligent computing, communication and devices (pp. 523–531). New Delhi: Springer.

  • Panda, S. P., & Nayak, A. K. (2015). An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology, 18(3), 305–315.

    Article  Google Scholar 

  • Panda, S. P., & Nayak, A. K. (2016). Modified Rule-based concatenative technique for intelligible speech synthesis in Indian languages. Advanced Science Letters, 22(2), 557–563.

    Article  Google Scholar 

  • Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.

    Article  Google Scholar 

  • Panda, S. P., Nayak, A. K., & Patnaik, S. (2015). Text-to-speech synthesis with an Indian language perspective. International Journal of Grid and Utility Computing, 6(3–4), 170–178.

    Article  Google Scholar 

  • Patil, H., Patel, T. B., Shah, N. J., Sailor, H. B., Krishnan, R., Kasthuri, G. R., … Murthy, H. (2013). A syllable-based framework for unit selection synthesis in 13 Indian languages. In IEEE International conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (pp. 1–8).

  • Prahallad, K., Vadapalli, A., Elluru, N., Mantena, G., Pulugundla, B., Bhaskararao, P., … Black, A. W. (2013). The blizzard challenge 2013–Indian language task. In Blizzard challenge workshop.

  • Prasanna, S. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.

    Article  Google Scholar 

  • Raghavendra, E. V., Desai, S., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2008). Global syllable set for building speech synthesis in Indian languages. In IEEE Spoken language technology workshop, 2008 (SLT 2008) (pp. 49–52).

  • Rama, J., Ramakrishnan, A. G., Muralishankar, R., & Prathibha, R. (2002). A complete text-to-speech synthesis system in Tamil. In WSS’ proceedings (pp. 191–194).

  • Reddy, V. R., & Rao, K. S. (2013). Two-stage intonation modeling using feed forward neural networks for syllable based text-to-speech synthesis. Computer Speech & Language, 27(5), 1105–1126.

    Article  Google Scholar 

  • Retrieved July 12, 2017, from http://tdil.mit.gov.in/.

  • Retrieved July 12, 2017, from http://dhvani.sourceforge.net.

  • Retrieved July 12, 2017, from http://www.unicode.org/.

  • Rojc, M., & Kačič, Z. (2007). Time and space-efficient architecture for a corpus-based text-to-speech synthesis system. Speech Communication, 49(3), 230–249.

    Article  Google Scholar 

  • Romsdorfer, H., & Pfister, B. (2007). Text analysis and language identification for polyglot text-to-speech synthesis. Speech communication, 49(9), 697–724.

    Article  Google Scholar 

  • Talesara, S., Patil, H. A., Patel, T., Sailor, H., & Shah, N. A. (2013). Novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In ICALP proceedings (pp. 139–142).

  • Thomas, S., Rao, M. N., Murthy, H., & Ramalingam, C. S. (2006). Natural sounding TTS based on syllable-like units. In IEEE 14th European signal processing conference (pp. 1–5).

  • Tiomkin, S., Malah, D., Shechtman, S., & Kons, Z. (2011). A Hybrid Text-to-speech system that combines concatenative and statistical synthesis units. IEEE Transactions on Audio, Speech and Language Processing, 19, 1278–1288.

    Article  Google Scholar 

  • Toman, M., Pucher, M., Moosmüller, S., & Schabus, D. (2015). Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176–193.

    Article  Google Scholar 

  • Torres, H. M., & Gurlekian, J. A. (2008). Acoustic speech unit segmentation for concatenative synthesis. Computer Speech & Language, 22(2), 196–206.

    Article  Google Scholar 

  • Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech and Language, 19, 55–83.

    Article  Google Scholar 

  • Xia, X. J., Ling, Z. H., Jiang, Y., & Dai, L. R. (2014). HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data. Speech Communication, 63, 27–37.

    Article  Google Scholar 

  • Yeh, C. Y., Chang, S. C., & Hwang, S. H. (2013). A consistency analysis on an acoustic module for Mandarin text-to-speech. Speech Communication, 55(2), 266–277.

    Article  Google Scholar 

  • York, J., & Pendharkar, P. C. (2004). Human–computer interaction issues for mobile computing in a variable work context. International Journal of Human-Computer Studies, 60(5), 771–797.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soumya Priyadarsini Panda.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Panda, S.P., Nayak, A.K. A waveform concatenation technique for text-to-speech synthesis. Int J Speech Technol 20, 959–976 (2017). https://doi.org/10.1007/s10772-017-9463-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9463-8

Keywords

Navigation