Abstract
Designing text-to-speech systems capable of producing natural sounding speech segments in different Indian languages is a challenging and ongoing problem. Due to the large number of possible pronunciations in different Indian languages, a number of speech segments are needed to be stored in the speech database while a concatenative speech synthesis technique is used to achieve highly natural speech segments. However, the large speech database size makes it unusable for small hand held devices or human computer interactive systems with limited storage resources. In this paper, we proposed a fraction-based waveform concatenation technique to produce intelligible speech segments from a small footprint speech database. The results of all the experiments performed shows the effectiveness of the proposed technique in producing intelligible speech segments in different Indian languages even with very less storage and computation overhead compared to the existing syllable-based technique.
Similar content being viewed by others
References
Adell, J., Escudero, D., & Bonafonte, A. (2012). Production of filled pauses in concatenative speech synthesis based on the underlying fluent sentence. Speech Communication, 54(3), 459–476.
Alías, F., Formiga, L., & Llora, X. (2011). Efficient and reliable perceptual weight tuning for unit-selection text-to-speech synthesis based on active interactive genetic algorithms: A proof-of-concept. Speech Communication, 53(5), 786–800.
Bellur, A., Narayan, K. B., Krishnan, K. R., Murthy, H. (2011). Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil. In IEEE National conference on communications (NCC) (pp. 1–5).
Benoı̂t, C., & Le Goff, B. (1998). Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP. Speech Communication, 26(1), 117–129.
Black, A., & Tokuda, K. (2005). The blizzard challenge 2005: Evaluating corpus-based speech synthesis on common databases. In Proceedings of interspeech.
Black, A. W., & Taylor, P. A. (1997). Automatically clustering similar units for unit selection in speech synthesis.
Cai, M. Q., Ling, Z. H., & Dai, L. R. (2015). Statistical parametric speech synthesis using a hidden trajectory model. Speech Communication, 72, 149–159.
Christiansen, C., Pedersen, M. S., & Dau, T. (2010). Prediction of speech intelligibility based on an auditory preprocessing model. Speech Communication, 52(7–8), 678–692.
Handley, Z. (2009). Is text-to-speech synthesis ready for use in computer-assisted language learning? Speech Communication, 51(10), 906–919.
Hunt, A. J., & Black, A. W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International conference on acoustics, speech, and signal processing (pp. 373–376).
Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communication, 40(1), 161–187.
Kishore, S. P., & Black, A. W. (2003). Unit size in unit selection speech synthesis. In INTERSPEECH.
Kishore, S. P., Black, A. W., Kumar, R., & Sangal, R. (2003). Experiments with unit selection speech databases for Indian languages. In National seminar on language technology tools, Hyderabad, India.
Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data driven synthesis approach for Indian languages using syllable as basic unit. In Proceedings of international conference on NLP (ICON) (pp. 311–316).
Li, Y., Tao, J., Hirose, K., Xu, X., & Lai, W. (2015). Hierarchical stress modeling and generation in mandarin for expressive text-to-speech. Speech Communication, 72, 59–73.
Morton, H., Gunson, N., Marshall, D., McInnes, F., Ayres, A., & Jack, M. (2011). Usability assessment of text-to-speech synthesis for additional detail in an automated telephone banking system. Computer Speech & Language, 25(2), 341–362.
Murthy, H. A., Bellur, A., Viswanath, V., Narayanan, B., Susan, A., Kasthuri, G., …, Prahallad, K. (2010). Building unit selection speech synthesis in Indian languages: An initiative by an Indian consortium. In Proceedings of COCOSDA, Kathmandu, Nepal.
Narendra, N. P., Rao, K. S., Ghosh, K., Vempada, R. R., & Maity, S. (2011). Development of syllable-based text to speech synthesis system in Bengali. International Journal of Speech Technology, 14, 167–181.
Panda, S. P., & Nayak, A. K. (2014). Integration of fuzzy if-then rule with waveform concatenation technique for text-to-speech synthesis in Odia. In IEEE International conference on information technology (ICIT) (pp. 88–93).
Panda, S. P., & Nayak, A. K. (2014). A rule-based concatenative approach to speech synthesis in Indian language text-to-speech systems. In Intelligent computing, communication and devices (pp. 523–531). New Delhi: Springer.
Panda, S. P., & Nayak, A. K. (2015). An efficient model for text-to-speech synthesis in Indian languages. International Journal of Speech Technology, 18(3), 305–315.
Panda, S. P., & Nayak, A. K. (2016). Modified Rule-based concatenative technique for intelligible speech synthesis in Indian languages. Advanced Science Letters, 22(2), 557–563.
Panda, S. P., & Nayak, A. K. (2016). Automatic speech segmentation in syllable centric speech recognition system. International Journal of Speech Technology, 19(1), 9–18.
Panda, S. P., Nayak, A. K., & Patnaik, S. (2015). Text-to-speech synthesis with an Indian language perspective. International Journal of Grid and Utility Computing, 6(3–4), 170–178.
Patil, H., Patel, T. B., Shah, N. J., Sailor, H. B., Krishnan, R., Kasthuri, G. R., … Murthy, H. (2013). A syllable-based framework for unit selection synthesis in 13 Indian languages. In IEEE International conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (pp. 1–8).
Prahallad, K., Vadapalli, A., Elluru, N., Mantena, G., Pulugundla, B., Bhaskararao, P., … Black, A. W. (2013). The blizzard challenge 2013–Indian language task. In Blizzard challenge workshop.
Prasanna, S. M., Reddy, B. S., & Krishnamoorthy, P. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 556–565.
Raghavendra, E. V., Desai, S., Yegnanarayana, B., Black, A. W., & Prahallad, K. (2008). Global syllable set for building speech synthesis in Indian languages. In IEEE Spoken language technology workshop, 2008 (SLT 2008) (pp. 49–52).
Rama, J., Ramakrishnan, A. G., Muralishankar, R., & Prathibha, R. (2002). A complete text-to-speech synthesis system in Tamil. In WSS’ proceedings (pp. 191–194).
Reddy, V. R., & Rao, K. S. (2013). Two-stage intonation modeling using feed forward neural networks for syllable based text-to-speech synthesis. Computer Speech & Language, 27(5), 1105–1126.
Retrieved July 12, 2017, from http://tdil.mit.gov.in/.
Retrieved July 12, 2017, from http://dhvani.sourceforge.net.
Retrieved July 12, 2017, from http://www.unicode.org/.
Rojc, M., & Kačič, Z. (2007). Time and space-efficient architecture for a corpus-based text-to-speech synthesis system. Speech Communication, 49(3), 230–249.
Romsdorfer, H., & Pfister, B. (2007). Text analysis and language identification for polyglot text-to-speech synthesis. Speech communication, 49(9), 697–724.
Talesara, S., Patil, H. A., Patel, T., Sailor, H., & Shah, N. A. (2013). Novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In ICALP proceedings (pp. 139–142).
Thomas, S., Rao, M. N., Murthy, H., & Ramalingam, C. S. (2006). Natural sounding TTS based on syllable-like units. In IEEE 14th European signal processing conference (pp. 1–5).
Tiomkin, S., Malah, D., Shechtman, S., & Kons, Z. (2011). A Hybrid Text-to-speech system that combines concatenative and statistical synthesis units. IEEE Transactions on Audio, Speech and Language Processing, 19, 1278–1288.
Toman, M., Pucher, M., Moosmüller, S., & Schabus, D. (2015). Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis. Speech Communication, 72, 176–193.
Torres, H. M., & Gurlekian, J. A. (2008). Acoustic speech unit segmentation for concatenative synthesis. Computer Speech & Language, 22(2), 196–206.
Viswanathan, M. (2005). Measuring speech quality for text-to-speech systems: Development and assessment of a modified mean opinion score (MOS) scale. Computer Speech and Language, 19, 55–83.
Xia, X. J., Ling, Z. H., Jiang, Y., & Dai, L. R. (2014). HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data. Speech Communication, 63, 27–37.
Yeh, C. Y., Chang, S. C., & Hwang, S. H. (2013). A consistency analysis on an acoustic module for Mandarin text-to-speech. Speech Communication, 55(2), 266–277.
York, J., & Pendharkar, P. C. (2004). Human–computer interaction issues for mobile computing in a variable work context. International Journal of Human-Computer Studies, 60(5), 771–797.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Panda, S.P., Nayak, A.K. A waveform concatenation technique for text-to-speech synthesis. Int J Speech Technol 20, 959–976 (2017). https://doi.org/10.1007/s10772-017-9463-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-017-9463-8