Abstract
This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.
This research was supported by the Technology Agency of the Czech Republic, project No. TH02010307 and by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme CESNET LM2015042, is greatly appreciated.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The shape of the histogram can be interpolated to a one-frame resolution but it has only a marginal effect on the resulting alignment.
- 2.
The pronunciation could also be variously distorted or the text and speech could not match exactly. However, this problem is out of scope of our research.
References
Adell, J., Bonafonte, A., Gómez, J.A., Bleda, M.J.C.: Comparative study of automatic phone segmentation methods for TTS. In: Proceedings of ICASSP, pp. 309–312 (2005)
Brognaux, S., Drugman, T.: HMM-based speech segmentation: Improvements of fully automatic approaches. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 5–15 (2016)
Brugnara, F., Falavigna, D., Omologo, M.: Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12, 357–370 (1993)
Finster, H.: Automatic speech segmentation using neural network and phonetic transcription. In: Proceedings of IJCNN (1992)
Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Heidelberg (2012)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of ICML, pp. 1764–1772 (2014)
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)
Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: Proceedings of ICME, pp. 224–227 (2007)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Hoffmann, S., Pfister, B.: Fully automatic segmentation for prosodic speech corpora. In: Proceedings of Interspeech, pp. 1389–1392 (2010)
Hoffmann, S., Pfister, B.: Text-to-speech alignment of long recordings using universal phone models. In: Proceedings of Interspeech, pp. 1520–1524 (2013)
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP, pp. 373–376 (1996)
Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings of Interspeech, Stockholm, Sweden, pp. 3425–3426 (2017)
Kominek, J., Bennett, C.L., Black, A.W.: Evaluating and correcting phoneme segmentation for unit selection synthesis. In: Proceedings of Eurospeech, pp. 313–316 (2003)
Malfrère, F., Deroo, O., Dutoit, T., Risa, C.: Phonetic alignment: speech synthesis-based vs. viterbi-based. Speech Commun. 40, 503–515 (2003)
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proceedings of Interspeech, pp. 1626–1629 (2008)
Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: Proceedings of Eurospeech, pp. 301–304 (2003)
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Toledano, D.T.: Neural network boundary refining for automatic speech segmentation. In: Proceedings of ICASSP, pp. 3438–3441 (2000)
Wang, L., et al.: Improved DNN-based segmentation for multi-genre broadcast audio. In: Proceedings of ICASSP, pp. 5700–5704 (2016)
Wells, J.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, pp. 684–732. Mouton de Gruyter, Berlin (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hanzlíček, Z., Vít, J., Tihelka, D. (2019). LSTM-Based Speech Segmentation for TTS Synthesis. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)