LSTM-Based Speech Segmentation for TTS Synthesis

Hanzlíček, Zdeněk; Vít, Jakub; Tihelka, Daniel

doi:10.1007/978-3-030-27947-9_31

Zdeněk Hanzlíček⁹,
Jakub Vít⁹ &
Daniel Tihelka⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

904 Accesses
7 Citations

Abstract

This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.

This research was supported by the Technology Agency of the Czech Republic, project No. TH02010307 and by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme CESNET LM2015042, is greatly appreciated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The shape of the histogram can be interpolated to a one-frame resolution but it has only a marginal effect on the resulting alignment.
2.
The pronunciation could also be variously distorted or the text and speech could not match exactly. However, this problem is out of scope of our research.

References

Adell, J., Bonafonte, A., Gómez, J.A., Bleda, M.J.C.: Comparative study of automatic phone segmentation methods for TTS. In: Proceedings of ICASSP, pp. 309–312 (2005)
Google Scholar
Brognaux, S., Drugman, T.: HMM-based speech segmentation: Improvements of fully automatic approaches. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 5–15 (2016)
Article Google Scholar
Brugnara, F., Falavigna, D., Omologo, M.: Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12, 357–370 (1993)
Article Google Scholar
Finster, H.: Automatic speech segmentation using neural network and phonetic transcription. In: Proceedings of IJCNN (1992)
Google Scholar
Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Heidelberg (2012)
Book Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of ICML, pp. 1764–1772 (2014)
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)
Article Google Scholar
Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: Proceedings of ICME, pp. 224–227 (2007)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Hoffmann, S., Pfister, B.: Fully automatic segmentation for prosodic speech corpora. In: Proceedings of Interspeech, pp. 1389–1392 (2010)
Google Scholar
Hoffmann, S., Pfister, B.: Text-to-speech alignment of long recordings using universal phone models. In: Proceedings of Interspeech, pp. 1520–1524 (2013)
Google Scholar
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP, pp. 373–376 (1996)
Google Scholar
Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings of Interspeech, Stockholm, Sweden, pp. 3425–3426 (2017)
Google Scholar
Kominek, J., Bennett, C.L., Black, A.W.: Evaluating and correcting phoneme segmentation for unit selection synthesis. In: Proceedings of Eurospeech, pp. 313–316 (2003)
Google Scholar
Malfrère, F., Deroo, O., Dutoit, T., Risa, C.: Phonetic alignment: speech synthesis-based vs. viterbi-based. Speech Commun. 40, 503–515 (2003)
Article Google Scholar
Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proceedings of Interspeech, pp. 1626–1629 (2008)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: Proceedings of Eurospeech, pp. 301–304 (2003)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Chapter Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)
Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Toledano, D.T.: Neural network boundary refining for automatic speech segmentation. In: Proceedings of ICASSP, pp. 3438–3441 (2000)
Google Scholar
Wang, L., et al.: Improved DNN-based segmentation for multi-genre broadcast audio. In: Proceedings of ICASSP, pp. 5700–5704 (2016)
Google Scholar
Wells, J.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, pp. 684–732. Mouton de Gruyter, Berlin (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Zdeněk Hanzlíček, Jakub Vít & Daniel Tihelka

Authors

Zdeněk Hanzlíček
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Vít
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Tihelka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zdeněk Hanzlíček .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hanzlíček, Z., Vít, J., Tihelka, D. (2019). LSTM-Based Speech Segmentation for TTS Synthesis. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_31
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics