Skip to main content

LSTM-Based Speech Segmentation for TTS Synthesis

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

Abstract

This paper describes experiments on speech segmentation for the purposes of text-to-speech synthesis. We used a bidirectional LSTM neural network for framewise phone classification and another bidirectional LSTM network for predicting the duration of particular phones. The proposed segmentation procedure combines both outputs and finds the optimal speech-phoneme alignment by using the dynamic programming approach. We introduced two modifications to increase the robustness of phoneme classification. Experiments were performed on 2 professional voices and 2 amateur voices. A comparison with a reference HMM-based segmentation with additional manual corrections was performed. Preference listening tests showed that the reference and experimental segmentation are equivalent when used in a unit selection TTS system.

This research was supported by the Technology Agency of the Czech Republic, project No. TH02010307 and by the Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme CESNET LM2015042, is greatly appreciated.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The shape of the histogram can be interpolated to a one-frame resolution but it has only a marginal effect on the resulting alignment.

  2. 2.

    The pronunciation could also be variously distorted or the text and speech could not match exactly. However, this problem is out of scope of our research.

References

  1. Adell, J., Bonafonte, A., Gómez, J.A., Bleda, M.J.C.: Comparative study of automatic phone segmentation methods for TTS. In: Proceedings of ICASSP, pp. 309–312 (2005)

    Google Scholar 

  2. Brognaux, S., Drugman, T.: HMM-based speech segmentation: Improvements of fully automatic approaches. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1), 5–15 (2016)

    Article  Google Scholar 

  3. Brugnara, F., Falavigna, D., Omologo, M.: Automatic segmentation and labeling of speech based on hidden Markov models. Speech Commun. 12, 357–370 (1993)

    Article  Google Scholar 

  4. Finster, H.: Automatic speech segmentation using neural network and phonetic transcription. In: Proceedings of IJCNN (1992)

    Google Scholar 

  5. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Heidelberg (2012)

    Book  Google Scholar 

  6. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of ICML, pp. 1764–1772 (2014)

    Google Scholar 

  7. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18, 602–610 (2005)

    Article  Google Scholar 

  8. Haubold, A., Kender, J.R.: Alignment of speech to highly imperfect text transcriptions. In: Proceedings of ICME, pp. 224–227 (2007)

    Google Scholar 

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  10. Hoffmann, S., Pfister, B.: Fully automatic segmentation for prosodic speech corpora. In: Proceedings of Interspeech, pp. 1389–1392 (2010)

    Google Scholar 

  11. Hoffmann, S., Pfister, B.: Text-to-speech alignment of long recordings using universal phone models. In: Proceedings of Interspeech, pp. 1520–1524 (2013)

    Google Scholar 

  12. Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP, pp. 373–376 (1996)

    Google Scholar 

  13. Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings of Interspeech, Stockholm, Sweden, pp. 3425–3426 (2017)

    Google Scholar 

  14. Kominek, J., Bennett, C.L., Black, A.W.: Evaluating and correcting phoneme segmentation for unit selection synthesis. In: Proceedings of Eurospeech, pp. 313–316 (2003)

    Google Scholar 

  15. Malfrère, F., Deroo, O., Dutoit, T., Risa, C.: Phonetic alignment: speech synthesis-based vs. viterbi-based. Speech Commun. 40, 503–515 (2003)

    Article  Google Scholar 

  16. Matoušek, J., Romportl, J.: Automatic pitch-synchronous phonetic segmentation. In: Proceedings of Interspeech, pp. 1626–1629 (2008)

    Google Scholar 

  17. Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: Proceedings of Eurospeech, pp. 301–304 (2003)

    Google Scholar 

  18. Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41

    Chapter  Google Scholar 

  19. Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)

    Google Scholar 

  20. Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40

    Chapter  Google Scholar 

  21. Toledano, D.T.: Neural network boundary refining for automatic speech segmentation. In: Proceedings of ICASSP, pp. 3438–3441 (2000)

    Google Scholar 

  22. Wang, L., et al.: Improved DNN-based segmentation for multi-genre broadcast audio. In: Proceedings of ICASSP, pp. 5700–5704 (2016)

    Google Scholar 

  23. Wells, J.: SAMPA computer readable phonetic alphabet. In: Handbook of Standards and Resources for Spoken Language Systems, pp. 684–732. Mouton de Gruyter, Berlin (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zdeněk Hanzlíček .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hanzlíček, Z., Vít, J., Tihelka, D. (2019). LSTM-Based Speech Segmentation for TTS Synthesis. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27947-9_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27946-2

  • Online ISBN: 978-3-030-27947-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics