Text-to-Text Transfer Transformer Phrasing Model Using Enriched Text Input

Řezáčková, Markéta; Matoušek, Jindřich

doi:10.1007/978-3-031-16270-1_32

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13502))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

Abstract

Appropriate prosodic phrasing of the input text is crucial for natural speech synthesis outputs. The presented paper focuses on using a Text-to-Text Transfer Transformer for predicting phrase boundaries in text and inspects the possibility of enriching the input text with more detailed information to improve the success rate of the phrasing model trained on plain text. This idea came from our previous research on phrasing that showed that more detailed syntactic/semantic information might lead to more accurate predicting of phrase boundaries.

This research was supported by the Czech Science Foundation (GA CR), project No. GA21-14758S, and by the grant of the University of West Bohemia, project No. SGS-2022-017.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multi-task Text Normalization Approach for Speech Synthesis

Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech

Enhancing the Quality of Nepali Text-to-Speech Systems

Notes

1.
Note that only the level ‘4’ phrases (prosodic/intonational phrases) were considered in our experiments; smaller ones (e.g. intermediate phrases were also labeled but not used).
2.
The numbers in the second and the third part of the table slightly differ from those in [23] since a couple of manual corrections and amendments had been made in NRS data during the last year.

References

Beckman, M.E., Ayers Elam, G.: Guidelines for ToBI Labelling, Version 3. The Ohio State University Research Foundation, Ohio State University (1997)
Google Scholar
Bejček, E., et al.: Prague dependency treebank 3.0 (2013). http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Cruttenden, A.: Intonation. Cambridge Textbooks in Linguistics, 2nd edn. Cambridge University Press, Cambridge (1997)
Google Scholar
Daneš, F.: Intonace a věta ve spisovné češtině. ČSAV, Praha (1957)
Google Scholar
Fernandez, R., Rendel, A., Ramabhadran, B., Hoory, R.: Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks. In: Li, H., Meng, H.M., Ma, B., Chng, E., Xie, L. (eds.) INTERSPEECH, pp. 2268–2272. ISCA (2014)
Google Scholar
Grůber, M., Matoušek, J.: Listening-test-based annotation of communicative functions for expressive speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 283–290. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_36
Chapter Google Scholar
Hanzlíček, Z., Vít, J., Tihelka, D.: LSTM-based speech segmentation for TTS synthesis. In: Ekštein, K. (ed.) TSD 2019. LNCS (LNAI), vol. 11697, pp. 361–372. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-27947-9_31
Chapter Google Scholar
Jůzová, M.: Prosodic phrase boundary classification based on Czech Speech Corpora. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 165–173. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64206-2_19
Chapter Google Scholar
Jůzová, M., Tihelka, D.: Speaker-dependent BiLSTM-based phrasing. In: Sojka, P., Kopeček, I., Pala, K., Horák, A. (eds.) TSD 2020. LNCS (LNAI), vol. 12284, pp. 340–347. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58323-1_37
Chapter Google Scholar
Klimkov, V., et al.: Phrase break prediction for long-form reading TTS: exploiting text structure information. In: Proceedings of InterSpeech 2017, pp. 1064–1068 (2017)
Google Scholar
Kunešová, M., Řezáčková, M.: Detection of prosodic boundaries in speech using Wav2Vec 2.0. In: Sojka, P., et al. (eds.) TSD 2022. LNCS. vol. 13502, pp. 376–387. Springer, Cham (2022)
Google Scholar
Louw, J.A., Moodley, A.: Speaker specific phrase break modeling with conditional random fields for text-to-speech. In: 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), pp. 1–6 (2016)
Google Scholar
Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proceedings of the 2nd IASTED international conference on Computational intelligence, pp. 442–447. ACTA Press, San Francisco (2006)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Experiments with automatic segmentation for Czech speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 287–294. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39398-6_41
Chapter Google Scholar
Prahallad, K., Raghavendra, E.V., Black, A.W.: Learning speaker-specific phrase breaks for text-to-speech systems. In: SSW (2010)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2020). arXiv:1910.10683
Read, I., Cox, S.: Stochastic and syntactic techniques for predicting phrase breaks. Comput. Speech Lang. 21(3), 519–542 (2007)
Article Google Scholar
Rosenberg, A., Fernandez, R., Ramabhadran, B.: Modeling phrasing and prominence using deep recurrent learning. In: InterSpeech 2015. pp. 3066–3070. ISCA (2015)
Google Scholar
Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)
Book Google Scholar
Taylor, P., Black, A.: Assigning phrase breaks from part-of-speech sequences. Comput. Speech Lang. 12, 99–117 (1998)
Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need (2017). arXiv:1706.03762
Volín, J., Řezáčková, M., Matouřek, J.: Human and transformer-based prosodic phrasing in two speech genres. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 761–772. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_68
Chapter Google Scholar
Volín, J.: The size of prosodic phrases in native and foreign-accented read-out monologues. Acta Universitatis Carolinae - Philologica 2, 145–158 (2019)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online, October 2020
Google Scholar
Švec, J.: t5s–T5 made simple. http://github.com/honzas83/t5s (2020). Accessed 02 April 2020
Švec, J., et al.: General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang. Resour. Eval. 48(2), 227–248 (2013). https://doi.org/10.1007/s10579-013-9246-z
Article Google Scholar

Download references

Acknowledgements

Computational resources were supplied by the project “e-Infrastruktura CZ” (e-INFRA CZ LM2018140) supported by the Ministry of Education, Youth and Sports of the Czech Republic.

Author information

Authors and Affiliations

New Technologies for the Information Society (NTIS) and Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Pilsen, Czech Republic
Markéta Řezáčková & Jindřich Matoušek

Authors

Markéta Řezáčková
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markéta Řezáčková .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Řezáčková, M., Matoušek, J. (2022). Text-to-Text Transfer Transformer Phrasing Model Using Enriched Text Input. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-16270-1_32
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16269-5
Online ISBN: 978-3-031-16270-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics