Abstract
Attempts to add expressivity to synthesized speech is one of the main strategies in speech technologies. This paper summarizes our researches on modeling Vietnamese prosody, with the goal of improving naturalness of synthesized speech in Vietnamese, as well as integrating expressivities (i.e. emotion/attitude). Based on the concept of “rendez-vous” between linguistic levels and prosodic functions, the prosody of utterance is proposed to be decomposed into several components. Therefore, each component is step by step modeled by an independent model: a dynamic linear segment model for tones, a relative registers model for F0 level of syllable, a rule-based approach for phrasing modeling and a F0 stylization modeling for the expressive function. All proposed models were integrated in speech Text-to-speech systems and also were evaluated by perception experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Scherer, K.R., Ellgring, H.: Multimodal expression of emotion: affect programs or componential appraisal patterns? Emotion 7(1), 158 (2007)
Nguyen, D.T., Luong, C.M., Vu, B.K., Mixdorff, H., Ngo, H.H.: Fujisaki model based F0 contours in vietnamese TTS. In: INTERSPEECH (2004)
Fujisaki, H., Gu, W.: Phonological representation of tone systems of some tone languages based on the command-response model for F0 contour generation. In: Tonal Aspects of Languages (2006)
Do Dat, T., Castelli, E., Hung, L.X., Serignat, J.-F., Van Loan, T.: Linear F0 contour model for Vietnamese tones and Vietnamese syllable synthesis with TD-PSOLA. In: Second International Symposium on Tonal Aspects of Languages (2006)
Trần, Đ.Đ.: Synthèse de la parole à partir du texte en langue Vietnamienne. INPG, Grenoble (2007)
Aubergé, V.: A gestalt morphology of prosody directed by functions: the example of a step by step model developed at ICP. In: International Conference on Speech Prosody 2002 (2002)
Morlec, Y., Bailly, G., Aubergé, V.: Generating the prosody of attitudes. In: Intonation: Theory, Models and Applications (1997)
Chen, G.-P., Bailly, G., Liu, Q.-F., Wang, R.-H.: A superposed prosodic model for Chinese text-to-speech synthesis. In: 2004 International Symposium on Chinese Spoken Language Processing, pp. 177–180 (2004)
Yên, P.T.N., Castelli, E., Cuong, N.Q.: Gabarits des tons vietnamiens. In: JEP 2002, Journées d’Etude Sur Parole XXIV, Nancy, France, pp 23–26 (2002)
Do, T.T., Takara, T.: Vietnamese text-to-speech system with precise tone generation. Acoust. Sci. Technol. 25(5), 347–353 (2004)
Mixdorff, H., Nguyen, B.H., Fujisaki, H., Luong, C.M.: Quantitative analysis and synthesis of syllabic tones in Vietnamese. In: EuroSpeech2003, Geneva, pp. 177–180 (2003)
Fujisakia, H., Gu, W.: Phonological representation of tone systems of some tone languages based on the command-response model for F0 contour generation. In: TAL2006, pp. 59–62 (2006)
Trần, Đ.Đ., Castelli, E., Serignat, J.-F., Trinh, V.L., Le, X.H.: Influence of F0 on Vietnamese syllable perception. Presented at the Interspeech 2005, Lisbon, Portugal, pp. 1697–1700 (2005)
Nguyen, Q.C.: Reconnaissance de la parole en langue Vietnamienne. Ph.D. thesis, INP- Grenoble, Grenoble, France (2002)
Trần, Đ.Đ., Castelli, E., Lê, X.H., Segrinat, J.F., Văn Loan, T.: Linear F0 contour model for Vietnamese tones and vietnamese syllable synthesis with TD-PSOLA. In: TAL2006, France, pp. 103–107 (2006)
Chou, F.-C., Tseng, C.Y., Lee, L.-S.: Automatic generation of prosodic structure for high quality Mandarin speech synthesis. In: ICSLP (1996)
Tao, J., Dong, H., Zhao, S.: Rule learning based Chinese prosodic phrase prediction. In: 2003 International Conference on Natural Language Processing and Knowledge Engineering. Proceedings, pp. 425–432 (2003)
Doukhan, D., Rilliard, A., Rosset, S., d’ Alessandro, C.: Modelling pause duration as a function of contextual length. In: INTERSPEECH (2012)
Apel, J., Neubarth, F., Pirker, H., Trost, H.: Have a break! Modelling pauses in German speech. In: KONVENS (2004)
Chistikov, P., Khomitsevich, O.: Improving prosodic break detection in a Russian TTS system. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 181–188. Springer, Heidelberg (2013)
Jokisch, O., Kruschke, H., Hoffmann, R.: Prosodic reading style simulation for text-to-speech synthesis. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 426–432. Springer, Heidelberg (2005)
Parlikar, A.: Style-Specific Phrasing in Speech Synthesis. Carnegie Mellon University, Pittsburgh (2013)
Selkirk, E.O.: On Prosodic Structure and Its Relation to Syntactic Structure. Indiana University Linguistics Club, Bloomington (1980)
Selkirk, E.: The syntax-phonology interface. In: Goldsmith, J., Riggle, J., Yu, A.C.L. (eds.) The Handbook of Phonological Theory, pp. 435–484. Wiley, New York (2011)
Nespor, M., Vogel, I.: Prosodic structure above the word. In: Cutler, D.A., Ladd, D.D.R. (eds.) Prosody: Models and Measurements, pp. 123–140. Springer, Berlin Heidelberg (1983)
Hayes, B.: The prosodic hierarchy in meter. Phon. Phonol. 1, 201–260 (1989)
Dehé, N., Feldhausen, I., Ishihara, S.: The prosody–syntax interface: focus, phrasing, language evolution. Lingua 121(13), 1863–1869 (2011)
Viet, H.A., Thu, D.T.P., Thang, H.Q.: Vietnamese parsing applying the PCFG model. In: Proceedings of the Second Asia Pacific International Conference on Information Science and Technology, Vietnam (2007)
Nguyen, P.-T., Vu, X.-L., Nguyen, T.-M.-H., Nguyen, V.-H., Le, H.-P.: Building a large syntactically-annotated corpus of Vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, Suntec, Singapore, pp. 182–185 (2009)
Le, A.-C., Nguyen, P.-T., Vuong, H.-T., Pham, M.-T., Ho, T.-B.: An experimental study on lexicalized statistical parsing for Vietnamese. In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, pp. 162–167 (2009)
Le, V.-B., Besacier, L.: Automatic speech recognition for under-resourced languages: application to Vietnamese language. IEEE Trans. Audio Speech Lang. Process. 17(8), 1471–1482 (2009)
Tran, D.D., Castelli, E.: Generation of F0 contours for Vietnamese speech synthesis. In: Proceedings of the third International Conference on Communications and Electronics (ICCE), Nha Trang, Vietnam, pp. 158–162 (2010)
Trang, N.T.T., Rilliard, A., Trần, Đ.Đ., D’Alessandro, C.: Prosodic phrasing modeling for Vietnamese TTS using syntactic information. In: INTERSPEECH 2014, Singapore, pp. 2332–2336 (2014)
Le Thi, X.: Etude contrastive de l’intonation expressive en français et en vietnamien. Ph.D. thesis, Université Paris 3, Paris, France (1989)
Shochi, T., Aubergé, V., Rilliard, A.: How prosodic attitudes can be false friends: Japanese vs. French social affects. In: Speech Prosody, Dresden, pp. 692–696 (2006)
Mac, D.-K., Aubergé, V., Rilliard, A., Castelli, E.: Audio-visual prosody of social attitudes in Vietnamese: building and evaluating a tones balanced corpus. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Acknowledgment
We would like to thank Mrs. NGUYEN Thi Thu Trang for her contributions in the frame work of the paper and of the research group.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Mac, DK., Tran, DD. (2015). Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-25660-3_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)