Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System

Mac, Dang-Khoa; Tran, Do-Dat

doi:10.1007/978-3-319-25660-3_23

Dang-Khoa Mac¹⁹ &
Do-Dat Tran¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9441))

784 Accesses

Abstract

Attempts to add expressivity to synthesized speech is one of the main strategies in speech technologies. This paper summarizes our researches on modeling Vietnamese prosody, with the goal of improving naturalness of synthesized speech in Vietnamese, as well as integrating expressivities (i.e. emotion/attitude). Based on the concept of “rendez-vous” between linguistic levels and prosodic functions, the prosody of utterance is proposed to be decomposed into several components. Therefore, each component is step by step modeled by an independent model: a dynamic linear segment model for tones, a relative registers model for F0 level of syllable, a rule-based approach for phrasing modeling and a F0 stylization modeling for the expressive function. All proposed models were integrated in speech Text-to-speech systems and also were evaluated by perception experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Scherer, K.R., Ellgring, H.: Multimodal expression of emotion: affect programs or componential appraisal patterns? Emotion 7(1), 158 (2007)
Article Google Scholar
Nguyen, D.T., Luong, C.M., Vu, B.K., Mixdorff, H., Ngo, H.H.: Fujisaki model based F0 contours in vietnamese TTS. In: INTERSPEECH (2004)
Google Scholar
Fujisaki, H., Gu, W.: Phonological representation of tone systems of some tone languages based on the command-response model for F0 contour generation. In: Tonal Aspects of Languages (2006)
Google Scholar
Do Dat, T., Castelli, E., Hung, L.X., Serignat, J.-F., Van Loan, T.: Linear F0 contour model for Vietnamese tones and Vietnamese syllable synthesis with TD-PSOLA. In: Second International Symposium on Tonal Aspects of Languages (2006)
Google Scholar
Trần, Đ.Đ.: Synthèse de la parole à partir du texte en langue Vietnamienne. INPG, Grenoble (2007)
Google Scholar
Aubergé, V.: A gestalt morphology of prosody directed by functions: the example of a step by step model developed at ICP. In: International Conference on Speech Prosody 2002 (2002)
Google Scholar
Morlec, Y., Bailly, G., Aubergé, V.: Generating the prosody of attitudes. In: Intonation: Theory, Models and Applications (1997)
Google Scholar
Chen, G.-P., Bailly, G., Liu, Q.-F., Wang, R.-H.: A superposed prosodic model for Chinese text-to-speech synthesis. In: 2004 International Symposium on Chinese Spoken Language Processing, pp. 177–180 (2004)
Google Scholar
Yên, P.T.N., Castelli, E., Cuong, N.Q.: Gabarits des tons vietnamiens. In: JEP 2002, Journées d’Etude Sur Parole XXIV, Nancy, France, pp 23–26 (2002)
Google Scholar
Do, T.T., Takara, T.: Vietnamese text-to-speech system with precise tone generation. Acoust. Sci. Technol. 25(5), 347–353 (2004)
Article Google Scholar
Mixdorff, H., Nguyen, B.H., Fujisaki, H., Luong, C.M.: Quantitative analysis and synthesis of syllabic tones in Vietnamese. In: EuroSpeech2003, Geneva, pp. 177–180 (2003)
Google Scholar
Fujisakia, H., Gu, W.: Phonological representation of tone systems of some tone languages based on the command-response model for F0 contour generation. In: TAL2006, pp. 59–62 (2006)
Google Scholar
Trần, Đ.Đ., Castelli, E., Serignat, J.-F., Trinh, V.L., Le, X.H.: Influence of F0 on Vietnamese syllable perception. Presented at the Interspeech 2005, Lisbon, Portugal, pp. 1697–1700 (2005)
Google Scholar
Nguyen, Q.C.: Reconnaissance de la parole en langue Vietnamienne. Ph.D. thesis, INP- Grenoble, Grenoble, France (2002)
Google Scholar
Trần, Đ.Đ., Castelli, E., Lê, X.H., Segrinat, J.F., Văn Loan, T.: Linear F0 contour model for Vietnamese tones and vietnamese syllable synthesis with TD-PSOLA. In: TAL2006, France, pp. 103–107 (2006)
Google Scholar
Chou, F.-C., Tseng, C.Y., Lee, L.-S.: Automatic generation of prosodic structure for high quality Mandarin speech synthesis. In: ICSLP (1996)
Google Scholar
Tao, J., Dong, H., Zhao, S.: Rule learning based Chinese prosodic phrase prediction. In: 2003 International Conference on Natural Language Processing and Knowledge Engineering. Proceedings, pp. 425–432 (2003)
Google Scholar
Doukhan, D., Rilliard, A., Rosset, S., d’ Alessandro, C.: Modelling pause duration as a function of contextual length. In: INTERSPEECH (2012)
Google Scholar
Apel, J., Neubarth, F., Pirker, H., Trost, H.: Have a break! Modelling pauses in German speech. In: KONVENS (2004)
Google Scholar
Chistikov, P., Khomitsevich, O.: Improving prosodic break detection in a Russian TTS system. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 181–188. Springer, Heidelberg (2013)
Chapter Google Scholar
Jokisch, O., Kruschke, H., Hoffmann, R.: Prosodic reading style simulation for text-to-speech synthesis. In: Tao, J., Tan, T., Picard, R.W. (eds.) ACII 2005. LNCS, vol. 3784, pp. 426–432. Springer, Heidelberg (2005)
Chapter Google Scholar
Parlikar, A.: Style-Specific Phrasing in Speech Synthesis. Carnegie Mellon University, Pittsburgh (2013)
Google Scholar
Selkirk, E.O.: On Prosodic Structure and Its Relation to Syntactic Structure. Indiana University Linguistics Club, Bloomington (1980)
Google Scholar
Selkirk, E.: The syntax-phonology interface. In: Goldsmith, J., Riggle, J., Yu, A.C.L. (eds.) The Handbook of Phonological Theory, pp. 435–484. Wiley, New York (2011)
Chapter Google Scholar
Nespor, M., Vogel, I.: Prosodic structure above the word. In: Cutler, D.A., Ladd, D.D.R. (eds.) Prosody: Models and Measurements, pp. 123–140. Springer, Berlin Heidelberg (1983)
Chapter Google Scholar
Hayes, B.: The prosodic hierarchy in meter. Phon. Phonol. 1, 201–260 (1989)
Google Scholar
Dehé, N., Feldhausen, I., Ishihara, S.: The prosody–syntax interface: focus, phrasing, language evolution. Lingua 121(13), 1863–1869 (2011)
Article Google Scholar
Viet, H.A., Thu, D.T.P., Thang, H.Q.: Vietnamese parsing applying the PCFG model. In: Proceedings of the Second Asia Pacific International Conference on Information Science and Technology, Vietnam (2007)
Google Scholar
Nguyen, P.-T., Vu, X.-L., Nguyen, T.-M.-H., Nguyen, V.-H., Le, H.-P.: Building a large syntactically-annotated corpus of Vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, Suntec, Singapore, pp. 182–185 (2009)
Google Scholar
Le, A.-C., Nguyen, P.-T., Vuong, H.-T., Pham, M.-T., Ho, T.-B.: An experimental study on lexicalized statistical parsing for Vietnamese. In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, Hanoi, Vietnam, pp. 162–167 (2009)
Google Scholar
Le, V.-B., Besacier, L.: Automatic speech recognition for under-resourced languages: application to Vietnamese language. IEEE Trans. Audio Speech Lang. Process. 17(8), 1471–1482 (2009)
Article Google Scholar
Tran, D.D., Castelli, E.: Generation of F0 contours for Vietnamese speech synthesis. In: Proceedings of the third International Conference on Communications and Electronics (ICCE), Nha Trang, Vietnam, pp. 158–162 (2010)
Google Scholar
Trang, N.T.T., Rilliard, A., Trần, Đ.Đ., D’Alessandro, C.: Prosodic phrasing modeling for Vietnamese TTS using syntactic information. In: INTERSPEECH 2014, Singapore, pp. 2332–2336 (2014)
Google Scholar
Le Thi, X.: Etude contrastive de l’intonation expressive en français et en vietnamien. Ph.D. thesis, Université Paris 3, Paris, France (1989)
Google Scholar
Shochi, T., Aubergé, V., Rilliard, A.: How prosodic attitudes can be false friends: Japanese vs. French social affects. In: Speech Prosody, Dresden, pp. 692–696 (2006)
Google Scholar
Mac, D.-K., Aubergé, V., Rilliard, A., Castelli, E.: Audio-visual prosody of social attitudes in Vietnamese: building and evaluating a tones balanced corpus. In: Tenth Annual Conference of the International Speech Communication Association (2009)
Google Scholar

Download references

Acknowledgment

We would like to thank Mrs. NGUYEN Thi Thu Trang for her contributions in the frame work of the paper and of the research group.

Author information

Authors and Affiliations

International Research Institute MICA, HUST-CNRS/UMI 2954-Grenoble INP, Hanoi, Vietnam
Dang-Khoa Mac & Do-Dat Tran

Authors

Dang-Khoa Mac
View author publications
You can also search for this author in PubMed Google Scholar
Do-Dat Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Do-Dat Tran .

Editor information

Editors and Affiliations

Institute of Infocomm Research, Singapore, Singapore
Xiao-Li Li
Ho Chi Minh City University of Tech, Ho Chi Minh City, Vietnam
Tru Cao
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Science & Technology, Japan Advanced Institute of, Nomi-shi, Ishikawa, Japan
Tu-Bao Ho
The University of Hong Kong, Hong Kong, China
David Cheung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mac, DK., Tran, DD. (2015). Modeling Vietnamese Speech Prosody: A Step-by-Step Approach Towards an Expressive Speech Synthesis System. In: Li, XL., Cao, T., Lim, EP., Zhou, ZH., Ho, TB., Cheung, D. (eds) Trends and Applications in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science(), vol 9441. Springer, Cham. https://doi.org/10.1007/978-3-319-25660-3_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-25660-3_23
Published: 26 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25659-7
Online ISBN: 978-3-319-25660-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics