Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

Mattheyses, Wesley; Latacz, Lukas; Verhelst, Werner; Sahli, Hichem

doi:10.1007/978-3-540-85853-9_12

Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

Wesley Mattheyses¹,
Lukas Latacz¹,
Werner Verhelst¹ &
…
Hichem Sahli¹

Conference paper

877 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5237))

Abstract

Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. In this paper we propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a natural perception of the synthetic speech signal.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bailly, G., Brar, M., Elisei, F., Odisio, M.: Audiovisual speech synthesis. International Journal of Speech Technology 6, 331–346 (2003)
Article Google Scholar
Breen, A.P., Bowers, E., Welsh, W.: An Investigation into the Generation of Mouth Shapes for a Talking Head. In: International Conference on Spoken Language Processing, vol. 4, pp. 2159–2162 (1996)
Google Scholar
Bregler, C., Covell, M., Slaney, M.: Video Rewrite: Driving Visual Speech with Audio. In: Association for Computing Machinery’s Special Interest Group on Graphics and Interactive Techniques, pp. 353–360 (1997)
Google Scholar
Cosatto, E., Graf, H.P.: Sample-Based Synthesis of Photo-Realistic Talking Heads. Computer Animation, 103–110 (1998)
Google Scholar
Cosatto, E., Graf, H.P.: Photo-realistic talking-heads from image samples. IEEE Transactions on multimedia 2, 152–163 (2000)
Article Google Scholar
Cosatto, E., Potamianos, G., Graf, H.P.: Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads. International Conference on Multimedia and Expo, pp. 619–622 (2000)
Google Scholar
Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes (MikeTalk). MIT AI Lab, A.I Memo 1658 (1999)
Google Scholar
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. Association for Computing Machinery’s Special Interest Group on Graphics and Interactive Techniques 21, 388–398 (2002)
Google Scholar
Fagel, S.: Joint Audio-Visual Units Selection - The Javus Speech Synthesizer. In: International Conference on Speech and Computer (2006)
Google Scholar
Goyal, U.K., Kapoor, A., Kalra, P.: Text-to-Audio Visual Speech Synthesizer. Virtual Worlds, 256–269 (2000)
Google Scholar
Grant, K.W., Greenberg, S.: Speech Intelligibility Derived From Asynchrounous Processing of Auditory-Visual Information. In: Workshop on Audio-Visual Speech Processing, pp. 132–137 (2001)
Google Scholar
Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. In: International Conference on Acoustics, Speech and Signal Processing, pp. 373–376 (1996)
Google Scholar
Kerkhoff, J., Marsi, E.: NeXTeNS: a New Open Source Text-to-speech System for Dutch. In: 13th meeting of Computational Linguistics in the Netherlands (2002)
Google Scholar
Latacz, L., Kong, Y., Verhelst, W.: Unit Selection Synthesis Using Long Non-Uniform Units and Phoneme Identity Matching. In: 6th ISCA Workshop on Speech Synthesis, pp. 270–275 (2007)
Google Scholar
Mattheyses, W., Latacz, L., Kong, Y.O., Verhelst, W.: Flemish Voice for the Nextens Text-To-Speech System. In: Fifth Slovenian and First International Language Technologies Conference (2006)
Google Scholar
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Article Google Scholar
Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453–467 (1990)
Article Google Scholar
Pandzic, I., Ostermann, J., Millen, D.: Users Evaluation: Synthetic talking faces for interactive services. The Visual Computer 15, 2330–2340 (1999)
Article Google Scholar
Theobald, B.J., Bangham, J.A., Matthews, I.A., Cawley, G.C.: Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication 44, 127–140 (2004)
Article Google Scholar
Wolberg, G.: Digital image warping. IEEE Computer Society Press, Los Alamitos (1990)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. ETRO, Vrije Universiteit Brussel, Pleinlaan 2, B-1050, Brussels, Belgium
Wesley Mattheyses, Lukas Latacz, Werner Verhelst & Hichem Sahli

Authors

Wesley Mattheyses
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Latacz
View author publications
You can also search for this author in PubMed Google Scholar
Werner Verhelst
View author publications
You can also search for this author in PubMed Google Scholar
Hichem Sahli
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Andrei Popescu-Belis Rainer Stiefelhagen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mattheyses, W., Latacz, L., Verhelst, W., Sahli, H. (2008). Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis. In: Popescu-Belis, A., Stiefelhagen, R. (eds) Machine Learning for Multimodal Interaction. MLMI 2008. Lecture Notes in Computer Science, vol 5237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85853-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-540-85853-9_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85852-2
Online ISBN: 978-3-540-85853-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics