Abstract
Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. In this paper we propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a natural perception of the synthetic speech signal.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bailly, G., Brar, M., Elisei, F., Odisio, M.: Audiovisual speech synthesis. International Journal of Speech Technology 6, 331–346 (2003)
Breen, A.P., Bowers, E., Welsh, W.: An Investigation into the Generation of Mouth Shapes for a Talking Head. In: International Conference on Spoken Language Processing, vol. 4, pp. 2159–2162 (1996)
Bregler, C., Covell, M., Slaney, M.: Video Rewrite: Driving Visual Speech with Audio. In: Association for Computing Machinery’s Special Interest Group on Graphics and Interactive Techniques, pp. 353–360 (1997)
Cosatto, E., Graf, H.P.: Sample-Based Synthesis of Photo-Realistic Talking Heads. Computer Animation, 103–110 (1998)
Cosatto, E., Graf, H.P.: Photo-realistic talking-heads from image samples. IEEE Transactions on multimedia 2, 152–163 (2000)
Cosatto, E., Potamianos, G., Graf, H.P.: Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads. International Conference on Multimedia and Expo, pp. 619–622 (2000)
Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes (MikeTalk). MIT AI Lab, A.I Memo 1658 (1999)
Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. Association for Computing Machinery’s Special Interest Group on Graphics and Interactive Techniques 21, 388–398 (2002)
Fagel, S.: Joint Audio-Visual Units Selection - The Javus Speech Synthesizer. In: International Conference on Speech and Computer (2006)
Goyal, U.K., Kapoor, A., Kalra, P.: Text-to-Audio Visual Speech Synthesizer. Virtual Worlds, 256–269 (2000)
Grant, K.W., Greenberg, S.: Speech Intelligibility Derived From Asynchrounous Processing of Auditory-Visual Information. In: Workshop on Audio-Visual Speech Processing, pp. 132–137 (2001)
Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. In: International Conference on Acoustics, Speech and Signal Processing, pp. 373–376 (1996)
Kerkhoff, J., Marsi, E.: NeXTeNS: a New Open Source Text-to-speech System for Dutch. In: 13th meeting of Computational Linguistics in the Netherlands (2002)
Latacz, L., Kong, Y., Verhelst, W.: Unit Selection Synthesis Using Long Non-Uniform Units and Phoneme Identity Matching. In: 6th ISCA Workshop on Speech Synthesis, pp. 270–275 (2007)
Mattheyses, W., Latacz, L., Kong, Y.O., Verhelst, W.: Flemish Voice for the Nextens Text-To-Speech System. In: Fifth Slovenian and First International Language Technologies Conference (2006)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453–467 (1990)
Pandzic, I., Ostermann, J., Millen, D.: Users Evaluation: Synthetic talking faces for interactive services. The Visual Computer 15, 2330–2340 (1999)
Theobald, B.J., Bangham, J.A., Matthews, I.A., Cawley, G.C.: Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication 44, 127–140 (2004)
Wolberg, G.: Digital image warping. IEEE Computer Society Press, Los Alamitos (1990)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mattheyses, W., Latacz, L., Verhelst, W., Sahli, H. (2008). Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis. In: Popescu-Belis, A., Stiefelhagen, R. (eds) Machine Learning for Multimodal Interaction. MLMI 2008. Lecture Notes in Computer Science, vol 5237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85853-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-85853-9_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85852-2
Online ISBN: 978-3-540-85853-9
eBook Packages: Computer ScienceComputer Science (R0)