Skip to main content

Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5237))

Abstract

Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Lately much interest goes out to data-driven 2D photorealistic synthesis, where the system uses a database of pre-recorded auditory and visual speech data to construct the target output signal. In this paper we propose a synthesis technique that creates both the target auditory and the target visual speech by using a same audiovisual database. To achieve this, the well-known unit selection synthesis technique is extended to work with multimodal segments containing original combinations of audio and video. This strategy results in a multimodal output signal that displays a high level of audiovisual correlation, which is crucial to achieve a natural perception of the synthetic speech signal.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bailly, G., Brar, M., Elisei, F., Odisio, M.: Audiovisual speech synthesis. International Journal of Speech Technology 6, 331–346 (2003)

    Article  Google Scholar 

  2. Breen, A.P., Bowers, E., Welsh, W.: An Investigation into the Generation of Mouth Shapes for a Talking Head. In: International Conference on Spoken Language Processing, vol. 4, pp. 2159–2162 (1996)

    Google Scholar 

  3. Bregler, C., Covell, M., Slaney, M.: Video Rewrite: Driving Visual Speech with Audio. In: Association for Computing Machinery’s Special Interest Group on Graphics and Interactive Techniques, pp. 353–360 (1997)

    Google Scholar 

  4. Cosatto, E., Graf, H.P.: Sample-Based Synthesis of Photo-Realistic Talking Heads. Computer Animation, 103–110 (1998)

    Google Scholar 

  5. Cosatto, E., Graf, H.P.: Photo-realistic talking-heads from image samples. IEEE Transactions on multimedia 2, 152–163 (2000)

    Article  Google Scholar 

  6. Cosatto, E., Potamianos, G., Graf, H.P.: Audio-Visual Unit Selection for the Synthesis of Photo-Realistic Talking-Heads. International Conference on Multimedia and Expo, pp. 619–622 (2000)

    Google Scholar 

  7. Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes (MikeTalk). MIT AI Lab, A.I Memo 1658 (1999)

    Google Scholar 

  8. Ezzat, T., Geiger, G., Poggio, T.: Trainable videorealistic speech animation. Association for Computing Machinery’s Special Interest Group on Graphics and Interactive Techniques 21, 388–398 (2002)

    Google Scholar 

  9. Fagel, S.: Joint Audio-Visual Units Selection - The Javus Speech Synthesizer. In: International Conference on Speech and Computer (2006)

    Google Scholar 

  10. Goyal, U.K., Kapoor, A., Kalra, P.: Text-to-Audio Visual Speech Synthesizer. Virtual Worlds, 256–269 (2000)

    Google Scholar 

  11. Grant, K.W., Greenberg, S.: Speech Intelligibility Derived From Asynchrounous Processing of Auditory-Visual Information. In: Workshop on Audio-Visual Speech Processing, pp. 132–137 (2001)

    Google Scholar 

  12. Hunt, A., Black, A.: Unit selection in a concatenative speech synthesis system using a large speech database. In: International Conference on Acoustics, Speech and Signal Processing, pp. 373–376 (1996)

    Google Scholar 

  13. Kerkhoff, J., Marsi, E.: NeXTeNS: a New Open Source Text-to-speech System for Dutch. In: 13th meeting of Computational Linguistics in the Netherlands (2002)

    Google Scholar 

  14. Latacz, L., Kong, Y., Verhelst, W.: Unit Selection Synthesis Using Long Non-Uniform Units and Phoneme Identity Matching. In: 6th ISCA Workshop on Speech Synthesis, pp. 270–275 (2007)

    Google Scholar 

  15. Mattheyses, W., Latacz, L., Kong, Y.O., Verhelst, W.: Flemish Voice for the Nextens Text-To-Speech System. In: Fifth Slovenian and First International Language Technologies Conference (2006)

    Google Scholar 

  16. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)

    Article  Google Scholar 

  17. Moulines, E., Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication 9, 453–467 (1990)

    Article  Google Scholar 

  18. Pandzic, I., Ostermann, J., Millen, D.: Users Evaluation: Synthetic talking faces for interactive services. The Visual Computer 15, 2330–2340 (1999)

    Article  Google Scholar 

  19. Theobald, B.J., Bangham, J.A., Matthews, I.A., Cawley, G.C.: Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication 44, 127–140 (2004)

    Article  Google Scholar 

  20. Wolberg, G.: Digital image warping. IEEE Computer Society Press, Los Alamitos (1990)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andrei Popescu-Belis Rainer Stiefelhagen

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mattheyses, W., Latacz, L., Verhelst, W., Sahli, H. (2008). Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis. In: Popescu-Belis, A., Stiefelhagen, R. (eds) Machine Learning for Multimodal Interaction. MLMI 2008. Lecture Notes in Computer Science, vol 5237. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85853-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85853-9_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85852-2

  • Online ISBN: 978-3-540-85853-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics