ABSTRACT
Audiovisual text-to-speech (AVTTS) synthesizers are capable of generating a synthetic audiovisual speech signal based on an input text. A possible approach to achieve this is model-based synthesis, where the talking head consists of a 3D model of which the polygons are varied in accordance with the target speech. In contrast with these rule-based systems, data-driven synthesizers create the target speech by reusing pre-recorded natural speech samples. The system we developed at the Vrije Universiteit Brussel is a data-based 2D photorealistic synthesizer that is able to create a synthetic visual speech signal that is similar to standard 'newsreader-style' television recordings.
- Edwards, G., Taylor, C., and Cootes, T. 1998. Interpreting face images using active appearance models. In Int. Conf. on Face and Gesture Recognition, 300--305. Google ScholarDigital Library
- Hunt, A., and Black, A. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In International Conference on Acoustics, Speech and Signal Processing, 373--376. Google ScholarDigital Library
- Mattheyses, W., Latacz, L., and Verhelst, W. 2009. On the importance of audiovisual coherence for the perceived quality of synthesized visual speech. EURASIP Journal on Audio, Speech, and Music Processing SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation. Google ScholarDigital Library
- Theobald, B.-J., Fagel, S., Bailly, G., and Elisei, F. 2008. Lips2008: Visual speech synthesis challenge. In Interspeech '08, 1875--1878.Google Scholar
Index Terms
- Photorealistic 2D audiovisual text-to-speech synthesis using active appearance models
Recommendations
Concatenative speech synthesis for Amharic using unit selection method
MEDES '12: Proceedings of the International Conference on Management of Emergent Digital EcoSystemsIn this paper we propose algorithms and methods that address critical issues in developing a general Amharic text-to-speech synthesizer. Converting grapheme to phoneme in Amharic is a very challenging task because of the two necessary and yet ...
Analysis of statistical parametric and unit selection speech synthesis systems applied to emotional speech
We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and ...
Audiovisual Speech Synthesis using Tacotron2
ICMI '21: Proceedings of the 2021 International Conference on Multimodal InteractionAudiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based ...
Comments