ABSTRACT
This paper presents a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. Based on the learned models, dense videos are synthesized, concatenated and downsampled. A new generative model is then built on the remaining image samples for the final video synthesis.
- G. Bailly, M. Bérar, F. Elisei, and M. Odisio. Audiovisual speech synthesis. International Journal of Speech Technology, 6:331--346, 2003.Google ScholarCross Ref
- M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Advances in Neural Information Processing Systems, volume 14, pages 585--591, Vancouver, Canada, 2001.Google Scholar
- C. Bishop. Neural networks for pattern recognition. Oxford University Press Inc., New York, NY, 1995. Google ScholarDigital Library
- M. Brand. Voice pupperty. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 21--28, Los Angeles, CA, 1999. Google ScholarDigital Library
- C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 353--360, Los Angeles, CA, 1997. Google ScholarDigital Library
- Y. Cao, W. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283--1302, 2005. Google ScholarDigital Library
- F. Chung. Spectral graph theory (CBMS regional conference series in mathematics, No. 92). American Mathematical Society, 1996.Google Scholar
- T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681--685, 2001. Google ScholarDigital Library
- E. Cosatto and H. Graf. Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia, 2(3):152--163, 2000. Google ScholarDigital Library
- D. Cosker, D. Marshall, P. Rosin, and Y. Hicks. Video realistic talking heads using hierarchical non-linear speech-appearance models. In Proceedings of Mirage, pages 2--7, Rocquencourt, France, 2003.Google Scholar
- Z. Deng, U. Neumann, J. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan. Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Transactions on Visualization and Computer Graphics, 12(6):1523--1534, 2006. Google ScholarDigital Library
- R. Diestel. Graph Theory. Springer-Verlag Heidelbery, New York, NY, 3rd edition, 2005.Google Scholar
- T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388--398, 2002. Google ScholarDigital Library
- G. Golub and C. V. Loan. Matrix Computations. The John Hopkins University Press, Baltimore, MD, 3rd edition, 1996. Google ScholarDigital Library
- R. Gutierrez-Osuna, P. Kakumanu, A. E. O. Garcia, A. Bojorquez, J. Castillo, and I. Rudomin. Speech-driven facial animation with realistic dynamics. IEEE Transactions on Multimedia, 7(1):33--42, 2005. Google ScholarDigital Library
- X. He, D. Cai, S. Yan, and H. Zhang. Neighborhood preserving embedding. In Proceedings of the 10th IEEE International Conference on Computer Vision, volume 2, pages 1208--1213, Beijing, China, 2005. Google ScholarDigital Library
- I. Pandzic. Facial animation framework for the web and mobile platforms. In Proceedings of the 7th International Conference on 3D Web Technology, pages 27--34, Tempe, AZ, 2002. Google ScholarDigital Library
- G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306--1326, 2003.Google ScholarCross Ref
- B. Theobald. Audio visual speech synthesis. In Proceedings of International Congress on Phonetic Sciences, pages 285--290, Saarbrücken, Germany, 2007.Google Scholar
- B. Theobald, J. Bangham, I. Matthews, and G. Cawley. Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication, 44:127--140, 2004.Google ScholarCross Ref
- L. Xie and Z.-Q. Liu. A coupled HMM approach to video-realistic speech animation. Pattern Recognition, 40:2325--2340, 2007. Google ScholarDigital Library
- S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):40--51, 2007. Google ScholarDigital Library
- G. Zhao, M. Barnard, and M. Pietikäinen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254--1265, 2009. Google ScholarDigital Library
Index Terms
- Synthesizing a talking mouth
Recommendations
Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling
This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual ...
Look who is talking: soundbite speaker name recognition in broadcast news speech
NAACL-Short '07: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short PapersSpeaker name recognition plays an important role in many spoken language applications, such as rich transcription, information extraction, question answering, and opinion mining. In this paper, we developed an SVM-based classification framework to ...
Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training
Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. We present a hidden Markov model (HMM)-based emphatic speech synthesis model. The ultimate objective is to ...
Comments