research-article

Synthesizing a talking mouth

Authors:
Ziheng Zhou

University of Oulu, Oulu, Finland

University of Oulu, Oulu, Finland
View Profile

,
Guoying Zhao

University of Oulu, Oulu, Finland

University of Oulu, Oulu, Finland
View Profile

,
Matti Pietikäinen

University of Oulu, Oulu, Finland

University of Oulu, Oulu, Finland
View Profile

ICVGIP '10: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image ProcessingDecember 2010Pages 211–218https://doi.org/10.1145/1924559.1924588

Published:12 December 2010Publication History

ICVGIP '10: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing

Pages 211–218

ABSTRACT

This paper presents a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. Based on the learned models, dense videos are synthesized, concatenated and downsampled. A new generative model is then built on the remaining image samples for the final video synthesis.

References

G. Bailly, M. Bérar, F. Elisei, and M. Odisio. Audiovisual speech synthesis. International Journal of Speech Technology, 6:331--346, 2003.Google ScholarCross Ref
M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Advances in Neural Information Processing Systems, volume 14, pages 585--591, Vancouver, Canada, 2001.Google Scholar
C. Bishop. Neural networks for pattern recognition. Oxford University Press Inc., New York, NY, 1995. Google ScholarDigital Library
M. Brand. Voice pupperty. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 21--28, Los Angeles, CA, 1999. Google ScholarDigital Library
C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 353--360, Los Angeles, CA, 1997. Google ScholarDigital Library
Y. Cao, W. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283--1302, 2005. Google ScholarDigital Library
F. Chung. Spectral graph theory (CBMS regional conference series in mathematics, No. 92). American Mathematical Society, 1996.Google Scholar
T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681--685, 2001. Google ScholarDigital Library
E. Cosatto and H. Graf. Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia, 2(3):152--163, 2000. Google ScholarDigital Library
D. Cosker, D. Marshall, P. Rosin, and Y. Hicks. Video realistic talking heads using hierarchical non-linear speech-appearance models. In Proceedings of Mirage, pages 2--7, Rocquencourt, France, 2003.Google Scholar
Z. Deng, U. Neumann, J. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan. Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Transactions on Visualization and Computer Graphics, 12(6):1523--1534, 2006. Google ScholarDigital Library
R. Diestel. Graph Theory. Springer-Verlag Heidelbery, New York, NY, 3rd edition, 2005.Google Scholar
T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388--398, 2002. Google ScholarDigital Library
G. Golub and C. V. Loan. Matrix Computations. The John Hopkins University Press, Baltimore, MD, 3rd edition, 1996. Google ScholarDigital Library
R. Gutierrez-Osuna, P. Kakumanu, A. E. O. Garcia, A. Bojorquez, J. Castillo, and I. Rudomin. Speech-driven facial animation with realistic dynamics. IEEE Transactions on Multimedia, 7(1):33--42, 2005. Google ScholarDigital Library
X. He, D. Cai, S. Yan, and H. Zhang. Neighborhood preserving embedding. In Proceedings of the 10th IEEE International Conference on Computer Vision, volume 2, pages 1208--1213, Beijing, China, 2005. Google ScholarDigital Library
I. Pandzic. Facial animation framework for the web and mobile platforms. In Proceedings of the 7th International Conference on 3D Web Technology, pages 27--34, Tempe, AZ, 2002. Google ScholarDigital Library
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306--1326, 2003.Google ScholarCross Ref
B. Theobald. Audio visual speech synthesis. In Proceedings of International Congress on Phonetic Sciences, pages 285--290, Saarbrücken, Germany, 2007.Google Scholar
B. Theobald, J. Bangham, I. Matthews, and G. Cawley. Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication, 44:127--140, 2004.Google ScholarCross Ref
L. Xie and Z.-Q. Liu. A coupled HMM approach to video-realistic speech animation. Pattern Recognition, 40:2325--2340, 2007. Google ScholarDigital Library
S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):40--51, 2007. Google ScholarDigital Library
G. Zhao, M. Barnard, and M. Pietikäinen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254--1265, 2009. Google ScholarDigital Library

Index Terms

Synthesizing a talking mouth
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Image and video acquisition
        3D imaging
  2. Computer graphics
    1. Animation
    2. Shape modeling
2. Theory of computation
  1. Randomness, geometry and discrete structures
    1. Computational geometry

Recommendations

Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling

This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual ...
Read More
Look who is talking: soundbite speaker name recognition in broadcast news speech
NAACL-Short '07: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

Speaker name recognition plays an important role in many spoken language applications, such as rich transcription, information extraction, question answering, and opinion mining. In this paper, we developed an SVM-based classification framework to ...
Read More
Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training

Emphasis plays an important role in expressive speech synthesis in highlighting the focus of an utterance to draw the attention of the listener. We present a hidden Markov model (HMM)-based emphatic speech synthesis model. The ultimate objective is to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICVGIP '10: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing
December 2010
533 pages
ISBN:9781450300605
DOI:10.1145/1924559
General Chairs:
Rama Chellappa
University of Maryland
,
Padmanabhan Anandan
Microsoft Research, India
,
Program Chairs:
A. N. Rajagopalan
Indian Institute of Technology Madras, India
,
P. J. Narayanan
International Institute of Information Technology Hyderabad, India
,
Philip Torr
Oxford Brookes University, UK
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 December 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate95of286submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 153
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Synthesizing a talking mouth

ICVGIP '10: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Realistic Mouth-Synching for Speech-Driven Talking Face Using Articulatory Modelling

Look who is talking: soundbite speaker name recognition in broadcast news speech

Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training