skip to main content
10.1145/1924559.1924588acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article

Synthesizing a talking mouth

Published:12 December 2010Publication History

ABSTRACT

This paper presents a visually realistic animation system for synthesizing a talking mouth. Video synthesis is achieved by first learning generative models from the recorded speech videos and then using the learned models to generate videos for novel utterances. A generative model considers the whole utterance contained in a video as a continuous process and represents it using a set of trigonometric functions embedded within a path graph. The transformation that projects the values of the functions to the image space is found through graph embedding. Such a model allows us to synthesize mouth images at arbitrary positions in the utterance. To synthesize a video for a novel utterance, the utterance is first compared with the existing ones from which we find the phoneme combinations that best approximate the utterance. Based on the learned models, dense videos are synthesized, concatenated and downsampled. A new generative model is then built on the remaining image samples for the final video synthesis.

References

  1. G. Bailly, M. Bérar, F. Elisei, and M. Odisio. Audiovisual speech synthesis. International Journal of Speech Technology, 6:331--346, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Proceedings of Advances in Neural Information Processing Systems, volume 14, pages 585--591, Vancouver, Canada, 2001.Google ScholarGoogle Scholar
  3. C. Bishop. Neural networks for pattern recognition. Oxford University Press Inc., New York, NY, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Brand. Voice pupperty. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 21--28, Los Angeles, CA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In Proceedings of the International Conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 353--360, Los Angeles, CA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Cao, W. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283--1302, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Chung. Spectral graph theory (CBMS regional conference series in mathematics, No. 92). American Mathematical Society, 1996.Google ScholarGoogle Scholar
  8. T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681--685, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Cosatto and H. Graf. Photo-realistic talking-heads from image samples. IEEE Transactions on Multimedia, 2(3):152--163, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Cosker, D. Marshall, P. Rosin, and Y. Hicks. Video realistic talking heads using hierarchical non-linear speech-appearance models. In Proceedings of Mirage, pages 2--7, Rocquencourt, France, 2003.Google ScholarGoogle Scholar
  11. Z. Deng, U. Neumann, J. Lewis, T.-Y. Kim, M. Bulut, and S. Narayanan. Expressive facial animation synthesis by learning speech coarticulation and expression spaces. IEEE Transactions on Visualization and Computer Graphics, 12(6):1523--1534, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Diestel. Graph Theory. Springer-Verlag Heidelbery, New York, NY, 3rd edition, 2005.Google ScholarGoogle Scholar
  13. T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388--398, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Golub and C. V. Loan. Matrix Computations. The John Hopkins University Press, Baltimore, MD, 3rd edition, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Gutierrez-Osuna, P. Kakumanu, A. E. O. Garcia, A. Bojorquez, J. Castillo, and I. Rudomin. Speech-driven facial animation with realistic dynamics. IEEE Transactions on Multimedia, 7(1):33--42, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. He, D. Cai, S. Yan, and H. Zhang. Neighborhood preserving embedding. In Proceedings of the 10th IEEE International Conference on Computer Vision, volume 2, pages 1208--1213, Beijing, China, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Pandzic. Facial animation framework for the web and mobile platforms. In Proceedings of the 7th International Conference on 3D Web Technology, pages 27--34, Tempe, AZ, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306--1326, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  19. B. Theobald. Audio visual speech synthesis. In Proceedings of International Congress on Phonetic Sciences, pages 285--290, Saarbrücken, Germany, 2007.Google ScholarGoogle Scholar
  20. B. Theobald, J. Bangham, I. Matthews, and G. Cawley. Near-videorealistic synthetic talking faces: implementation and evaluation. Speech Communication, 44:127--140, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  21. L. Xie and Z.-Q. Liu. A coupled HMM approach to video-realistic speech animation. Pattern Recognition, 40:2325--2340, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):40--51, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Zhao, M. Barnard, and M. Pietikäinen. Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254--1265, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Synthesizing a talking mouth

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            ICVGIP '10: Proceedings of the Seventh Indian Conference on Computer Vision, Graphics and Image Processing
            December 2010
            533 pages
            ISBN:9781450300605
            DOI:10.1145/1924559

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 December 2010

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate95of286submissions,33%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader