Speaker independence in automated lip-sync for audio–video communication

https://doi.org/10.1016/S0169-7552(98)00216-5Get rights and content

Abstract

By analyzing the absolute value of the Fourier transform of a speaker's voice signal we can predict the position of the mouth for English vowel sounds. This is without the use of text, speech recognition or mechanical or other sensing devices attached to the speaker's mouth. This capability can reduce the time required for mouth animation considerably. We expect it to be competitive eventually with the speech/text driven solutions which are becoming popular. Our technique would require much less interaction from the user and no knowledge of phonetic spelling. We discuss the problems of producing an algorithm that is speaker independent. The goal is to avoid having to measure mouth movements off video for each speaker's training sounds. We have discovered that eliminating variation due to pitch yields moments which are mouth shape dependent but not speaker dependent. This implies that careful construction of predictor surfaces can produce speaker independent prediction of mouth motion for English vowels.

Introduction

The ability to animate a talking mouth/face/head with parameters obtained directly from the acoustic speech signal permits a large and widely varied number of multimedia applications over and above the obvious one of automating the animation process. Many if not all of these applications may be manifested in WWW/Internet activities, video communications, and virtual reality. Some of them are: accurate face portrayals of intelligent talking agents, interactive gaming, automatic film dubbing, aid to the hearing impaired, aid to hearing in extremely noisy environments, speech therapy, language learning, and compression of facial images for video communications such as might be used for network based video-conferencing.

We have presented our results for speech driven lip synching of English vowels for a single speaker in 3, 4, 7, 8, 9. We show by computing shape evaluators of the magnitude of the Fourier transform of sections of the digitized input sound that predictor equations can be derived to predict mouth parameters which are used to describe the motion of the mouth corresponding to the input sound. This is without the use of text, speech recognition, or mechanical or other sensing devices attached to the speaker's mouth, and in that way is similar to the neural net based system of Lavagetto [5]. This capability can reduce the time required for mouth animation considerably. Our method runs very fast, requires only simple techniques, and is potentially speaker independent. We expect the method to compare with, or surpass, systems based on text, synthetic speech or speech recognition, such as those described in 6, 10, 11, 12.

We determine the fundamental frequency of the input sound at successive intervals, or, equivalently, the length of the glottal pulse (GP), by optimizing a linear combination of the first eight harmonics of a sequence of DFTs derived from the input signal. Once the GP has been detected and is being tracked accurately, the DFTs of successive intervals are scaled, clipped, smoothed and normalized to produce a probability density function. The appropriate moments are then computed and scaled and used as the independent variables input to a bivariate predictor function for each visible mouth parameter, which are the dependent variables. These parameters are: Flare, or the vertical distance between the upper and lower lip; Jaw, or the distance between the teeth; Corners, or the horizontal opening between the lips; and Edges, the distance between the join points of the upper and lower lips. Predictor functions are computed using moment sequences from several training sounds.

Our method works well because it does not depend on local behavior of the input signal or its transform. We avoid trying to detect such phenomena as the location of formants, for example. Neither do we use the computationally intensive method of hidden Markov models [15] which is common in speech recognition. Since smoothing is applied at several steps of the process, our method is based on global behavior of the GP and its transform.

The method makes successful predictions of mouth movements for cases in which the mouth measurements and the training sounds are from the same speaker, and the sounds are restricted to English vowels. When we compare the predicted mouth parameters to the actual values measured from video for non-training sound sequences, the accuracy has been good. Indeed, in some cases the technique has been sufficiently accurate to enable us to detect mouth movements missed by the measurement process.

Section snippets

Speaker independence

We describe our attempt to extend the results to treat multiple speakers. The goal is to avoid the labor intensive problem of measuring mouth movements from video of everyone who uses the system. There have been recent attempts to automate the process of observing and measuring the parameters of mouth motion 1, 2, 13, 14 but nothing has appeared that produces the accuracy needed for our application.

We seek mouth movement which can be used for applications such as lip reading and cartooning as

Intraspeaker variation

Linguistics researchers and experts in voice processing have known for years that shape characteristics of a given sound are relatively static for a given speaker. For example the relative locations of the first and second formants in the high front vowel /i/ are relatively constant over pitch variation for a given speaker. However, locating formants automatically is difficult and error prone because they represent local behavior of the DFT. We have chosen to use moments, which are common

Interspeaker variation

The variation reduction procedure described in the previous section not only had a positive result for predicting mouth motion for a single speaker, but it also produced results which were similar over several speakers. Tracks of moments versus time produced by our three speakers for the same sound were almost identical except for timing considerations and variations in mouth position. As examples, Fig. 3Fig. 4 show the tracks for the sound OWIE for speakers A and C, respectively. Compare them

Procedure

We used the same training sounds and mouth measurements of speaker A as in our previous papers, but we now use only the mean and the variance of the modified DFT as independent variables.

We have constructed biquadratic least-squares surfaces where we have added points along the boundary of the rectangle R=[1.3, 2]×[1.1, 6] containing the tracks of training sounds, which both reduces ill-conditioning of the problem and ensures that occurrences of (mean, variance) pairs which lie too far from the

Conclusions and future research

The above results suggest that we may now have speaker independence which can be described by two independent variables for predicting English vowels. We can now spend our energy on the development of universally effective predictor surfaces. We are eager to move to sounds other than vowels and glides, for example nasals and liquids, to see if similar approaches can be used. With similar success in the vowel-like (resonating) consonants, we would then move on to fricatives, stops and

David McAllister's primary research areas are speech processing, computer graphics and imaging. He is also interested in true 3D display and has presented several tutorials in this area for SPIE and SIGGRAPH. He has published in the areas of curve and surface representation, fault-tolerant software reliability, and most recently in the lip synchronization of speech.

References (15)

  • S. Basu, A. Pentland, Recovering 3D lip structure from 2D observations using a model trained from video, in: Proc. Eur....
  • C.M. Jones, S.S. Dlay, Automated lip synchronisation for human–computer interaction and special effect animation, in:...
  • B. Koster, R. Rodman, D. Bitzer, Automated lip-sync: direct translation of speech-sound to mouth shape, in: Proc. 28th...
  • B. Koster, Automatic lip-sync: direct translation of speech-sound to mouth-animation, Ph.D. Dissertation, Department of...
  • F. Lavagetto

    Time-delay neural networks for estimating lip movements from speech analysis: a useful tool in audio–video synchronization

    IEEE Trans. on Circuits and Systems for Video Technology

    (1997)
  • J.P. Lewis

    Automated lip-sync: background and techniques

    J. Visual. Comput. Animation

    (1991)
  • D.F. McAllister, R.D. Rodman, D.L. Bitzer, A.S. Freeman, Lip synchronization for animation, in: SIGGRAPH '97 Visual...
There are more references available in the full text version of this article.

Cited by (0)

  1. Download : Download full-size image
David McAllister's primary research areas are speech processing, computer graphics and imaging. He is also interested in true 3D display and has presented several tutorials in this area for SPIE and SIGGRAPH. He has published in the areas of curve and surface representation, fault-tolerant software reliability, and most recently in the lip synchronization of speech.

  1. Download : Download full-size image
Robert Rodman's research interests are in speech processing, most particularly in lip synchronization and speaker recognition. In the past he has done research in the use of computers in political campaigns, the use of computerized telephone calls to monitor elderly or frail persons living along, the use of voice I/O as aids to the handicapped, speech recognition of Chinese and theoretical linguistics. He is the co-author of An Introduction to Language, now in its sixth edition. He is also a co-author of the monograph Voice Recognition and author of the book Computer Voice Technology.

  1. Download : Download full-size image
Donald Bitzer's current research includes convolutional decoding for high-speed networks, high-speed, error-free communication channels for satellite and land communications, computer-based education approach to teaching discrete mathematics in computer science, and speech processing, including lip-sync, speech and speaker recognition, and language identification. He is a Distinguished University Research Professor at North Carolina State University and a member of the National Academy of Sciences.

  1. Download : Download full-size image
Andrew Freeman is a former graduate student at North Carolina State University, where he recently earned an M.S. in computer science. He received his Bachelors degree in computer science in May 1994 from Johns Hopkins University. Before coming to N.C. State he was employed as a network administrator for the Office of Management and Budget at the Executive Office of the President of the United States.

View full text