Abstract:
Talking heads synthesis with expressions from speech is proposed in this paper. Talking heads synthesis can be considered as a learning problem of sequence-to-sequence ma...Show MoreMetadata
Abstract:
Talking heads synthesis with expressions from speech is proposed in this paper. Talking heads synthesis can be considered as a learning problem of sequence-to-sequence mapping, which consists of audio as input and video as output. To synthesize talking heads, we use SAVEE database which consists of videos of multiple sentences speeches recorded from front of face. Audiovisual data can be considered as two parallel sequential data of audio and visual features and it is composed of continuous value. Thus, audio and visual features of our dataset are represented by a regression model. In this research, the regression model is trained with long short-term memory (LSTM) by minimizing mean squared error (MSE). Then, audio features are used as input and visual features are used as target of LSTM. Thereby, talking heads are synthesized from speech. Our method is proposed to use lower level audio features than phonemes and it enables to synthesize talking heads with expressions while existing researches which use phonemes as audio features only can synthesize neutral expression talking heads. With SAVEE database, we achieved the minimum MSE 17.03 on our testing dataset. In experiment, we use mel-frequency cepstral coefficient (MFCC), AMFCC and A2 MFCC with energy as audio feature and active appearance model (AAM) on entire face region as visual feature.
Date of Conference: 11-13 December 2015
Date Added to IEEE Xplore: 11 February 2016
ISBN Information: