Abstract
The face conveys a blend of verbal and nonverbal information playing an important role in daily interaction. While speech articulation mostly affects the orofacial areas, emotional behaviors are externalized across the entire face. Considering the relation between verbal and nonverbal behaviors is important to create naturalistic facial movements for conversational agents (CAs). Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. These relationships are ignored when facial movements across the face are separately generated. This paper proposes to create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. The input to the models are features extracted from speech that convey the verbal and emotional states of the speakers. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. The objective and subjective evaluations of the results demonstrate the benefits of joint modeling of facial regions using this framework.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
Albrecht, I., Schröder, M., Haber, J., Seidel, H.P.: Mixed feelings: expression of non-basic emotions in a muscle-based talking head. Virtual Reality 8(4), 201–212 (2005)
Anderson, R., Stenger, B., Wan, V., Cipolla, R.: Expressive visual text-to-speech using active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, pp. 3382–3389, June 2013
Balci, K.: Xface: MPEG-4 based open source toolkit for 3D facial animation. In: Conference on Advanced Visual Interfaces (AVI 2004), Gallipoli, Italy, pp. 399–402, May 2004
Brand, M.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999), New York, NY, USA, pp. 21–28 (1999)
Burmania, A., Parthasarathy, S., Busso, C.: Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Transactions on Affective Computing 7(4), 374–388 (2016)
Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing 15(3), 1075–1086 (2007)
Busso, C., Narayanan, S.: Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Transactions on Audio, Speech and Language Processing 15(8), 2331–2347 (2007)
Busso, C., Narayanan, S.: Interplay between linguistic and affective goals in facial expression during emotional utterances. In: 7th International Seminar on Speech Production (ISSP 2006), Ubatuba-SP, Brazil, pp. 549–556, December 2006
Busso, C., Narayanan, S.: Scripted dialogs versus improvisation: lessons learned about emotional elicitation techniques from the IEMOCAP database. In: Interspeech 2008 - Eurospeech, Brisbane, Australia, pp. 1670–1673, September 2008
Cao, Y., Tien, W., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Transactions on Graphics 24(4), 1283–1302 (2005)
Choi, K., Luo, Y., Hwang, J.: Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. The Journal of VLSI Signal Processing 29(1–2), 51–61 (2001)
Cohen, M.M., Massaro, D.W.: Modeling coarticulation in synthetic visual speech. In: Magnenat-Thalmann, N., Thalmann, D. (eds.) Models and Techniques in Computer Animation, pp. 139–156. Springer Verlag, Tokyo (1993)
Ding, Y., Pelachaud, C., Artières, T.: Modeling multimodal behaviors from speech prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS (LNAI), vol. 8108, pp. 217–228. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40415-3_19
Eyben, F., Scherer, K., Schuller, B., Sundberg, J., André, E., Busso, C., Devillers, L., Epps, J., Laukka, P., Narayanan, S., Truong, K.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7(2), 190–202 (2016)
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015), Brisbane, Australia, pp. 4884–4888, April 2015
Fan, B., Xie, L., Yang, S., Wang, L., Soong, F.K.: A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications 75(9), 5287–5309 (2016)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2010), Sardinia, Italy, pp. 249–256, May 2010
Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O., Bojorquez, A., Castillo, J., Rudomin, I.: Speech-driven facial animation with realistic dynamics. IEEE Transactions on Multimedia 7(1), 33–42 (2005)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego, CA, USA, pp. 1–13, May 2015
Li, X., Wu, Z., Meng, H., Jia, J., Lou, X., Cai, L.: Expressive speech driven talking avatar synthesis with DBLSTM using limited amount of emotional bimodal data. In: Interspeech 2016, San Francisco, CA, USA, pp. 1477–1481, September 2016
Mana, N., Pianesi, F.: HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In: International Conference on Multimodal Interfaces (ICMI 2006), Banff, AB, Canada, pp. 380–387, November 2006
Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech and Language Processing 20(8), 2329–2340 (2012)
Mariooryad, S., Busso, C.: Feature and model level compensation of lexical content for facial emotion recognition. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2013), Shanghai, China, pp. 1–6, April 2013
Marsella, S., Xu, Y., Lhommet, M., Feng, A., Scherer, S., Shapiro, A.: Virtual character performance from speech. In: ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA 2013), Anaheim, CA, USA, pp. 25–35, July 2013
Mehrabian, A.: Communication without words. In: Mortensen, C. (ed.) Communication Theory, pp. 193–200. Transaction Publishers, New Brunswick (2007)
Pelachaud, C., Badler, N., Steedman, M.: Generating facial expressions for speech. Cognitive Science 20(1), 1–46 (1996)
Sadoughi, N., Liu, Y., Busso, C.: Speech-driven animation constrained by appropriate discourse functions. In: International conference on multimodal interaction (ICMI 2014), Istanbul, Turkey, pp. 148–155, November 2014
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014)
Taylor, S., Kato, A., Matthews, I., Milner, B.: Audio-to-visual speech conversion using deep neural networks. In: Interspeech 2016, San Francisco, CA, USA, pp. 1482–1486, September 2016
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M., Schuller, B., Zafeiriou, S.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, pp. 5200–5204, March 2016
Xu, Y., Feng, A.W., Marsella, S., Shapiro, A.: A practical and configurable lip sync method for games. In: Motion in Games (MIG 2013), Dublin, Ireland, pp. 131–140, November 2013
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sadoughi, N., Busso, C. (2017). Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory. In: Beskow, J., Peters, C., Castellano, G., O'Sullivan, C., Leite, I., Kopp, S. (eds) Intelligent Virtual Agents. IVA 2017. Lecture Notes in Computer Science(), vol 10498. Springer, Cham. https://doi.org/10.1007/978-3-319-67401-8_49
Download citation
DOI: https://doi.org/10.1007/978-3-319-67401-8_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67400-1
Online ISBN: 978-3-319-67401-8
eBook Packages: Computer ScienceComputer Science (R0)