Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Zhou, Mohan; Bai, Yalong; Zhang, Wei; Yao, Ting; Zhao, Tiejun; Mei, Tao

doi:10.1007/978-3-031-19839-7_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13698))

Included in the following conference series:

European Conference on Computer Vision

Abstract

We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset “ViCo”, highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired “speaking-listening" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Code & ViCo dataset: https://project.mhzhou.com/vico.

M. Zhou—This work was done at JD Explore Academy.

M. Zhou and Y. Bai—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Video-driven speaker-listener generation based on Transformer and neural renderer

Article 29 January 2024

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation

References

Afouras, T., Chung, J.S., Zisserman, A.: Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)
Bansal, A., Ma, S., Ramanan, D., Sheikh, Y.: Recycle-gan: unsupervised video retargeting. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–135 (2018)
Google Scholar
Barker, L.L.: Listening behavior (1971)
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Berger, C.R.: Interpersonal communication: theoretical perspectives, future prospects. J. Commun. 55, 415–477 (2005)
Article Google Scholar
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 187–194 (1999)
Google Scholar
Bohr, P., Gargote, R., Vhorkate, R., Yawle, R., Bairagi, V.: A no reference image blur detection using cumulative probability blur detection (cpbd) metric. Int. J. Sci. Modern Eng. 1(5) (2013)
Google Scholar
Buschmeier, H., et al.: Alico: a multimodal corpus for the study of active listening. In: LREC 2014, Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland,, 26–31 May 2014, pp. 3638–3643 (2014)
Google Scholar
Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: a 3d facial expression database for visual computing. IEEE Trans. Vis. Comput. Graph. 20(3), 413–425 (2013)
Google Scholar
Cassel, N.N.W.W.: Elements of face-to-face conversation for embodied conversational agents, embodied conversational agents (2000)
Google Scholar
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? arXiv preprint arXiv:1705.02966 (2017)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y., Tong, X.: Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set. In: IEEE Computer Vision and Pattern Recognition Workshops (2019)
Google Scholar
Fassaert, T., van Dulmen, S., Schellevis, F., Bensing, J.: Active listening in medical consultations: development of the active listening observation scale (alos-global). Patient Educ. Counsel. 68(3), 258–264 (2007)
Article Google Scholar
Gillies, M., Pan, X., Slater, M., Shawe-Taylor, J.: Responsive listening behavior. Comput. Anim. Virt. Worlds 19(5), 579–589 (2008)
Article Google Scholar
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3497–3506 (2019)
Google Scholar
Hadar, U., Steiner, T.J., Rose, F.C.: Head movement during listening turns in conversation. J. Nonverbal Behav. 9(4), 214–228 (1985)
Article Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Processi. Syst. 30 (2017)
Google Scholar
Heylen, D., Bevacqua, E., Pelachaud, C., Poggi, I., Gratch, J., Schröder, M.: Generating listening behaviour. In: Emotion-Oriented Systems, pp. 321–347. Springer, Heidleberg (2011). https://doi.org/10.1007/978-3-642-15184-2_17
Heylen, D., Bevacqua, E., Tellier, M., Pelachaud, C.: Searching for prototypical facial feedback signals. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 147–153. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74997-4_14
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hömke, P., Holler, J., Levinson, S.C.: Eye blinks are perceived as communicative signals in human face-to-face interaction. PloS One 13(12), e0208030 (2018)
Article Google Scholar
Honeycutt, J.M., Ford, S.G.: Mental imagery and intrapersonal communication: a review of research on imagined interactions (iis) and current developments. Ann. Int. Commun. Assoc. 25(1), 315–345 (2001)
Google Scholar
Huang, Z., Zhang, T., Heng, W., Shi, B., Zhou, S.: Real-time intermediate flow estimation for video frame interpolation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Jalongo, M.R.: Promoting active listening in the classroom. Childhood Educ. 72(1), 13–18 (1995)
Article Google Scholar
Joo, H., Simon, T., Cikara, M., Sheikh, Y.: Towards social artificial intelligence: nonverbal social signal prediction in a triadic interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10873–10883 (2019)
Google Scholar
Kendon, A.: Movement coordination in social interaction: some examples described. Acta Psychologica 32, 101–125 (1970)
Article Google Scholar
Kendon, A., Harris, R.M., Key, M.R.: Organization of behavior in face-to-face interaction. Walter de Gruyter (2011)
Google Scholar
Kim, H.: Deep video portraits. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)
Article Google Scholar
Kong, L., et al.: Ifrnet: Intermediate feature refine network for efficient frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1969–1978 (2022)
Google Scholar
Li, L., et al.: Write-a-speaker: text-based emotional and rhythmic talking-head generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1911–1920 (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Luhmann, N.: What is communication? Commun. Theory 2(3), 251–259 (1992)
Article Google Scholar
Maatman, R.M., Gratch, J., Marsella, S.: Natural behavior of a listening agent. In: Panayiotopoulos, T., Gratch, J., Aylett, R., Ballin, D., Olivier, P., Rist, T. (eds.) IVA 2005. LNCS (LNAI), vol. 3661, pp. 25–36. Springer, Heidelberg (2005). https://doi.org/10.1007/11550617_3
Chapter Google Scholar
McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2011)
Article Google Scholar
McNaughton, D., Hamlin, D., McCarthy, J., Head-Reeves, D., Schreiner, M.: Learning to listen: teaching an active listening strategy to preservice education professionals. Topics Early Childhood Spec. Educ. 27(4), 223–231 (2008)
Article Google Scholar
Melis, G., Kočiskỳ, T., Blunsom, P.: Mogrifier lstm. arXiv preprint arXiv:1909.01792 (2019)
Mineyama, S., Tsutsumi, A., Takao, S., Nishiuchi, K., Kawakami, N.: Supervisors’ attitudes and skills for active listening with regard to working conditions and psychological stress reactions among subordinate workers. J. Occup. Health 49(2), 81–87 (2007)
Article Google Scholar
Oertel, C., Jonell, P., Kontogiorgos, D., Mora, K.F., Odobez, J.M., Gustafson, J.: Towards an engagement-aware attentive artificial listener for multi-party interactions. Front. Rob. AI 189 (2021)
Google Scholar
Park, J., Lee, C., Kim, C.S.: Asymmetric bilateral motion estimation for video frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14539–14548 (2021)
Google Scholar
Parker, J., Coiera, E.: Improving clinical communication: a view from psychology. J. Am. Med. Inf. Assoc. 7(5), 453–461 (2000)
Article Google Scholar
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)
Google Scholar
Petridis, S., Martinez, B., Pantic, M.: The mahnob laughter database. Image Vision Comput. 31(2), 186–202 (2013)
Article Google Scholar
Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492 (2020)
Google Scholar
Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 497–500 (2001)
Google Scholar
Ren, Y., Li, G., Chen, Y., Li, T.H., Liu, S.: Pirenderer: controllable portrait image generation via semantic neural rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13759–13768 (2021)
Google Scholar
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: Meshtalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1173–1182 (2021)
Google Scholar
Robertson, K.: Active listening: more than just paying attention. Aust. Family Phys. 34(12) (2005)
Google Scholar
Rogers, C.R., Farson, R.E.: Active listening (1957)
Google Scholar
Rost, M., Wilson, J.: Active Listening. Routledge, Abingdon (2013)
Book Google Scholar
Stacks, D.W., Salwen, M.B.: An Integrated Approach to Communication Theory and Research. Routledge, Abingdon (2014)
Book Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014)
Google Scholar
Tomasello, M.: Origins of Human Communication. MIT press, London (2010)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Wang, K., et al.: MEAD: a large-scale audio-visual dataset for emotional talking-face generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 700–717. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_42
Chapter Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wu, W., Zhang, Y., Li, C., Qian, C., Loy, C.C.: Reenactgan: learning to reenact faces via boundary transfer. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 603–619 (2018)
Google Scholar
Zhang, C., Ni, S., Fan, Z., Li, H., Zeng, M., Budagavi, M., Guo, X.: 3d talking face with personalized pose dynamics. IEEE Trans. Vis. Comput. Graph. (2021)
Google Scholar
Zhang, C., et al.: Facial: synthesizing dynamic talking face with implicit attribute learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3867–3876 (2021)
Google Scholar
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Article Google Scholar
Zhu, H., Luo, M.D., Wang, R., Zheng, A.H., He, R.: Deep audio-visual learning: a survey. Int. J. Autom. Comput., 1–26 (2021)
Google Scholar

Download references

Acknowledgment

This work was supported by the National Key R &D Program of China under Grant No. 2020AAA0108600.

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Mohan Zhou & Tiejun Zhao
JD Explore Academy, Beijing, China
Yalong Bai, Wei Zhang, Ting Yao & Tao Mei

Authors

Mohan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yalong Bai
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ting Yao
View author publications
You can also search for this author in PubMed Google Scholar
Tiejun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tao Mei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tiejun Zhao .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1476 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, M., Bai, Y., Zhang, W., Yao, T., Zhao, T., Mei, T. (2022). Responsive Listening Head Generation: A Benchmark Dataset and Baseline. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13698. Springer, Cham. https://doi.org/10.1007/978-3-031-19839-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-19839-7_8
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19838-0
Online ISBN: 978-3-031-19839-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Responsive Listening Head Generation: A Benchmark Dataset and Baseline