research-article

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

Authors:

Shibao ZhengAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9581 - 9585

https://doi.org/10.1145/3581783.3612869

Published: 27 October 2023 Publication History

Abstract

In dyadic speaker-listener interactions, the listener's head reactions, together with the speaker's head movements, form an important non-verbal semantic expression. The listener Head generation task aims to synthesize the responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the talking-head generation, it is more challenging to capture the correlation cues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge at the ACM Multimedia 2023 conference.

References

[1]

Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 187--194.

Digital Library

[2]

Zhigang Chang, Zhao Yang, Yongbiao Chen, Qin Zhou, and Shibao Zheng. 2021. Seq-Masks: Bridging the gap between appearance and gait modeling for video-based person re-identification. In 2021 International Conference on Visual Communications and Image Processing (VCIP). 1--5. https://doi.org/10.1109/VCIP53242.2021.9675368

[3]

Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020).

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497--3506.

[6]

Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. 2020. Towards Fast, Accurate and Stable 3D Dense Face Alignment. In Proceedings of the European Conference on Computer Vision (ECCV).

Digital Library

[7]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[8]

Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. 2022. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, Vol. 35 (2022), 378--393.

[9]

Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10462--10472.

[10]

Cheng Luo, Siyang Song, Weicheng Xie, Micol Spitale, Linlin Shen, and Hatice Gunes. 2023. ReactFace: Multiple Appropriate Facial Reaction Generation in Dyadic Interactions. arXiv preprint arXiv:2305.15748 (2023).

[11]

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395--20405.

[12]

Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. 2009. A 3D face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance. Ieee, 296--301.

[13]

KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484--492.

Digital Library

[14]

Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. 2021. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13759--13768.

[15]

Zilong Shao, Siyang Song, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. 2021. Personality recognition by modelling person-specific cognitive processes using graph representation. In proceedings of the 29th ACM international conference on multimedia. 357--366.

Digital Library

[16]

Siyang Song, Zilong Shao, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. 2022. Learning person-specific cognition from facial reactions for automatic personality recognition. IEEE Transactions on Affective Computing (2022).

Digital Library

[17]

Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 2022. Explicitly controllable 3d-aware portrait generation. arXiv preprint arXiv:2209.05434 (2022).

[18]

Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision. Springer, 700--717.

Digital Library

[19]

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, and Tiejun Zhao. 2023 a. Interactive Conversational Head Generation. arxiv: 2307.02090 [cs.CV]

[20]

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2022. Responsive listening head generation: a benchmark dataset and baseline. In European Conference on Computer Vision. Springer, 124--142.

Digital Library

[21]

Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2023 b. Visual-Aware Text-to-Speech*. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. https://doi.org/10.1109/ICASSP49357.2023.10095084

[22]

Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. 2021. Deep audio-visual learning: A survey. International Journal of Automation and Computing, Vol. 18 (2021), 351--376.

Digital Library

Cited By

Liu MWang JQian XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ListenFormer: Responsive Listening Head Generation with Non-autoregressive TransformersProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681182(7094-7103)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681182
Sun MXu CJiang XLiu YSun BHuang R(2024)Beyond Talking – Generating Holistic 3D Human Dyadic Motion for CommunicationInternational Journal of Computer Vision10.1007/s11263-024-02300-7Online publication date: 17-Dec-2024
https://doi.org/10.1007/s11263-024-02300-7
Fujii AFukuda K(2024)Generation of Listener’s Facial Response Using Cross-Modal Mapping of Speaker’s ExpressionHCI International 2024 – Late Breaking Posters10.1007/978-3-031-78531-3_22(194-201)Online publication date: 30-Dec-2024
https://doi.org/10.1007/978-3-031-78531-3_22
Show More Cited By

Index Terms

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction

Recommendations

Listener impressions of foreigner-directed speech: A systematic review
Abstract
Non-native speakers of a particular language face communicative challenges when interacting with native speakers in everyday life. A strategy frequently employed by native speakers to ensure smooth communication is speech accommodation ...
AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons
Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized ...
Semantic-aware Responsive Listener Head Synthesis
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Audience providing proper reaction during a conversation can bring positive impact to speaker, which is significant to digital human and social agent areas. Given information sent by speaker, responsive listener head synthesis task aims to generate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
126
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu MWang JQian XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)ListenFormer: Responsive Listening Head Generation with Non-autoregressive TransformersProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681182(7094-7103)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681182
Sun MXu CJiang XLiu YSun BHuang R(2024)Beyond Talking – Generating Holistic 3D Human Dyadic Motion for CommunicationInternational Journal of Computer Vision10.1007/s11263-024-02300-7Online publication date: 17-Dec-2024
https://doi.org/10.1007/s11263-024-02300-7
Fujii AFukuda K(2024)Generation of Listener’s Facial Response Using Cross-Modal Mapping of Speaker’s ExpressionHCI International 2024 – Late Breaking Posters10.1007/978-3-031-78531-3_22(194-201)Online publication date: 30-Dec-2024
https://doi.org/10.1007/978-3-031-78531-3_22
Tran MChang DSiniukov MSoleymani M(2024)DIM: Dyadic Interaction Modeling for Social Behavior GenerationComputer Vision – ECCV 202410.1007/978-3-031-72913-3_27(484-503)Online publication date: 2-Dec-2024
https://doi.org/10.1007/978-3-031-72913-3_27

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten