skip to main content
10.1145/3581783.3612869acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

Published: 27 October 2023 Publication History

Abstract

In dyadic speaker-listener interactions, the listener's head reactions, together with the speaker's head movements, form an important non-verbal semantic expression. The listener Head generation task aims to synthesize the responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the talking-head generation, it is more challenging to capture the correlation cues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge at the ACM Multimedia 2023 conference.

References

[1]
Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. 187--194.
[2]
Zhigang Chang, Zhao Yang, Yongbiao Chen, Qin Zhou, and Shibao Zheng. 2021. Seq-Masks: Bridging the gap between appearance and gait modeling for video-based person re-identification. In 2021 International Conference on Visual Communications and Image Processing (VCIP). 1--5. https://doi.org/10.1109/VCIP53242.2021.9675368
[3]
Joon Son Chung, Jaesung Huh, Seongkyu Mun, Minjae Lee, Hee Soo Heo, Soyeon Choe, Chiheon Ham, Sunghwan Jung, Bong-Jin Lee, and Icksang Han. 2020. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982 (2020).
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5]
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra Malik. 2019. Learning individual styles of conversational gesture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3497--3506.
[6]
Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. 2020. Towards Fast, Accurate and Stable 3D Dense Face Alignment. In Proceedings of the European Conference on Computer Vision (ECCV).
[7]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[8]
Jingyun Liang, Yuchen Fan, Xiaoyu Xiang, Rakesh Ranjan, Eddy Ilg, Simon Green, Jiezhang Cao, Kai Zhang, Radu Timofte, and Luc V Gool. 2022. Recurrent video restoration transformer with guided deformable attention. Advances in Neural Information Processing Systems, Vol. 35 (2022), 378--393.
[9]
Xian Liu, Qianyi Wu, Hang Zhou, Yinghao Xu, Rui Qian, Xinyi Lin, Xiaowei Zhou, Wayne Wu, Bo Dai, and Bolei Zhou. 2022. Learning hierarchical cross-modal association for co-speech gesture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10462--10472.
[10]
Cheng Luo, Siyang Song, Weicheng Xie, Micol Spitale, Linlin Shen, and Hatice Gunes. 2023. ReactFace: Multiple Appropriate Facial Reaction Generation in Dyadic Interactions. arXiv preprint arXiv:2305.15748 (2023).
[11]
Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395--20405.
[12]
Pascal Paysan, Reinhard Knothe, Brian Amberg, Sami Romdhani, and Thomas Vetter. 2009. A 3D face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance. Ieee, 296--301.
[13]
KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia. 484--492.
[14]
Yurui Ren, Ge Li, Yuanqi Chen, Thomas H Li, and Shan Liu. 2021. Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13759--13768.
[15]
Zilong Shao, Siyang Song, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. 2021. Personality recognition by modelling person-specific cognitive processes using graph representation. In proceedings of the 29th ACM international conference on multimedia. 357--366.
[16]
Siyang Song, Zilong Shao, Shashank Jaiswal, Linlin Shen, Michel Valstar, and Hatice Gunes. 2022. Learning person-specific cognition from facial reactions for automatic personality recognition. IEEE Transactions on Affective Computing (2022).
[17]
Junshu Tang, Bo Zhang, Binxin Yang, Ting Zhang, Dong Chen, Lizhuang Ma, and Fang Wen. 2022. Explicitly controllable 3d-aware portrait generation. arXiv preprint arXiv:2209.05434 (2022).
[18]
Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision. Springer, 700--717.
[19]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, and Tiejun Zhao. 2023 a. Interactive Conversational Head Generation. arxiv: 2307.02090 [cs.CV]
[20]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2022. Responsive listening head generation: a benchmark dataset and baseline. In European Conference on Computer Vision. Springer, 124--142.
[21]
Mohan Zhou, Yalong Bai, Wei Zhang, Ting Yao, Tiejun Zhao, and Tao Mei. 2023 b. Visual-Aware Text-to-Speech*. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1--5. https://doi.org/10.1109/ICASSP49357.2023.10095084
[22]
Hao Zhu, Man-Di Luo, Rui Wang, Ai-Hua Zheng, and Ran He. 2021. Deep audio-visual learning: A survey. International Journal of Automation and Computing, Vol. 18 (2021), 351--376.

Cited By

View all
  • (2024)ListenFormer: Responsive Listening Head Generation with Non-autoregressive TransformersProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681182(7094-7103)Online publication date: 28-Oct-2024
  • (2024)Beyond Talking – Generating Holistic 3D Human Dyadic Motion for CommunicationInternational Journal of Computer Vision10.1007/s11263-024-02300-7Online publication date: 17-Dec-2024
  • (2024)Generation of Listener’s Facial Response Using Cross-Modal Mapping of Speaker’s ExpressionHCI International 2024 – Late Breaking Posters10.1007/978-3-031-78531-3_22(194-201)Online publication date: 30-Dec-2024
  • Show More Cited By

Index Terms

  1. Hierarchical Semantic Perceptual Listener Head Video Generation: A High-performance Pipeline

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)ListenFormer: Responsive Listening Head Generation with Non-autoregressive TransformersProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681182(7094-7103)Online publication date: 28-Oct-2024
    • (2024)Beyond Talking – Generating Holistic 3D Human Dyadic Motion for CommunicationInternational Journal of Computer Vision10.1007/s11263-024-02300-7Online publication date: 17-Dec-2024
    • (2024)Generation of Listener’s Facial Response Using Cross-Modal Mapping of Speaker’s ExpressionHCI International 2024 – Late Breaking Posters10.1007/978-3-031-78531-3_22(194-201)Online publication date: 30-Dec-2024
    • (2024)DIM: Dyadic Interaction Modeling for Social Behavior GenerationComputer Vision – ECCV 202410.1007/978-3-031-72913-3_27(484-503)Online publication date: 2-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media