Co-speech Gesture Video Generation with 3D Human Meshes

Mahapatra, Aniruddha; Mishra, Richa; Li, Renda; Chen, Ziyi; Ding, Boyang; Wang, Shoulei; Zhu, Jun-Yan; Chang, Peng; Han, Mei; Xiao, Jing

doi:10.1007/978-3-031-73024-5_11

Aniruddha Mahapatra¹³,
Richa Mishra¹³,
Renda Li^14,15,
Ziyi Chen¹⁶,
Boyang Ding^14,15,
Shoulei Wang^14,15,
Jun-Yan Zhu¹³,
Peng Chang¹⁶,
Mei Han¹⁶ &
…
Jing Xiao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15147))

Included in the following conference series:

European Conference on Computer Vision

238 Accesses

Abstract

Co-speech gesture video generation is an enabling technique for many digital human applications. Substantial progress has been made in creating high-quality talking head videos. However, existing hand gesture video generation methods are primarily limited by the widely adopted 2D skeleton-based gesture representation and still struggle to generate realistic hands. We introduce an audio-driven co-speech video generation pipeline to synthesize human speech videos leveraging 3D human mesh-based representations. By adopting a 3D human mesh-based gesture representation, we present a mesh-grounded video generator that includes a mesh texture map optimization step followed by a conditional GAN network and outputs photorealistic gesture videos with realistic hands. Our experiments on the TalkSHOW dataset demonstrate the effectiveness of our method over 2D skeleton-based baselines.

A. Mahapatra and R. Mishra—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Article 13 May 2024

MMIDM: Generating 3D Gesture from Multimodal Inputs with Diffusion Models

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

References

Ao, T., Zhang, Z., Liu, L.: GestureDiffuCLIP: gesture diffusion model with CLIP latents. ACM Trans. Graph. (2023). https://doi.org/10.1145/3592097
Article Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS, vol. 33 (2020)
Google Scholar
Boukhayma, A., Bem, R.D., Torr, P.H.: 3D hand shape and pose from images in the wild. In: CVPR (2019)
Google Scholar
Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2Video: video editing using image diffusion. In: ICCV (2023)
Google Scholar
Chan, C., Ginosar, S., Zhou, T., Efros, A.A.: Everybody dance now. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Chen, W., et al.: Control-a-video: controllable text-to-video generation with diffusion models. arXiv:2305.13840 (2023)
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M.J.: Capture, learning, and synthesis of 3D speaking styles. In: CVPR (2019)
Google Scholar
Fan, Y., Lin, Z., Saito, J., Wang, W., Komura, T.: FaceFormer: speech-driven 3D facial animation with transformers. In: CVPR (2022)
Google Scholar
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373 (2023)
Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Google Scholar
Guan, J., et al.: StyleSync: high-fidelity generalized and personalized lip sync in style-based generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H., Zhang, J.: AD-NeRF: audio driven neural radiance fields for talking head synthesis. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Habibie, I., et al.: Learning speech-driven 3D conversational gestures from video. In: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, pp. 101–108 (2021)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS, vol. 30 (2017)
Google Scholar
Huang, Z., et al.: Make-your-anchor: a diffusion-based 2D avatar generation framework. In: CVPR (2024)
Google Scholar
Huh, M., Zhang, R., Zhu, J.Y., Paris, S., Hertzmann, A.: Transforming and projecting images to class-conditional generative networks. In: ECCV (2020)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Google Scholar
Karras, T., Aila, T., Laine, S., Herva, A., Lehtinen, J.: Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph. (TOG) 36(4), 1–12 (2017)
Article Google Scholar
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 194 (2017)
Article Google Scholar
Lin, S., Yang, L., Saleemi, I., Sengupta, S.: Robust high-resolution video matting with temporal guidance. In: WACV (2022)
Google Scholar
Liu, X., et al.: Audio-driven co-speech gesture video generation. In: NeurIPS (2022)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2, pp. 851–866 (2023)
Google Scholar
Lu, Y., Chai, J., Cao, X.: Live speech portraits: real-time photorealistic talking-head animation. ACM Trans. Graph. 40(6), 1–7 (2021). https://doi.org/10.1145/3478513.3480484
Article Google Scholar
Ma, Y., et al.: StyleTalk: one-shot talking head generation with controllable speaking styles. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1896–1904 (2023)
Google Scholar
Mallya, A., Wang, T.-C., Sapra, K., Liu, M.-Y.: World-consistent video-to-video synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12353, pp. 359–378. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3_22
Chapter Google Scholar
Mensah, D., Kim, N.H., Aittala, M., Laine, S., Lehtinen, J.: A hybrid generator architecture for controllable face synthesis. In: ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–10 (2023)
Google Scholar
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A.: Conditional image generation with PixelCNN decoders. In: NeurIPS, vol. 29 (2016)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS, vol. 35 (2022)
Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P., Jawahar, C.: A lip sync expert is all you need for speech to lip generation in the wild. In: ACM MM (2020)
Google Scholar
Z Qi, C., et al.: FateZero: fusing attentions for zero-shot text-based video editing. arXiv:2303.09535 (2023)
Qian, S., Tu, Z., Zhi, Y., Liu, W., Gao, S.: Speech drives templates: co-speech gesture synthesis with learned templates. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE (2021)
Google Scholar
Ravi, N., et al.: Accelerating 3D deep learning with PyTorch3D. arXiv:2007.08501 (2020)
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., Sheikh, Y.: MeshTalk: 3D face animation from speech using cross-modality disentanglement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shen, S., et al.: DiffTalk: crafting diffusion models for generalized audio-driven portraits animation. In: CVPR (2023)
Google Scholar
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: NeurIPS (2019)
Google Scholar
Skorokhodov, I., Tulyakov, S., Elhoseiny, M.: StyleGAN-V: a continuous video generator with the price, image quality and perks of styleGAN2. In: CVPR (2022)
Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MoCoGAN: decomposing motion and content for video generation. In: CVPR (2018)
Google Scholar
Van Den Oord, A., Vinyals, O.: Neural discrete representation learning. In: NeurIPS, vol. 30 (2017)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: NeurIPS (2016)
Google Scholar
Wang, J., Qian, X., Zhang, M., Tan, R.T., Li, H.: Seeing what you said: talking face generation guided by a lip reading expert. In: CVPR (2023)
Google Scholar
Wang, T.C., et al.: Video-to-video synthesis. In: NeurIPS (2018)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: CVPR (2018)
Google Scholar
Wang, T.C., Mallya, A., Liu, M.Y.: One-shot free-view neural talking-head synthesis for video conferencing. In: CVPR (2021)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Wu, H., Jia, J., Wang, H., Dou, Y., Duan, C., Deng, Q.: Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In: ACM MM (2021)
Google Scholar
Wu, J.Z., et al.: Tune-a-video: one-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565 (2022)
Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)
Yang, S., Zhou, Y., Liu, Z., , Loy, C.C.: Rerender a video: zero-shot text-guided video-to-video translation. In: ACM SIGGRAPH Asia Conference Proceedings (2023)
Google Scholar
Yang, S., et al.: DiffuseStyleGesture: stylized audio-driven co-speech gesture generation with diffusion models. In: IJCAI (2023)
Google Scholar
Yao, X., Fried, O., Fatahalian, K., Agrawala, M.: Iterative text-based editing of talking-heads using neural retargeting. ACM Trans. Graph. (TOG) 40(3), 1–14 (2021)
Article Google Scholar
Yi, H., et al.: Generating holistic 3D human motion from speech. In: CVPR (2023)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, W., et al.: SadTalker: learning realistic 3D motion coefficients for stylized audio-driven single image talking face animation. arXiv preprint arXiv:2211.12194 (2022)
Zhao, J., Zhang, H.: Thin-plate spline motion model for image animation. In: CVPR (2022)
Google Scholar
Zhu, L., Liu, X., Liu, X., Qian, R., Liu, Z., Yu, L.: Taming diffusion models for audio-driven co-speech gesture generation. In: CVPR (2023)
Google Scholar
Zielonka, W., Bolkart, T., Thies, J.: Instant volumetric head avatars. In: CVPR (2023)
Google Scholar

Download references

Acknowledgments

We thank Kangle Deng, Yufei Ye, and Shubham Tulsiani for their helpful discussion. The project is partly supported by Ping An Research.

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, USA
Aniruddha Mahapatra, Richa Mishra & Jun-Yan Zhu
University of Science and Technology of China, Hefei, China
Renda Li, Boyang Ding & Shoulei Wang
Ping An Technology, Shenzhen, China
Renda Li, Boyang Ding, Shoulei Wang & Jing Xiao
PAII Inc., Palo Alto, CA, USA
Ziyi Chen, Peng Chang & Mei Han

Authors

Aniruddha Mahapatra
View author publications
You can also search for this author in PubMed Google Scholar
Richa Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Renda Li
View author publications
You can also search for this author in PubMed Google Scholar
Ziyi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Boyang Ding
View author publications
You can also search for this author in PubMed Google Scholar
Shoulei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Yan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Peng Chang
View author publications
You can also search for this author in PubMed Google Scholar
Mei Han
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richa Mishra .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mahapatra, A. et al. (2025). Co-speech Gesture Video Generation with 3D Human Meshes. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15147. Springer, Cham. https://doi.org/10.1007/978-3-031-73024-5_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-73024-5_11
Published: 24 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73023-8
Online ISBN: 978-3-031-73024-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Co-speech Gesture Video Generation with 3D Human Meshes