Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Jin, Peng; Li, Hao; Cheng, Zesen; Li, Kehan; Yu, Runyi; Liu, Chang; Ji, Xiangyang; Yuan, Li; Chen, Jie

doi:10.1007/978-3-031-72698-9_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15083))

Included in the following conference series:

European Conference on Computer Vision

368 Accesses

Abstract

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Text Motion Translator: A Bi-directional Model for Enhanced 3D Human Motion Generation from Open-Vocabulary Descriptions

ParCo: Part-Coordinating Text-to-Motion Synthesis

Towards Open Domain Text-Driven Synthesis of Multi-person Motions

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)
Google Scholar
Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)
Google Scholar
Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured denoising diffusion models in discrete state-spaces. In: NeurIPS, pp. 17981–17993 (2021)
Google Scholar
Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating humans: computer graphics animation and control. Oxford University Press (1993)
Google Scholar
Barquero, G., Escalera, S., Palmero, C.: Belfusion: latent diffusion for behavior-driven human motion prediction. In: ICCV, pp. 2317–2327 (2023)
Google Scholar
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: VR, pp. 1–10 (2021)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Google Scholar
Chen, L.H., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: Humanmac: masked motion completion for human motion prediction. arXiv preprint arXiv:2302.03665 (2023)
Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR, pp. 10638–10647 (2020)
Google Scholar
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR, pp. 18000–18010 (2023)
Google Scholar
Cheng, Z., et al.: Parallel vertex diffusion for unified visual grounding. arXiv preprint arXiv:2303.07216 (2023)
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3D human poses from natural language. In: ECCV, pp. 346–362 (2022). https://doi.org/10.1007/978-3-031-20068-7_20
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)
Google Scholar
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV, pp. 1396–1406 (2021)
Google Scholar
Gong, S., Li, M., Feng, J., Wu, Z., Kong, L.: DIFFUSEQ: sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933 (2022)
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR, pp. 5152–5161 (2022)
Google Scholar
Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV, pp. 580–597 (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: ACM MM, pp. 2021–2029 (2020)
Google Scholar
He, C., Saito, J., Zachary, J., Rushmeier, H., Zhou, Y.: NeMF: neural motion fields for kinematic animation. In: NeurIPS, pp. 4244–4256 (2022)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)
Google Scholar
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3D scenes. In: CVPR, pp. 16750–16761 (2023)
Google Scholar
Jeong, H., Kwon, G., Ye, J.C.: Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900 (2023)
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: human motion as a foreign language. In: NeurIPS (2023)
Google Scholar
Jin, P., et al.: Expectation-maximization contrastive learning for compact video-and-language representations. In: NeurIPS, pp. 30291–30306 (2022)
Google Scholar
Jin, P., et al.: Video-text as game players: hierarchical banzhaf interaction for cross-modal representation learning. In: CVPR, pp. 2472–2482 (2023)
Google Scholar
Jin, P., et al.: Diffusionret: generative text-video retrieval with diffusion model. In: ICCV, pp. 2470–2481 (2023)
Google Scholar
Jin, P., Wu, Y., Fan, Y., Sun, Z., Wei, Y., Yuan, L.: Act as you wish: fine-grained control of motion diffusion model with hierarchical semantic graphs. In: NeurIPS (2023)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV, pp. 694–711 (2016)
Google Scholar
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: ICCV, pp. 2151–2162 (2023)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predicting structured data 1(0) (2006)
Google Scholar
Lin, J., et al.: Motion-x: a large-scale 3D expressive whole-body human motion dataset. In: NeurIPS (2023)
Google Scholar
Loshchilov, I., Hutter, F., et al.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017)
Maas, A.L., Hannun, A.Y., Ng, A.Y., et al.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer vae. In: ICCV, pp. 10985–10995 (2021)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: Temos: generating diverse human motions from textual descriptions. In: ECCV, pp. 480–497 (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big data 4(4), 236–252 (2016)
Google Scholar
Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. 109, 13–26 (2018)
Article Google Scholar
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)
Google Scholar
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
Shi, P., Lin, J.: Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255 (2019)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)
Google Scholar
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Google Scholar
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: ICLR (2021)
Google Scholar
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)
Google Scholar
Wang, Y., Leng, Z., Li, F.W., Wu, S.C., Liang, X.: Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In: ICCV. pp. 22035–22044 (2023)
Google Scholar
Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. arXiv preprint arXiv:2310.08580 (2023)
Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3D human-object interactions with physics-informed diffusion. In: ICCV, pp. 14928–14940 (2023)
Google Scholar
Yang, C., Wang, R., Yao, S., Liu, S., Abdelzaher, T.: Revisiting over-smoothing in deep gcns. arXiv preprint arXiv:2003.13663 (2020)
Yu, H., Zhang, D., Xie, P., Zhang, T.: Point-based radiance fields for controllable human motion synthesis. arXiv preprint arXiv:2310.03375 (2023)
Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: Freedom: training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833 (2023)
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: physics-guided human motion diffusion model. In: ICCV, pp. 16010–16021 (2023)
Google Scholar
Zhai, Y., Huang, M., Luan, T., Dong, L., Nwogu, I., Lyu, S., Doermann, D., Yuan, J.: Language-guided human motion synthesis with atomic actions. In: ACM MM, pp. 5262–5271 (2023)
Google Scholar
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)
Google Scholar
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. TPAMI (2024)
Google Scholar
Zhang, M., et al.: Remodiffuse: Retrieval-augmented motion diffusion model. In: ICCV (2023)
Google Scholar
Zhang, Q., Song, J., Huang, X., Chen, Y., Liu, M.Y.: Diffcollage: parallel generation of large content with diffusion models. arXiv preprint arXiv:2303.17076 (2023)
Zhang, Y., et al: Motiongpt: finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900 (2023)
Zhao, M., Bao, F., Li, C., Zhu, J.: Egsde: unpaired image-to-image translation via energy-guided stochastic differential equations. In: NeurIPS, pp. 3609–3623 (2022)
Google Scholar
Zhu, W., et al.: Human motion generation: a survey. arXiv preprint arXiv:2307.10894 (2023)

Download references

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2022ZD0118101), Natural Science Foundation of China (No. 61972217, 32071459, 62176249, 62006133, 62271465, 62332002, 62202014), the Shenzhen Medical Research Funds in China (No. B2302037), and AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China.

Author information

Authors and Affiliations

School of Electronic and Computer Engineering, Peking University, Shenzhen, China
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Li Yuan & Jie Chen
Peng Cheng Laboratory, Shenzhen, China
Peng Jin, Hao Li, Li Yuan & Jie Chen
AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, China
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Li Yuan & Jie Chen
Department of Automation and BNRist, Tsinghua University, Beijing, China
Chang Liu & Xiangyang Ji

Authors

Peng Jin
View author publications
You can also search for this author in PubMed Google Scholar
Hao Li
View author publications
You can also search for this author in PubMed Google Scholar
Zesen Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Kehan Li
View author publications
You can also search for this author in PubMed Google Scholar
Runyi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Chang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Ji
View author publications
You can also search for this author in PubMed Google Scholar
Li Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chang Liu or Li Yuan .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 816 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, P. et al. (2025). Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-72698-9_23
Published: 26 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation