Skip to main content

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)

    Google Scholar 

  2. Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)

    Google Scholar 

  3. Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured denoising diffusion models in discrete state-spaces. In: NeurIPS, pp. 17981–17993 (2021)

    Google Scholar 

  4. Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating humans: computer graphics animation and control. Oxford University Press (1993)

    Google Scholar 

  5. Barquero, G., Escalera, S., Palmero, C.: Belfusion: latent diffusion for behavior-driven human motion prediction. In: ICCV, pp. 2317–2327 (2023)

    Google Scholar 

  6. Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: VR, pp. 1–10 (2021)

    Google Scholar 

  7. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)

    Google Scholar 

  8. Chen, L.H., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: Humanmac: masked motion completion for human motion prediction. arXiv preprint arXiv:2302.03665 (2023)

  9. Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR, pp. 10638–10647 (2020)

    Google Scholar 

  10. Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR, pp. 18000–18010 (2023)

    Google Scholar 

  11. Cheng, Z., et al.: Parallel vertex diffusion for unified visual grounding. arXiv preprint arXiv:2303.07216 (2023)

  12. Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)

  13. Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3D human poses from natural language. In: ECCV, pp. 346–362 (2022). https://doi.org/10.1007/978-3-031-20068-7_20

  14. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)

    Google Scholar 

  15. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV, pp. 1396–1406 (2021)

    Google Scholar 

  16. Gong, S., Li, M., Feng, J., Wu, Z., Kong, L.: DIFFUSEQ: sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933 (2022)

  17. Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR, pp. 5152–5161 (2022)

    Google Scholar 

  18. Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV, pp. 580–597 (2022). https://doi.org/10.1007/978-3-031-19833-5_34

  19. Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: ACM MM, pp. 2021–2029 (2020)

    Google Scholar 

  20. He, C., Saito, J., Zachary, J., Rushmeier, H., Zhou, Y.: NeMF: neural motion fields for kinematic animation. In: NeurIPS, pp. 4244–4256 (2022)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  22. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  23. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)

    Google Scholar 

  24. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  25. Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3D scenes. In: CVPR, pp. 16750–16761 (2023)

    Google Scholar 

  26. Jeong, H., Kwon, G., Ye, J.C.: Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900 (2023)

  27. Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: human motion as a foreign language. In: NeurIPS (2023)

    Google Scholar 

  28. Jin, P., et al.: Expectation-maximization contrastive learning for compact video-and-language representations. In: NeurIPS, pp. 30291–30306 (2022)

    Google Scholar 

  29. Jin, P., et al.: Video-text as game players: hierarchical banzhaf interaction for cross-modal representation learning. In: CVPR, pp. 2472–2482 (2023)

    Google Scholar 

  30. Jin, P., et al.: Diffusionret: generative text-video retrieval with diffusion model. In: ICCV, pp. 2470–2481 (2023)

    Google Scholar 

  31. Jin, P., Wu, Y., Fan, Y., Sun, Z., Wei, Y., Yuan, L.: Act as you wish: fine-grained control of motion diffusion model with hierarchical semantic graphs. In: NeurIPS (2023)

    Google Scholar 

  32. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV, pp. 694–711 (2016)

    Google Scholar 

  33. Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: ICCV, pp. 2151–2162 (2023)

    Google Scholar 

  34. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  35. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  36. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predicting structured data 1(0) (2006)

    Google Scholar 

  37. Lin, J., et al.: Motion-x: a large-scale 3D expressive whole-body human motion dataset. In: NeurIPS (2023)

    Google Scholar 

  38. Loshchilov, I., Hutter, F., et al.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017)

  39. Maas, A.L., Hannun, A.Y., Ng, A.Y., et al.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)

    Google Scholar 

  40. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)

    Google Scholar 

  41. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer vae. In: ICCV, pp. 10985–10995 (2021)

    Google Scholar 

  42. Petrovich, M., Black, M.J., Varol, G.: Temos: generating diverse human motions from textual descriptions. In: ECCV, pp. 480–497 (2022). https://doi.org/10.1007/978-3-031-20047-2_28

  43. Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big data 4(4), 236–252 (2016)

    Google Scholar 

  44. Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. 109, 13–26 (2018)

    Article  Google Scholar 

  45. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)

    Google Scholar 

  46. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)

    Google Scholar 

  47. Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)

  48. Shi, P., Lin, J.: Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255 (2019)

  49. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)

    Google Scholar 

  50. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  51. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)

    Google Scholar 

  52. Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: ICLR (2021)

    Google Scholar 

  53. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)

    Google Scholar 

  54. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  55. Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  56. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  57. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)

    Google Scholar 

  58. Wang, Y., Leng, Z., Li, F.W., Wu, S.C., Liang, X.: Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In: ICCV. pp. 22035–22044 (2023)

    Google Scholar 

  59. Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)

  60. Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. arXiv preprint arXiv:2310.08580 (2023)

  61. Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3D human-object interactions with physics-informed diffusion. In: ICCV, pp. 14928–14940 (2023)

    Google Scholar 

  62. Yang, C., Wang, R., Yao, S., Liu, S., Abdelzaher, T.: Revisiting over-smoothing in deep gcns. arXiv preprint arXiv:2003.13663 (2020)

  63. Yu, H., Zhang, D., Xie, P., Zhang, T.: Point-based radiance fields for controllable human motion synthesis. arXiv preprint arXiv:2310.03375 (2023)

  64. Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: Freedom: training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833 (2023)

  65. Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: physics-guided human motion diffusion model. In: ICCV, pp. 16010–16021 (2023)

    Google Scholar 

  66. Zhai, Y., Huang, M., Luan, T., Dong, L., Nwogu, I., Lyu, S., Doermann, D., Yuan, J.: Language-guided human motion synthesis with atomic actions. In: ACM MM, pp. 5262–5271 (2023)

    Google Scholar 

  67. Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)

    Google Scholar 

  68. Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. TPAMI (2024)

    Google Scholar 

  69. Zhang, M., et al.: Remodiffuse: Retrieval-augmented motion diffusion model. In: ICCV (2023)

    Google Scholar 

  70. Zhang, Q., Song, J., Huang, X., Chen, Y., Liu, M.Y.: Diffcollage: parallel generation of large content with diffusion models. arXiv preprint arXiv:2303.17076 (2023)

  71. Zhang, Y., et al: Motiongpt: finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900 (2023)

  72. Zhao, M., Bao, F., Li, C., Zhu, J.: Egsde: unpaired image-to-image translation via energy-guided stochastic differential equations. In: NeurIPS, pp. 3609–3623 (2022)

    Google Scholar 

  73. Zhu, W., et al.: Human motion generation: a survey. arXiv preprint arXiv:2307.10894 (2023)

Download references

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2022ZD0118101), Natural Science Foundation of China (No. 61972217, 32071459, 62176249, 62006133, 62271465, 62332002, 62202014), the Shenzhen Medical Research Funds in China (No. B2302037), and AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chang Liu or Li Yuan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 816 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jin, P. et al. (2025). Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72698-9_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72697-2

  • Online ISBN: 978-3-031-72698-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics