Abstract
Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local action-guided motion diffusion model, which facilitates global motion generation by utilizing local actions as fine-grained control signals. Specifically, we provide an automated method for reference local action sampling and leverage graph attention networks to assess the guiding weight of each local action in the overall motion synthesis. During the diffusion process for synthesizing global motion, we calculate the local-action gradient to provide conditional guidance. This local-to-global paradigm reduces the complexity associated with direct global motion generation and promotes motion diversity via sampling diverse actions as conditions. Extensive experiments on two human motion datasets, i.e., HumanML3D and KIT, demonstrate the effectiveness of our method. Furthermore, our method provides flexibility in seamlessly combining various local actions and continuous guiding weight adjustment, accommodating diverse user preferences, which may hold potential significance for the community. The project page is available at https://jpthu17.github.io/GuidedMotion-project/.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)
Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)
Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured denoising diffusion models in discrete state-spaces. In: NeurIPS, pp. 17981–17993 (2021)
Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating humans: computer graphics animation and control. Oxford University Press (1993)
Barquero, G., Escalera, S., Palmero, C.: Belfusion: latent diffusion for behavior-driven human motion prediction. In: ICCV, pp. 2317–2327 (2023)
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: VR, pp. 1–10 (2021)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS, pp. 1877–1901 (2020)
Chen, L.H., Zhang, J., Li, Y., Pang, Y., Xia, X., Liu, T.: Humanmac: masked motion completion for human motion prediction. arXiv preprint arXiv:2302.03665 (2023)
Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR, pp. 10638–10647 (2020)
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR, pp. 18000–18010 (2023)
Cheng, Z., et al.: Parallel vertex diffusion for unified visual grounding. arXiv preprint arXiv:2303.07216 (2023)
Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015)
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: Posescript: 3D human poses from natural language. In: ECCV, pp. 346–362 (2022). https://doi.org/10.1007/978-3-031-20068-7_20
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS, pp. 8780–8794 (2021)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV, pp. 1396–1406 (2021)
Gong, S., Li, M., Feng, J., Wu, Z., Kong, L.: DIFFUSEQ: sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933 (2022)
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR, pp. 5152–5161 (2022)
Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: ECCV, pp. 580–597 (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Guo, C., et al.: Action2motion: conditioned generation of 3D human motions. In: ACM MM, pp. 2021–2029 (2020)
He, C., Saito, J., Zachary, J., Rushmeier, H., Zhou, Y.: NeMF: neural motion fields for kinematic animation. In: NeurIPS, pp. 4244–4256 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS. pp. 6840–6851 (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Huang, S., et al.: Diffusion-based generation, optimization, and planning in 3D scenes. In: CVPR, pp. 16750–16761 (2023)
Jeong, H., Kwon, G., Ye, J.C.: Zero-shot generation of coherent storybook from plain text story using diffusion models. arXiv preprint arXiv:2302.03900 (2023)
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: Motiongpt: human motion as a foreign language. In: NeurIPS (2023)
Jin, P., et al.: Expectation-maximization contrastive learning for compact video-and-language representations. In: NeurIPS, pp. 30291–30306 (2022)
Jin, P., et al.: Video-text as game players: hierarchical banzhaf interaction for cross-modal representation learning. In: CVPR, pp. 2472–2482 (2023)
Jin, P., et al.: Diffusionret: generative text-video retrieval with diffusion model. In: ICCV, pp. 2470–2481 (2023)
Jin, P., Wu, Y., Fan, Y., Sun, Z., Wei, Y., Yuan, L.: Act as you wish: fine-grained control of motion diffusion model with hierarchical semantic graphs. In: NeurIPS (2023)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV, pp. 694–711 (2016)
Karunratanakul, K., Preechakul, K., Suwajanakorn, S., Tang, S.: Guided motion diffusion for controllable human motion synthesis. In: ICCV, pp. 2151–2162 (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predicting structured data 1(0) (2006)
Lin, J., et al.: Motion-x: a large-scale 3D expressive whole-body human motion dataset. In: NeurIPS (2023)
Loshchilov, I., Hutter, F., et al.: Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101 (2017)
Maas, A.L., Hannun, A.Y., Ng, A.Y., et al.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer vae. In: ICCV, pp. 10985–10995 (2021)
Petrovich, M., Black, M.J., Varol, G.: Temos: generating diverse human motions from textual descriptions. In: ECCV, pp. 480–497 (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big data 4(4), 236–252 (2016)
Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. 109, 13–26 (2018)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: MICCAI, pp. 234–241 (2015)
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
Shi, P., Lin, J.: Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255 (2019)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML, pp. 2256–2265 (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: ICLR (2021)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR (2023)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)
Wang, Y., Leng, Z., Li, F.W., Wu, S.C., Liang, X.: Fg-t2m: Fine-grained text-driven human motion generation via diffusion model. In: ICCV. pp. 22035–22044 (2023)
Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)
Xie, Y., Jampani, V., Zhong, L., Sun, D., Jiang, H.: Omnicontrol: Control any joint at any time for human motion generation. arXiv preprint arXiv:2310.08580 (2023)
Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3D human-object interactions with physics-informed diffusion. In: ICCV, pp. 14928–14940 (2023)
Yang, C., Wang, R., Yao, S., Liu, S., Abdelzaher, T.: Revisiting over-smoothing in deep gcns. arXiv preprint arXiv:2003.13663 (2020)
Yu, H., Zhang, D., Xie, P., Zhang, T.: Point-based radiance fields for controllable human motion synthesis. arXiv preprint arXiv:2310.03375 (2023)
Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: Freedom: training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833 (2023)
Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: Physdiff: physics-guided human motion diffusion model. In: ICCV, pp. 16010–16021 (2023)
Zhai, Y., Huang, M., Luan, T., Dong, L., Nwogu, I., Lyu, S., Doermann, D., Yuan, J.: Language-guided human motion synthesis with atomic actions. In: ACM MM, pp. 5262–5271 (2023)
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2m-gpt: Generating human motion from textual descriptions with discrete representations. In: CVPR (2023)
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. TPAMI (2024)
Zhang, M., et al.: Remodiffuse: Retrieval-augmented motion diffusion model. In: ICCV (2023)
Zhang, Q., Song, J., Huang, X., Chen, Y., Liu, M.Y.: Diffcollage: parallel generation of large content with diffusion models. arXiv preprint arXiv:2303.17076 (2023)
Zhang, Y., et al: Motiongpt: finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900 (2023)
Zhao, M., Bao, F., Li, C., Zhu, J.: Egsde: unpaired image-to-image translation via energy-guided stochastic differential equations. In: NeurIPS, pp. 3609–3623 (2022)
Zhu, W., et al.: Human motion generation: a survey. arXiv preprint arXiv:2307.10894 (2023)
Acknowledgements
This work was supported in part by the National Key R&D Program of China (No. 2022ZD0118101), Natural Science Foundation of China (No. 61972217, 32071459, 62176249, 62006133, 62271465, 62332002, 62202014), the Shenzhen Medical Research Funds in China (No. B2302037), and AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, China.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jin, P. et al. (2025). Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15083. Springer, Cham. https://doi.org/10.1007/978-3-031-72698-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-72698-9_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72697-2
Online ISBN: 978-3-031-72698-9
eBook Packages: Computer ScienceComputer Science (R0)