Skip to main content

ParCo: Part-Coordinating Text-to-Motion Synthesis

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15114))

Included in the following conference series:

  • 369 Accesses

Abstract

We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at: https://github.com/qrzou/ParCo.

Q. Zou and S. Yuan—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: generative adversarial synthesis from language to action. In: ICRA (2018)

    Google Scholar 

  2. Ahuja, C., Morency, L.P.: Language2Pose: natural language grounded pose forecasting. In: 3DV (2019)

    Google Scholar 

  3. Antakli, A., Hermann, E., Zinnikus, I., Du, H., Fischer, K.: Intelligent distributed human motion simulation in human-robot collaboration environments. In: ACM IVA (2018)

    Google Scholar 

  4. Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3d human motions for simultaneous action generation. arXiv preprint arXiv:2304.10417 (2023)

  5. Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)

    Google Scholar 

  6. Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2Gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR) (2021)

    Google Scholar 

  7. Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: CVPR (2017)

    Google Scholar 

  8. Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)

    Google Scholar 

  9. Chen, X., et al.: Learning variational motion prior for video-based motion capture. arXiv preprint arXiv:2210.15134 (2022)

  10. Djuric, N., et al.: Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving. In: WACV (2020)

    Google Scholar 

  11. Duan, Y., et al.: Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776 (2021)

  12. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)

    Google Scholar 

  13. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV (2021)

    Google Scholar 

  14. Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)

    Google Scholar 

  15. Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34

    Chapter  Google Scholar 

  16. Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: ACM MM (2020)

    Google Scholar 

  17. Harvey, F.G., Pal, C.: Recurrent transition networks for character locomotion. In: SIGGRAPH (2018)

    Google Scholar 

  18. Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM TOG (2020)

    Google Scholar 

  19. Herbet, G., Duffau, H.: Revisiting the functional anatomy of the human brain: toward a meta-networking theory of cerebral functions. Physiol. Rev. (2020)

    Google Scholar 

  20. Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: ICCV (2019)

    Google Scholar 

  21. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  22. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  23. Kappel, M., et al.: High-fidelity neural human motion transfer from monocular video. In: CVPR (2021)

    Google Scholar 

  24. Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: 3DV (2020)

    Google Scholar 

  25. Kim, J., Kim, J., Choi, S.: FLAME: free-form language-based motion synthesis & editing. In: AAAI (2023)

    Google Scholar 

  26. Koppula, H., Saxena, A.: Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In: ICML (2013)

    Google Scholar 

  27. Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE TPAMI (2015)

    Google Scholar 

  28. Kullback, S.: Information theory and statistics (1997)

    Google Scholar 

  29. Lab, C.G.: CMU graphics lab motion capture database (2016)

    Google Scholar 

  30. Lee, H.Y., et al.: Dancing to music. In: NeurIPS (2019)

    Google Scholar 

  31. Li, B., Zhao, Y., Zhelun, S., Sheng, L.: DanceFormer: music conditioned 3D dance generation with parametric motion transformer. In: AAAI (2022)

    Google Scholar 

  32. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: ICCV (2021)

    Google Scholar 

  33. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)

    Google Scholar 

  34. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2 (2023)

    Google Scholar 

  35. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  36. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)

    Google Scholar 

  37. Majoe, D., Widmer, L., Gutknecht, J.: Enhanced motion interaction for multimedia applications. In: Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia (2009)

    Google Scholar 

  38. Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28

    Chapter  Google Scholar 

  39. Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)

    Google Scholar 

  40. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)

    Google Scholar 

  41. Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)

  42. Parmar, N., et al.: Image transformer. In: ICML (2018)

    Google Scholar 

  43. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)

    Google Scholar 

  44. Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485 (2018)

  45. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV (2021)

    Google Scholar 

  46. Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28

    Chapter  Google Scholar 

  47. Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data (2016)

    Google Scholar 

  48. Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. (2018)

    Google Scholar 

  49. Raab, S., Leibovitch, I., Li, P., Aberman, K., Sorkine-Hornung, O., Cohen-Or, D.: MoDi: unconditional motion synthesis from diverse data. In: CVPR (2023)

    Google Scholar 

  50. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  51. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)

  52. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. In: ICCV (2021)

    Google Scholar 

  53. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  54. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)

    Google Scholar 

  55. Thiebaut de Schotten, M., Forkel, S.J.: The emergent properties of the connected brain. Science (2022)

    Google Scholar 

  56. Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)

  57. Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR (2022)

    Google Scholar 

  58. Tang, X., et al.: Real-time controllable motion transition for characters. ACM TOG (2022)

    Google Scholar 

  59. Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., Asfour, T.: Master motor map (MMM)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In: 2014 IEEE-RAS International Conference on Humanoid Robots (2014)

    Google Scholar 

  60. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–373. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21

    Chapter  Google Scholar 

  61. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  62. Vaswani, A., et al.: Attention is all you need. NeurIPS (2017)

    Google Scholar 

  63. Wang, Y., Leng, Z., Li, F.W., Wu, S.C., Liang, X.: FG-T2M: fine-grained text-driven human motion generation via diffusion model. In: ICCV (2023)

    Google Scholar 

  64. Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: ICCV (2019)

    Google Scholar 

  65. Yeasin, M., Polat, E., Sharma, R.: A multiobject tracking framework for interactive multimedia applications. IEEE TMM (2004)

    Google Scholar 

  66. Zhang, J., et al.: T2M-GPT: generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023)

  67. Zhang, M., et al.: MotionDiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)

  68. Zhang, M., et al.: RemoDiffuse: retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)

  69. Zhang, Y., Black, M.J., Tang, S.: Perpetual motion: generating unbounded human motion. arXiv preprint arXiv:2007.13886 (2020)

  70. Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: CVPR (2020)

    Google Scholar 

  71. Zhong, C., Hu, L., Zhang, Z., Xia, S.: ATTT2M: text-driven human motion generation with multi-perspective attention mechanism. In: ICCV (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China under Grant 2018AAA0102801.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiran Zou .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3469 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zou, Q. et al. (2025). ParCo: Part-Coordinating Text-to-Motion Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15114. Springer, Cham. https://doi.org/10.1007/978-3-031-72992-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72992-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72991-1

  • Online ISBN: 978-3-031-72992-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics