Abstract
We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at: https://github.com/qrzou/ParCo.
Q. Zou and S. Yuan—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: generative adversarial synthesis from language to action. In: ICRA (2018)
Ahuja, C., Morency, L.P.: Language2Pose: natural language grounded pose forecasting. In: 3DV (2019)
Antakli, A., Hermann, E., Zinnikus, I., Du, H., Fischer, K.: Intelligent distributed human motion simulation in human-robot collaboration environments. In: ACM IVA (2018)
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3d human motions for simultaneous action generation. arXiv preprint arXiv:2304.10417 (2023)
Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2Gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR) (2021)
Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: CVPR (2017)
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)
Chen, X., et al.: Learning variational motion prior for video-based motion capture. arXiv preprint arXiv:2210.15134 (2022)
Djuric, N., et al.: Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving. In: WACV (2020)
Duan, Y., et al.: Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776 (2021)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV (2021)
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: ACM MM (2020)
Harvey, F.G., Pal, C.: Recurrent transition networks for character locomotion. In: SIGGRAPH (2018)
Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM TOG (2020)
Herbet, G., Duffau, H.: Revisiting the functional anatomy of the human brain: toward a meta-networking theory of cerebral functions. Physiol. Rev. (2020)
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: ICCV (2019)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Kappel, M., et al.: High-fidelity neural human motion transfer from monocular video. In: CVPR (2021)
Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: 3DV (2020)
Kim, J., Kim, J., Choi, S.: FLAME: free-form language-based motion synthesis & editing. In: AAAI (2023)
Koppula, H., Saxena, A.: Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In: ICML (2013)
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE TPAMI (2015)
Kullback, S.: Information theory and statistics (1997)
Lab, C.G.: CMU graphics lab motion capture database (2016)
Lee, H.Y., et al.: Dancing to music. In: NeurIPS (2019)
Li, B., Zhao, Y., Zhelun, S., Sheng, L.: DanceFormer: music conditioned 3D dance generation with parametric motion transformer. In: AAAI (2022)
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: ICCV (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2 (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)
Majoe, D., Widmer, L., Gutknecht, J.: Enhanced motion interaction for multimedia applications. In: Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia (2009)
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Parmar, N., et al.: Image transformer. In: ICML (2018)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485 (2018)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV (2021)
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data (2016)
Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. (2018)
Raab, S., Leibovitch, I., Li, P., Aberman, K., Sorkine-Hornung, O., Cohen-Or, D.: MoDi: unconditional motion synthesis from diverse data. In: CVPR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. In: ICCV (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Thiebaut de Schotten, M., Forkel, S.J.: The emergent properties of the connected brain. Science (2022)
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR (2022)
Tang, X., et al.: Real-time controllable motion transition for characters. ACM TOG (2022)
Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., Asfour, T.: Master motor map (MMM)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In: 2014 IEEE-RAS International Conference on Humanoid Robots (2014)
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–373. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Vaswani, A., et al.: Attention is all you need. NeurIPS (2017)
Wang, Y., Leng, Z., Li, F.W., Wu, S.C., Liang, X.: FG-T2M: fine-grained text-driven human motion generation via diffusion model. In: ICCV (2023)
Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: ICCV (2019)
Yeasin, M., Polat, E., Sharma, R.: A multiobject tracking framework for interactive multimedia applications. IEEE TMM (2004)
Zhang, J., et al.: T2M-GPT: generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023)
Zhang, M., et al.: MotionDiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhang, M., et al.: RemoDiffuse: retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)
Zhang, Y., Black, M.J., Tang, S.: Perpetual motion: generating unbounded human motion. arXiv preprint arXiv:2007.13886 (2020)
Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: CVPR (2020)
Zhong, C., Hu, L., Zhang, Z., Xia, S.: ATTT2M: text-driven human motion generation with multi-perspective attention mechanism. In: ICCV (2023)
Acknowledgements
This work was supported by the National Key R&D Program of China under Grant 2018AAA0102801.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zou, Q. et al. (2025). ParCo: Part-Coordinating Text-to-Motion Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15114. Springer, Cham. https://doi.org/10.1007/978-3-031-72992-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-72992-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72991-1
Online ISBN: 978-3-031-72992-8
eBook Packages: Computer ScienceComputer Science (R0)