ParCo: Part-Coordinating Text-to-Motion Synthesis

Zou, Qiran; Yuan, Shangyuan; Du, Shian; Wang, Yu; Liu, Chang; Xu, Yi; Chen, Jie; Ji, Xiangyang

doi:10.1007/978-3-031-72992-8_8

Qiran Zou¹³,
Shangyuan Yuan ORCID: orcid.org/0009-0006-8832-6372¹³,
Shian Du¹³,
Yu Wang¹⁴,
Chang Liu¹³,
Yi Xu¹⁵,
Jie Chen¹⁴ &
…
Xiangyang Ji¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15114))

Included in the following conference series:

European Conference on Computer Vision

369 Accesses

Abstract

We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties for networks to understand part concepts. Moreover, introducing finer-grained part concepts poses computational complexity challenges. In this paper, we propose Part-Coordinating Text-to-Motion Synthesis (ParCo), endowed with enhanced capabilities for understanding part motions and communication among different part motion generators, ensuring a coordinated and fined-grained motion synthesis. Specifically, we discretize whole-body motion into multiple part motions to establish the prior concept of different parts. Afterward, we employ multiple lightweight generators designed to synthesize different part motions and coordinate them through our part coordination module. Our approach demonstrates superior performance on common benchmarks with economic computations, including HumanML3D and KIT-ML, providing substantial evidence of its effectiveness. Code is available at: https://github.com/qrzou/ParCo.

Q. Zou and S. Yuan—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

CoMo: Controllable Motion Generation Through Language Guided Pose Code Editing

Towards Open Domain Text-Driven Synthesis of Multi-person Motions

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: generative adversarial synthesis from language to action. In: ICRA (2018)
Google Scholar
Ahuja, C., Morency, L.P.: Language2Pose: natural language grounded pose forecasting. In: 3DV (2019)
Google Scholar
Antakli, A., Hermann, E., Zinnikus, I., Du, H., Fischer, K.: Intelligent distributed human motion simulation in human-robot collaboration environments. In: ACM IVA (2018)
Google Scholar
Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: spatial composition of 3d human motions for simultaneous action generation. arXiv preprint arXiv:2304.10417 (2023)
Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)
Google Scholar
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2Gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR) (2021)
Google Scholar
Butepage, J., Black, M.J., Kragic, D., Kjellstrom, H.: Deep representation learning for human motion prediction and classification. In: CVPR (2017)
Google Scholar
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: CVPR (2023)
Google Scholar
Chen, X., et al.: Learning variational motion prior for video-based motion capture. arXiv preprint arXiv:2210.15134 (2022)
Djuric, N., et al.: Uncertainty-aware short-term motion prediction of traffic actors for autonomous driving. In: WACV (2020)
Google Scholar
Duan, Y., et al.: Single-shot motion completion with transformer. arXiv preprint arXiv:2103.00776 (2021)
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Google Scholar
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV (2021)
Google Scholar
Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: CVPR (2022)
Google Scholar
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Chapter Google Scholar
Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: ACM MM (2020)
Google Scholar
Harvey, F.G., Pal, C.: Recurrent transition networks for character locomotion. In: SIGGRAPH (2018)
Google Scholar
Harvey, F.G., Yurick, M., Nowrouzezahrai, D., Pal, C.: Robust motion in-betweening. ACM TOG (2020)
Google Scholar
Herbet, G., Duffau, H.: Revisiting the functional anatomy of the human brain: toward a meta-networking theory of cerebral functions. Physiol. Rev. (2020)
Google Scholar
Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: ICCV (2019)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Google Scholar
Kappel, M., et al.: High-fidelity neural human motion transfer from monocular video. In: CVPR (2021)
Google Scholar
Kaufmann, M., Aksan, E., Song, J., Pece, F., Ziegler, R., Hilliges, O.: Convolutional autoencoders for human motion infilling. In: 3DV (2020)
Google Scholar
Kim, J., Kim, J., Choi, S.: FLAME: free-form language-based motion synthesis & editing. In: AAAI (2023)
Google Scholar
Koppula, H., Saxena, A.: Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. In: ICML (2013)
Google Scholar
Koppula, H.S., Saxena, A.: Anticipating human activities using object affordances for reactive robotic response. IEEE TPAMI (2015)
Google Scholar
Kullback, S.: Information theory and statistics (1997)
Google Scholar
Lab, C.G.: CMU graphics lab motion capture database (2016)
Google Scholar
Lee, H.Y., et al.: Dancing to music. In: NeurIPS (2019)
Google Scholar
Li, B., Zhao, Y., Zhelun, S., Sheng, L.: DanceFormer: music conditioned 3D dance generation with parametric motion transformer. In: AAAI (2022)
Google Scholar
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: ICCV (2021)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2 (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)
Google Scholar
Majoe, D., Widmer, L., Gutknecht, J.: Enhanced motion interaction for multimedia applications. In: Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia (2009)
Google Scholar
Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28
Chapter Google Scholar
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)
Google Scholar
Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Google Scholar
Nichol, A., et al.: GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021)
Parmar, N., et al.: Image transformer. In: ICML (2018)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485 (2018)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV (2021)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Chapter Google Scholar
Plappert, M., Mandery, C., Asfour, T.: The KIT motion-language dataset. Big Data (2016)
Google Scholar
Plappert, M., Mandery, C., Asfour, T.: Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks. Robot. Auton. Syst. (2018)
Google Scholar
Raab, S., Leibovitch, I., Li, P., Aberman, K., Sorkine-Hornung, O., Cohen-Or, D.: MoDi: unconditional motion synthesis from diverse data. In: CVPR (2023)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125 (2022)
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. In: ICCV (2021)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. In: NeurIPS (2022)
Google Scholar
Thiebaut de Schotten, M., Forkel, S.J.: The emergent properties of the connected brain. Science (2022)
Google Scholar
Shafir, Y., Tevet, G., Kapon, R., Bermano, A.H.: Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023)
Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR (2022)
Google Scholar
Tang, X., et al.: Real-time controllable motion transition for characters. ACM TOG (2022)
Google Scholar
Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., Asfour, T.: Master motor map (MMM)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In: 2014 IEEE-RAS International Conference on Humanoid Robots (2014)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–373. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Chapter Google Scholar
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Vaswani, A., et al.: Attention is all you need. NeurIPS (2017)
Google Scholar
Wang, Y., Leng, Z., Li, F.W., Wu, S.C., Liang, X.: FG-T2M: fine-grained text-driven human motion generation via diffusion model. In: ICCV (2023)
Google Scholar
Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeleton-based action synthesis. In: ICCV (2019)
Google Scholar
Yeasin, M., Polat, E., Sharma, R.: A multiobject tracking framework for interactive multimedia applications. IEEE TMM (2004)
Google Scholar
Zhang, J., et al.: T2M-GPT: generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023)
Zhang, M., et al.: MotionDiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhang, M., et al.: RemoDiffuse: retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)
Zhang, Y., Black, M.J., Tang, S.: Perpetual motion: generating unbounded human motion. arXiv preprint arXiv:2007.13886 (2020)
Zhao, R., Su, H., Ji, Q.: Bayesian adversarial human motion synthesis. In: CVPR (2020)
Google Scholar
Zhong, C., Hu, L., Zhang, Z., Xia, S.: ATTT2M: text-driven human motion generation with multi-perspective attention mechanism. In: ICCV (2023)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China under Grant 2018AAA0102801.

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Qiran Zou, Shangyuan Yuan, Shian Du, Chang Liu & Xiangyang Ji
Peking University Shenzhen Graduate School, Shenzhen, China
Yu Wang & Jie Chen
Dalian University of Technology, Dalian, China
Yi Xu

Authors

Qiran Zou
View author publications
You can also search for this author in PubMed Google Scholar
Shangyuan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Shian Du
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Jie Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiran Zou .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3469 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zou, Q. et al. (2025). ParCo: Part-Coordinating Text-to-Motion Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15114. Springer, Cham. https://doi.org/10.1007/978-3-031-72992-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-72992-8_8
Published: 30 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72991-1
Online ISBN: 978-3-031-72992-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics