research-article

CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language Models

Authors:

Yao-Xiang Ding,

Kun ZhouAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 43, Issue 6

Article No.: 196, Pages 1 - 13

https://doi.org/10.1145/3687932

Published: 19 November 2024 Publication History

Abstract

Text-to-pose generation is challenging due to the complexity of natural language and human posture semantics. Utilizing large language models (LLMs) for text-to-pose generation is appealing due to their strong capabilities in text understanding and reasoning. However, as LLMs are designed for general-purpose language processing and not specifically trained for pose generation, it remains nontrivial to generate precise articulation targets for the full body using LLMs directly. To this end, we propose CPoser, a novel approach to harness the power of LLMs for text-to-pose generation, featuring a prompt parsing stage and a pose optimization stage. The parsing stage utilizes LLMs to turn text prompts into pose intermediate representations (Pose-IRs) through a set of predefined structured queries. These Pose-IRs explicitly describe specific pose conditions, such as squatting depth and knee bending angle, naturally forming an objective function that a target pose should satisfy. The optimization stage solves for expressive poses and hand gestures based on the Pose-IR objective function via robust optimization in a quantized pose prior space. The results are further refined to enhance naturalness and incorporate facial expressions. Experiments show that our approach effectively understands diverse text prompts for pose generation, surpassing existing text-to-pose methods.

References

[1]

Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh. 2018. Text2Action: Generative Adversarial Synthesis from Language to Action. In 2018 IEEE International Conference on Robotics and Automation (ICRA). 5915--5920.

Digital Library

[2]

Anthropic. 2024. Claude. https://www.anthropic.com/

[3]

Tenglong Ao, Zeyi Zhang, and Libin Liu. 2023. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents. ACM Trans. Graph. (2023), 18 pages.

Digital Library

[4]

Samaneh Azadi, Thomas Hayes, Akbar Shah, Guan Pang, Devi Parikh, and Sonal Gupta. 2023. Text-Conditional Contextualized Avatars For Zero-Shot Personalization. arXiv:2304.07410 [cs.CV]

[5]

Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. In 2021 IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR). IEEE.

[6]

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Computer Vision - ECCV 2016 (Lecture Notes in Computer Science). Springer International Publishing.

[7]

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).

[8]

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. 2023. Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000--18010.

[9]

Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, and Yizhou Wang. 2022. GFPose: Learning 3D Human Pose Prior with Gradient Fields. arXiv preprint arXiv:2212.08641 (2022).

[10]

Darren Cosker, Eva Krumhuber, and Adrian Hilton. 2011. A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. In 2011 international conference on computer vision. IEEE, 2296--2303.

Digital Library

[11]

Andrey Davydov, Anastasia Remizova, Victor Constantin, Sina Honari, Mathieu Salzmann, and Pascal Fua. 2022. Adversarial Parametric Pose Prior. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10987--10995.

[12]

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. 2022. PoseScript: 3D Human Poses from Natural Language. In ECCV.

[13]

Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, and Grégory Rogez. 2023. PoseFix: Correcting 3D Human Poses with Natural Language. In ICCV.

[14]

Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, and Michael J. Black. 2024. ChatPose: Chatting about 3D Human Pose. In CVPR.

[15]

Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In ACM MM. 2021--2029.

[16]

Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. 2022. NeMF: Neural Motion Fields for Kinematic Animation. In NeurIPS.

[17]

Yannan He, Garvita Tiwari, Tolga Birdal, Jan Eric Lenssen, and Gerard Pons-Moll. 2024. NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors. In Conference on Computer Vision and Pattern Recognition (CVPR).

[18]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv preprint arxiv:2006.11239 (2020).

[19]

Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1--19.

Digital Library

[20]

Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan, Thabo Beeler, Supasorn Suwajanakorn, and Siyu Tang. 2023. Optimizing Diffusion Noise Can Serve As Universal Motion Priors. In arxiv:2312.11994.

[21]

Jiyeon Kim and Sandra Forsythe. 2008. Adoption of Virtual Try-on technology for online apparel shopping. Journal of Interactive Marketing 22, 2 (2008), 45--59.

[22]

Yumeng Li, Yao-Xiang Ding, Zhong Ren, and Kun Zhou. 2023. QPoser: Quantized Explicit Pose Prior Modeling for Controllable Pose Generation. arXiv:2312.01104 [cs.CV]

[23]

J. Lin, J. Chang, L. Liu, G. Li, L. Lin, Q. Tian, and C. W. Chen. 2023a. Being comes from not-being: open-vocabulary text-to-motion generation with wordless training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023).

[24]

Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. 2023b. Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset. Advances in Neural Information Processing Systems (2023).

[25]

Xiao Lin and Mohamed R Amer. 2018. Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652 (2018).

[26]

Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. 2020. Character Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4, Article 40 (aug 2020), 12 pages.

Digital Library

[27]

Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming 45, 1--3 (1989), 503--528.

[28]

Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang, and Xin Tong. 2023. Plan, Posture and Go: Towards Open-World Text-to-Motion Generation. arXiv:2312.14828 [cs.CV]

[29]

Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. 2023. HumanTOMATO: Text-aligned Whole-body Motion Generation. arxiv:2310.12978 (2023).

[30]

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442--5451.

[31]

Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. 2020. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image. In European Conference on Computer Vision (ECCV).

Digital Library

[32]

Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10975--10985.

[33]

William Peebles and Saining Xie. 2022. Scalable Diffusion Models with Transformers. arXiv preprint arXiv:2212.09748 (2022).

[34]

M. Petrovich, M. J. Black, and G. Varol. 2021. Action-conditioned 3d human motion synthesis with transformer vae. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021).

[35]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]

[36]

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV]

[37]

Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. 2018. Generating 3D faces using Convolutional Mesh Autoencoders. In European Conference on Computer Vision (ECCV). 725--741. http://coma.is.tue.mpg.de/

[38]

Kishore K Reddy and Mubarak Shah. 2013. Recognizing 50 human action categories of web videos. Machine vision and applications 24, 5 (2013), 971--981.

[39]

Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36, 6 (Nov. 2017).

[40]

Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. 2023. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023).

[41]

Sanjay Subramanian, Evonne Ng, Lea Müller, Dan Klein, Shiry Ginosar, and Trevor Darrell. 2024. Pose Priors from Language Models. arXiv:2405.03689 [cs.CV]

[42]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).

[43]

Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022a. Motionclip: Exposing human motion generation to clip space. In ECCV. 358--374.

[44]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022b. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022).

[45]

Garvita Tiwari, Dimitrije Antic, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision (ECCV).

[46]

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6309--6318.

Digital Library

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[48]

Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, and Jiangmiao Pang. 2024. Unified Human-Scene Interaction via Prompted Chain-of-Contacts. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=1vCnDyQkjg

[49]

Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. 2022. ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters. ACM Transactions on Graphics 41, 6 (Nov. 2022), 1--16.

Digital Library

[50]

Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, and Libin Liu. 2023. MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations. arXiv:2310.10198 [cs.CV]

[51]

Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and Matthew J Rosato. 2006. A 3D facial expression database for facial behavior research. In 7th international conference on automatic face and gesture recognition (FGR06). IEEE, 211--216.

Digital Library

[52]

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. 2023. PhysDiff: Physics-Guided Human Motion Diffusion Model. In ICCV.

[53]

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2208.15001 (2022).

[54]

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. 2023. ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. arXiv preprint arXiv:2304.01116 (2023).

[55]

Yan Zhang, Mohamed Hassan, Heiko Neumann, Michael J. Black, and Siyu Tang. 2020. Generating 3D People in Scenes Without People. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Cited By

Li YXu HKitamura YTag BFujita KKostakos VKay JHoang T(2024)Towards using Eye Gaze Redirection in Immersive Reading Tasks for Visual Fatigue ReductionCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3678474(607-611)Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3675094.3678474
Mancinelli CPuppo E(2024)Splines on manifoldsComputer Aided Geometric Design10.1016/j.cagd.2024.102349112:COnline publication date: 8-Aug-2024
https://dl.acm.org/doi/10.1016/j.cagd.2024.102349
Chen SXu HLi HLuo KLiu GFu CTan PLiu S(2024)PointRegGPT: Boosting 3D Point Cloud Registration Using Generative Point-Cloud Pairs for TrainingComputer Vision – ECCV 202410.1007/978-3-031-72983-6_16(272-289)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72983-6_16

Index Terms

CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language Models
1. Computing methodologies
  1. Computer graphics
    1. Animation
  2. Machine learning

Recommendations

Enhancing Large Language Models-Based Code Generation by Leveraging Genetic Improvement
Genetic Programming
Abstract
In recent years, the rapid advances in neural networks for Natural Language Processing (NLP) have led to the development of Large Language Models (LLMs), able to substantially improve the state-of-the-art in many NLP tasks, such as question ...
Enabling controllable table-to-text generation via prompting large language models with guided planning
Abstract
Recently, Large Language Models (LLMs) has demonstrated unparalleled capabilities in understanding and generation, hence holding promising prospects for applying LLMs to table-to-text generation. However, the generation process with LLMs lacks a ...
Highlights
- A new perspective on adapting LLMs to challenging tasks.
- A controllable method for table-to-text generation.
- State-of-the-art results on the few-shot table-to-text generation dataset.
Preventing and Detecting Misinformation Generated by Large Language Models
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

As large language models (LLMs) become increasingly capable and widely deployed, the risk of them generating misinformation poses a critical challenge. Misinformation from LLMs can take various forms, from factual errors due to hallucination to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 43, Issue 6

December 2024

1828 pages

EISSN:1557-7368

DOI:10.1145/3702969

Issue’s Table of Contents

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 November 2024

Published in TOG Volume 43, Issue 6

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
249
Total Downloads

Downloads (Last 12 months)249
Downloads (Last 6 weeks)57

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li YXu HKitamura YTag BFujita KKostakos VKay JHoang T(2024)Towards using Eye Gaze Redirection in Immersive Reading Tasks for Visual Fatigue ReductionCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3678474(607-611)Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3675094.3678474
Mancinelli CPuppo E(2024)Splines on manifoldsComputer Aided Geometric Design10.1016/j.cagd.2024.102349112:COnline publication date: 8-Aug-2024
https://dl.acm.org/doi/10.1016/j.cagd.2024.102349
Chen SXu HLi HLuo KLiu GFu CTan PLiu S(2024)PointRegGPT: Boosting 3D Point Cloud Registration Using Generative Point-Cloud Pairs for TrainingComputer Vision – ECCV 202410.1007/978-3-031-72983-6_16(272-289)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72983-6_16

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents