skip to main content
research-article

CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language Models

Published: 19 November 2024 Publication History

Abstract

Text-to-pose generation is challenging due to the complexity of natural language and human posture semantics. Utilizing large language models (LLMs) for text-to-pose generation is appealing due to their strong capabilities in text understanding and reasoning. However, as LLMs are designed for general-purpose language processing and not specifically trained for pose generation, it remains nontrivial to generate precise articulation targets for the full body using LLMs directly. To this end, we propose CPoser, a novel approach to harness the power of LLMs for text-to-pose generation, featuring a prompt parsing stage and a pose optimization stage. The parsing stage utilizes LLMs to turn text prompts into pose intermediate representations (Pose-IRs) through a set of predefined structured queries. These Pose-IRs explicitly describe specific pose conditions, such as squatting depth and knee bending angle, naturally forming an objective function that a target pose should satisfy. The optimization stage solves for expressive poses and hand gestures based on the Pose-IR objective function via robust optimization in a quantized pose prior space. The results are further refined to enhance naturalness and incorporate facial expressions. Experiments show that our approach effectively understands diverse text prompts for pose generation, surpassing existing text-to-pose methods.

References

[1]
Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh. 2018. Text2Action: Generative Adversarial Synthesis from Language to Action. In 2018 IEEE International Conference on Robotics and Automation (ICRA). 5915--5920.
[2]
Anthropic. 2024. Claude. https://www.anthropic.com/
[3]
Tenglong Ao, Zeyi Zhang, and Libin Liu. 2023. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents. ACM Trans. Graph. (2023), 18 pages.
[4]
Samaneh Azadi, Thomas Hayes, Akbar Shah, Guan Pang, Devi Parikh, and Sonal Gupta. 2023. Text-Conditional Contextualized Avatars For Zero-Shot Personalization. arXiv:2304.07410 [cs.CV]
[5]
Uttaran Bhattacharya, Nicholas Rewkowski, Abhishek Banerjee, Pooja Guhan, Aniket Bera, and Dinesh Manocha. 2021. Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents. In 2021 IEEE Conference on Virtual Reality and 3D User Interfaces (IEEE VR). IEEE.
[6]
Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. In Computer Vision - ECCV 2016 (Lecture Notes in Computer Science). Springer International Publishing.
[7]
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
[8]
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. 2023. Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18000--18010.
[9]
Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, and Yizhou Wang. 2022. GFPose: Learning 3D Human Pose Prior with Gradient Fields. arXiv preprint arXiv:2212.08641 (2022).
[10]
Darren Cosker, Eva Krumhuber, and Adrian Hilton. 2011. A FACS valid 3D dynamic action unit database with applications to 3D dynamic morphable facial modeling. In 2011 international conference on computer vision. IEEE, 2296--2303.
[11]
Andrey Davydov, Anastasia Remizova, Victor Constantin, Sina Honari, Mathieu Salzmann, and Pascal Fua. 2022. Adversarial Parametric Pose Prior. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10987--10995.
[12]
Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, and Grégory Rogez. 2022. PoseScript: 3D Human Poses from Natural Language. In ECCV.
[13]
Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, and Grégory Rogez. 2023. PoseFix: Correcting 3D Human Poses with Natural Language. In ICCV.
[14]
Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, and Michael J. Black. 2024. ChatPose: Chatting about 3D Human Pose. In CVPR.
[15]
Chuan Guo, Xinxin Zuo, Sen Wang, Shihao Zou, Qingyao Sun, Annan Deng, Minglun Gong, and Li Cheng. 2020. Action2motion: Conditioned generation of 3d human motions. In ACM MM. 2021--2029.
[16]
Chengan He, Jun Saito, James Zachary, Holly Rushmeier, and Yi Zhou. 2022. NeMF: Neural Motion Fields for Kinematic Animation. In NeurIPS.
[17]
Yannan He, Garvita Tiwari, Tolga Birdal, Jan Eric Lenssen, and Gerard Pons-Moll. 2024. NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors. In Conference on Computer Vision and Pattern Recognition (CVPR).
[18]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv preprint arxiv:2006.11239 (2020).
[19]
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu. 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1--19.
[20]
Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan, Thabo Beeler, Supasorn Suwajanakorn, and Siyu Tang. 2023. Optimizing Diffusion Noise Can Serve As Universal Motion Priors. In arxiv:2312.11994.
[21]
Jiyeon Kim and Sandra Forsythe. 2008. Adoption of Virtual Try-on technology for online apparel shopping. Journal of Interactive Marketing 22, 2 (2008), 45--59.
[22]
Yumeng Li, Yao-Xiang Ding, Zhong Ren, and Kun Zhou. 2023. QPoser: Quantized Explicit Pose Prior Modeling for Controllable Pose Generation. arXiv:2312.01104 [cs.CV]
[23]
J. Lin, J. Chang, L. Liu, G. Li, L. Lin, Q. Tian, and C. W. Chen. 2023a. Being comes from not-being: open-vocabulary text-to-motion generation with wordless training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023).
[24]
Jing Lin, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. 2023b. Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset. Advances in Neural Information Processing Systems (2023).
[25]
Xiao Lin and Mohamed R Amer. 2018. Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652 (2018).
[26]
Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. 2020. Character Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4, Article 40 (aug 2020), 12 pages.
[27]
Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming 45, 1--3 (1989), 503--528.
[28]
Jinpeng Liu, Wenxun Dai, Chunyu Wang, Yiji Cheng, Yansong Tang, and Xin Tong. 2023. Plan, Posture and Go: Towards Open-World Text-to-Motion Generation. arXiv:2312.14828 [cs.CV]
[29]
Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei Zhang, and Heung-Yeung Shum. 2023. HumanTOMATO: Text-aligned Whole-body Motion Generation. arxiv:2310.12978 (2023).
[30]
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. In International Conference on Computer Vision. 5442--5451.
[31]
Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. 2020. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image. In European Conference on Computer Vision (ECCV).
[32]
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). 10975--10985.
[33]
William Peebles and Saining Xie. 2022. Scalable Diffusion Models with Transformers. arXiv preprint arXiv:2212.09748 (2022).
[34]
M. Petrovich, M. J. Black, and G. Varol. 2021. Action-conditioned 3d human motion synthesis with transformer vae. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021).
[35]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
[36]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125 [cs.CV]
[37]
Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J. Black. 2018. Generating 3D faces using Convolutional Mesh Autoencoders. In European Conference on Computer Vision (ECCV). 725--741. http://coma.is.tue.mpg.de/
[38]
Kishore K Reddy and Mubarak Shah. 2013. Recognizing 50 human action categories of web videos. Machine vision and applications 24, 5 (2013), 971--981.
[39]
Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36, 6 (Nov. 2017).
[40]
Yonatan Shafir, Guy Tevet, Roy Kapon, and Amit H Bermano. 2023. Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418 (2023).
[41]
Sanjay Subramanian, Evonne Ng, Lea Müller, Dan Klein, Shiry Ginosar, and Trevor Darrell. 2024. Pose Priors from Language Models. arXiv:2405.03689 [cs.CV]
[42]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
[43]
Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. 2022a. Motionclip: Exposing human motion generation to clip space. In ECCV. 358--374.
[44]
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. 2022b. Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022).
[45]
Garvita Tiwari, Dimitrije Antic, Jan Eric Lenssen, Nikolaos Sarafianos, Tony Tung, and Gerard Pons-Moll. 2022. Pose-NDF: Modeling Human Pose Manifolds with Neural Distance Fields. In European Conference on Computer Vision (ECCV).
[46]
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural Discrete Representation Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6309--6318.
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[48]
Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, and Jiangmiao Pang. 2024. Unified Human-Scene Interaction via Prompted Chain-of-Contacts. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=1vCnDyQkjg
[49]
Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. 2022. ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters. ACM Transactions on Graphics 41, 6 (Nov. 2022), 1--16.
[50]
Heyuan Yao, Zhenhua Song, Yuyang Zhou, Tenglong Ao, Baoquan Chen, and Libin Liu. 2023. MoConVQ: Unified Physics-Based Motion Control via Scalable Discrete Representations. arXiv:2310.10198 [cs.CV]
[51]
Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and Matthew J Rosato. 2006. A 3D facial expression database for facial behavior research. In 7th international conference on automatic face and gesture recognition (FGR06). IEEE, 211--216.
[52]
Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. 2023. PhysDiff: Physics-Guided Human Motion Diffusion Model. In ICCV.
[53]
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. 2022. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2208.15001 (2022).
[54]
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. 2023. ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model. arXiv preprint arXiv:2304.01116 (2023).
[55]
Yan Zhang, Mohamed Hassan, Heiko Neumann, Michael J. Black, and Siyu Tang. 2020. Generating 3D People in Scenes Without People. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Cited By

View all
  • (2024)Towards using Eye Gaze Redirection in Immersive Reading Tasks for Visual Fatigue ReductionCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3678474(607-611)Online publication date: 5-Oct-2024
  • (2024)Splines on manifoldsComputer Aided Geometric Design10.1016/j.cagd.2024.102349112:COnline publication date: 8-Aug-2024
  • (2024)PointRegGPT: Boosting 3D Point Cloud Registration Using Generative Point-Cloud Pairs for TrainingComputer Vision – ECCV 202410.1007/978-3-031-72983-6_16(272-289)Online publication date: 29-Sep-2024

Index Terms

  1. CPoser: An Optimization-after-Parsing Approach for Text-to-Pose Generation Using Large Language Models

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Graphics
      ACM Transactions on Graphics  Volume 43, Issue 6
      December 2024
      1828 pages
      EISSN:1557-7368
      DOI:10.1145/3702969
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 November 2024
      Published in TOG Volume 43, Issue 6

      Check for updates

      Author Tags

      1. human posture
      2. text-to-pose generation
      3. zero-shot learning
      4. pose priors
      5. large language models

      Qualifiers

      • Research-article

      Funding Sources

      • Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)249
      • Downloads (Last 6 weeks)57
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Towards using Eye Gaze Redirection in Immersive Reading Tasks for Visual Fatigue ReductionCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3678474(607-611)Online publication date: 5-Oct-2024
      • (2024)Splines on manifoldsComputer Aided Geometric Design10.1016/j.cagd.2024.102349112:COnline publication date: 8-Aug-2024
      • (2024)PointRegGPT: Boosting 3D Point Cloud Registration Using Generative Point-Cloud Pairs for TrainingComputer Vision – ECCV 202410.1007/978-3-031-72983-6_16(272-289)Online publication date: 29-Sep-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media