PoseEmbroider: Towards a 3D, Visual, Semantic-Aware Human Pose Representation

Delmas, Ginger; Weinzaepfel, Philippe; Moreno-Noguer, Francesc; Rogez, Grégory

doi:10.1007/978-3-031-73209-6_4

Ginger Delmas^13,14,
Philippe Weinzaepfel¹⁴,
Francesc Moreno-Noguer¹³ &
…
Grégory Rogez¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15129))

Included in the following conference series:

European Conference on Computer Vision

295 Accesses

Abstract

Aligning multiple modalities in a latent space, such as images and texts, has shown to produce powerful semantic visual representations, fueling tasks like image captioning, text-to-image generation, or image grounding. In the context of human-centric vision, albeit CLIP-like representations encode most standard human poses relatively well (such as standing or sitting), they lack sufficient acuteness to discern detailed or uncommon ones. Actually, while 3D human poses have been often associated with images (e.g. to perform pose estimation or pose-conditioned image generation), or more recently with text (e.g. for text-to-pose generation), they have seldom been paired with both. In this work, we combine 3D poses, person’s pictures and textual pose descriptions to produce an enhanced 3D-, visual- and semantic-aware human pose representation. We introduce a new transformer-based model, trained in a retrieval fashion, which can take as input any combination of the aforementioned modalities. When composing modalities, it outperforms a standard multi-modal alignment retrieval model, making it possible to sort out partial information (e.g. image with the lower body occluded). We showcase the potential of such an embroidered pose representation for (1) SMPL regression from image with optional text cue; and (2) on the task of fine-grained instruction generation, which consists in generating a text that describes how to move from one 3D pose to another (as a fitness coach). Unlike prior works, our model can take any kind of input (image and/or pose) without retraining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PoseScript: 3D Human Poses from Natural Language

TIPS: Text-Induced Pose Synthesis

F-HOI: Toward Fine-Grained Semantic-Aligned 3D Human-Object Interactions

References

Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: 3DV (2019)
Google Scholar
Akbari, H., et al.: Vatt: transformers for multimodal self-supervised learning from raw video, audio and text. In: NeurIPS (2021)
Google Scholar
Alayrac, J.B., et al.: Self-supervised multimodal versatile networks. In: NeurIPS (2020)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop (2005)
Google Scholar
Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW (2018)
Google Scholar
Black, M.J., Patel, P., Tesch, J., Yang, J.: BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: CVPR (2023)
Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chapter Google Scholar
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: learning to follow image editing instructions. In: CVPR (2023)
Google Scholar
Cai, Z., et al.: Smpler-x: scaling up expressive human pose and shape estimation. In: NeurIPS (2024)
Google Scholar
Chen, W., et al.: Beyond appearance: a semantic controllable self-supervised learning framework for human-centric visual tasks. In: CVPR (2023)
Google Scholar
Chen, Z., Li, Q., Wang, X., Yang, W.: LiftedCL: lifting contrastive learning for human-centric perception. In: ICLR (2022)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* chatgpt quality (2023). https://lmsys.org/blog/2023-03-30-vicuna/
Ci, Y., et al.: UniHCP: a unified model for human-centric perceptions. In: CVPR (2023)
Google Scholar
Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 346–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_20
Chapter Google Scholar
Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G.: PoseFix: correcting 3D Human Poses with Natural Language. In: ICCV (2023)
Google Scholar
Ding, Y., Tian, C., Ding, H., Liu, L.: The clip model is secretly an image-to-prompt converter. In: NeurIPS (2024)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Driess, D., et al.: Palm-e: an embodied multimodal language model. In: arXiv preprint arXiv:2303.03378 (2023)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: improving visual-semantic embeddings with hard negatives (2018)
Google Scholar
Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Chatpose: chatting about 3D human pose. In: CVPR (2024)
Google Scholar
Fieraru, M., Zanfir, M., Pirlea, S.C., Olaru, V., Sminchisescu, C.: AIFit: automatic 3D human-interpretable feedback models for fitness training. In: CVPR (2021)
Google Scholar
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NeurIPS (2013)
Google Scholar
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: ICCV (2021)
Google Scholar
Girdhar, R., et al.: ImageBind: one embedding space to bind them all. In: CVPR (2023)
Google Scholar
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. In: CVPR (2022)
Google Scholar
Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa*, A., Malik*, J.: Humans in 4D: reconstructing and tracking humans with transformers. In: ICCV (2023)
Google Scholar
Guo, C., et al.: Generating diverse and natural 3d human motions from text. In: CVPR (2022)
Google Scholar
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)
Google Scholar
Hong, F., Pan, L., Cai, Z., Liu, Z.: Versatile multi-modal pre-training for human-centric perception. In: CVPR (2022)
Google Scholar
Ibrahimi, S., Sun, X., Wang, P., Garg, A., Sanan, A., Omar, M.: Audio-enhanced text-to-video retrieval using text-conditioned feature alignment. In: ICCV (2023)
Google Scholar
Jin, Z., Hayat, M., Yang, Y., Guo, Y., Lei, Y.: Context-aware alignment and mutual masking for 3D-language pre-training. In: CVPR (2023)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Google Scholar
Kim, G., Kwon, T., Ye, J.C.: Diffusionclip: text-guided diffusion models for robust image manipulation. In: CVPR (2022)
Google Scholar
Kim, H., Zala, A., Burri, G., Bansal, M.: FixMyPose: Pose correctional captioning and retrieval. In: AAAI (2021)
Google Scholar
Kim, J., Kim, J., Choi, S.: Flame: Free-form language-based motion synthesis & editing. In: AAAI (2023)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV (2019)
Google Scholar
Kwon, G., Cai, Z., Ravichandran, A., Bas, E., Bhotika, R., Soatto, S.: Masked vision and language modeling for multi-modal representation learning. In: ICLR (2023)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Lin, J., et al.: Motion-x: a large-scale 3D expressive whole-body human motion dataset. In: NeurIPS (2023)
Google Scholar
Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3D whole-body mesh recovery with component aware transformer. In: CVPR (2023)
Google Scholar
Lin, Y., Wei, C., Wang, H., Yuille, A., Xie, C.: Smaug: sparse masked autoencoder for efficient video-language pre-training. In: ICCV (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM TOG (2015)
Google Scholar
Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G.: PoseGPT: quantizing human motion for large scale generative modeling. In: ECCV (2022)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV (2019)
Google Scholar
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)
Google Scholar
Mizrahi, D., et al.: 4m: massively multimodal masked modeling. In: NeurIPS (2024)
Google Scholar
Müller, L., Osman, A.A.A., Tang, S., Huang, C.H.P., Black, M.J.: On self-contact and human pose. In: CVPR (2021)
Google Scholar
Müller, M., Arzt, A., Balke, S., Dorfer, M., Widmer, G.: Cross-modal music retrieval and applications: an overview of key methodologies. IEEE Signal Process. Mag. (2018)
Google Scholar
Oord, A.V.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Papineni, K., Roukos, S., Ward, T., jing Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Chapter Google Scholar
Petrovich, M., Black, M.J., Varol, G.: TMR: text-to-motion retrieval using contrastive 3D human motion synthesis. In: ICCV (2023)
Google Scholar
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data (2016)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Ruan, Y., Lee, H.H., Zhang, Y., Zhang, K., Chang, A.X.: TriCOLo: trimodal contrastive loss for text to shape retrieval. In: WACV (2024)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Tang, S., et al.: Humanbench: towards general human-centric perception with projector assisted pretraining. In: CVPR (2023)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to clip space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)
Google Scholar
Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: ECCV (2018)
Google Scholar
Wang, Y., et al.: Hulk: a universal knowledge translator for human-centric tasks. arXiv preprint arXiv:2312.01697 (2023)
Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: Next-GPT: any-to-any multimodal LLM. arXiv preprint arXiv:2309.05519 (2023)
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: simple vision transformer baselines for human pose estimation. In: NeurIPS (2022)
Google Scholar
Yin, K., Zou, S., Ge, Y., Tian, Z.: Tri-modal motion retrieval by learning a joint embedding space. arXiv preprint arXiv:4030.0691 (2024)
Youwang, K., Ji-Yeon, K., Oh, T.H.: Clip-actor: text-driven recommendation and stylization for animating human meshes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13663, pp. 173–191. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_11
Chapter Google Scholar
Yuan, J., et al.: Hap: structure-aware masked image modeling for human-centric perception. In: NeurIPS (2023)
Google Scholar
Zhang, S., et al.: EgoBody: Human body shape and motion of interacting people from head-mounted devices. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13666, pp. 180–200. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20068-7_11
Chapter Google Scholar

Download references

Acknowledgments

This work is supported by the Spanish government with the project MoHuCo PID2020-120049RB-I00, and by NAVER LABS Europe under technology transfer contract ‘Text4Pose’.

Author information

Authors and Affiliations

Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Barcelona, Spain
Ginger Delmas & Francesc Moreno-Noguer
NAVER LABS Europe, Meylan, France
Ginger Delmas, Philippe Weinzaepfel & Grégory Rogez

Authors

Ginger Delmas
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Weinzaepfel
View author publications
You can also search for this author in PubMed Google Scholar
Francesc Moreno-Noguer
View author publications
You can also search for this author in PubMed Google Scholar
Grégory Rogez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ginger Delmas .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5088 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Delmas, G., Weinzaepfel, P., Moreno-Noguer, F., Rogez, G. (2025). PoseEmbroider: Towards a 3D, Visual, Semantic-Aware Human Pose Representation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15129. Springer, Cham. https://doi.org/10.1007/978-3-031-73209-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-73209-6_4
Published: 01 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73208-9
Online ISBN: 978-3-031-73209-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PoseEmbroider: Towards a 3D, Visual, Semantic-Aware Human Pose Representation